Skip to content

Database Optimization for Large-Scale Tables

This guide provides comprehensive strategies for optimizing PostgreSQL database performance in enterprise systems, specifically for tables with millions of records using soft-delete patterns.

Target Scenario:

  • Tables: resources, engagements, engagement_orders, audit_logs
  • Scale: 10+ million rows per table
  • Pattern: Soft delete using is_deleted or deleted_at column
  • Problem: 30% of data is flagged as deleted, causing full table scans

Slow Query Performance Due to Soft Deletes

-- Common query pattern (SLOW on large tables)
SELECT * FROM resources
WHERE reference_number = 'REF001234'
AND is_deleted = 0;
-- Execution plan shows:
Seq Scan on resources (cost=0.00..500000.00 rows=1 width=1024)
Filter: ((reference_number = 'REF001234'::text) AND (is_deleted = 0))
Rows Removed by Filter: 3000000

Standard B-tree indexes include all rows, including deleted ones:

-- Standard index (includes deleted records)
CREATE INDEX idx_resources_reference_number ON resources(reference_number);
-- Query still scans deleted rows
-- Index contains: REF001234 (is_deleted=0), REF001234 (is_deleted=1), ...
-- PostgreSQL must check EACH matching reference_number to verify is_deleted = 0

Impact:

  • Index size: 10M rows × average_row_size = massive index
  • Query must scan ALL matching values, then filter by is_deleted
  • I/O overhead: Reading millions of deleted records from disk

With 30% deleted data:

Total rows: 10,000,000
Active rows (is_deleted=0): 7,000,000
Deleted rows (is_deleted=1): 3,000,000
-- PostgreSQL statistics may show:
-- "30% of data matches is_deleted=1"
-- "70% of data matches is_deleted=0"
-- For selective queries, planner may choose:
-- Seq Scan (if it thinks filtering is cheaper than index lookup)
-- Index grows over time
Initial size: 500 MB (7M active rows)
After 1 year: 750 MB (7M active + 3M deleted)
After 2 years: 1 GB (7M active + 6M deleted)
-- Index maintenance (VACUUM, REINDEX) takes longer
-- Query performance degrades proportionally
MetricBefore OptimizationTarget After Optimization
Query Time (SELECT by ref_num)2-5 seconds< 100ms
Index Size (resources)1.2 GB600 MB
VACUUM Time45 minutes15 minutes
Disk I/O5000 IOPS500 IOPS
Query PlanSeq ScanIndex Scan (Partial)

Proposed Solution: Partial Indexing Strategy

Section titled “Proposed Solution: Partial Indexing Strategy”

A Partial Index (also called a Filtered Index) only includes rows that match a specific condition.

-- Partial index: ONLY active resources
CREATE INDEX idx_resources_reference_number_active
ON resources(reference_number)
WHERE is_deleted = 0;
-- Index contains ONLY:
-- REF001234 (is_deleted=0)
-- REF005678 (is_deleted=0)
-- ...
-- EXCLUDES all is_deleted=1 rows
Standard Index: 10M rows × 50 bytes = 500 MB
Partial Index: 7M rows × 50 bytes = 350 MB (30% savings)
-- Query with partial index
EXPLAIN ANALYZE
SELECT * FROM resources
WHERE reference_number = 'REF001234' AND is_deleted = 0;
-- Execution plan:
Index Scan using idx_resources_reference_number_active on resources
(cost=0.42..8.44 rows=1 width=1024) (actual time=0.015..0.016 rows=1 loops=1)
Index Cond: (reference_number = 'REF001234'::text)
Planning Time: 0.082 ms
Execution Time: 0.035 ms -- 🎯 From 2000ms to 0.035ms
  • VACUUM only processes 7M rows (not 10M)
  • REINDEX rebuilds smaller index
  • Less cache pollution (only active data in memory)

PostgreSQL will automatically use a partial index if:

  1. The query’s WHERE clause matches the index predicate exactly
-- Index: WHERE is_deleted = 0
-- ✅ WILL USE partial index
SELECT * FROM resources WHERE reference_number = 'REF001' AND is_deleted = 0;
-- ❌ WILL NOT USE partial index (different predicate)
SELECT * FROM resources WHERE reference_number = 'REF001' AND is_deleted = 1;
SELECT * FROM resources WHERE reference_number = 'REF001'; -- Missing is_deleted filter
  1. The query planner determines it’s more efficient than a seq scan

Phase 1: Create Partial Indexes (Active Records Only)

Section titled “Phase 1: Create Partial Indexes (Active Records Only)”
-- Check existing indexes
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS index_scans,
idx_tup_read AS tuples_read,
idx_tup_fetch AS tuples_fetched
FROM pg_stat_user_indexes
WHERE tablename IN ('resources', 'engagements', 'engagement_orders', 'audit_logs')
ORDER BY pg_relation_size(indexrelid) DESC;

Expected Output:

tablename | indexname | index_size | index_scans
------------------+------------------------------------+------------+------------
resources | idx_resources_reference_number | 1.2 GB | 5,000,000
resources | idx_resources_national_id | 800 MB | 2,000,000
engagements | idx_engagements_resource_id | 2.5 GB | 8,000,000

Step 2: Create Partial Indexes for resources Table

Section titled “Step 2: Create Partial Indexes for resources Table”
-- ==============================================================
-- RESOURCES TABLE: Partial Indexes for Active Records
-- ==============================================================
-- 1. Reference Number - Most frequently queried field
CREATE INDEX CONCURRENTLY idx_resources_reference_number_active
ON resources(reference_number)
WHERE is_deleted = 0;
-- 2. National ID - Unique identifier
CREATE INDEX CONCURRENTLY idx_resources_national_id_active
ON resources(national_id)
WHERE is_deleted = 0 AND national_id IS NOT NULL;
-- 3. Full Name - Search by name
CREATE INDEX CONCURRENTLY idx_resources_name_active
ON resources(first_name, last_name)
WHERE is_deleted = 0;
-- 4. Passport ID - For international records
CREATE INDEX CONCURRENTLY idx_resources_passport_id_active
ON resources(passport_id)
WHERE is_deleted = 0 AND passport_id IS NOT NULL;
-- 5. Created Date - For recent records filtering
CREATE INDEX CONCURRENTLY idx_resources_created_at_active
ON resources(created_at DESC)
WHERE is_deleted = 0;
-- 6. Composite: National ID + Birth Date (for verification)
CREATE INDEX CONCURRENTLY idx_resources_national_id_birth_date_active
ON resources(national_id, birth_date)
WHERE is_deleted = 0 AND national_id IS NOT NULL;

Why CONCURRENTLY?

  • Allows the index to be built without locking the table
  • Production traffic can continue while index is being created
  • Takes longer, but safer for live systems

Verification:

-- Check index creation progress
SELECT
phase,
round(100.0 * blocks_done / nullif(blocks_total, 0), 2) AS "% complete",
blocks_done,
blocks_total
FROM pg_stat_progress_create_index;
-- After completion, verify index is being used
EXPLAIN ANALYZE
SELECT * FROM resources
WHERE reference_number = 'REF001234' AND is_deleted = 0;
-- Should show: "Index Scan using idx_resources_reference_number_active"

Step 3: Create Partial Indexes for engagements Table

Section titled “Step 3: Create Partial Indexes for engagements Table”
-- ==============================================================
-- ENGAGEMENTS TABLE: Partial Indexes for Active Records
-- ==============================================================
-- 1. Resource ID - Foreign key lookup
CREATE INDEX CONCURRENTLY idx_engagements_resource_id_active
ON engagements(resource_id)
WHERE is_deleted = 0;
-- 2. Engagement Date - Filter by date range
CREATE INDEX CONCURRENTLY idx_engagements_date_active
ON engagements(engagement_date DESC)
WHERE is_deleted = 0;
-- 3. Engagement Number - Unique identifier
CREATE INDEX CONCURRENTLY idx_engagements_number_active
ON engagements(engagement_number)
WHERE is_deleted = 0;
-- 4. Composite: Resource + Engagement Date (common query pattern)
CREATE INDEX CONCURRENTLY idx_engagements_resource_date_active
ON engagements(resource_id, engagement_date DESC)
WHERE is_deleted = 0;
-- 5. Department + Engagement Date (for departmental reports)
CREATE INDEX CONCURRENTLY idx_engagements_dept_date_active
ON engagements(department_id, engagement_date DESC)
WHERE is_deleted = 0;
-- 6. Status + Engagement Date (for workflow queries)
CREATE INDEX CONCURRENTLY idx_engagements_status_date_active
ON engagements(status, engagement_date DESC)
WHERE is_deleted = 0 AND status IN ('PENDING', 'IN_PROGRESS', 'COMPLETED');

Step 4: Create Partial Indexes for engagement_orders Table

Section titled “Step 4: Create Partial Indexes for engagement_orders Table”
-- ==============================================================
-- ENGAGEMENT_ORDERS TABLE: Partial Indexes for Active Records
-- ==============================================================
-- 1. Engagement ID - Foreign key lookup
CREATE INDEX CONCURRENTLY idx_engagement_orders_engagement_id_active
ON engagement_orders(engagement_id)
WHERE is_deleted = 0;
-- 2. Order Number - Unique identifier
CREATE INDEX CONCURRENTLY idx_engagement_orders_order_number_active
ON engagement_orders(order_number)
WHERE is_deleted = 0;
-- 3. Order Type + Status (common filter)
CREATE INDEX CONCURRENTLY idx_engagement_orders_type_status_active
ON engagement_orders(order_type, status)
WHERE is_deleted = 0;
-- 4. Created Date - For recent orders
CREATE INDEX CONCURRENTLY idx_engagement_orders_created_at_active
ON engagement_orders(created_at DESC)
WHERE is_deleted = 0;
-- 5. Composite: Engagement + Order Type (common join pattern)
CREATE INDEX CONCURRENTLY idx_engagement_orders_engagement_type_active
ON engagement_orders(engagement_id, order_type)
WHERE is_deleted = 0;

Step 5: Create Partial Indexes for audit_logs Table

Section titled “Step 5: Create Partial Indexes for audit_logs Table”

Note: Audit logs typically don’t use soft deletes, but may need time-based partitioning.

-- ==============================================================
-- AUDIT_LOGS TABLE: Time-Based Partial Indexes
-- ==============================================================
-- 1. Recent logs only (last 90 days)
CREATE INDEX CONCURRENTLY idx_audit_logs_recent
ON audit_logs(created_at DESC, action, user_id)
WHERE created_at >= NOW() - INTERVAL '90 days';
-- 2. User activity (recent)
CREATE INDEX CONCURRENTLY idx_audit_logs_user_recent
ON audit_logs(user_id, created_at DESC)
WHERE created_at >= NOW() - INTERVAL '90 days';
-- 3. Entity changes (recent)
CREATE INDEX CONCURRENTLY idx_audit_logs_entity_recent
ON audit_logs(entity_type, entity_id, created_at DESC)
WHERE created_at >= NOW() - INTERVAL '90 days';

After verifying partial indexes are being used:

-- Drop old standard indexes (DO THIS CAREFULLY!)
DROP INDEX IF EXISTS idx_resources_reference_number;
DROP INDEX IF EXISTS idx_resources_national_id;
DROP INDEX IF EXISTS idx_engagements_resource_id;
DROP INDEX IF EXISTS idx_engagement_orders_engagement_id;
-- Monitor query performance for 24-48 hours before proceeding

Phase 2: Automated Archival & Cleanup Strategy

Section titled “Phase 2: Automated Archival & Cleanup Strategy”

Over time, soft-deleted records accumulate:

Year 1: 3M deleted rows (30%)
Year 2: 6M deleted rows (46%)
Year 3: 9M deleted rows (56%)
Year 4: 12M deleted rows (63%)

Solution: Archive old deleted records to separate tables or external storage.

-- Archive table for old resources (read-only storage)
CREATE TABLE resources_archive (
LIKE resources INCLUDING ALL
);
-- Add metadata
ALTER TABLE resources_archive
ADD COLUMN archived_at TIMESTAMPTZ DEFAULT NOW();
-- Create index for archive queries
CREATE INDEX idx_resources_archive_ref ON resources_archive(reference_number);
CREATE INDEX idx_resources_archive_archived_at ON resources_archive(archived_at);
-- Repeat for other tables
CREATE TABLE engagements_archive (LIKE engagements INCLUDING ALL);
CREATE TABLE engagement_orders_archive (LIKE engagement_orders INCLUDING ALL);

Step 2: Monthly Archival Job (PostgreSQL Function)

Section titled “Step 2: Monthly Archival Job (PostgreSQL Function)”
-- ==============================================================
-- FUNCTION: Archive Old Deleted Records (5+ years old)
-- ==============================================================
CREATE OR REPLACE FUNCTION archive_old_deleted_records()
RETURNS TABLE(
table_name TEXT,
rows_archived BIGINT,
rows_deleted BIGINT
)
LANGUAGE plpgsql
AS $$
DECLARE
cutoff_date TIMESTAMPTZ := NOW() - INTERVAL '5 years';
batch_size INT := 10000;
total_archived BIGINT;
total_deleted BIGINT;
BEGIN
-- Archive RESOURCES
RAISE NOTICE 'Archiving resources older than %', cutoff_date;
-- Insert in batches to avoid long transactions
LOOP
WITH archived AS (
INSERT INTO resources_archive
SELECT *, NOW() AS archived_at
FROM resources
WHERE is_deleted = 1
AND deleted_at < cutoff_date
LIMIT batch_size
RETURNING id
)
DELETE FROM resources
WHERE id IN (SELECT id FROM archived);
GET DIAGNOSTICS total_archived = ROW_COUNT;
EXIT WHEN total_archived = 0;
RAISE NOTICE 'Archived % resources', total_archived;
PERFORM pg_sleep(1); -- Throttle to reduce load
END LOOP;
-- Archive ENGAGEMENTS
RAISE NOTICE 'Archiving engagements older than %', cutoff_date;
LOOP
WITH archived AS (
INSERT INTO engagements_archive
SELECT *, NOW() AS archived_at
FROM engagements
WHERE is_deleted = 1
AND deleted_at < cutoff_date
LIMIT batch_size
RETURNING id
)
DELETE FROM engagements
WHERE id IN (SELECT id FROM archived);
GET DIAGNOSTICS total_deleted = ROW_COUNT;
EXIT WHEN total_deleted = 0;
RAISE NOTICE 'Archived % engagements', total_deleted;
PERFORM pg_sleep(1);
END LOOP;
-- Archive ENGAGEMENT_ORDERS
RAISE NOTICE 'Archiving engagement orders older than %', cutoff_date;
LOOP
WITH archived AS (
INSERT INTO engagement_orders_archive
SELECT *, NOW() AS archived_at
FROM engagement_orders
WHERE is_deleted = 1
AND deleted_at < cutoff_date
LIMIT batch_size
RETURNING id
)
DELETE FROM engagement_orders
WHERE id IN (SELECT id FROM archived);
GET DIAGNOSTICS total_deleted = ROW_COUNT;
EXIT WHEN total_deleted = 0;
RAISE NOTICE 'Archived % engagement orders', total_deleted;
PERFORM pg_sleep(1);
END LOOP;
-- Return summary
RETURN QUERY
SELECT
'resources'::TEXT,
(SELECT COUNT(*) FROM resources_archive)::BIGINT,
total_archived;
RAISE NOTICE 'Archival complete. Running VACUUM ANALYZE...';
-- Clean up space
VACUUM ANALYZE resources;
VACUUM ANALYZE engagements;
VACUUM ANALYZE engagement_orders;
END;
$$;
-- Install pg_cron extension (if not already installed)
CREATE EXTENSION IF NOT EXISTS pg_cron;
-- Schedule archival job: 1st day of each month at 2 AM
SELECT cron.schedule(
'archive-old-deleted-records',
'0 2 1 * *',
'SELECT archive_old_deleted_records();'
);
-- Verify scheduled jobs
SELECT * FROM cron.job;

Alternative: Shell Script (if pg_cron not available)

/scripts/monthly_archive.sh
#!/bin/bash
# Run archival function
psql -U app_user -d app_core_db -c "SELECT archive_old_deleted_records();"
# Log results
echo "Archival completed at $(date)" >> /var/log/app/archival.log

Crontab entry:

# Run monthly archival on 1st day at 2 AM
0 2 1 * * /scripts/monthly_archive.sh
-- Create monitoring table
CREATE TABLE archival_job_logs (
id SERIAL PRIMARY KEY,
job_name TEXT NOT NULL,
started_at TIMESTAMPTZ DEFAULT NOW(),
completed_at TIMESTAMPTZ,
rows_processed BIGINT,
status TEXT,
error_message TEXT
);
-- Modified archival function with logging
CREATE OR REPLACE FUNCTION archive_old_deleted_records_with_logging()
RETURNS VOID
LANGUAGE plpgsql
AS $$
DECLARE
job_id INT;
rows_count BIGINT;
BEGIN
-- Log start
INSERT INTO archival_job_logs (job_name, status)
VALUES ('monthly_archive', 'RUNNING')
RETURNING id INTO job_id;
-- Run archival
SELECT SUM(rows_archived) INTO rows_count
FROM archive_old_deleted_records();
-- Log completion
UPDATE archival_job_logs
SET completed_at = NOW(),
rows_processed = rows_count,
status = 'COMPLETED'
WHERE id = job_id;
EXCEPTION WHEN OTHERS THEN
-- Log error
UPDATE archival_job_logs
SET completed_at = NOW(),
status = 'FAILED',
error_message = SQLERRM
WHERE id = job_id;
RAISE;
END;
$$;

-- Run EXPLAIN ANALYZE before and after to compare plans
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT id, reference_number, first_name, last_name
FROM resources
WHERE reference_number = 'REF001234'
AND is_deleted = 0;

Before (Standard Index):

Seq Scan on resources
Filter: (reference_number = 'REF001234' AND is_deleted = 0)
Rows Removed by Filter: 9,999,999
Execution Time: 2,345.67 ms

After (Partial Index):

Index Scan using idx_resources_reference_number_active on resources
Index Cond: (reference_number = 'REF001234'::text)
Execution Time: 0.035 ms
-- Check which indexes are being used (run after 24h in production)
SELECT
indexrelname AS index_name,
relname AS table_name,
idx_scan AS times_used,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname IN ('resources', 'engagements', 'engagement_orders')
ORDER BY idx_scan DESC;

PracticeDetails
Always use CONCURRENTLYBuild indexes without table locks on live systems
Monitor before droppingVerify new partial indexes are used before dropping old ones
Batch archivalArchive in batches of 10,000 with pg_sleep(1) between batches
Schedule in off-peak hoursRun archival jobs at 2 AM when load is lowest
Keep archive tablesNever permanently delete — archive to separate tables
VACUUM after archivalRun VACUUM ANALYZE after large deletes to reclaim space