BullMQ in Production: Concerns, Trade-Offs, and Mitigations

Why This Article Exists

Most “BullMQ tutorial” articles end with a happy path: install, add a job, process it, done. But when your job queue is processing critical business operations for real users in production — the stakes are fundamentally different.

A duplicate job in a low-stakes system means someone gets two confirmation emails. Annoying, but harmless. A duplicate job in a high-stakes system means a critical operation executes twice — corrupting records, double-charging accounts, or creating phantom entries in your database. That’s not a bug report — that’s a production incident.

This article is a brutally honest assessment of every concern, every trade-off, and every mitigation for running BullMQ (backed by Redis) in a NestJS microservices-based enterprise system.

Why BullMQ? (And Why Not Something Else?)

Before we dive into concerns, let’s acknowledge why BullMQ earned its place in the stack:

Jobs survive worker crashes. They sit in Redis, waiting to be picked up — not lost in a dead process's memory. Failed jobs automatically retry with configurable delays. No need to hand-roll retry logic for every worker. Emergency operations can jump ahead of batch reports. Lower number = higher priority. Chain dependent jobs together: update inventory → record billing → notify user. The parent only runs when all children complete.

But here’s the question nobody asks early enough: is a Redis-backed job queue the right foundation for a system where data loss or duplication causes direct business harm?

The answer is: yes, but only with rigorous safeguards. Let’s walk through each one.

Part 1: Concerns and Mitigations

Concern 1 — Data Loss (Redis Is an In-Memory Database)

The core tension: Redis is fast because it’s in-memory. But in-memory means volatile. How do you get durability without sacrificing the speed that made you choose Redis in the first place?

Choice A: AOF with `appendfsync everysec`

Redis writes every operation to an append-only log file, flushing to disk once per second.

Choice B: AOF with `appendfsync always`

Redis flushes to disk on every single write operation.

Aspect	`everysec`	`always`
Max data loss	Up to 1 second of writes	Zero (in theory)
Throughput impact	~5-10% overhead	~50-80% overhead
Good for	99% of production workloads	Financial transactions, audit logs
Risk	Losing 1 second of jobs during power failure	Redis becomes so slow that queues back up

Our recommendation: appendfsync everysec combined with RDB snapshots. Here’s why — if your Redis is processing 1,000 jobs/sec and you lose 1 second, you lose at most 1,000 jobs. But those jobs were also recorded in your source database (the operation request exists in PostgreSQL before it enters the queue). You can replay them. The cost of always — halving your throughput — creates a different kind of risk: queue backlog, which leads to delayed processing. Trading one risk for another is not an improvement.

# redis.conf — recommended production configuration
appendonly yes
appendfsync everysec

# RDB snapshots as a secondary safety net
save 900 1       # Snapshot every 15 min if >= 1 change
save 300 10      # Snapshot every 5 min if >= 10 changes
save 60 10000    # Snapshot every 1 min if >= 10,000 changes

Concern 2 — Duplicate Job Execution (Idempotency)

The core tension: BullMQ guarantees at-least-once delivery, not exactly-once. “At-least-once” is a polite way of saying “we might run your job twice.” In low-stakes systems, this means a duplicate email. In high-stakes systems, this means a duplicate operation.

Defense Layer 1: Deterministic Job IDs

BullMQ will reject a job if another job with the same ID already exists. Use IDs derived from business events, not random UUIDs:

await orderQueue.add(
    'process-order',
    { orderId, resourceId, items },
    {
        // BullMQ rejects this if a job with the same ID is already queued
        jobId: `process:${orderId}:${timestamp}`,
        priority: 1,
    },
);

Defense Layer 2: Database State Check in Workers

Even with deterministic IDs, network timing can still produce duplicates. Your worker must verify the current state in the database before acting:

@Processor('order-queue')
export class OrderProcessor {
    @Process('process-order')
    async handleProcess(job: Job<ProcessPayload>): Promise<ProcessResult> {
        const order = await this.orderRepo.findOne({
            where: { id: job.data.orderId },
        });

        // Idempotency guard — if already processed, skip immediately
        if (order.status === OrderStatus.PROCESSED) {
            this.logger.warn(`Duplicate job detected, skipping: ${job.id}`);
            return { skipped: true, reason: 'already_processed' };
        }

        // Proceed with processing...
        await this.executeDbOperation(async () => {
            await this.orderRepo.update(order.id, {
                status: OrderStatus.PROCESSED,
                processed_at: new Date(),
                processed_by: job.data.operatorId,
            });
        });

        return { success: true };
    }
}

Think about this: the deterministic jobId prevents the same job from being enqueued twice. The database state check prevents the same action from being executed twice. You need both layers. One protects the queue, the other protects the data.

Concern 3 — Complex Multi-Step Transactions

The core tension: traditional database transactions give you atomicity — all-or-nothing. But in a microservices world with queued jobs, each step might run on a different service, at a different time. There’s no global transaction manager.

Choice A: BullMQ Flows (Parent-Child Jobs)

BullMQ’s built-in flow system. The parent job only executes after all children complete successfully. If a child fails, the parent never runs.

import { FlowProducer } from 'bullmq';

const flow = new FlowProducer({ connection: redisConnection });

await flow.add({
    name: 'notify-user',
    queueName: 'notification-queue',
    data: { resourceId, message: 'Your order has been processed' },
    children: [
        {
            name: 'update-inventory',
            queueName: 'inventory-queue',
            data: { orderId, items },
        },
        {
            name: 'record-billing',
            queueName: 'billing-queue',
            data: { orderId, totalAmount },
        },
    ],
});
// Parent won't run until ALL children complete

Choice B: Saga Pattern (Compensating Transactions)

Each step has a corresponding “undo” step. If step 3 fails, you run the compensating actions for steps 2 and 1 in reverse order.

Aspect	BullMQ Flows	Saga Pattern
Complexity	Low — BullMQ handles orchestration	High — you write every compensation handler
Rollback	Parent doesn’t run if child fails (but completed children aren’t reversed)	Full rollback via compensating transactions
Use when	Steps are independent (order doesn’t matter between children)	Steps must be strictly ordered and fully reversible
Example	Notify user only after inventory + billing both succeed	Update inventory → if billing fails → revert inventory
Weakness	Children that did succeed aren’t automatically undone	Compensation logic can itself fail (what if reverting inventory fails?)

Our recommendation: Start with BullMQ Flows for most workflows. Move to Saga Pattern only when you have a strict ordering requirement and need full rollback capability. The Saga Pattern is more correct in theory but dramatically more complex to implement and test. In most scenarios, the “parent doesn’t run” behavior of Flows is sufficient — if billing fails, you simply don’t notify the user, and a human resolves the billing issue manually.

Concern 4 — WebSocket Gateway: Where BullMQ Meets Socket.io

When you extract a dedicated Notification Service that consumes BullMQ jobs and broadcasts via Socket.io, you’re bridging two very different paradigms: asynchronous batch processing and real-time push. This bridge has its own set of concerns.

Reference: Socket.io Communication Convention

graph LR
    A[Data Owner Service] -->|add job| B[BullMQ Queue]
    C[Order Service] -->|add job| B
    D[Results Service] -->|add job| B
    B -->|process| E[Notification Worker]
    E -->|emit event| F[Socket.io Gateway]
    F -->|to room| G[Frontend Clients]
    F -->|Redis Pub/Sub| H[Socket.io Gateway\nPod 2]

4.1 Centralized Namespace Management

The concern: The Gateway must accept connections from every service domain in the system. Without structure, you end up with a spaghetti of event names that nobody can debug at 3 AM.

The mitigation: Follow Socket.io Convention - Naming (Section 1.1). Every namespace follows /{service_name} format. Every connection must pass through JWT authentication middleware to populate socket.data.user and socket.data.departmentId for room management.

@WebSocketGateway({ namespace: '/data_owner', cors: { origin: '*' } })
export class NotificationGateway implements OnGatewayInit {
    afterInit(server: Server): void {
        server.use(async (socket, next) => {
            const token = socket.handshake.auth.token;
            if (!token) return next(new Error('Authentication failed'));

            try {
                const user = await this.authService.verifyToken(token);
                socket.data.user = user;
                socket.data.departmentId = user.department;
                next();
            } catch {
                next(new Error('Invalid token'));
            }
        });
    }
}

4.2 Queue Latency: Real-Time vs. Batch

The concern: Imagine your system dumps 10,000 batch report jobs into BullMQ at midnight. Now an operator updates a critical record at 00:01. That records:updated event — which needs sub-second delivery — is sitting behind 10,000 report jobs.

The core tension: Should you use priority within a single queue, or separate queues entirely?

Choice A: Single Queue with Priorities

// Same queue, different priorities
await eventQueue.add('critical-alert', payload, { priority: 1 }); // Processed first
await eventQueue.add('daily-report', payload, { priority: 100 }); // Processed last

Choice B: Separate Queues with Dedicated Workers

const QUEUES = {
    CRITICAL: 'critical-events', // records:updated, system:alert
    GENERAL: 'general-events', // resources:created, appointments:scheduled
    BATCH: 'batch-reports', // Daily reports, statistics
};

Aspect	Single Queue + Priority	Separate Queues
Isolation	None — batch jobs can still slow down critical events during peak Redis load	Full — each queue has its own workers, batch can’t block critical
Ops complexity	Low — one queue to monitor	Higher — multiple queues, multiple worker pools
Scaling	Scale workers for one queue	Scale critical workers to 5 pods, leave batch at 1
When batch floods	Critical events wait until Redis processes priority sort	Critical events continue unaffected

Our recommendation: Separate queues. The priority system within BullMQ works, but it doesn’t provide isolation. With separate queues, you can scale critical workers independently and even run them against a different Redis instance if needed.

Additionally, apply throttling per Socket.io Convention (Section 3.4) — records that update every millisecond from a sensor should be throttled to emit at most every 5 seconds:

@Process('records-update')
async handleRecordUpdate(job: Job<RecordUpdatePayload>): Promise<ThrottleResult> {
    const rateLimitKey = `records-ratelimit:${job.data.resourceId}`;
    const isLimited = await this.redis.get(rateLimitKey);

    if (isLimited) {
        return { throttled: true };
    }

    await this.redis.setex(rateLimitKey, 5, '1'); // Lock for 5 seconds

    this.notificationGateway.server
        .to(`department:${job.data.departmentCode}`)
        .emit('records:updated', this.buildPayload(job.data));

    return { emitted: true };
}

4.3 Duplicate Notifications at the Frontend (At-Least-Once Delivery)

The concern: BullMQ guarantees at-least-once delivery. This means the frontend might receive the same resources:created event twice if a network hiccup causes a job retry.

The mitigation: Per Socket.io Convention — Payload Structure (Section 2), every event must include metadata.trace_id. The frontend maintains a local deduplication cache:

// Frontend — State Management (Zustand / Redux)
const processedTraceIds = new Set<string>();

socket.on('resources:created', (payload: SocketPayload<ResourceData>) => {
    // Idempotency check — skip if we've seen this trace_id before
    if (processedTraceIds.has(payload.metadata.trace_id)) {
        return; // Silently ignore the duplicate
    }

    processedTraceIds.add(payload.metadata.trace_id);

    // Prevent memory leak — cap the set size
    if (processedTraceIds.size > 1000) {
        const oldest = processedTraceIds.values().next().value;
        processedTraceIds.delete(oldest);
    }

    // Proceed normally
    updateResourceList(payload.data);
});

Think about this: the deduplication happens at two levels. BullMQ’s deterministic jobId (Concern 2) prevents the queue from processing duplicates. The trace_id on the frontend prevents the UI from displaying duplicates. Both are necessary because they operate in different failure domains.

4.4 Payload Size Bloat in Redis

The concern: A developer attaches a full object graph (50KB+ of JSON) to a BullMQ job payload. Multiply that by 10,000 concurrent jobs. You’ve just consumed 500MB of Redis memory for job payloads alone.

The mitigation: Follow the Socket.io Convention Payload Guidelines: “Don’t include the entire object if only ID is needed.” Send thin payloads — IDs and minimal status context. The frontend fetches full data via REST API when it receives the notification.

// ❌ Fat payload — Redis memory explodes
await notificationQueue.add('resource-updated', {
    resource: fullResourceObject, // 50KB+ per job
    history: allHistoryRecords,   // Could be megabytes
    attachments: base64Files,     // Absolutely not
});

// ✅ Thin payload — Redis stays lean
await notificationQueue.add('resource-updated', {
    id: resource.id,
    data: {
        reference_number: resource.reference_number,
        status: resource.status,
    },
    metadata: {
        timestamp: new Date().toISOString(),
        triggered_by: userId,
        trace_id: uuidv4(),
        department: departmentId,
    },
});
// Frontend receives notification → calls GET /resources/:id for full data

The trade-off is real: thin payloads mean the frontend needs a follow-up REST call. This adds ~50-100ms of latency to get the full data. But it keeps your Redis memory consumption predictable and your queue throughput high.

4.5 Scaling the Notification Service Across Multiple Pods

The concern: You deploy 3 pods of the Notification Gateway for high availability. A user connected to Pod 1 should receive an event emitted by a worker processing on Pod 3. But Socket.io, by default, only broadcasts to sockets connected to its own process.

The mitigation: Per Socket.io Convention — Redis Adapter (Section 4.1), always install @socket.io/redis-adapter for the Notification Service:

import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';

const pubClient = createClient({ url: process.env.REDIS_URL });
const subClient = pubClient.duplicate();

await Promise.all([pubClient.connect(), subClient.connect()]);

// Now io.to('department:A').emit(...) reaches ALL pods
io.adapter(createAdapter(pubClient, subClient));

Think about this: without the Redis adapter, your system works perfectly in development (single process) and fails silently in production (multiple pods). Events get “lost” — but they’re not really lost; they’re just emitted to the wrong pod.

Part 2: Redis Sentinel for High Availability

A single Redis instance is a single point of failure. If it goes down, all queues stop, all sessions are lost, and your entire notification system goes dark. Redis Sentinel provides automatic failover.

Architecture

Component	Count	Role
Master Node	1	Handles all Read/Write operations
Replica Nodes	2	Continuously replicate data from Master; ready to be promoted
Sentinel Nodes	3 (odd number)	Monitor health; vote to promote a Replica when Master dies

What Actually Happens During a Failover?

sequenceDiagram
    participant S1 as Sentinel 1
    participant S2 as Sentinel 2
    participant S3 as Sentinel 3
    participant M as Master (dies)
    participant R1 as Replica 1
    participant App as NestJS App

    M->>M: Process crashes / Network lost
    S1->>S1: Master unreachable (SDOWN)
    S2->>S2: Master unreachable (SDOWN)
    S1->>S2: "Is Master down?" → Vote
    S2->>S1: "Yes" → Quorum reached (ODOWN)
    S1->>R1: "You are the new Master"
    R1->>R1: Promote to Master
    S1->>App: Notify: new Master is Replica 1
    App->>R1: Reconnect (automatic via ioredis)

The key insight: your NestJS application doesn’t need to know which Redis node is the Master. It connects to the Sentinel cluster, and ioredis (BullMQ’s underlying Redis client) automatically discovers and follows the current Master — even through failovers.

Docker Compose Setup

version: '3.8'

services:
    redis-master:
        image: bitnami/redis:7.2
        container_name: app-redis-master
        environment:
            - REDIS_REPLICATION_MODE=master
            - REDIS_PASSWORD=${REDIS_PASSWORD}
            - REDIS_AOF_ENABLED=yes
        ports:
            - '6379:6379'
        volumes:
            - redis_master_data:/bitnami/redis/data

    redis-replica-1:
        image: bitnami/redis:7.2
        container_name: app-redis-replica-1
        environment:
            - REDIS_REPLICATION_MODE=slave
            - REDIS_MASTER_HOST=redis-master
            - REDIS_MASTER_PORT_NUMBER=6379
            - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
            - REDIS_PASSWORD=${REDIS_PASSWORD}
            - REDIS_AOF_ENABLED=yes
        depends_on:
            - redis-master

    redis-replica-2:
        image: bitnami/redis:7.2
        container_name: app-redis-replica-2
        environment:
            - REDIS_REPLICATION_MODE=slave
            - REDIS_MASTER_HOST=redis-master
            - REDIS_MASTER_PORT_NUMBER=6379
            - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
            - REDIS_PASSWORD=${REDIS_PASSWORD}
            - REDIS_AOF_ENABLED=yes
        depends_on:
            - redis-master

    # Sentinel group (odd count to prevent split-brain during voting)
    redis-sentinel-1:
        image: bitnami/redis-sentinel:7.2
        container_name: app-redis-sentinel-1
        environment:
            - REDIS_MASTER_HOST=redis-master
            - REDIS_MASTER_PORT_NUMBER=6379
            - REDIS_MASTER_SET=app-master-set
            - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
            - REDIS_SENTINEL_QUORUM=2
        depends_on:
            - redis-master
        ports:
            - '26379:26379'

    redis-sentinel-2:
        image: bitnami/redis-sentinel:7.2
        container_name: app-redis-sentinel-2
        environment:
            - REDIS_MASTER_HOST=redis-master
            - REDIS_MASTER_PORT_NUMBER=6379
            - REDIS_MASTER_SET=app-master-set
            - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
            - REDIS_SENTINEL_QUORUM=2
        depends_on:
            - redis-master
        ports:
            - '26380:26379'

    redis-sentinel-3:
        image: bitnami/redis-sentinel:7.2
        container_name: app-redis-sentinel-3
        environment:
            - REDIS_MASTER_HOST=redis-master
            - REDIS_MASTER_PORT_NUMBER=6379
            - REDIS_MASTER_SET=app-master-set
            - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
            - REDIS_SENTINEL_QUORUM=2
        depends_on:
            - redis-master
        ports:
            - '26381:26379'

volumes:
    redis_master_data:

NestJS BullMQ Configuration with Sentinel

import { BullModule } from '@nestjs/bullmq';
import { Module } from '@nestjs/common';

@Module({
    imports: [
        BullModule.forRoot({
            connection: {
                // Connect via Sentinels — ioredis discovers the current Master automatically
                sentinels: [
                    { host: process.env.REDIS_SENTINEL_1_HOST, port: 26379 },
                    { host: process.env.REDIS_SENTINEL_2_HOST, port: 26380 },
                    { host: process.env.REDIS_SENTINEL_3_HOST, port: 26381 },
                ],
                name: 'app-master-set', // Must match REDIS_MASTER_SET in Docker Compose
                password: process.env.REDIS_PASSWORD,

                // Retry strategy during failover — don't give up too quickly
                sentinelRetryStrategy: (times: number) => {
                    return Math.min(times * 500, 2000); // 500ms, 1s, 1.5s, 2s (cap)
                },
            },

            // Default job options
            defaultJobOptions: {
                removeOnComplete: 1000, // Keep last 1,000 completed jobs for debugging
                removeOnFail: 5000, // Keep last 5,000 failed jobs for analysis
                attempts: 3, // Retry failed jobs up to 3 times
                backoff: {
                    type: 'exponential', // 1s → 2s → 4s between retries
                    delay: 1000,
                },
            },
        }),
    ],
})
export class AppModule {}

Part 3: Scaling Strategy

As the system grows — more records, more services, more concurrent users — scale methodically:

Application Level — Scale Workers Independently

The beauty of BullMQ is that workers are just NestJS processes. Scale them independently from your API servers:
Terminal window
```
# Scale file-processing workers (heavy CPU) to 5 pods
kubectl scale deployment file-worker --replicas=5

# Keep API servers at 3 pods (handle HTTP traffic)
kubectl scale deployment data-owner-api --replicas=3

# Critical notification workers always at 3+ for redundancy
kubectl scale deployment notification-worker --replicas=3
```
Think about this: if your batch report generation is slow, don’t scale the entire API. Scale the report worker. BullMQ’s architecture naturally supports this because producers (API servers) and consumers (workers) are decoupled.

Infrastructure Level — Memory Management

Redis is your system’s nervous system — sessions, queues, Socket.io Pub/Sub all flow through it. Managing its memory is critical.

No-Eviction Policy: In most Redis use cases, allkeys-lru (evict least recently used keys) is fine. For a critical queue? Absolutely not. A pending operation must never be evicted because Redis ran out of memory.

# redis.conf — non-negotiable for critical queues
maxmemory-policy noeviction   # Redis returns errors instead of deleting jobs
maxmemory 4gb                 # Set based on your server's available RAM

Job Archiving: Redis should not become a permanent audit log. Write a scheduled worker that moves completed/failed jobs to PostgreSQL and frees Redis memory:

@Cron(CronExpression.EVERY_DAY_AT_2AM)
async archiveCompletedJobs(): Promise<void> {
    const completedJobs = await this.queue.getCompleted(0, 10000);

    if (completedJobs.length === 0) return;

    // Archive to PostgreSQL for audit requirements
    await this.auditLogRepo.save(
        completedJobs.map(job => ({
            job_id: job.id,
            queue_name: job.queueName,
            data: job.data,
            result: job.returnvalue,
            completed_at: new Date(job.finishedOn),
        })),
    );

    // Remove from Redis to reclaim memory
    await Promise.all(completedJobs.map(job => job.remove()));

    this.logger.log(`Archived ${completedJobs.length} jobs from Redis to PostgreSQL`);
}

Infrastructure Level — When Sentinel Isn’t Enough: Redis Cluster

Redis Sentinel provides high availability (automatic failover), but it doesn’t provide horizontal scaling. You still have a single Master handling all writes. When your throughput exceeds what one Redis Master can handle, you need Redis Cluster.

Sentinel vs. Cluster: When Do You Switch?

Aspect	Redis Sentinel	Redis Cluster
Scaling model	Vertical (bigger Master)	Horizontal (more shards)
Write throughput	Limited to 1 Master	Distributed across shards
Failover	Automatic (Sentinel votes)	Automatic (built-in)
Complexity	Moderate	High
BullMQ compatibility	Native — just works	Requires Hash Tags (see below)
When to choose	< 100K ops/sec, most systems	> 100K ops/sec, massive scale

// ✅ Hash Tags ensure all keys for this queue land on the same shard
BullModule.registerQueue({
    name: '{app-main}:order-queue',
});
BullModule.registerQueue({
    name: '{app-main}:notification-queue',
});

// ❌ Without Hash Tags — keys scatter across shards, Lua scripts fail
BullModule.registerQueue({ name: 'order-queue' });

Our recommendation: Stay with Sentinel until you’ve exhausted vertical scaling options. Most enterprise systems won’t generate enough Redis traffic to warrant a Cluster. Sentinel gives you HA with much less operational complexity.

Architecture Level — Beyond BullMQ: When to Evaluate Kafka or RabbitMQ

BullMQ excels at task processing: “do this thing, retry if it fails, tell me when it’s done.” But as your system matures, you may need capabilities that stretch beyond BullMQ’s design:

Need	BullMQ	Kafka	RabbitMQ
”Do this task” (command)	Excellent	Overkill	Good
”This happened” (event)	Acceptable	Excellent	Good
Event replay / audit trail	Not designed for this	Built-in (log retention)	Not designed for this
Fan-out to many consumers	One consumer per job	Native (consumer groups)	Native (exchanges)
Complex routing rules	No	Limited	Excellent (topic/header exchanges)
Ops complexity	Low (just Redis)	High (ZooKeeper/KRaft, brokers, topics)	Moderate (Erlang runtime)
Throughput ceiling	~50K jobs/sec	1M+ events/sec	~100K msgs/sec
Learning curve	Low	High	Moderate

The key question: has your requirement shifted from “do tasks” to “broadcast events”? If one service needs to announce “resource updated” and 5 different services need to react independently — that’s an event streaming problem. BullMQ’s one-consumer-per-job model doesn’t fit. That’s when Kafka earns its complexity cost.

Our recommendation: use a hybrid approach.

BullMQ for task processing (process orders, generate reports, send notifications)
Kafka (when you need it) for event streaming and audit logs
Socket.io for real-time frontend push (the final mile to the browser)

Don’t adopt Kafka “just in case.” Its operational complexity is significant. Add it when you have a concrete event streaming requirement that BullMQ cannot serve.

Decision Framework

When a new feature requires async processing, use this decision tree:

Does the work need to happen asynchronously?
│
├── NO → Execute synchronously in the service layer
│
└── YES → Is it a "task" (do this thing) or an "event" (this happened)?
    │
    ├── TASK → Use BullMQ
    │   ├── Is it business-critical? (billing, inventory, audit)
    │   │   ├── YES → Separate critical queue + Priority 1 + Idempotency guard
    │   │   └── NO  → General queue + default priority
    │   │
    │   ├── Multi-step workflow?
    │   │   ├── Independent steps → BullMQ Flows (Parent-Child)
    │   │   └── Ordered + reversible → Saga Pattern
    │   │
    │   └── Frontend needs to know?
    │       ├── YES → BullMQ Worker → Socket.io emit (thin payload)
    │       └── NO  → Worker completes silently
    │
    └── EVENT (multiple consumers need it) → Evaluate Kafka

Quick Reference: Common Scenarios

Scenario	Approach	Queue	Socket.io Target
Critical system alert	BullMQ `critical-events` + Priority 1	Separate critical queue	`department:{deptCode}`
Process completion	BullMQ `general-events`	General queue	`user:{userId}`
Order processed	BullMQ `critical-events` + Idempotency	Separate critical queue	`user:{operatorId}`
Daily statistical report	BullMQ `batch-reports`	Batch queue	Email (no Socket.io)
File/image processing	BullMQ `processing-queue`	Dedicated queue, scale workers	`user:{userId}`
Real-time chat / presence	Socket.io directly	N/A — no BullMQ needed	`user:{userId}`
Resource creation broadcast	Evaluate Kafka (fan-out to many services)	N/A	Multiple rooms

References

Socket.io Communication Convention — Naming, Payload, Rooms, Throttling standards
BullMQ Documentation — Flows, Priority, Rate Limiting
Redis Sentinel Documentation — HA configuration
Redis Cluster Documentation — Horizontal scaling
Socket.io Redis Adapter — Cross-instance broadcasting