BullMQ in Production: Concerns, Trade-Offs, and Mitigations
Why This Article Exists
Section titled “Why This Article Exists”Most “BullMQ tutorial” articles end with a happy path: install, add a job, process it, done. But when your job queue is processing critical business operations for real users in production — the stakes are fundamentally different.
A duplicate job in a low-stakes system means someone gets two confirmation emails. Annoying, but harmless. A duplicate job in a high-stakes system means a critical operation executes twice — corrupting records, double-charging accounts, or creating phantom entries in your database. That’s not a bug report — that’s a production incident.
This article is a brutally honest assessment of every concern, every trade-off, and every mitigation for running BullMQ (backed by Redis) in a NestJS microservices-based enterprise system.
Why BullMQ? (And Why Not Something Else?)
Section titled “Why BullMQ? (And Why Not Something Else?)”Before we dive into concerns, let’s acknowledge why BullMQ earned its place in the stack:
But here’s the question nobody asks early enough: is a Redis-backed job queue the right foundation for a system where data loss or duplication causes direct business harm?
The answer is: yes, but only with rigorous safeguards. Let’s walk through each one.
Part 1: Concerns and Mitigations
Section titled “Part 1: Concerns and Mitigations”Concern 1 — Data Loss (Redis Is an In-Memory Database)
Section titled “Concern 1 — Data Loss (Redis Is an In-Memory Database)”The core tension: Redis is fast because it’s in-memory. But in-memory means volatile. How do you get durability without sacrificing the speed that made you choose Redis in the first place?
Choice A: AOF with appendfsync everysec
Section titled “Choice A: AOF with appendfsync everysec”Redis writes every operation to an append-only log file, flushing to disk once per second.
Choice B: AOF with appendfsync always
Section titled “Choice B: AOF with appendfsync always”Redis flushes to disk on every single write operation.
| Aspect | everysec | always |
|---|---|---|
| Max data loss | Up to 1 second of writes | Zero (in theory) |
| Throughput impact | ~5-10% overhead | ~50-80% overhead |
| Good for | 99% of production workloads | Financial transactions, audit logs |
| Risk | Losing 1 second of jobs during power failure | Redis becomes so slow that queues back up |
Our recommendation: appendfsync everysec combined with RDB snapshots. Here’s why — if your Redis is processing 1,000 jobs/sec and you lose 1 second, you lose at most 1,000 jobs. But those jobs were also recorded in your source database (the operation request exists in PostgreSQL before it enters the queue). You can replay them. The cost of always — halving your throughput — creates a different kind of risk: queue backlog, which leads to delayed processing. Trading one risk for another is not an improvement.
# redis.conf — recommended production configurationappendonly yesappendfsync everysec
# RDB snapshots as a secondary safety netsave 900 1 # Snapshot every 15 min if >= 1 changesave 300 10 # Snapshot every 5 min if >= 10 changessave 60 10000 # Snapshot every 1 min if >= 10,000 changesConcern 2 — Duplicate Job Execution (Idempotency)
Section titled “Concern 2 — Duplicate Job Execution (Idempotency)”The core tension: BullMQ guarantees at-least-once delivery, not exactly-once. “At-least-once” is a polite way of saying “we might run your job twice.” In low-stakes systems, this means a duplicate email. In high-stakes systems, this means a duplicate operation.
Defense Layer 1: Deterministic Job IDs
Section titled “Defense Layer 1: Deterministic Job IDs”BullMQ will reject a job if another job with the same ID already exists. Use IDs derived from business events, not random UUIDs:
await orderQueue.add( 'process-order', { orderId, resourceId, items }, { // BullMQ rejects this if a job with the same ID is already queued jobId: `process:${orderId}:${timestamp}`, priority: 1, },);Defense Layer 2: Database State Check in Workers
Section titled “Defense Layer 2: Database State Check in Workers”Even with deterministic IDs, network timing can still produce duplicates. Your worker must verify the current state in the database before acting:
@Processor('order-queue')export class OrderProcessor { @Process('process-order') async handleProcess(job: Job<ProcessPayload>): Promise<ProcessResult> { const order = await this.orderRepo.findOne({ where: { id: job.data.orderId }, });
// Idempotency guard — if already processed, skip immediately if (order.status === OrderStatus.PROCESSED) { this.logger.warn(`Duplicate job detected, skipping: ${job.id}`); return { skipped: true, reason: 'already_processed' }; }
// Proceed with processing... await this.executeDbOperation(async () => { await this.orderRepo.update(order.id, { status: OrderStatus.PROCESSED, processed_at: new Date(), processed_by: job.data.operatorId, }); });
return { success: true }; }}Think about this: the deterministic jobId prevents the same job from being enqueued twice. The database state check prevents the same action from being executed twice. You need both layers. One protects the queue, the other protects the data.
Concern 3 — Complex Multi-Step Transactions
Section titled “Concern 3 — Complex Multi-Step Transactions”The core tension: traditional database transactions give you atomicity — all-or-nothing. But in a microservices world with queued jobs, each step might run on a different service, at a different time. There’s no global transaction manager.
Choice A: BullMQ Flows (Parent-Child Jobs)
Section titled “Choice A: BullMQ Flows (Parent-Child Jobs)”BullMQ’s built-in flow system. The parent job only executes after all children complete successfully. If a child fails, the parent never runs.
import { FlowProducer } from 'bullmq';
const flow = new FlowProducer({ connection: redisConnection });
await flow.add({ name: 'notify-user', queueName: 'notification-queue', data: { resourceId, message: 'Your order has been processed' }, children: [ { name: 'update-inventory', queueName: 'inventory-queue', data: { orderId, items }, }, { name: 'record-billing', queueName: 'billing-queue', data: { orderId, totalAmount }, }, ],});// Parent won't run until ALL children completeChoice B: Saga Pattern (Compensating Transactions)
Section titled “Choice B: Saga Pattern (Compensating Transactions)”Each step has a corresponding “undo” step. If step 3 fails, you run the compensating actions for steps 2 and 1 in reverse order.
| Aspect | BullMQ Flows | Saga Pattern |
|---|---|---|
| Complexity | Low — BullMQ handles orchestration | High — you write every compensation handler |
| Rollback | Parent doesn’t run if child fails (but completed children aren’t reversed) | Full rollback via compensating transactions |
| Use when | Steps are independent (order doesn’t matter between children) | Steps must be strictly ordered and fully reversible |
| Example | Notify user only after inventory + billing both succeed | Update inventory → if billing fails → revert inventory |
| Weakness | Children that did succeed aren’t automatically undone | Compensation logic can itself fail (what if reverting inventory fails?) |
Our recommendation: Start with BullMQ Flows for most workflows. Move to Saga Pattern only when you have a strict ordering requirement and need full rollback capability. The Saga Pattern is more correct in theory but dramatically more complex to implement and test. In most scenarios, the “parent doesn’t run” behavior of Flows is sufficient — if billing fails, you simply don’t notify the user, and a human resolves the billing issue manually.
Concern 4 — WebSocket Gateway: Where BullMQ Meets Socket.io
Section titled “Concern 4 — WebSocket Gateway: Where BullMQ Meets Socket.io”When you extract a dedicated Notification Service that consumes BullMQ jobs and broadcasts via Socket.io, you’re bridging two very different paradigms: asynchronous batch processing and real-time push. This bridge has its own set of concerns.
Reference: Socket.io Communication Convention
graph LR
A[Data Owner Service] -->|add job| B[BullMQ Queue]
C[Order Service] -->|add job| B
D[Results Service] -->|add job| B
B -->|process| E[Notification Worker]
E -->|emit event| F[Socket.io Gateway]
F -->|to room| G[Frontend Clients]
F -->|Redis Pub/Sub| H[Socket.io Gateway\nPod 2]
4.1 Centralized Namespace Management
Section titled “4.1 Centralized Namespace Management”The concern: The Gateway must accept connections from every service domain in the system. Without structure, you end up with a spaghetti of event names that nobody can debug at 3 AM.
The mitigation: Follow Socket.io Convention - Naming (Section 1.1). Every namespace follows /{service_name} format. Every connection must pass through JWT authentication middleware to populate socket.data.user and socket.data.departmentId for room management.
@WebSocketGateway({ namespace: '/data_owner', cors: { origin: '*' } })export class NotificationGateway implements OnGatewayInit { afterInit(server: Server): void { server.use(async (socket, next) => { const token = socket.handshake.auth.token; if (!token) return next(new Error('Authentication failed'));
try { const user = await this.authService.verifyToken(token); socket.data.user = user; socket.data.departmentId = user.department; next(); } catch { next(new Error('Invalid token')); } }); }}4.2 Queue Latency: Real-Time vs. Batch
Section titled “4.2 Queue Latency: Real-Time vs. Batch”The concern: Imagine your system dumps 10,000 batch report jobs into BullMQ at midnight. Now an operator updates a critical record at 00:01. That records:updated event — which needs sub-second delivery — is sitting behind 10,000 report jobs.
The core tension: Should you use priority within a single queue, or separate queues entirely?
Choice A: Single Queue with Priorities
Section titled “Choice A: Single Queue with Priorities”// Same queue, different prioritiesawait eventQueue.add('critical-alert', payload, { priority: 1 }); // Processed firstawait eventQueue.add('daily-report', payload, { priority: 100 }); // Processed lastChoice B: Separate Queues with Dedicated Workers
Section titled “Choice B: Separate Queues with Dedicated Workers”const QUEUES = { CRITICAL: 'critical-events', // records:updated, system:alert GENERAL: 'general-events', // resources:created, appointments:scheduled BATCH: 'batch-reports', // Daily reports, statistics};| Aspect | Single Queue + Priority | Separate Queues |
|---|---|---|
| Isolation | None — batch jobs can still slow down critical events during peak Redis load | Full — each queue has its own workers, batch can’t block critical |
| Ops complexity | Low — one queue to monitor | Higher — multiple queues, multiple worker pools |
| Scaling | Scale workers for one queue | Scale critical workers to 5 pods, leave batch at 1 |
| When batch floods | Critical events wait until Redis processes priority sort | Critical events continue unaffected |
Our recommendation: Separate queues. The priority system within BullMQ works, but it doesn’t provide isolation. With separate queues, you can scale critical workers independently and even run them against a different Redis instance if needed.
Additionally, apply throttling per Socket.io Convention (Section 3.4) — records that update every millisecond from a sensor should be throttled to emit at most every 5 seconds:
@Process('records-update')async handleRecordUpdate(job: Job<RecordUpdatePayload>): Promise<ThrottleResult> { const rateLimitKey = `records-ratelimit:${job.data.resourceId}`; const isLimited = await this.redis.get(rateLimitKey);
if (isLimited) { return { throttled: true }; }
await this.redis.setex(rateLimitKey, 5, '1'); // Lock for 5 seconds
this.notificationGateway.server .to(`department:${job.data.departmentCode}`) .emit('records:updated', this.buildPayload(job.data));
return { emitted: true };}4.3 Duplicate Notifications at the Frontend (At-Least-Once Delivery)
Section titled “4.3 Duplicate Notifications at the Frontend (At-Least-Once Delivery)”The concern: BullMQ guarantees at-least-once delivery. This means the frontend might receive the same resources:created event twice if a network hiccup causes a job retry.
The mitigation: Per Socket.io Convention — Payload Structure (Section 2), every event must include metadata.trace_id. The frontend maintains a local deduplication cache:
// Frontend — State Management (Zustand / Redux)const processedTraceIds = new Set<string>();
socket.on('resources:created', (payload: SocketPayload<ResourceData>) => { // Idempotency check — skip if we've seen this trace_id before if (processedTraceIds.has(payload.metadata.trace_id)) { return; // Silently ignore the duplicate }
processedTraceIds.add(payload.metadata.trace_id);
// Prevent memory leak — cap the set size if (processedTraceIds.size > 1000) { const oldest = processedTraceIds.values().next().value; processedTraceIds.delete(oldest); }
// Proceed normally updateResourceList(payload.data);});Think about this: the deduplication happens at two levels. BullMQ’s deterministic jobId (Concern 2) prevents the queue from processing duplicates. The trace_id on the frontend prevents the UI from displaying duplicates. Both are necessary because they operate in different failure domains.
4.4 Payload Size Bloat in Redis
Section titled “4.4 Payload Size Bloat in Redis”The concern: A developer attaches a full object graph (50KB+ of JSON) to a BullMQ job payload. Multiply that by 10,000 concurrent jobs. You’ve just consumed 500MB of Redis memory for job payloads alone.
The mitigation: Follow the Socket.io Convention Payload Guidelines: “Don’t include the entire object if only ID is needed.” Send thin payloads — IDs and minimal status context. The frontend fetches full data via REST API when it receives the notification.
// ❌ Fat payload — Redis memory explodesawait notificationQueue.add('resource-updated', { resource: fullResourceObject, // 50KB+ per job history: allHistoryRecords, // Could be megabytes attachments: base64Files, // Absolutely not});
// ✅ Thin payload — Redis stays leanawait notificationQueue.add('resource-updated', { id: resource.id, data: { reference_number: resource.reference_number, status: resource.status, }, metadata: { timestamp: new Date().toISOString(), triggered_by: userId, trace_id: uuidv4(), department: departmentId, },});// Frontend receives notification → calls GET /resources/:id for full dataThe trade-off is real: thin payloads mean the frontend needs a follow-up REST call. This adds ~50-100ms of latency to get the full data. But it keeps your Redis memory consumption predictable and your queue throughput high.
4.5 Scaling the Notification Service Across Multiple Pods
Section titled “4.5 Scaling the Notification Service Across Multiple Pods”The concern: You deploy 3 pods of the Notification Gateway for high availability. A user connected to Pod 1 should receive an event emitted by a worker processing on Pod 3. But Socket.io, by default, only broadcasts to sockets connected to its own process.
The mitigation: Per Socket.io Convention — Redis Adapter (Section 4.1), always install @socket.io/redis-adapter for the Notification Service:
import { createAdapter } from '@socket.io/redis-adapter';import { createClient } from 'redis';
const pubClient = createClient({ url: process.env.REDIS_URL });const subClient = pubClient.duplicate();
await Promise.all([pubClient.connect(), subClient.connect()]);
// Now io.to('department:A').emit(...) reaches ALL podsio.adapter(createAdapter(pubClient, subClient));Think about this: without the Redis adapter, your system works perfectly in development (single process) and fails silently in production (multiple pods). Events get “lost” — but they’re not really lost; they’re just emitted to the wrong pod.
Part 2: Redis Sentinel for High Availability
Section titled “Part 2: Redis Sentinel for High Availability”A single Redis instance is a single point of failure. If it goes down, all queues stop, all sessions are lost, and your entire notification system goes dark. Redis Sentinel provides automatic failover.
Architecture
Section titled “Architecture”| Component | Count | Role |
|---|---|---|
| Master Node | 1 | Handles all Read/Write operations |
| Replica Nodes | 2 | Continuously replicate data from Master; ready to be promoted |
| Sentinel Nodes | 3 (odd number) | Monitor health; vote to promote a Replica when Master dies |
What Actually Happens During a Failover?
Section titled “What Actually Happens During a Failover?”sequenceDiagram
participant S1 as Sentinel 1
participant S2 as Sentinel 2
participant S3 as Sentinel 3
participant M as Master (dies)
participant R1 as Replica 1
participant App as NestJS App
M->>M: Process crashes / Network lost
S1->>S1: Master unreachable (SDOWN)
S2->>S2: Master unreachable (SDOWN)
S1->>S2: "Is Master down?" → Vote
S2->>S1: "Yes" → Quorum reached (ODOWN)
S1->>R1: "You are the new Master"
R1->>R1: Promote to Master
S1->>App: Notify: new Master is Replica 1
App->>R1: Reconnect (automatic via ioredis)
The key insight: your NestJS application doesn’t need to know which Redis node is the Master. It connects to the Sentinel cluster, and ioredis (BullMQ’s underlying Redis client) automatically discovers and follows the current Master — even through failovers.
Docker Compose Setup
Section titled “Docker Compose Setup”version: '3.8'
services: redis-master: image: bitnami/redis:7.2 container_name: app-redis-master environment: - REDIS_REPLICATION_MODE=master - REDIS_PASSWORD=${REDIS_PASSWORD} - REDIS_AOF_ENABLED=yes ports: - '6379:6379' volumes: - redis_master_data:/bitnami/redis/data
redis-replica-1: image: bitnami/redis:7.2 container_name: app-redis-replica-1 environment: - REDIS_REPLICATION_MODE=slave - REDIS_MASTER_HOST=redis-master - REDIS_MASTER_PORT_NUMBER=6379 - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD} - REDIS_PASSWORD=${REDIS_PASSWORD} - REDIS_AOF_ENABLED=yes depends_on: - redis-master
redis-replica-2: image: bitnami/redis:7.2 container_name: app-redis-replica-2 environment: - REDIS_REPLICATION_MODE=slave - REDIS_MASTER_HOST=redis-master - REDIS_MASTER_PORT_NUMBER=6379 - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD} - REDIS_PASSWORD=${REDIS_PASSWORD} - REDIS_AOF_ENABLED=yes depends_on: - redis-master
# Sentinel group (odd count to prevent split-brain during voting) redis-sentinel-1: image: bitnami/redis-sentinel:7.2 container_name: app-redis-sentinel-1 environment: - REDIS_MASTER_HOST=redis-master - REDIS_MASTER_PORT_NUMBER=6379 - REDIS_MASTER_SET=app-master-set - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD} - REDIS_SENTINEL_QUORUM=2 depends_on: - redis-master ports: - '26379:26379'
redis-sentinel-2: image: bitnami/redis-sentinel:7.2 container_name: app-redis-sentinel-2 environment: - REDIS_MASTER_HOST=redis-master - REDIS_MASTER_PORT_NUMBER=6379 - REDIS_MASTER_SET=app-master-set - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD} - REDIS_SENTINEL_QUORUM=2 depends_on: - redis-master ports: - '26380:26379'
redis-sentinel-3: image: bitnami/redis-sentinel:7.2 container_name: app-redis-sentinel-3 environment: - REDIS_MASTER_HOST=redis-master - REDIS_MASTER_PORT_NUMBER=6379 - REDIS_MASTER_SET=app-master-set - REDIS_MASTER_PASSWORD=${REDIS_PASSWORD} - REDIS_SENTINEL_QUORUM=2 depends_on: - redis-master ports: - '26381:26379'
volumes: redis_master_data:NestJS BullMQ Configuration with Sentinel
Section titled “NestJS BullMQ Configuration with Sentinel”import { BullModule } from '@nestjs/bullmq';import { Module } from '@nestjs/common';
@Module({ imports: [ BullModule.forRoot({ connection: { // Connect via Sentinels — ioredis discovers the current Master automatically sentinels: [ { host: process.env.REDIS_SENTINEL_1_HOST, port: 26379 }, { host: process.env.REDIS_SENTINEL_2_HOST, port: 26380 }, { host: process.env.REDIS_SENTINEL_3_HOST, port: 26381 }, ], name: 'app-master-set', // Must match REDIS_MASTER_SET in Docker Compose password: process.env.REDIS_PASSWORD,
// Retry strategy during failover — don't give up too quickly sentinelRetryStrategy: (times: number) => { return Math.min(times * 500, 2000); // 500ms, 1s, 1.5s, 2s (cap) }, },
// Default job options defaultJobOptions: { removeOnComplete: 1000, // Keep last 1,000 completed jobs for debugging removeOnFail: 5000, // Keep last 5,000 failed jobs for analysis attempts: 3, // Retry failed jobs up to 3 times backoff: { type: 'exponential', // 1s → 2s → 4s between retries delay: 1000, }, }, }), ],})export class AppModule {}Part 3: Scaling Strategy
Section titled “Part 3: Scaling Strategy”As the system grows — more records, more services, more concurrent users — scale methodically:
-
Application Level — Scale Workers Independently
The beauty of BullMQ is that workers are just NestJS processes. Scale them independently from your API servers:
Terminal window # Scale file-processing workers (heavy CPU) to 5 podskubectl scale deployment file-worker --replicas=5# Keep API servers at 3 pods (handle HTTP traffic)kubectl scale deployment data-owner-api --replicas=3# Critical notification workers always at 3+ for redundancykubectl scale deployment notification-worker --replicas=3Think about this: if your batch report generation is slow, don’t scale the entire API. Scale the report worker. BullMQ’s architecture naturally supports this because producers (API servers) and consumers (workers) are decoupled.
-
Infrastructure Level — Memory Management
Redis is your system’s nervous system — sessions, queues, Socket.io Pub/Sub all flow through it. Managing its memory is critical.
No-Eviction Policy: In most Redis use cases,
allkeys-lru(evict least recently used keys) is fine. For a critical queue? Absolutely not. A pending operation must never be evicted because Redis ran out of memory.# redis.conf — non-negotiable for critical queuesmaxmemory-policy noeviction # Redis returns errors instead of deleting jobsmaxmemory 4gb # Set based on your server's available RAMJob Archiving: Redis should not become a permanent audit log. Write a scheduled worker that moves completed/failed jobs to PostgreSQL and frees Redis memory:
@Cron(CronExpression.EVERY_DAY_AT_2AM)async archiveCompletedJobs(): Promise<void> {const completedJobs = await this.queue.getCompleted(0, 10000);if (completedJobs.length === 0) return;// Archive to PostgreSQL for audit requirementsawait this.auditLogRepo.save(completedJobs.map(job => ({job_id: job.id,queue_name: job.queueName,data: job.data,result: job.returnvalue,completed_at: new Date(job.finishedOn),})),);// Remove from Redis to reclaim memoryawait Promise.all(completedJobs.map(job => job.remove()));this.logger.log(`Archived ${completedJobs.length} jobs from Redis to PostgreSQL`);} -
Infrastructure Level — When Sentinel Isn’t Enough: Redis Cluster
Redis Sentinel provides high availability (automatic failover), but it doesn’t provide horizontal scaling. You still have a single Master handling all writes. When your throughput exceeds what one Redis Master can handle, you need Redis Cluster.
Sentinel vs. Cluster: When Do You Switch?
Section titled “Sentinel vs. Cluster: When Do You Switch?”Aspect Redis Sentinel Redis Cluster Scaling model Vertical (bigger Master) Horizontal (more shards) Write throughput Limited to 1 Master Distributed across shards Failover Automatic (Sentinel votes) Automatic (built-in) Complexity Moderate High BullMQ compatibility Native — just works Requires Hash Tags (see below) When to choose < 100K ops/sec, most systems > 100K ops/sec, massive scale // ✅ Hash Tags ensure all keys for this queue land on the same shardBullModule.registerQueue({name: '{app-main}:order-queue',});BullModule.registerQueue({name: '{app-main}:notification-queue',});// ❌ Without Hash Tags — keys scatter across shards, Lua scripts failBullModule.registerQueue({ name: 'order-queue' });Our recommendation: Stay with Sentinel until you’ve exhausted vertical scaling options. Most enterprise systems won’t generate enough Redis traffic to warrant a Cluster. Sentinel gives you HA with much less operational complexity.
-
Architecture Level — Beyond BullMQ: When to Evaluate Kafka or RabbitMQ
BullMQ excels at task processing: “do this thing, retry if it fails, tell me when it’s done.” But as your system matures, you may need capabilities that stretch beyond BullMQ’s design:
Need BullMQ Kafka RabbitMQ ”Do this task” (command) Excellent Overkill Good ”This happened” (event) Acceptable Excellent Good Event replay / audit trail Not designed for this Built-in (log retention) Not designed for this Fan-out to many consumers One consumer per job Native (consumer groups) Native (exchanges) Complex routing rules No Limited Excellent (topic/header exchanges) Ops complexity Low (just Redis) High (ZooKeeper/KRaft, brokers, topics) Moderate (Erlang runtime) Throughput ceiling ~50K jobs/sec 1M+ events/sec ~100K msgs/sec Learning curve Low High Moderate The key question: has your requirement shifted from “do tasks” to “broadcast events”? If one service needs to announce “resource updated” and 5 different services need to react independently — that’s an event streaming problem. BullMQ’s one-consumer-per-job model doesn’t fit. That’s when Kafka earns its complexity cost.
Our recommendation: use a hybrid approach.
- BullMQ for task processing (process orders, generate reports, send notifications)
- Kafka (when you need it) for event streaming and audit logs
- Socket.io for real-time frontend push (the final mile to the browser)
Don’t adopt Kafka “just in case.” Its operational complexity is significant. Add it when you have a concrete event streaming requirement that BullMQ cannot serve.
Decision Framework
Section titled “Decision Framework”When a new feature requires async processing, use this decision tree:
Does the work need to happen asynchronously?│├── NO → Execute synchronously in the service layer│└── YES → Is it a "task" (do this thing) or an "event" (this happened)? │ ├── TASK → Use BullMQ │ ├── Is it business-critical? (billing, inventory, audit) │ │ ├── YES → Separate critical queue + Priority 1 + Idempotency guard │ │ └── NO → General queue + default priority │ │ │ ├── Multi-step workflow? │ │ ├── Independent steps → BullMQ Flows (Parent-Child) │ │ └── Ordered + reversible → Saga Pattern │ │ │ └── Frontend needs to know? │ ├── YES → BullMQ Worker → Socket.io emit (thin payload) │ └── NO → Worker completes silently │ └── EVENT (multiple consumers need it) → Evaluate KafkaQuick Reference: Common Scenarios
Section titled “Quick Reference: Common Scenarios”| Scenario | Approach | Queue | Socket.io Target |
|---|---|---|---|
| Critical system alert | BullMQ critical-events + Priority 1 | Separate critical queue | department:{deptCode} |
| Process completion | BullMQ general-events | General queue | user:{userId} |
| Order processed | BullMQ critical-events + Idempotency | Separate critical queue | user:{operatorId} |
| Daily statistical report | BullMQ batch-reports | Batch queue | Email (no Socket.io) |
| File/image processing | BullMQ processing-queue | Dedicated queue, scale workers | user:{userId} |
| Real-time chat / presence | Socket.io directly | N/A — no BullMQ needed | user:{userId} |
| Resource creation broadcast | Evaluate Kafka (fan-out to many services) | N/A | Multiple rooms |
References
Section titled “References”- Socket.io Communication Convention — Naming, Payload, Rooms, Throttling standards
- BullMQ Documentation — Flows, Priority, Rate Limiting
- Redis Sentinel Documentation — HA configuration
- Redis Cluster Documentation — Horizontal scaling
- Socket.io Redis Adapter — Cross-instance broadcasting