Skip to content

BullMQ in Production: Concerns, Trade-Offs, and Mitigations

Most “BullMQ tutorial” articles end with a happy path: install, add a job, process it, done. But when your job queue is processing critical business operations for real users in production — the stakes are fundamentally different.

A duplicate job in a low-stakes system means someone gets two confirmation emails. Annoying, but harmless. A duplicate job in a high-stakes system means a critical operation executes twice — corrupting records, double-charging accounts, or creating phantom entries in your database. That’s not a bug report — that’s a production incident.

This article is a brutally honest assessment of every concern, every trade-off, and every mitigation for running BullMQ (backed by Redis) in a NestJS microservices-based enterprise system.


Before we dive into concerns, let’s acknowledge why BullMQ earned its place in the stack:

Jobs survive worker crashes. They sit in Redis, waiting to be picked up — not lost in a dead process's memory. Failed jobs automatically retry with configurable delays. No need to hand-roll retry logic for every worker. Emergency operations can jump ahead of batch reports. Lower number = higher priority. Chain dependent jobs together: update inventory → record billing → notify user. The parent only runs when all children complete.

But here’s the question nobody asks early enough: is a Redis-backed job queue the right foundation for a system where data loss or duplication causes direct business harm?

The answer is: yes, but only with rigorous safeguards. Let’s walk through each one.


Concern 1 — Data Loss (Redis Is an In-Memory Database)

Section titled “Concern 1 — Data Loss (Redis Is an In-Memory Database)”

The core tension: Redis is fast because it’s in-memory. But in-memory means volatile. How do you get durability without sacrificing the speed that made you choose Redis in the first place?

Redis writes every operation to an append-only log file, flushing to disk once per second.

Redis flushes to disk on every single write operation.

Aspecteverysecalways
Max data lossUp to 1 second of writesZero (in theory)
Throughput impact~5-10% overhead~50-80% overhead
Good for99% of production workloadsFinancial transactions, audit logs
RiskLosing 1 second of jobs during power failureRedis becomes so slow that queues back up

Our recommendation: appendfsync everysec combined with RDB snapshots. Here’s why — if your Redis is processing 1,000 jobs/sec and you lose 1 second, you lose at most 1,000 jobs. But those jobs were also recorded in your source database (the operation request exists in PostgreSQL before it enters the queue). You can replay them. The cost of always — halving your throughput — creates a different kind of risk: queue backlog, which leads to delayed processing. Trading one risk for another is not an improvement.

# redis.conf — recommended production configuration
appendonly yes
appendfsync everysec
# RDB snapshots as a secondary safety net
save 900 1 # Snapshot every 15 min if >= 1 change
save 300 10 # Snapshot every 5 min if >= 10 changes
save 60 10000 # Snapshot every 1 min if >= 10,000 changes

Concern 2 — Duplicate Job Execution (Idempotency)

Section titled “Concern 2 — Duplicate Job Execution (Idempotency)”

The core tension: BullMQ guarantees at-least-once delivery, not exactly-once. “At-least-once” is a polite way of saying “we might run your job twice.” In low-stakes systems, this means a duplicate email. In high-stakes systems, this means a duplicate operation.

BullMQ will reject a job if another job with the same ID already exists. Use IDs derived from business events, not random UUIDs:

await orderQueue.add(
'process-order',
{ orderId, resourceId, items },
{
// BullMQ rejects this if a job with the same ID is already queued
jobId: `process:${orderId}:${timestamp}`,
priority: 1,
},
);

Defense Layer 2: Database State Check in Workers

Section titled “Defense Layer 2: Database State Check in Workers”

Even with deterministic IDs, network timing can still produce duplicates. Your worker must verify the current state in the database before acting:

@Processor('order-queue')
export class OrderProcessor {
@Process('process-order')
async handleProcess(job: Job<ProcessPayload>): Promise<ProcessResult> {
const order = await this.orderRepo.findOne({
where: { id: job.data.orderId },
});
// Idempotency guard — if already processed, skip immediately
if (order.status === OrderStatus.PROCESSED) {
this.logger.warn(`Duplicate job detected, skipping: ${job.id}`);
return { skipped: true, reason: 'already_processed' };
}
// Proceed with processing...
await this.executeDbOperation(async () => {
await this.orderRepo.update(order.id, {
status: OrderStatus.PROCESSED,
processed_at: new Date(),
processed_by: job.data.operatorId,
});
});
return { success: true };
}
}

Think about this: the deterministic jobId prevents the same job from being enqueued twice. The database state check prevents the same action from being executed twice. You need both layers. One protects the queue, the other protects the data.


Concern 3 — Complex Multi-Step Transactions

Section titled “Concern 3 — Complex Multi-Step Transactions”

The core tension: traditional database transactions give you atomicity — all-or-nothing. But in a microservices world with queued jobs, each step might run on a different service, at a different time. There’s no global transaction manager.

Choice A: BullMQ Flows (Parent-Child Jobs)

Section titled “Choice A: BullMQ Flows (Parent-Child Jobs)”

BullMQ’s built-in flow system. The parent job only executes after all children complete successfully. If a child fails, the parent never runs.

import { FlowProducer } from 'bullmq';
const flow = new FlowProducer({ connection: redisConnection });
await flow.add({
name: 'notify-user',
queueName: 'notification-queue',
data: { resourceId, message: 'Your order has been processed' },
children: [
{
name: 'update-inventory',
queueName: 'inventory-queue',
data: { orderId, items },
},
{
name: 'record-billing',
queueName: 'billing-queue',
data: { orderId, totalAmount },
},
],
});
// Parent won't run until ALL children complete

Choice B: Saga Pattern (Compensating Transactions)

Section titled “Choice B: Saga Pattern (Compensating Transactions)”

Each step has a corresponding “undo” step. If step 3 fails, you run the compensating actions for steps 2 and 1 in reverse order.

AspectBullMQ FlowsSaga Pattern
ComplexityLow — BullMQ handles orchestrationHigh — you write every compensation handler
RollbackParent doesn’t run if child fails (but completed children aren’t reversed)Full rollback via compensating transactions
Use whenSteps are independent (order doesn’t matter between children)Steps must be strictly ordered and fully reversible
ExampleNotify user only after inventory + billing both succeedUpdate inventory → if billing fails → revert inventory
WeaknessChildren that did succeed aren’t automatically undoneCompensation logic can itself fail (what if reverting inventory fails?)

Our recommendation: Start with BullMQ Flows for most workflows. Move to Saga Pattern only when you have a strict ordering requirement and need full rollback capability. The Saga Pattern is more correct in theory but dramatically more complex to implement and test. In most scenarios, the “parent doesn’t run” behavior of Flows is sufficient — if billing fails, you simply don’t notify the user, and a human resolves the billing issue manually.


Concern 4 — WebSocket Gateway: Where BullMQ Meets Socket.io

Section titled “Concern 4 — WebSocket Gateway: Where BullMQ Meets Socket.io”

When you extract a dedicated Notification Service that consumes BullMQ jobs and broadcasts via Socket.io, you’re bridging two very different paradigms: asynchronous batch processing and real-time push. This bridge has its own set of concerns.

Reference: Socket.io Communication Convention

graph LR
    A[Data Owner Service] -->|add job| B[BullMQ Queue]
    C[Order Service] -->|add job| B
    D[Results Service] -->|add job| B
    B -->|process| E[Notification Worker]
    E -->|emit event| F[Socket.io Gateway]
    F -->|to room| G[Frontend Clients]
    F -->|Redis Pub/Sub| H[Socket.io Gateway\nPod 2]

The concern: The Gateway must accept connections from every service domain in the system. Without structure, you end up with a spaghetti of event names that nobody can debug at 3 AM.

The mitigation: Follow Socket.io Convention - Naming (Section 1.1). Every namespace follows /{service_name} format. Every connection must pass through JWT authentication middleware to populate socket.data.user and socket.data.departmentId for room management.

@WebSocketGateway({ namespace: '/data_owner', cors: { origin: '*' } })
export class NotificationGateway implements OnGatewayInit {
afterInit(server: Server): void {
server.use(async (socket, next) => {
const token = socket.handshake.auth.token;
if (!token) return next(new Error('Authentication failed'));
try {
const user = await this.authService.verifyToken(token);
socket.data.user = user;
socket.data.departmentId = user.department;
next();
} catch {
next(new Error('Invalid token'));
}
});
}
}

The concern: Imagine your system dumps 10,000 batch report jobs into BullMQ at midnight. Now an operator updates a critical record at 00:01. That records:updated event — which needs sub-second delivery — is sitting behind 10,000 report jobs.

The core tension: Should you use priority within a single queue, or separate queues entirely?

// Same queue, different priorities
await eventQueue.add('critical-alert', payload, { priority: 1 }); // Processed first
await eventQueue.add('daily-report', payload, { priority: 100 }); // Processed last

Choice B: Separate Queues with Dedicated Workers

Section titled “Choice B: Separate Queues with Dedicated Workers”
const QUEUES = {
CRITICAL: 'critical-events', // records:updated, system:alert
GENERAL: 'general-events', // resources:created, appointments:scheduled
BATCH: 'batch-reports', // Daily reports, statistics
};
AspectSingle Queue + PrioritySeparate Queues
IsolationNone — batch jobs can still slow down critical events during peak Redis loadFull — each queue has its own workers, batch can’t block critical
Ops complexityLow — one queue to monitorHigher — multiple queues, multiple worker pools
ScalingScale workers for one queueScale critical workers to 5 pods, leave batch at 1
When batch floodsCritical events wait until Redis processes priority sortCritical events continue unaffected

Our recommendation: Separate queues. The priority system within BullMQ works, but it doesn’t provide isolation. With separate queues, you can scale critical workers independently and even run them against a different Redis instance if needed.

Additionally, apply throttling per Socket.io Convention (Section 3.4) — records that update every millisecond from a sensor should be throttled to emit at most every 5 seconds:

@Process('records-update')
async handleRecordUpdate(job: Job<RecordUpdatePayload>): Promise<ThrottleResult> {
const rateLimitKey = `records-ratelimit:${job.data.resourceId}`;
const isLimited = await this.redis.get(rateLimitKey);
if (isLimited) {
return { throttled: true };
}
await this.redis.setex(rateLimitKey, 5, '1'); // Lock for 5 seconds
this.notificationGateway.server
.to(`department:${job.data.departmentCode}`)
.emit('records:updated', this.buildPayload(job.data));
return { emitted: true };
}

4.3 Duplicate Notifications at the Frontend (At-Least-Once Delivery)

Section titled “4.3 Duplicate Notifications at the Frontend (At-Least-Once Delivery)”

The concern: BullMQ guarantees at-least-once delivery. This means the frontend might receive the same resources:created event twice if a network hiccup causes a job retry.

The mitigation: Per Socket.io Convention — Payload Structure (Section 2), every event must include metadata.trace_id. The frontend maintains a local deduplication cache:

// Frontend — State Management (Zustand / Redux)
const processedTraceIds = new Set<string>();
socket.on('resources:created', (payload: SocketPayload<ResourceData>) => {
// Idempotency check — skip if we've seen this trace_id before
if (processedTraceIds.has(payload.metadata.trace_id)) {
return; // Silently ignore the duplicate
}
processedTraceIds.add(payload.metadata.trace_id);
// Prevent memory leak — cap the set size
if (processedTraceIds.size > 1000) {
const oldest = processedTraceIds.values().next().value;
processedTraceIds.delete(oldest);
}
// Proceed normally
updateResourceList(payload.data);
});

Think about this: the deduplication happens at two levels. BullMQ’s deterministic jobId (Concern 2) prevents the queue from processing duplicates. The trace_id on the frontend prevents the UI from displaying duplicates. Both are necessary because they operate in different failure domains.

The concern: A developer attaches a full object graph (50KB+ of JSON) to a BullMQ job payload. Multiply that by 10,000 concurrent jobs. You’ve just consumed 500MB of Redis memory for job payloads alone.

The mitigation: Follow the Socket.io Convention Payload Guidelines: “Don’t include the entire object if only ID is needed.” Send thin payloads — IDs and minimal status context. The frontend fetches full data via REST API when it receives the notification.

// ❌ Fat payload — Redis memory explodes
await notificationQueue.add('resource-updated', {
resource: fullResourceObject, // 50KB+ per job
history: allHistoryRecords, // Could be megabytes
attachments: base64Files, // Absolutely not
});
// ✅ Thin payload — Redis stays lean
await notificationQueue.add('resource-updated', {
id: resource.id,
data: {
reference_number: resource.reference_number,
status: resource.status,
},
metadata: {
timestamp: new Date().toISOString(),
triggered_by: userId,
trace_id: uuidv4(),
department: departmentId,
},
});
// Frontend receives notification → calls GET /resources/:id for full data

The trade-off is real: thin payloads mean the frontend needs a follow-up REST call. This adds ~50-100ms of latency to get the full data. But it keeps your Redis memory consumption predictable and your queue throughput high.

4.5 Scaling the Notification Service Across Multiple Pods

Section titled “4.5 Scaling the Notification Service Across Multiple Pods”

The concern: You deploy 3 pods of the Notification Gateway for high availability. A user connected to Pod 1 should receive an event emitted by a worker processing on Pod 3. But Socket.io, by default, only broadcasts to sockets connected to its own process.

The mitigation: Per Socket.io Convention — Redis Adapter (Section 4.1), always install @socket.io/redis-adapter for the Notification Service:

import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';
const pubClient = createClient({ url: process.env.REDIS_URL });
const subClient = pubClient.duplicate();
await Promise.all([pubClient.connect(), subClient.connect()]);
// Now io.to('department:A').emit(...) reaches ALL pods
io.adapter(createAdapter(pubClient, subClient));

Think about this: without the Redis adapter, your system works perfectly in development (single process) and fails silently in production (multiple pods). Events get “lost” — but they’re not really lost; they’re just emitted to the wrong pod.


Part 2: Redis Sentinel for High Availability

Section titled “Part 2: Redis Sentinel for High Availability”

A single Redis instance is a single point of failure. If it goes down, all queues stop, all sessions are lost, and your entire notification system goes dark. Redis Sentinel provides automatic failover.

ComponentCountRole
Master Node1Handles all Read/Write operations
Replica Nodes2Continuously replicate data from Master; ready to be promoted
Sentinel Nodes3 (odd number)Monitor health; vote to promote a Replica when Master dies
sequenceDiagram
    participant S1 as Sentinel 1
    participant S2 as Sentinel 2
    participant S3 as Sentinel 3
    participant M as Master (dies)
    participant R1 as Replica 1
    participant App as NestJS App

    M->>M: Process crashes / Network lost
    S1->>S1: Master unreachable (SDOWN)
    S2->>S2: Master unreachable (SDOWN)
    S1->>S2: "Is Master down?" → Vote
    S2->>S1: "Yes" → Quorum reached (ODOWN)
    S1->>R1: "You are the new Master"
    R1->>R1: Promote to Master
    S1->>App: Notify: new Master is Replica 1
    App->>R1: Reconnect (automatic via ioredis)

The key insight: your NestJS application doesn’t need to know which Redis node is the Master. It connects to the Sentinel cluster, and ioredis (BullMQ’s underlying Redis client) automatically discovers and follows the current Master — even through failovers.

version: '3.8'
services:
redis-master:
image: bitnami/redis:7.2
container_name: app-redis-master
environment:
- REDIS_REPLICATION_MODE=master
- REDIS_PASSWORD=${REDIS_PASSWORD}
- REDIS_AOF_ENABLED=yes
ports:
- '6379:6379'
volumes:
- redis_master_data:/bitnami/redis/data
redis-replica-1:
image: bitnami/redis:7.2
container_name: app-redis-replica-1
environment:
- REDIS_REPLICATION_MODE=slave
- REDIS_MASTER_HOST=redis-master
- REDIS_MASTER_PORT_NUMBER=6379
- REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
- REDIS_PASSWORD=${REDIS_PASSWORD}
- REDIS_AOF_ENABLED=yes
depends_on:
- redis-master
redis-replica-2:
image: bitnami/redis:7.2
container_name: app-redis-replica-2
environment:
- REDIS_REPLICATION_MODE=slave
- REDIS_MASTER_HOST=redis-master
- REDIS_MASTER_PORT_NUMBER=6379
- REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
- REDIS_PASSWORD=${REDIS_PASSWORD}
- REDIS_AOF_ENABLED=yes
depends_on:
- redis-master
# Sentinel group (odd count to prevent split-brain during voting)
redis-sentinel-1:
image: bitnami/redis-sentinel:7.2
container_name: app-redis-sentinel-1
environment:
- REDIS_MASTER_HOST=redis-master
- REDIS_MASTER_PORT_NUMBER=6379
- REDIS_MASTER_SET=app-master-set
- REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
- REDIS_SENTINEL_QUORUM=2
depends_on:
- redis-master
ports:
- '26379:26379'
redis-sentinel-2:
image: bitnami/redis-sentinel:7.2
container_name: app-redis-sentinel-2
environment:
- REDIS_MASTER_HOST=redis-master
- REDIS_MASTER_PORT_NUMBER=6379
- REDIS_MASTER_SET=app-master-set
- REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
- REDIS_SENTINEL_QUORUM=2
depends_on:
- redis-master
ports:
- '26380:26379'
redis-sentinel-3:
image: bitnami/redis-sentinel:7.2
container_name: app-redis-sentinel-3
environment:
- REDIS_MASTER_HOST=redis-master
- REDIS_MASTER_PORT_NUMBER=6379
- REDIS_MASTER_SET=app-master-set
- REDIS_MASTER_PASSWORD=${REDIS_PASSWORD}
- REDIS_SENTINEL_QUORUM=2
depends_on:
- redis-master
ports:
- '26381:26379'
volumes:
redis_master_data:
import { BullModule } from '@nestjs/bullmq';
import { Module } from '@nestjs/common';
@Module({
imports: [
BullModule.forRoot({
connection: {
// Connect via Sentinels — ioredis discovers the current Master automatically
sentinels: [
{ host: process.env.REDIS_SENTINEL_1_HOST, port: 26379 },
{ host: process.env.REDIS_SENTINEL_2_HOST, port: 26380 },
{ host: process.env.REDIS_SENTINEL_3_HOST, port: 26381 },
],
name: 'app-master-set', // Must match REDIS_MASTER_SET in Docker Compose
password: process.env.REDIS_PASSWORD,
// Retry strategy during failover — don't give up too quickly
sentinelRetryStrategy: (times: number) => {
return Math.min(times * 500, 2000); // 500ms, 1s, 1.5s, 2s (cap)
},
},
// Default job options
defaultJobOptions: {
removeOnComplete: 1000, // Keep last 1,000 completed jobs for debugging
removeOnFail: 5000, // Keep last 5,000 failed jobs for analysis
attempts: 3, // Retry failed jobs up to 3 times
backoff: {
type: 'exponential', // 1s → 2s → 4s between retries
delay: 1000,
},
},
}),
],
})
export class AppModule {}

As the system grows — more records, more services, more concurrent users — scale methodically:

  1. Application Level — Scale Workers Independently

    The beauty of BullMQ is that workers are just NestJS processes. Scale them independently from your API servers:

    Terminal window
    # Scale file-processing workers (heavy CPU) to 5 pods
    kubectl scale deployment file-worker --replicas=5
    # Keep API servers at 3 pods (handle HTTP traffic)
    kubectl scale deployment data-owner-api --replicas=3
    # Critical notification workers always at 3+ for redundancy
    kubectl scale deployment notification-worker --replicas=3

    Think about this: if your batch report generation is slow, don’t scale the entire API. Scale the report worker. BullMQ’s architecture naturally supports this because producers (API servers) and consumers (workers) are decoupled.

  2. Infrastructure Level — Memory Management

    Redis is your system’s nervous system — sessions, queues, Socket.io Pub/Sub all flow through it. Managing its memory is critical.

    No-Eviction Policy: In most Redis use cases, allkeys-lru (evict least recently used keys) is fine. For a critical queue? Absolutely not. A pending operation must never be evicted because Redis ran out of memory.

    # redis.conf — non-negotiable for critical queues
    maxmemory-policy noeviction # Redis returns errors instead of deleting jobs
    maxmemory 4gb # Set based on your server's available RAM

    Job Archiving: Redis should not become a permanent audit log. Write a scheduled worker that moves completed/failed jobs to PostgreSQL and frees Redis memory:

    @Cron(CronExpression.EVERY_DAY_AT_2AM)
    async archiveCompletedJobs(): Promise<void> {
    const completedJobs = await this.queue.getCompleted(0, 10000);
    if (completedJobs.length === 0) return;
    // Archive to PostgreSQL for audit requirements
    await this.auditLogRepo.save(
    completedJobs.map(job => ({
    job_id: job.id,
    queue_name: job.queueName,
    data: job.data,
    result: job.returnvalue,
    completed_at: new Date(job.finishedOn),
    })),
    );
    // Remove from Redis to reclaim memory
    await Promise.all(completedJobs.map(job => job.remove()));
    this.logger.log(`Archived ${completedJobs.length} jobs from Redis to PostgreSQL`);
    }
  3. Infrastructure Level — When Sentinel Isn’t Enough: Redis Cluster

    Redis Sentinel provides high availability (automatic failover), but it doesn’t provide horizontal scaling. You still have a single Master handling all writes. When your throughput exceeds what one Redis Master can handle, you need Redis Cluster.

    AspectRedis SentinelRedis Cluster
    Scaling modelVertical (bigger Master)Horizontal (more shards)
    Write throughputLimited to 1 MasterDistributed across shards
    FailoverAutomatic (Sentinel votes)Automatic (built-in)
    ComplexityModerateHigh
    BullMQ compatibilityNative — just worksRequires Hash Tags (see below)
    When to choose< 100K ops/sec, most systems> 100K ops/sec, massive scale
    // ✅ Hash Tags ensure all keys for this queue land on the same shard
    BullModule.registerQueue({
    name: '{app-main}:order-queue',
    });
    BullModule.registerQueue({
    name: '{app-main}:notification-queue',
    });
    // ❌ Without Hash Tags — keys scatter across shards, Lua scripts fail
    BullModule.registerQueue({ name: 'order-queue' });

    Our recommendation: Stay with Sentinel until you’ve exhausted vertical scaling options. Most enterprise systems won’t generate enough Redis traffic to warrant a Cluster. Sentinel gives you HA with much less operational complexity.

  4. Architecture Level — Beyond BullMQ: When to Evaluate Kafka or RabbitMQ

    BullMQ excels at task processing: “do this thing, retry if it fails, tell me when it’s done.” But as your system matures, you may need capabilities that stretch beyond BullMQ’s design:

    NeedBullMQKafkaRabbitMQ
    ”Do this task” (command)ExcellentOverkillGood
    ”This happened” (event)AcceptableExcellentGood
    Event replay / audit trailNot designed for thisBuilt-in (log retention)Not designed for this
    Fan-out to many consumersOne consumer per jobNative (consumer groups)Native (exchanges)
    Complex routing rulesNoLimitedExcellent (topic/header exchanges)
    Ops complexityLow (just Redis)High (ZooKeeper/KRaft, brokers, topics)Moderate (Erlang runtime)
    Throughput ceiling~50K jobs/sec1M+ events/sec~100K msgs/sec
    Learning curveLowHighModerate

    The key question: has your requirement shifted from “do tasks” to “broadcast events”? If one service needs to announce “resource updated” and 5 different services need to react independently — that’s an event streaming problem. BullMQ’s one-consumer-per-job model doesn’t fit. That’s when Kafka earns its complexity cost.

    Our recommendation: use a hybrid approach.

    • BullMQ for task processing (process orders, generate reports, send notifications)
    • Kafka (when you need it) for event streaming and audit logs
    • Socket.io for real-time frontend push (the final mile to the browser)

    Don’t adopt Kafka “just in case.” Its operational complexity is significant. Add it when you have a concrete event streaming requirement that BullMQ cannot serve.


When a new feature requires async processing, use this decision tree:

Does the work need to happen asynchronously?
├── NO → Execute synchronously in the service layer
└── YES → Is it a "task" (do this thing) or an "event" (this happened)?
├── TASK → Use BullMQ
│ ├── Is it business-critical? (billing, inventory, audit)
│ │ ├── YES → Separate critical queue + Priority 1 + Idempotency guard
│ │ └── NO → General queue + default priority
│ │
│ ├── Multi-step workflow?
│ │ ├── Independent steps → BullMQ Flows (Parent-Child)
│ │ └── Ordered + reversible → Saga Pattern
│ │
│ └── Frontend needs to know?
│ ├── YES → BullMQ Worker → Socket.io emit (thin payload)
│ └── NO → Worker completes silently
└── EVENT (multiple consumers need it) → Evaluate Kafka
ScenarioApproachQueueSocket.io Target
Critical system alertBullMQ critical-events + Priority 1Separate critical queuedepartment:{deptCode}
Process completionBullMQ general-eventsGeneral queueuser:{userId}
Order processedBullMQ critical-events + IdempotencySeparate critical queueuser:{operatorId}
Daily statistical reportBullMQ batch-reportsBatch queueEmail (no Socket.io)
File/image processingBullMQ processing-queueDedicated queue, scale workersuser:{userId}
Real-time chat / presenceSocket.io directlyN/A — no BullMQ neededuser:{userId}
Resource creation broadcastEvaluate Kafka (fan-out to many services)N/AMultiple rooms