Skip to content

Scalable Real-Time Architecture Setup

This guide covers the foundational architecture decisions for building a scalable real-time enterprise system capable of supporting 1000+ concurrent WebSocket connections. We address the fundamental challenges of distributed real-time systems, specifically focusing on Socket.io infrastructure, session management, and enterprise network constraints.

Key Topics:

  • Why standard load balancing fails with WebSocket connections
  • Session affinity (sticky sessions) concepts and implementation
  • PM2 deployment strategies for real-time systems
  • Kong API Gateway configuration for WebSocket routing
  • Enterprise network constraints and solutions
  • Redis Adapter for distributed Socket.io systems

This guide is organized into five main sections:

  1. Theoretical Foundations - Core computer science concepts underlying real-time distributed systems
  2. Problem Analysis - Deep dive into the specific challenges of WebSocket load balancing
  3. Solution Architecture - Step-by-step methodology for building the solution
  4. Implementation Details - Practical configuration and code examples
  5. Production Deployment - Scaling strategies and operational considerations

Before diving into the specific problems and solutions, it’s important to understand the theoretical foundations that govern distributed real-time systems.

The CAP theorem states that a distributed system can only guarantee two out of three properties:

  • Consistency (C): All nodes see the same data at the same time
  • Availability (A): Every request receives a response
  • Partition Tolerance (P): System continues to operate despite network partitions

Application to WebSocket Architecture:

Our real-time architecture makes specific CAP trade-offs:

WebSocket Sticky Session Architecture:
├─ Partition Tolerance (P): ✅ MUST HAVE
│ └─ Enterprise networks may have temporary partitions
├─ Availability (A): ✅ HIGH PRIORITY
│ └─ Critical notifications must be delivered
└─ Consistency (C): ⚠️ EVENTUAL
└─ Acceptable for notification delivery (can tolerate brief delays)

Design Decision: We prioritize AP (Availability + Partition Tolerance) over strict consistency, accepting eventual consistency for notification delivery. This is appropriate because:

  1. A delayed notification (1-2 seconds) is acceptable
  2. A completely dropped notification is NOT acceptable
  3. Operations can tolerate brief inconsistencies but not system unavailability

In distributed systems, there are two fundamental approaches to handling stateful connections:

Approach 1: Session Affinity (Our Choice)

Client → Always routes to same backend server
├─ Pros:
│ ├─ Simple implementation
│ ├─ No state synchronization overhead
│ └─ Lower latency (no remote state lookup)
└─ Cons:
├─ Uneven load distribution if hashing is poor
├─ Session lost if server fails
└─ Cannot easily migrate sessions between servers

Approach 2: Shared State (Not chosen, but considered)

Client → Any backend server → Lookup session in shared storage
├─ Pros:
│ ├─ Perfect load distribution
│ ├─ Sessions survive server failure
│ └─ Easy server migration
└─ Cons:
├─ Higher latency (remote state lookup on every message)
├─ Complex synchronization
├─ Single point of failure (shared storage)
└─ Higher infrastructure cost

Why Session Affinity Wins:

For real-time notifications:

  • Latency matters: Every millisecond counts in critical alerts
  • Connection lifetime: WebSocket connections are long-lived (hours), so initial routing overhead is amortized
  • Failure tolerance: We use Redis Adapter for message delivery, not session storage
  • Simplicity: Reduces moving parts and potential failure points

Understanding the capacity limits helps us design for scale.

Per-Server Connection Limit:

Max Connections per Server = min(
File Descriptors Limit,
Memory Limit / Memory per Connection,
CPU Limit / CPU per Connection
)

Real-World Calculations:

For a typical Node.js server (8 CPU, 16GB RAM):

File Descriptors:
├─ Linux default: 1024 (too low!)
├─ Recommended: ulimit -n 65536
└─ Practical limit: 50,000 concurrent connections
Memory:
├─ Available for connections: 12GB (after OS, Node.js heap)
├─ Memory per WebSocket: ~10KB (idle) to ~50KB (active)
├─ Conservative estimate: 30KB average
└─ Memory limit: 12GB / 30KB = 400,000 connections (theoretical)
CPU:
├─ Event loop blocks at ~5000 active messages/second
├─ Per connection: ~0.5ms CPU time per message
├─ With 1000 connections × 1 message/min = 16.7 msg/sec
└─ CPU is NOT the bottleneck for our use case
PRACTICAL LIMIT: 5,000 connections per server
(Conservative estimate accounting for message bursts)

Hash-Based Load Distribution:

With consistent hashing on x-device-id:

Expected connections per server = Total Connections / Number of Servers
Standard deviation (perfect hash) ≈ sqrt(N / k)
where:
N = total connections
k = number of servers
Example: 1200 connections, 3 servers
├─ Expected per server: 1200 / 3 = 400
├─ Std deviation: sqrt(1200 / 3) = 20
└─ Actual distribution: 380-420 per server (95% confidence)

Why This Matters:

With proper hashing, load distributes evenly. However, with IP-based hashing under NAT:

Hash(single_enterprise_IP) = always same value
├─ All 1200 connections → Server 1
├─ Server 2: Idle
└─ Server 3: Idle
Result: 100% load on one server, 0% on others ❌

Session Continuity Probability:

With round-robin (no sticky sessions):

P(correct server) = 1 / n
where n = number of backend servers
Example: 4 servers
├─ P(hit correct server) = 25%
├─ P(session lost) = 75%
└─ User experience: 3 out of 4 requests fail ❌

With sticky sessions:

P(correct server) = 1 (100%, deterministic)
Unless:
├─ Server fails (rare: 99.9% uptime = 0.1% failure)
├─ Hash collision (negligible with good hash function)
└─ Session expires (controlled by application logic)

Before implementing, we follow a structured problem-solving methodology:

Break down the monolithic “scalable WebSocket” problem:

Primary Problem: Support 1000+ concurrent WebSocket connections
├─ Sub-problem 1: Session persistence across requests
│ ├─ Root cause: Stateful WebSocket connections
│ └─ Solution space: Sticky sessions OR shared state
├─ Sub-problem 2: Load distribution despite NAT
│ ├─ Root cause: All clients appear as single IP
│ └─ Solution space: Alternative routing keys
├─ Sub-problem 3: Inter-server message broadcasting
│ ├─ Root cause: Message sent to Server A must reach client on Server B
│ └─ Solution space: Message bus (Redis Pub/Sub)
└─ Sub-problem 4: Horizontal scalability
├─ Root cause: Single server limited to ~5000 connections
└─ Solution space: Multi-server cluster + load balancer

Session Persistence Solutions:

SolutionComplexityLatencyScalabilityCostChosen?
Sticky SessionsLowLowHighLow
Shared Session StoreHighMediumVery HighHigh
Session ReplicationVery HighHighMediumHigh
Serverless WebSocketLowLowVery HighHigh

Routing Key Solutions (under NAT):

SolutionReliabilityImplementationIT BurdenChosen?
IP-based hashing❌ FailsEasyNone
Cookie-based hashing⚠️ ModerateEasyMay be stripped
Custom header hashing✅ HighMediumAllow headers
Query param hashing✅ HighEasyNone✅ Fallback
Session ID hashing⚠️ ModerateMediumNone

Decision Matrix:

Fork Mode vs Cluster Mode:
┌─────────────────────────────────────────────────────────────┐
│ Fork Mode (CHOSEN) │
├─────────────────────────────────────────────────────────────┤
│ ✅ Single process = simple session management │
│ ✅ Kong handles load balancing (battle-tested) │
│ ✅ Easier to debug and monitor │
│ ✅ Clearer scaling story (add more servers) │
│ ⚠️ Requires multiple server instances for redundancy │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Cluster Mode (REJECTED) │
├─────────────────────────────────────────────────────────────┤
│ ❌ Requires Redis Adapter for ALL session management │
│ ❌ Complex worker management │
│ ❌ Harder to debug (which worker handled what?) │
│ ⚠️ Better CPU utilization on single server │
│ ⚠️ Built-in redundancy (multiple workers) │
└─────────────────────────────────────────────────────────────┘
Decision: Fork Mode + Multiple Servers wins due to:
- Simplicity (fewer moving parts)
- Better failure isolation
- Clearer monitoring and debugging

When you run multiple Node.js processes (whether via PM2 Cluster Mode or Kubernetes replicas), each process maintains its own in-memory state. Socket.io stores connection state locally in each process.

The Challenge:

Client connects to Process A
├─ Socket.io generates unique SID (Session ID)
├─ Socket.io state stored in Process A's memory
└─ Client has SID cookie
Round-robin load balancer routes next request to Process B
├─ Process B has NO knowledge of this SID
├─ Socket.io cannot find the session
└─ Connection breaks or creates duplicate session

Problem Illustration:

sequenceDiagram
    participant Client as Client Device
    participant LB as Load Balancer<br/>(Round-Robin)
    participant PA as Process A<br/>Socket.io
    participant PB as Process B<br/>Socket.io

    Client->>LB: Connect WebSocket
    LB->>PA: Route to Process A
    PA->>Client: SID=abc123, stored in memory
    Client->>LB: Send message

    Note over LB: Round-robin<br/>routes to B
    LB->>PB: Route to Process B
    PB->>PB: SID=abc123 NOT found in memory
    PB-->>Client: ❌ Connection lost or error

PM2 Cluster Mode uses round-robin load balancing by default, which distributes requests randomly among worker processes.

Without Socket.io Redis Adapter:

Request 1 → Process 1 ✅ Session found
Request 2 → Process 2 ❌ Session lost (round-robin)
Request 3 → Process 3 ❌ Session lost (round-robin)

The Statistics:

  • With 4 workers: Only 25% chance of hitting the same process
  • With 8 workers: Only 12.5% chance of hitting the same process
  • Probability decreases with each additional worker

Enterprise environments introduce unique networking challenges not present in typical web applications.

The NAT Problem:

graph TD
    subgraph "Enterprise LAN"
        Device1["Device 1<br/>192.168.1.100"]
        Device2["Device 2<br/>192.168.1.101"]
        PC["PC<br/>192.168.1.102"]
        TV["Display<br/>192.168.1.103"]
    end

    subgraph "Firewall/NAT"
        NAT["NAT Gateway<br/>Outbound IP:<br/>203.0.113.50"]
    end

    subgraph "Internet"
        WebServer["Web Server<br/>Sees all requests<br/>from 203.0.113.50"]
    end

    Device1 --> NAT
    Device2 --> NAT
    PC --> NAT
    TV --> NAT
    NAT --> WebServer

    style NAT fill:#fff4e1,stroke:#ff9800,stroke-width:2px
    style WebServer fill:#ffe0e0,stroke:#f44336,stroke-width:2px

The Problem:

All 1000+ enterprise devices appear as a single IP address (203.0.113.50) to the WebSocket server. Standard hash-on-IP load balancing becomes ineffective:

Hash(203.0.113.50) → Always Process A
├─ All 1000 connections sticky to one process
├─ Other processes sit idle
└─ Single point of failure

Real-World Example:

  • Enterprise has 1200 devices across 50 departments
  • All devices connect through central firewall/NAT gateway
  • Server sees all 1200 as coming from IP 203.0.113.50
  • Without proper sticky session handling, connections constantly drop

Root Cause Analysis:

Why NAT Breaks Standard Load Balancing:
1. Traditional Load Balancing Assumption:
└─ Each client has unique IP address
└─ Hash(unique_IP) → Distribute evenly
2. Enterprise Reality:
└─ All clients share one public IP
└─ Hash(same_IP) → All route to same server
Mathematical Impact:
Without NAT: Hash distribution variance ≈ sqrt(N/k)
With NAT: Hash distribution variance = N (all traffic to one server)
where N = connections, k = servers

Solution Space Exploration:

SolutionReliabilityImplementation ComplexityIT CoordinationChosen
Use custom header✅ HighMediumLow
Use query parameter✅ HighEasyNone✅ Fallback
Client-side cookie⚠️ ModerateEasyHigh
mTLS client certificates✅ Very HighVery HighVery High
VPN per device✅ HighVery HighVery High

Why Custom Header Wins:

Decision Factors:
1. Reliability: ✅
└─ Headers preserved by most proxies
└─ Fallback to query param if stripped
2. Implementation: ✅
└─ Simple client-side: io({ extraHeaders: {...} })
└─ Simple server-side: Kong hash_on_header config
3. IT Burden: ✅
└─ Only need to allow custom headers (standard practice)
└─ No proxy reconfiguration needed
4. Performance: ✅
└─ No additional overhead
└─ Header sent with initial WebSocket handshake

Load Distribution Formula:

With device-ID based hashing:

Expected distribution with proper hashing:
Server load variance = sqrt(N / k)
Example: 1200 devices, 3 servers
├─ Expected per server: 400 devices
├─ Standard deviation: sqrt(1200/3) = 20
├─ 95% confidence interval: 400 ± 40
└─ Actual range: 360-440 devices per server
Efficiency: ~93-107% of perfect distribution ✅
Compare to IP-based hashing under NAT:
├─ Server 1: 1200 devices (300%)
├─ Server 2: 0 devices (0%)
└─ Server 3: 0 devices (0%)
Efficiency: 0% (completely broken) ❌

The comprehensive solution combines three components:

  1. Sticky Sessions (Session Affinity) at Kong Gateway level
  2. Fork Mode (single process) or careful worker configuration for PM2
  3. Redis Adapter for inter-node communication

Architecture Diagram:

graph TB
    subgraph "Enterprise Network"
        Devices["1000+ Devices"]
    end

    subgraph "DMZ / Cloud Network"
        Kong["Kong API Gateway<br/>Sticky Session<br/>Hash on x-device-id"]
    end

    subgraph "Application Cluster"
        direction LR
        subgraph "Process 1"
            Socket1["Socket.io<br/>Instance"]
        end
        subgraph "Process 2"
            Socket2["Socket.io<br/>Instance"]
        end
        subgraph "Process 3"
            Socket3["Socket.io<br/>Instance"]
        end
    end

    subgraph "Shared State"
        Redis["Redis<br/>Pub/Sub Adapter"]
    end

    Devices -->|all from same IP| Kong
    Kong -->|x-device-id=device-001| Socket1
    Kong -->|x-device-id=device-002| Socket2
    Kong -->|x-device-id=device-042| Socket3

    Socket1 <-->|inter-node messages| Redis
    Socket2 <-->|inter-node messages| Redis
    Socket3 <-->|inter-node messages| Redis

    style Kong fill:#fff4e1,stroke:#ff9800,stroke-width:3px
    style Redis fill:#e1f5ff,stroke:#0288d1,stroke-width:3px

Component 1: Device Identification Strategy

Section titled “Component 1: Device Identification Strategy”

Since all devices appear as one IP, we need another way to identify them uniquely.

Strategy: x-device-id Header

frontend/src/lib/socket-config.ts
import io from 'socket.io-client';
import { generateDeviceId } from './device-id';
/**
* Generate unique device identifier
* Combination of: Device type + Device model + Generated UUID
*/
export function createSocket() {
const deviceId = generateDeviceId();
return io(process.env.NEXT_PUBLIC_WS_URL, {
// 1. Pass device ID as custom header
extraHeaders: {
'x-device-id': deviceId,
'x-device-type': getDeviceType(), // 'tablet' | 'pc' | 'display'
'x-department-id': getDepartmentId(), // From user's department assignment
},
// 2. Include in query params (fallback for some proxies)
query: {
deviceId: deviceId,
},
// 3. Include in initial auth
auth: {
token: authToken,
deviceId: deviceId,
},
// Standard Socket.io options
reconnection: true,
reconnectionDelay: 1000,
reconnectionDelayMax: 5000,
reconnectionAttempts: 5,
});
}

Device ID Generation:

frontend/src/lib/device-id.ts
import { v4 as uuidv4 } from 'uuid';
/**
* Generate stable device ID
* Persists in localStorage for device identification across sessions
*/
export function generateDeviceId(): string {
const storageKey = 'app-device-id';
let deviceId = localStorage.getItem(storageKey);
if (!deviceId) {
// Format: device-type_unique-identifier
const deviceType = getDeviceType();
const uniqueId = uuidv4().split('-')[0]; // Use first segment of UUID
deviceId = `${deviceType}_${uniqueId}`;
localStorage.setItem(storageKey, deviceId);
}
return deviceId;
}
function getDeviceType(): string {
const userAgent = navigator.userAgent.toLowerCase();
if (userAgent.includes('ipad')) return 'tablet-ios';
if (userAgent.includes('iphone')) return 'mobile-ios';
if (userAgent.includes('android')) return 'android';
if (userAgent.includes('windows')) return 'pc-win';
if (userAgent.includes('macintosh')) return 'mac';
if (userAgent.includes('linux')) return 'pc-linux';
return 'unknown';
}
function getDepartmentId(): string {
// Retrieved from authenticated user's session
return sessionStorage.getItem('user-department-id') || 'unknown';
}

Kong acts as the reverse proxy and load balancer with sticky session support.

Kong Configuration for WebSocket Sticky Sessions:

kong/config/kong.yml
_format_version: '3.0'
_transform: true
# 1. Define the WebSocket upstream
upstreams:
- name: ws-cluster
algorithm: 'consistent-hashing' # Instead of round-robin
healthchecks:
passive:
healthy:
interval: 0
unhealthy:
interval: 0
active:
type: http
http_path: /health
interval: 60
timeout: 3
unhealthy_http_statuses:
- 429
- 500
- 503
hash_on: 'header' # Hash on custom header
hash_on_header: 'x-device-id' # Not IP!
hash_fallback: 'none' # Don't fallback to IP
# 2. Define cluster targets (Node.js WebSocket servers)
- name: ws-cluster
targets:
- target: 'websocket-node-1:3000'
weight: 100
- target: 'websocket-node-2:3000'
weight: 100
- target: 'websocket-node-3:3000'
weight: 100
# 3. Define the service (exposes WebSocket)
services:
- name: websocket-service
host: ws-cluster # Reference upstream
port: 3000
protocol: 'ws' # WebSocket protocol
path: /ws
# 4. Define route with WebSocket support
routes:
- name: websocket-route
service: websocket-service
protocols:
- ws
- wss # Secure WebSocket
paths:
- /ws
strip_path: false
# 5. Request/Response transformer to ensure header preservation
plugins:
- name: request-transformer
service: websocket-service
config:
add:
headers:
- 'x-forwarded-for: $(client_ip)'
- 'x-forwarded-proto: $(scheme)'

Key Configuration Details:

SettingValueReason
algorithmconsistent-hashingEnsures same client always routes to same backend
hash_onheaderUse custom header for hashing, not IP
hash_on_headerx-device-idHash on device ID header from client
hash_fallbacknoneDon’t fallback to IP hashing (fails under NAT)
protocolwsUse WebSocket protocol, not HTTP

Component 3: PM2 Configuration for Fork Mode

Section titled “Component 3: PM2 Configuration for Fork Mode”

Why Fork Mode over Cluster Mode:

Cluster Mode (Round-Robin)
├─ 8 workers × 4 cores = 32 processes
├─ Each handles random connections
├─ Session state fragmented across all
└─ Requires Redis Adapter for state sync
Fork Mode (Single Process)
├─ 1 worker per server instance
├─ Kong handles sticky routing
├─ All connections on same process = same in-memory state
├─ Horizontal scaling: Add more server instances
└─ Simpler architecture, easier to debug

PM2 Configuration for Fork Mode:

ecosystem.config.js
module.exports = {
apps: [
{
name: 'websocket-node-1',
script: './dist/apps/data-owner-bc/main.js',
instances: 1, // Single instance (Fork Mode)
exec_mode: 'fork', // NOT 'cluster'
env: {
NODE_ENV: 'production',
DATA_OWNER_BC_MODULE_HTTP_PORT: 3000,
DATA_OWNER_BC_MODULE_MICROSERVICE_PORT: 4000,
// Redis config for Socket.io adapter
REDIS_HOST: 'redis-cluster.internal',
REDIS_PORT: 6379,
},
// Zero-downtime reload
listen_timeout: 10000,
kill_timeout: 5000,
wait_ready: true,
// Monitoring and restart
max_memory_restart: '1G',
max_restarts: 5,
min_uptime: '10s',
// Logging
out_file: '/var/log/pm2/websocket-node-1-out.log',
error_file: '/var/log/pm2/websocket-node-1-error.log',
combine_logs: true,
},
// Identical config for other nodes
{
name: 'websocket-node-2',
script: './dist/apps/data-owner-bc/main.js',
instances: 1,
exec_mode: 'fork',
env: {
NODE_ENV: 'production',
DATA_OWNER_BC_MODULE_HTTP_PORT: 3001,
DATA_OWNER_BC_MODULE_MICROSERVICE_PORT: 4001,
REDIS_HOST: 'redis-cluster.internal',
REDIS_PORT: 6379,
},
},
{
name: 'websocket-node-3',
script: './dist/apps/data-owner-bc/main.js',
instances: 1,
exec_mode: 'fork',
env: {
NODE_ENV: 'production',
DATA_OWNER_BC_MODULE_HTTP_PORT: 3002,
DATA_OWNER_BC_MODULE_MICROSERVICE_PORT: 4002,
REDIS_HOST: 'redis-cluster.internal',
REDIS_PORT: 6379,
},
},
],
};

Deployment Command:

Terminal window
# Start all WebSocket nodes
pm2 start ecosystem.config.js
# Monitor
pm2 monit
# View logs
pm2 logs websocket-node-1
# Zero-downtime reload (for updates)
pm2 reload websocket-node-1

Component 4: NestJS Socket.io Server Configuration

Section titled “Component 4: NestJS Socket.io Server Configuration”

Backend Socket.io Setup with Redis Adapter:

apps/data-owner-bc/src/main.ts
import { NestFactory } from '@nestjs/core';
import { SocketIoAdapter } from '@nestjs/platform-socket.io';
import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';
import { bootstrapApplication } from '@lib/common/utils/bootstrap.util';
import { DataOwnerBCModule } from './data-owner-bc.module';
class RedisIoAdapter extends SocketIoAdapter {
private adapterConstructor: ReturnType<typeof createAdapter>;
async connectToRedis(): Promise<void> {
const pubClient = createClient({
host: process.env.REDIS_HOST || 'localhost',
port: parseInt(process.env.REDIS_PORT || '6379', 10),
});
await pubClient.connect();
const subClient = pubClient.duplicate();
await subClient.connect();
this.adapterConstructor = createAdapter(pubClient, subClient);
}
createIOServer(port: number, options?: any) {
const server = super.createIOServer(port, {
...options,
cors: {
origin: process.env.ALLOWED_ORIGINS?.split(',') || ['http://localhost:3000'],
methods: ['GET', 'POST'],
credentials: true,
},
transports: ['websocket', 'polling'],
});
// Apply Redis adapter for inter-node communication
server.adapter(this.adapterConstructor);
return server;
}
}
async function bootstrap(): Promise<void> {
const app = await NestFactory.create(DataOwnerBCModule);
// Set up Socket.io with Redis adapter
if (process.env.NODE_ENV === 'production') {
const redisAdapter = new RedisIoAdapter(app);
await redisAdapter.connectToRedis();
app.useWebSocketAdapter(redisAdapter);
} else {
// Development: use standard adapter (single process)
app.useWebSocketAdapter(new SocketIoAdapter(app));
}
// Standard bootstrap
await bootstrapApplication({
module: DataOwnerBCModule,
globalPrefixNameEnv: 'DATA_OWNER_PREFIX_NAME',
globalPrefixVersionEnv: 'DATA_OWNER_PREFIX_VERSION',
defaultGlobalPrefixName: 'data-owner-bc',
defaultGlobalPrefixVersion: 'v1',
httpPortEnv: 'DATA_OWNER_BC_MODULE_HTTP_PORT',
microservicePortEnv: 'DATA_OWNER_BC_MODULE_MICROSERVICE_PORT',
swagger: {
title: 'Data Owner API with Real-Time WebSocket',
description: 'Data Owner BC with scalable real-time notifications',
tag: 'Data Owner BC',
},
});
}
bootstrap();

Socket.io Gateway for Real-Time Events:

apps/data-owner-bc/src/modules/notification/gateways/notification.gateway.ts
import { Injectable, Logger } from '@nestjs/common';
import {
ConnectedSocket,
MessageBody,
SubscribeMessage,
WebSocketGateway,
WebSocketServer,
} from '@nestjs/websockets';
import { Server, Socket } from 'socket.io';
import { NotificationService } from '../services/notification.service';
@Injectable()
@WebSocketGateway({
cors: {
origin: process.env.ALLOWED_ORIGINS?.split(',') || ['http://localhost:3000'],
credentials: true,
},
transports: ['websocket', 'polling'],
})
export class NotificationGateway {
private readonly logger = new Logger(NotificationGateway.name);
@WebSocketServer()
server: Server;
constructor(private readonly notificationService: NotificationService) {}
/**
* Handle client connection
* Extract device ID from headers for tracking
*/
handleConnection(socket: Socket) {
const deviceId = socket.handshake.headers['x-device-id'] as string;
const departmentId = socket.handshake.headers['x-department-id'] as string;
const userId = socket.handshake.auth?.userId;
socket.data.deviceId = deviceId;
socket.data.departmentId = departmentId;
socket.data.userId = userId;
this.logger.log(
`Device connected: ${deviceId} (Dept: ${departmentId}, User: ${userId}, SID: ${socket.id})`,
);
this.notificationService.trackConnection({
socketId: socket.id,
deviceId,
departmentId,
userId,
connectedAt: new Date(),
});
}
/**
* Handle client disconnection
*/
handleDisconnect(socket: Socket) {
const { deviceId, departmentId } = socket.data;
this.logger.log(`Device disconnected: ${deviceId}`);
this.notificationService.untrackConnection(socket.id);
}
/**
* Subscribe client to department notifications
*/
@SubscribeMessage('subscribe-department')
async subscribeToDepartment(
@ConnectedSocket() socket: Socket,
@MessageBody() data: { departmentId: string },
) {
const { departmentId } = data;
const { deviceId } = socket.data;
// Socket.io room = department ID
socket.join(`department:${departmentId}`);
this.logger.log(`Device ${deviceId} subscribed to department: ${departmentId}`);
return {
status: 'subscribed',
room: `department:${departmentId}`,
};
}
/**
* Broadcast notification to specific department
* Reaches ALL connected devices in that department, regardless of which Node process
*/
async notifyDepartment(departmentId: string, notification: any) {
this.server.to(`department:${departmentId}`).emit('notification', notification);
}
/**
* Broadcast to all connected clients
*/
async notifyAll(notification: any) {
this.server.emit('notification:broadcast', notification);
}
/**
* Send notification to specific device
*/
async notifyDevice(deviceId: string, notification: any) {
const sockets = await this.server
.fetchSockets()
.then((sockets) => sockets.filter((socket) => socket.data.deviceId === deviceId));
sockets.forEach((socket) => {
socket.emit('notification:direct', notification);
});
}
}

Single Server Deployment (Development/Testing)

Section titled “Single Server Deployment (Development/Testing)”

For 100-200 concurrent connections:

docker-compose.yml
version: '3.8'
services:
redis:
image: redis:7-alpine
ports:
- '6379:6379'
volumes:
- redis-data:/data
app:
build: .
ports:
- '3000:3000'
environment:
NODE_ENV: development
REDIS_HOST: redis
DATA_OWNER_BC_MODULE_HTTP_PORT: 3000
depends_on:
- redis
volumes:
redis-data:

Multi-Server Cluster Deployment (Production)

Section titled “Multi-Server Cluster Deployment (Production)”

For 1000+ concurrent connections:

graph TB
    subgraph "Load Balancing Tier"
        Kong["Kong API Gateway<br/>Port: 443 (SSL)<br/>Sticky Session:<br/>Hash on x-device-id"]
    end

    subgraph "Application Tier"
        WS1["Node 1<br/>Port: 3000<br/>Fork Mode"]
        WS2["Node 2<br/>Port: 3000<br/>Fork Mode"]
        WS3["Node 3<br/>Port: 3000<br/>Fork Mode"]
    end

    subgraph "Shared State Tier"
        Redis["Redis Cluster<br/>3 Masters<br/>3 Replicas<br/>Port: 6379"]
    end

    subgraph "Data Tier"
        DB["PostgreSQL<br/>Primary + Replicas<br/>Port: 5432"]
    end

    Kong -->|Hash: device-id| WS1
    Kong -->|Hash: device-id| WS2
    Kong -->|Hash: device-id| WS3

    WS1 <-->|Pub/Sub| Redis
    WS2 <-->|Pub/Sub| Redis
    WS3 <-->|Pub/Sub| Redis

    WS1 --> DB
    WS2 --> DB
    WS3 --> DB

    style Kong fill:#fff4e1,stroke:#ff9800,stroke-width:3px
    style Redis fill:#e1f5ff,stroke:#0288d1,stroke-width:3px
    style DB fill:#e8f5e9,stroke:#4caf50,stroke-width:3px

Infrastructure Requirements:

ComponentSpecificationReason
Kong Gateway4 CPU, 8GB RAMHandle SSL termination, header-based routing
WebSocket Nodes8 CPU, 16GB RAM each (×3 nodes)Process 300-400 concurrent connections per node
Redis Cluster4 CPU, 8GB RAM each (×6 nodes)Pub/Sub for inter-node messaging, high throughput
PostgreSQL16 CPU, 32GB RAM (Primary)Store notifications, device state, audit logs

Approximate Capacity:

  • 3 WebSocket nodes × 400 connections = 1200 concurrent connections
  • 50% reserve capacity for spikes
  • Each node: ~4GB RAM for Socket.io state

Health Check Endpoint:

apps/data-owner-bc/src/common/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import { HealthCheck, HealthCheckService, HttpHealthIndicator } from '@nestjs/terminus';
import { Public } from '@lib/common/decorators/public.decorator';
@Controller('health')
export class HealthController {
constructor(
private health: HealthCheckService,
private http: HttpHealthIndicator,
) {}
@Get()
@Public()
@HealthCheck()
check() {
return this.health.check([
() => this.http.pingCheck('redis', `redis://${process.env.REDIS_HOST}:6379`),
() => this.http.pingCheck('database', `postgresql://${process.env.DB_HOST}:5432`),
]);
}
}

Monitoring Metrics:

import * as promClient from 'prom-client';
const activeConnections = new promClient.Gauge({
name: 'socket_io_active_connections',
help: 'Number of active Socket.io connections',
labelNames: ['node'],
});
const messagesSent = new promClient.Counter({
name: 'socket_io_messages_sent_total',
help: 'Total number of messages sent',
labelNames: ['event_type'],
});

Enterprise IT Requirements:

  1. Outbound Rules:

    • Allow HTTPS (443) to WebSocket server
    • Allow WebSocket secure (WSS) connections
    • Allow WebSocket fallback (polling on HTTP)
  2. Sticky Session Requirements:

    • Preserve HTTP headers (especially x-device-id)
    • Don’t strip custom headers on inspection
    • Allow persistent connections (don’t timeout)

Firewall Configuration Example:

Outbound Rule: Allow WebSocket
├─ Protocol: TCP
├─ Destination Port: 443 (HTTPS/WSS)
├─ Allow Headers: x-device-id, x-device-type
├─ Keep-Alive Timeout: 30 minutes
└─ Rule Priority: High

Handling 1000+ Devices with Reliable Identification:

frontend/src/hooks/useDeviceRegistration.ts
export function useDeviceRegistration() {
useEffect(() => {
const registerDevice = async () => {
const deviceId = generateDeviceId();
const deviceInfo = {
id: deviceId,
type: getDeviceType(),
model: getDeviceModel(),
os: getOS(),
osVersion: getOSVersion(),
appVersion: process.env.NEXT_PUBLIC_APP_VERSION,
departmentId: getUserDepartmentId(),
lastRegistered: new Date(),
};
try {
await fetch('/api/v1/devices/register', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${authToken}`,
'x-device-id': deviceId,
},
body: JSON.stringify(deviceInfo),
});
} catch (error) {
console.error('Device registration failed:', error);
}
};
registerDevice();
}, []);
}

Backend Device Registry:

apps/data-owner-bc/src/modules/device/entities/device-registry.entity.ts
@Entity({ name: 'device_registries', database: AppDatabases.APP_CORE })
export class DeviceRegistry implements ITimestamp {
@PrimaryColumn({ type: 'varchar', length: 50 })
device_id: string; // e.g., 'tablet-ios_a1b2c3d4'
@Column({ type: 'varchar', length: 20 })
device_type: string; // 'tablet-ios' | 'pc' | 'display'
@Column({ type: 'varchar', length: 100 })
device_model: string;
@Column({ type: 'varchar', length: 50 })
os: string;
@Column({ type: 'varchar', length: 20 })
os_version: string;
@Column({ type: 'uuid' })
user_id: string; // Currently logged-in user
@Column({ type: 'varchar', length: 20 })
department_id: string;
@Column({ type: 'varchar', length: 50 })
app_version: string;
@Column({ type: 'varchar', length: 39, nullable: true })
last_ip_address: string; // For debugging NAT issues
@UpdateDateColumn({ type: 'timestamptz' })
last_seen: Date;
@CreateDateColumn({ type: 'timestamptz' })
registered_at: Date;
@Column({ type: 'uuid', nullable: true })
created_by: string | null;
@Column({ type: 'uuid', nullable: true })
updated_by: string | null;
}

Architecture Setup Components:

  • Client Device ID Generation: Stable, persistent device identification
  • Kong Gateway Configuration: Sticky sessions with x-device-id hashing
  • PM2 Fork Mode: Single-process per Node.js instance
  • Redis Adapter: Inter-node Socket.io communication
  • Health Checks: Monitoring endpoints for deployment health
  • Firewall Rules: Allow WebSocket with header preservation
  • Device Registry: Track all connected devices and their metadata
  • Monitoring & Logging: PM2 logs, Prometheus metrics, connection tracking

Part 2 covers real-time database architecture, including:

  • Notification system data model
  • Event-driven synchronization
  • Lookup table integration for real-time queries
  • Data lifecycle and archival strategies