Scalable Real-Time Architecture Setup

Overview

This guide covers the foundational architecture decisions for building a scalable real-time enterprise system capable of supporting 1000+ concurrent WebSocket connections. We address the fundamental challenges of distributed real-time systems, specifically focusing on Socket.io infrastructure, session management, and enterprise network constraints.

Key Topics:

Why standard load balancing fails with WebSocket connections
Session affinity (sticky sessions) concepts and implementation
PM2 deployment strategies for real-time systems
Kong API Gateway configuration for WebSocket routing
Enterprise network constraints and solutions
Redis Adapter for distributed Socket.io systems

Document Structure

This guide is organized into five main sections:

Theoretical Foundations - Core computer science concepts underlying real-time distributed systems
Problem Analysis - Deep dive into the specific challenges of WebSocket load balancing
Solution Architecture - Step-by-step methodology for building the solution
Implementation Details - Practical configuration and code examples
Production Deployment - Scaling strategies and operational considerations

Theoretical Foundations

Distributed Systems Theory

Before diving into the specific problems and solutions, it’s important to understand the theoretical foundations that govern distributed real-time systems.

CAP Theorem in Real-Time Systems

The CAP theorem states that a distributed system can only guarantee two out of three properties:

Consistency (C): All nodes see the same data at the same time
Availability (A): Every request receives a response
Partition Tolerance (P): System continues to operate despite network partitions

Application to WebSocket Architecture:

Our real-time architecture makes specific CAP trade-offs:

WebSocket Sticky Session Architecture:
├─ Partition Tolerance (P): ✅ MUST HAVE
│  └─ Enterprise networks may have temporary partitions
│
├─ Availability (A): ✅ HIGH PRIORITY
│  └─ Critical notifications must be delivered
│
└─ Consistency (C): ⚠️  EVENTUAL
   └─ Acceptable for notification delivery (can tolerate brief delays)

Design Decision: We prioritize AP (Availability + Partition Tolerance) over strict consistency, accepting eventual consistency for notification delivery. This is appropriate because:

A delayed notification (1-2 seconds) is acceptable
A completely dropped notification is NOT acceptable
Operations can tolerate brief inconsistencies but not system unavailability

In distributed systems, there are two fundamental approaches to handling stateful connections:

Approach 1: Session Affinity (Our Choice)

Client → Always routes to same backend server
├─ Pros:
│  ├─ Simple implementation
│  ├─ No state synchronization overhead
│  └─ Lower latency (no remote state lookup)
│
└─ Cons:
   ├─ Uneven load distribution if hashing is poor
   ├─ Session lost if server fails
   └─ Cannot easily migrate sessions between servers

Approach 2: Shared State (Not chosen, but considered)

Client → Any backend server → Lookup session in shared storage
├─ Pros:
│  ├─ Perfect load distribution
│  ├─ Sessions survive server failure
│  └─ Easy server migration
│
└─ Cons:
   ├─ Higher latency (remote state lookup on every message)
   ├─ Complex synchronization
   ├─ Single point of failure (shared storage)
   └─ Higher infrastructure cost

Why Session Affinity Wins:

For real-time notifications:

Latency matters: Every millisecond counts in critical alerts
Connection lifetime: WebSocket connections are long-lived (hours), so initial routing overhead is amortized
Failure tolerance: We use Redis Adapter for message delivery, not session storage
Simplicity: Reduces moving parts and potential failure points

Mathematical Capacity Model

Understanding the capacity limits helps us design for scale.

Connection Capacity Formula

Per-Server Connection Limit:

Max Connections per Server = min(
    File Descriptors Limit,
    Memory Limit / Memory per Connection,
    CPU Limit / CPU per Connection
)

Real-World Calculations:

For a typical Node.js server (8 CPU, 16GB RAM):

File Descriptors:
├─ Linux default: 1024 (too low!)
├─ Recommended: ulimit -n 65536
└─ Practical limit: 50,000 concurrent connections

Memory:
├─ Available for connections: 12GB (after OS, Node.js heap)
├─ Memory per WebSocket: ~10KB (idle) to ~50KB (active)
├─ Conservative estimate: 30KB average
└─ Memory limit: 12GB / 30KB = 400,000 connections (theoretical)

CPU:
├─ Event loop blocks at ~5000 active messages/second
├─ Per connection: ~0.5ms CPU time per message
├─ With 1000 connections × 1 message/min = 16.7 msg/sec
└─ CPU is NOT the bottleneck for our use case

PRACTICAL LIMIT: 5,000 connections per server
(Conservative estimate accounting for message bursts)

Load Distribution Formula

Hash-Based Load Distribution:

With consistent hashing on x-device-id:

Expected connections per server = Total Connections / Number of Servers

Standard deviation (perfect hash) ≈ sqrt(N / k)
where:
  N = total connections
  k = number of servers

Example: 1200 connections, 3 servers
├─ Expected per server: 1200 / 3 = 400
├─ Std deviation: sqrt(1200 / 3) = 20
└─ Actual distribution: 380-420 per server (95% confidence)

Why This Matters:

With proper hashing, load distributes evenly. However, with IP-based hashing under NAT:

Hash(single_enterprise_IP) = always same value
├─ All 1200 connections → Server 1
├─ Server 2: Idle
└─ Server 3: Idle

Result: 100% load on one server, 0% on others ❌

Failure Probability Model

Session Continuity Probability:

With round-robin (no sticky sessions):

P(correct server) = 1 / n

where n = number of backend servers

Example: 4 servers
├─ P(hit correct server) = 25%
├─ P(session lost) = 75%
└─ User experience: 3 out of 4 requests fail ❌

With sticky sessions:

P(correct server) = 1  (100%, deterministic)

Unless:
├─ Server fails (rare: 99.9% uptime = 0.1% failure)
├─ Hash collision (negligible with good hash function)
└─ Session expires (controlled by application logic)

Solution Methodology Framework

Before implementing, we follow a structured problem-solving methodology:

Step 1: Problem Decomposition

Break down the monolithic “scalable WebSocket” problem:

Primary Problem: Support 1000+ concurrent WebSocket connections
├─ Sub-problem 1: Session persistence across requests
│  ├─ Root cause: Stateful WebSocket connections
│  └─ Solution space: Sticky sessions OR shared state
│
├─ Sub-problem 2: Load distribution despite NAT
│  ├─ Root cause: All clients appear as single IP
│  └─ Solution space: Alternative routing keys
│
├─ Sub-problem 3: Inter-server message broadcasting
│  ├─ Root cause: Message sent to Server A must reach client on Server B
│  └─ Solution space: Message bus (Redis Pub/Sub)
│
└─ Sub-problem 4: Horizontal scalability
   ├─ Root cause: Single server limited to ~5000 connections
   └─ Solution space: Multi-server cluster + load balancer

Step 2: Solution Space Exploration

Session Persistence Solutions:

Solution	Complexity	Latency	Scalability	Cost	Chosen?
Sticky Sessions	Low	Low	High	Low	✅
Shared Session Store	High	Medium	Very High	High	❌
Session Replication	Very High	High	Medium	High	❌
Serverless WebSocket	Low	Low	Very High	High	❌

Routing Key Solutions (under NAT):

Solution	Reliability	Implementation	IT Burden	Chosen?
IP-based hashing	❌ Fails	Easy	None	❌
Cookie-based hashing	⚠️ Moderate	Easy	May be stripped	❌
Custom header hashing	✅ High	Medium	Allow headers	✅
Query param hashing	✅ High	Easy	None	✅ Fallback
Session ID hashing	⚠️ Moderate	Medium	None	❌

Step 3: Architecture Trade-off Analysis

Decision Matrix:

Fork Mode vs Cluster Mode:

┌─────────────────────────────────────────────────────────────┐
│                         Fork Mode (CHOSEN)                  │
├─────────────────────────────────────────────────────────────┤
│ ✅ Single process = simple session management               │
│ ✅ Kong handles load balancing (battle-tested)              │
│ ✅ Easier to debug and monitor                              │
│ ✅ Clearer scaling story (add more servers)                 │
│ ⚠️  Requires multiple server instances for redundancy       │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      Cluster Mode (REJECTED)                │
├─────────────────────────────────────────────────────────────┤
│ ❌ Requires Redis Adapter for ALL session management        │
│ ❌ Complex worker management                                │
│ ❌ Harder to debug (which worker handled what?)             │
│ ⚠️  Better CPU utilization on single server                 │
│ ⚠️  Built-in redundancy (multiple workers)                  │
└─────────────────────────────────────────────────────────────┘

Decision: Fork Mode + Multiple Servers wins due to:
- Simplicity (fewer moving parts)
- Better failure isolation
- Clearer monitoring and debugging

Part 1: Understanding the Problem

The Socket.io Session Problem

When you run multiple Node.js processes (whether via PM2 Cluster Mode or Kubernetes replicas), each process maintains its own in-memory state. Socket.io stores connection state locally in each process.

The Challenge:

Client connects to Process A
├─ Socket.io generates unique SID (Session ID)
├─ Socket.io state stored in Process A's memory
└─ Client has SID cookie

Round-robin load balancer routes next request to Process B
├─ Process B has NO knowledge of this SID
├─ Socket.io cannot find the session
└─ Connection breaks or creates duplicate session

Problem Illustration:

sequenceDiagram
    participant Client as Client Device
    participant LB as Load Balancer<br/>(Round-Robin)
    participant PA as Process A<br/>Socket.io
    participant PB as Process B<br/>Socket.io

    Client->>LB: Connect WebSocket
    LB->>PA: Route to Process A
    PA->>Client: SID=abc123, stored in memory
    Client->>LB: Send message

    Note over LB: Round-robin<br/>routes to B
    LB->>PB: Route to Process B
    PB->>PB: SID=abc123 NOT found in memory
    PB-->>Client: ❌ Connection lost or error

Why PM2 Cluster Mode Fails

PM2 Cluster Mode uses round-robin load balancing by default, which distributes requests randomly among worker processes.

Without Socket.io Redis Adapter:

Request 1 → Process 1 ✅ Session found
Request 2 → Process 2 ❌ Session lost (round-robin)
Request 3 → Process 3 ❌ Session lost (round-robin)

The Statistics:

With 4 workers: Only 25% chance of hitting the same process
With 8 workers: Only 12.5% chance of hitting the same process
Probability decreases with each additional worker

Enterprise Network Constraints

Enterprise environments introduce unique networking challenges not present in typical web applications.

The NAT Problem:

graph TD
    subgraph "Enterprise LAN"
        Device1["Device 1<br/>192.168.1.100"]
        Device2["Device 2<br/>192.168.1.101"]
        PC["PC<br/>192.168.1.102"]
        TV["Display<br/>192.168.1.103"]
    end

    subgraph "Firewall/NAT"
        NAT["NAT Gateway<br/>Outbound IP:<br/>203.0.113.50"]
    end

    subgraph "Internet"
        WebServer["Web Server<br/>Sees all requests<br/>from 203.0.113.50"]
    end

    Device1 --> NAT
    Device2 --> NAT
    PC --> NAT
    TV --> NAT
    NAT --> WebServer

    style NAT fill:#fff4e1,stroke:#ff9800,stroke-width:2px
    style WebServer fill:#ffe0e0,stroke:#f44336,stroke-width:2px

The Problem:

All 1000+ enterprise devices appear as a single IP address (203.0.113.50) to the WebSocket server. Standard hash-on-IP load balancing becomes ineffective:

Hash(203.0.113.50) → Always Process A
├─ All 1000 connections sticky to one process
├─ Other processes sit idle
└─ Single point of failure

Real-World Example:

Enterprise has 1200 devices across 50 departments
All devices connect through central firewall/NAT gateway
Server sees all 1200 as coming from IP 203.0.113.50
Without proper sticky session handling, connections constantly drop

The NAT Problem — Detailed Analysis

Root Cause Analysis:

Why NAT Breaks Standard Load Balancing:

1. Traditional Load Balancing Assumption:
   └─ Each client has unique IP address
      └─ Hash(unique_IP) → Distribute evenly

2. Enterprise Reality:
   └─ All clients share one public IP
      └─ Hash(same_IP) → All route to same server

Mathematical Impact:
   Without NAT: Hash distribution variance ≈ sqrt(N/k)
   With NAT: Hash distribution variance = N (all traffic to one server)

   where N = connections, k = servers

Solution Space Exploration:

Solution	Reliability	Implementation Complexity	IT Coordination	Chosen
Use custom header	✅ High	Medium	Low	✅
Use query parameter	✅ High	Easy	None	✅ Fallback
Client-side cookie	⚠️ Moderate	Easy	High	❌
mTLS client certificates	✅ Very High	Very High	Very High	❌
VPN per device	✅ High	Very High	Very High	❌

Why Custom Header Wins:

Decision Factors:

1. Reliability: ✅
   └─ Headers preserved by most proxies
   └─ Fallback to query param if stripped

2. Implementation: ✅
   └─ Simple client-side: io({ extraHeaders: {...} })
   └─ Simple server-side: Kong hash_on_header config

3. IT Burden: ✅
   └─ Only need to allow custom headers (standard practice)
   └─ No proxy reconfiguration needed

4. Performance: ✅
   └─ No additional overhead
   └─ Header sent with initial WebSocket handshake

Load Distribution Formula:

With device-ID based hashing:

Expected distribution with proper hashing:

Server load variance = sqrt(N / k)

Example: 1200 devices, 3 servers
├─ Expected per server: 400 devices
├─ Standard deviation: sqrt(1200/3) = 20
├─ 95% confidence interval: 400 ± 40
└─ Actual range: 360-440 devices per server

Efficiency: ~93-107% of perfect distribution ✅

Compare to IP-based hashing under NAT:
├─ Server 1: 1200 devices (300%)
├─ Server 2: 0 devices (0%)
└─ Server 3: 0 devices (0%)

Efficiency: 0% (completely broken) ❌

Part 2: The Solution Architecture

Solution Architecture Overview

The comprehensive solution combines three components:

Sticky Sessions (Session Affinity) at Kong Gateway level
Fork Mode (single process) or careful worker configuration for PM2
Redis Adapter for inter-node communication

Architecture Diagram:

graph TB
    subgraph "Enterprise Network"
        Devices["1000+ Devices"]
    end

    subgraph "DMZ / Cloud Network"
        Kong["Kong API Gateway<br/>Sticky Session<br/>Hash on x-device-id"]
    end

    subgraph "Application Cluster"
        direction LR
        subgraph "Process 1"
            Socket1["Socket.io<br/>Instance"]
        end
        subgraph "Process 2"
            Socket2["Socket.io<br/>Instance"]
        end
        subgraph "Process 3"
            Socket3["Socket.io<br/>Instance"]
        end
    end

    subgraph "Shared State"
        Redis["Redis<br/>Pub/Sub Adapter"]
    end

    Devices -->|all from same IP| Kong
    Kong -->|x-device-id=device-001| Socket1
    Kong -->|x-device-id=device-002| Socket2
    Kong -->|x-device-id=device-042| Socket3

    Socket1 <-->|inter-node messages| Redis
    Socket2 <-->|inter-node messages| Redis
    Socket3 <-->|inter-node messages| Redis

    style Kong fill:#fff4e1,stroke:#ff9800,stroke-width:3px
    style Redis fill:#e1f5ff,stroke:#0288d1,stroke-width:3px

Component 1: Device Identification Strategy

Since all devices appear as one IP, we need another way to identify them uniquely.

Strategy: x-device-id Header

import io from 'socket.io-client';

import { generateDeviceId } from './device-id';

/**
 * Generate unique device identifier
 * Combination of: Device type + Device model + Generated UUID
 */
export function createSocket() {
    const deviceId = generateDeviceId();

    return io(process.env.NEXT_PUBLIC_WS_URL, {
        // 1. Pass device ID as custom header
        extraHeaders: {
            'x-device-id': deviceId,
            'x-device-type': getDeviceType(), // 'tablet' | 'pc' | 'display'
            'x-department-id': getDepartmentId(), // From user's department assignment
        },

        // 2. Include in query params (fallback for some proxies)
        query: {
            deviceId: deviceId,
        },

        // 3. Include in initial auth
        auth: {
            token: authToken,
            deviceId: deviceId,
        },

        // Standard Socket.io options
        reconnection: true,
        reconnectionDelay: 1000,
        reconnectionDelayMax: 5000,
        reconnectionAttempts: 5,
    });
}

Device ID Generation:

import { v4 as uuidv4 } from 'uuid';

/**
 * Generate stable device ID
 * Persists in localStorage for device identification across sessions
 */
export function generateDeviceId(): string {
    const storageKey = 'app-device-id';
    let deviceId = localStorage.getItem(storageKey);

    if (!deviceId) {
        // Format: device-type_unique-identifier
        const deviceType = getDeviceType();
        const uniqueId = uuidv4().split('-')[0]; // Use first segment of UUID
        deviceId = `${deviceType}_${uniqueId}`;

        localStorage.setItem(storageKey, deviceId);
    }

    return deviceId;
}

function getDeviceType(): string {
    const userAgent = navigator.userAgent.toLowerCase();

    if (userAgent.includes('ipad')) return 'tablet-ios';
    if (userAgent.includes('iphone')) return 'mobile-ios';
    if (userAgent.includes('android')) return 'android';
    if (userAgent.includes('windows')) return 'pc-win';
    if (userAgent.includes('macintosh')) return 'mac';
    if (userAgent.includes('linux')) return 'pc-linux';

    return 'unknown';
}

function getDepartmentId(): string {
    // Retrieved from authenticated user's session
    return sessionStorage.getItem('user-department-id') || 'unknown';
}

Component 2: Kong Gateway Configuration

Kong acts as the reverse proxy and load balancer with sticky session support.

Kong Configuration for WebSocket Sticky Sessions:

_format_version: '3.0'
_transform: true

# 1. Define the WebSocket upstream
upstreams:
    - name: ws-cluster
      algorithm: 'consistent-hashing' # Instead of round-robin
      healthchecks:
          passive:
              healthy:
                  interval: 0
              unhealthy:
                  interval: 0
          active:
              type: http
              http_path: /health
              interval: 60
              timeout: 3
              unhealthy_http_statuses:
                  - 429
                  - 500
                  - 503
      hash_on: 'header' # Hash on custom header
      hash_on_header: 'x-device-id' # Not IP!
      hash_fallback: 'none' # Don't fallback to IP

    # 2. Define cluster targets (Node.js WebSocket servers)
    - name: ws-cluster
      targets:
          - target: 'websocket-node-1:3000'
            weight: 100
          - target: 'websocket-node-2:3000'
            weight: 100
          - target: 'websocket-node-3:3000'
            weight: 100

# 3. Define the service (exposes WebSocket)
services:
    - name: websocket-service
      host: ws-cluster # Reference upstream
      port: 3000
      protocol: 'ws' # WebSocket protocol
      path: /ws

# 4. Define route with WebSocket support
routes:
    - name: websocket-route
      service: websocket-service
      protocols:
          - ws
          - wss # Secure WebSocket
      paths:
          - /ws
      strip_path: false

# 5. Request/Response transformer to ensure header preservation
plugins:
    - name: request-transformer
      service: websocket-service
      config:
          add:
              headers:
                  - 'x-forwarded-for: $(client_ip)'
                  - 'x-forwarded-proto: $(scheme)'

Key Configuration Details:

Setting	Value	Reason
`algorithm`	`consistent-hashing`	Ensures same client always routes to same backend
`hash_on`	`header`	Use custom header for hashing, not IP
`hash_on_header`	`x-device-id`	Hash on device ID header from client
`hash_fallback`	`none`	Don’t fallback to IP hashing (fails under NAT)
`protocol`	`ws`	Use WebSocket protocol, not HTTP

Component 3: PM2 Configuration for Fork Mode

Why Fork Mode over Cluster Mode:

Cluster Mode (Round-Robin)
├─ 8 workers × 4 cores = 32 processes
├─ Each handles random connections
├─ Session state fragmented across all
└─ Requires Redis Adapter for state sync

Fork Mode (Single Process)
├─ 1 worker per server instance
├─ Kong handles sticky routing
├─ All connections on same process = same in-memory state
├─ Horizontal scaling: Add more server instances
└─ Simpler architecture, easier to debug

PM2 Configuration for Fork Mode:

module.exports = {
    apps: [
        {
            name: 'websocket-node-1',
            script: './dist/apps/data-owner-bc/main.js',
            instances: 1, // Single instance (Fork Mode)
            exec_mode: 'fork', // NOT 'cluster'
            env: {
                NODE_ENV: 'production',
                DATA_OWNER_BC_MODULE_HTTP_PORT: 3000,
                DATA_OWNER_BC_MODULE_MICROSERVICE_PORT: 4000,
                // Redis config for Socket.io adapter
                REDIS_HOST: 'redis-cluster.internal',
                REDIS_PORT: 6379,
            },
            // Zero-downtime reload
            listen_timeout: 10000,
            kill_timeout: 5000,
            wait_ready: true,

            // Monitoring and restart
            max_memory_restart: '1G',
            max_restarts: 5,
            min_uptime: '10s',

            // Logging
            out_file: '/var/log/pm2/websocket-node-1-out.log',
            error_file: '/var/log/pm2/websocket-node-1-error.log',
            combine_logs: true,
        },

        // Identical config for other nodes
        {
            name: 'websocket-node-2',
            script: './dist/apps/data-owner-bc/main.js',
            instances: 1,
            exec_mode: 'fork',
            env: {
                NODE_ENV: 'production',
                DATA_OWNER_BC_MODULE_HTTP_PORT: 3001,
                DATA_OWNER_BC_MODULE_MICROSERVICE_PORT: 4001,
                REDIS_HOST: 'redis-cluster.internal',
                REDIS_PORT: 6379,
            },
        },

        {
            name: 'websocket-node-3',
            script: './dist/apps/data-owner-bc/main.js',
            instances: 1,
            exec_mode: 'fork',
            env: {
                NODE_ENV: 'production',
                DATA_OWNER_BC_MODULE_HTTP_PORT: 3002,
                DATA_OWNER_BC_MODULE_MICROSERVICE_PORT: 4002,
                REDIS_HOST: 'redis-cluster.internal',
                REDIS_PORT: 6379,
            },
        },
    ],
};

Deployment Command:

# Start all WebSocket nodes
pm2 start ecosystem.config.js

# Monitor
pm2 monit

# View logs
pm2 logs websocket-node-1

# Zero-downtime reload (for updates)
pm2 reload websocket-node-1

Component 4: NestJS Socket.io Server Configuration

Backend Socket.io Setup with Redis Adapter:

import { NestFactory } from '@nestjs/core';
import { SocketIoAdapter } from '@nestjs/platform-socket.io';

import { createAdapter } from '@socket.io/redis-adapter';
import { createClient } from 'redis';

import { bootstrapApplication } from '@lib/common/utils/bootstrap.util';

import { DataOwnerBCModule } from './data-owner-bc.module';

class RedisIoAdapter extends SocketIoAdapter {
    private adapterConstructor: ReturnType<typeof createAdapter>;

    async connectToRedis(): Promise<void> {
        const pubClient = createClient({
            host: process.env.REDIS_HOST || 'localhost',
            port: parseInt(process.env.REDIS_PORT || '6379', 10),
        });

        await pubClient.connect();

        const subClient = pubClient.duplicate();
        await subClient.connect();

        this.adapterConstructor = createAdapter(pubClient, subClient);
    }

    createIOServer(port: number, options?: any) {
        const server = super.createIOServer(port, {
            ...options,
            cors: {
                origin: process.env.ALLOWED_ORIGINS?.split(',') || ['http://localhost:3000'],
                methods: ['GET', 'POST'],
                credentials: true,
            },
            transports: ['websocket', 'polling'],
        });

        // Apply Redis adapter for inter-node communication
        server.adapter(this.adapterConstructor);

        return server;
    }
}

async function bootstrap(): Promise<void> {
    const app = await NestFactory.create(DataOwnerBCModule);

    // Set up Socket.io with Redis adapter
    if (process.env.NODE_ENV === 'production') {
        const redisAdapter = new RedisIoAdapter(app);
        await redisAdapter.connectToRedis();
        app.useWebSocketAdapter(redisAdapter);
    } else {
        // Development: use standard adapter (single process)
        app.useWebSocketAdapter(new SocketIoAdapter(app));
    }

    // Standard bootstrap
    await bootstrapApplication({
        module: DataOwnerBCModule,
        globalPrefixNameEnv: 'DATA_OWNER_PREFIX_NAME',
        globalPrefixVersionEnv: 'DATA_OWNER_PREFIX_VERSION',
        defaultGlobalPrefixName: 'data-owner-bc',
        defaultGlobalPrefixVersion: 'v1',
        httpPortEnv: 'DATA_OWNER_BC_MODULE_HTTP_PORT',
        microservicePortEnv: 'DATA_OWNER_BC_MODULE_MICROSERVICE_PORT',
        swagger: {
            title: 'Data Owner API with Real-Time WebSocket',
            description: 'Data Owner BC with scalable real-time notifications',
            tag: 'Data Owner BC',
        },
    });
}

bootstrap();

Socket.io Gateway for Real-Time Events:

import { Injectable, Logger } from '@nestjs/common';
import {
    ConnectedSocket,
    MessageBody,
    SubscribeMessage,
    WebSocketGateway,
    WebSocketServer,
} from '@nestjs/websockets';

import { Server, Socket } from 'socket.io';

import { NotificationService } from '../services/notification.service';

@Injectable()
@WebSocketGateway({
    cors: {
        origin: process.env.ALLOWED_ORIGINS?.split(',') || ['http://localhost:3000'],
        credentials: true,
    },
    transports: ['websocket', 'polling'],
})
export class NotificationGateway {
    private readonly logger = new Logger(NotificationGateway.name);

    @WebSocketServer()
    server: Server;

    constructor(private readonly notificationService: NotificationService) {}

    /**
     * Handle client connection
     * Extract device ID from headers for tracking
     */
    handleConnection(socket: Socket) {
        const deviceId = socket.handshake.headers['x-device-id'] as string;
        const departmentId = socket.handshake.headers['x-department-id'] as string;
        const userId = socket.handshake.auth?.userId;

        socket.data.deviceId = deviceId;
        socket.data.departmentId = departmentId;
        socket.data.userId = userId;

        this.logger.log(
            `Device connected: ${deviceId} (Dept: ${departmentId}, User: ${userId}, SID: ${socket.id})`,
        );

        this.notificationService.trackConnection({
            socketId: socket.id,
            deviceId,
            departmentId,
            userId,
            connectedAt: new Date(),
        });
    }

    /**
     * Handle client disconnection
     */
    handleDisconnect(socket: Socket) {
        const { deviceId, departmentId } = socket.data;
        this.logger.log(`Device disconnected: ${deviceId}`);

        this.notificationService.untrackConnection(socket.id);
    }

    /**
     * Subscribe client to department notifications
     */
    @SubscribeMessage('subscribe-department')
    async subscribeToDepartment(
        @ConnectedSocket() socket: Socket,
        @MessageBody() data: { departmentId: string },
    ) {
        const { departmentId } = data;
        const { deviceId } = socket.data;

        // Socket.io room = department ID
        socket.join(`department:${departmentId}`);

        this.logger.log(`Device ${deviceId} subscribed to department: ${departmentId}`);

        return {
            status: 'subscribed',
            room: `department:${departmentId}`,
        };
    }

    /**
     * Broadcast notification to specific department
     * Reaches ALL connected devices in that department, regardless of which Node process
     */
    async notifyDepartment(departmentId: string, notification: any) {
        this.server.to(`department:${departmentId}`).emit('notification', notification);
    }

    /**
     * Broadcast to all connected clients
     */
    async notifyAll(notification: any) {
        this.server.emit('notification:broadcast', notification);
    }

    /**
     * Send notification to specific device
     */
    async notifyDevice(deviceId: string, notification: any) {
        const sockets = await this.server
            .fetchSockets()
            .then((sockets) => sockets.filter((socket) => socket.data.deviceId === deviceId));

        sockets.forEach((socket) => {
            socket.emit('notification:direct', notification);
        });
    }
}

Part 3: Deployment Architecture

Single Server Deployment (Development/Testing)

For 100-200 concurrent connections:

version: '3.8'

services:
    redis:
        image: redis:7-alpine
        ports:
            - '6379:6379'
        volumes:
            - redis-data:/data

    app:
        build: .
        ports:
            - '3000:3000'
        environment:
            NODE_ENV: development
            REDIS_HOST: redis
            DATA_OWNER_BC_MODULE_HTTP_PORT: 3000
        depends_on:
            - redis

volumes:
    redis-data:

Multi-Server Cluster Deployment (Production)

For 1000+ concurrent connections:

graph TB
    subgraph "Load Balancing Tier"
        Kong["Kong API Gateway<br/>Port: 443 (SSL)<br/>Sticky Session:<br/>Hash on x-device-id"]
    end

    subgraph "Application Tier"
        WS1["Node 1<br/>Port: 3000<br/>Fork Mode"]
        WS2["Node 2<br/>Port: 3000<br/>Fork Mode"]
        WS3["Node 3<br/>Port: 3000<br/>Fork Mode"]
    end

    subgraph "Shared State Tier"
        Redis["Redis Cluster<br/>3 Masters<br/>3 Replicas<br/>Port: 6379"]
    end

    subgraph "Data Tier"
        DB["PostgreSQL<br/>Primary + Replicas<br/>Port: 5432"]
    end

    Kong -->|Hash: device-id| WS1
    Kong -->|Hash: device-id| WS2
    Kong -->|Hash: device-id| WS3

    WS1 <-->|Pub/Sub| Redis
    WS2 <-->|Pub/Sub| Redis
    WS3 <-->|Pub/Sub| Redis

    WS1 --> DB
    WS2 --> DB
    WS3 --> DB

    style Kong fill:#fff4e1,stroke:#ff9800,stroke-width:3px
    style Redis fill:#e1f5ff,stroke:#0288d1,stroke-width:3px
    style DB fill:#e8f5e9,stroke:#4caf50,stroke-width:3px

Infrastructure Requirements:

Component	Specification	Reason
Kong Gateway	4 CPU, 8GB RAM	Handle SSL termination, header-based routing
WebSocket Nodes	8 CPU, 16GB RAM each (×3 nodes)	Process 300-400 concurrent connections per node
Redis Cluster	4 CPU, 8GB RAM each (×6 nodes)	Pub/Sub for inter-node messaging, high throughput
PostgreSQL	16 CPU, 32GB RAM (Primary)	Store notifications, device state, audit logs

Approximate Capacity:

3 WebSocket nodes × 400 connections = 1200 concurrent connections
50% reserve capacity for spikes
Each node: ~4GB RAM for Socket.io state

Health Checks and Monitoring

Health Check Endpoint:

import { Controller, Get } from '@nestjs/common';
import { HealthCheck, HealthCheckService, HttpHealthIndicator } from '@nestjs/terminus';

import { Public } from '@lib/common/decorators/public.decorator';

@Controller('health')
export class HealthController {
    constructor(
        private health: HealthCheckService,
        private http: HttpHealthIndicator,
    ) {}

    @Get()
    @Public()
    @HealthCheck()
    check() {
        return this.health.check([
            () => this.http.pingCheck('redis', `redis://${process.env.REDIS_HOST}:6379`),
            () => this.http.pingCheck('database', `postgresql://${process.env.DB_HOST}:5432`),
        ]);
    }
}

Monitoring Metrics:

import * as promClient from 'prom-client';

const activeConnections = new promClient.Gauge({
  name: 'socket_io_active_connections',
  help: 'Number of active Socket.io connections',
  labelNames: ['node'],
});

const messagesSent = new promClient.Counter({
  name: 'socket_io_messages_sent_total',
  help: 'Total number of messages sent',
  labelNames: ['event_type'],
});

Part 4: Enterprise Network Considerations

Firewall Configuration

Enterprise IT Requirements:

Outbound Rules:
- Allow HTTPS (443) to WebSocket server
- Allow WebSocket secure (WSS) connections
- Allow WebSocket fallback (polling on HTTP)
Sticky Session Requirements:
- Preserve HTTP headers (especially x-device-id)
- Don’t strip custom headers on inspection
- Allow persistent connections (don’t timeout)

Firewall Configuration Example:

Outbound Rule: Allow WebSocket
├─ Protocol: TCP
├─ Destination Port: 443 (HTTPS/WSS)
├─ Allow Headers: x-device-id, x-device-type
├─ Keep-Alive Timeout: 30 minutes
└─ Rule Priority: High

Device Identification at Scale

Handling 1000+ Devices with Reliable Identification:

export function useDeviceRegistration() {
    useEffect(() => {
        const registerDevice = async () => {
            const deviceId = generateDeviceId();
            const deviceInfo = {
                id: deviceId,
                type: getDeviceType(),
                model: getDeviceModel(),
                os: getOS(),
                osVersion: getOSVersion(),
                appVersion: process.env.NEXT_PUBLIC_APP_VERSION,
                departmentId: getUserDepartmentId(),
                lastRegistered: new Date(),
            };

            try {
                await fetch('/api/v1/devices/register', {
                    method: 'POST',
                    headers: {
                        'Content-Type': 'application/json',
                        Authorization: `Bearer ${authToken}`,
                        'x-device-id': deviceId,
                    },
                    body: JSON.stringify(deviceInfo),
                });
            } catch (error) {
                console.error('Device registration failed:', error);
            }
        };

        registerDevice();
    }, []);
}

Backend Device Registry:

@Entity({ name: 'device_registries', database: AppDatabases.APP_CORE })
export class DeviceRegistry implements ITimestamp {
    @PrimaryColumn({ type: 'varchar', length: 50 })
    device_id: string; // e.g., 'tablet-ios_a1b2c3d4'

    @Column({ type: 'varchar', length: 20 })
    device_type: string; // 'tablet-ios' | 'pc' | 'display'

    @Column({ type: 'varchar', length: 100 })
    device_model: string;

    @Column({ type: 'varchar', length: 50 })
    os: string;

    @Column({ type: 'varchar', length: 20 })
    os_version: string;

    @Column({ type: 'uuid' })
    user_id: string; // Currently logged-in user

    @Column({ type: 'varchar', length: 20 })
    department_id: string;

    @Column({ type: 'varchar', length: 50 })
    app_version: string;

    @Column({ type: 'varchar', length: 39, nullable: true })
    last_ip_address: string; // For debugging NAT issues

    @UpdateDateColumn({ type: 'timestamptz' })
    last_seen: Date;

    @CreateDateColumn({ type: 'timestamptz' })
    registered_at: Date;

    @Column({ type: 'uuid', nullable: true })
    created_by: string | null;

    @Column({ type: 'uuid', nullable: true })
    updated_by: string | null;
}

Summary Checklist

Architecture Setup Components:

Client Device ID Generation: Stable, persistent device identification
Kong Gateway Configuration: Sticky sessions with x-device-id hashing
PM2 Fork Mode: Single-process per Node.js instance
Redis Adapter: Inter-node Socket.io communication
Health Checks: Monitoring endpoints for deployment health
Firewall Rules: Allow WebSocket with header preservation
Device Registry: Track all connected devices and their metadata
Monitoring & Logging: PM2 logs, Prometheus metrics, connection tracking

Next Steps

Part 2 covers real-time database architecture, including:

Notification system data model
Event-driven synchronization
Lookup table integration for real-time queries
Data lifecycle and archival strategies