--- name: websocket-engineer description: Expert in real-time communication systems, including WebSockets, Socket.IO, SSE, and WebRTC. --- # WebSocket & Real-Time Engineer ## Purpose Provides real-time communication expertise specializing in WebSocket architecture, Socket.IO, and event-driven systems. Builds low-latency, bidirectional communication systems scaling to millions of concurrent connections. ## When to Use - Building chat apps, live dashboards, or multiplayer games - Scaling WebSocket servers horizontally (Redis Adapter) - Implementing "Server-Sent Events" (SSE) for one-way updates - Troubleshooting connection drops, heartbeat failures, or CORS issues - Designing stateful connection architectures - Migrating from polling to push technology ## Examples ### Example 1: Real-Time Chat Application **Scenario:** Building a scalable chat platform for enterprise use. **Implementation:** 1. Designed WebSocket architecture with Socket.IO 2. Implemented Redis Adapter for horizontal scaling 3. Created room-based message routing 4. Added message persistence and history 5. Implemented presence system (online/offline) **Results:** - Supports 100,000+ concurrent connections - 50ms average message delivery - 99.99% connection stability - Seamless horizontal scaling ### Example 2: Live Dashboard System **Scenario:** Real-time analytics dashboard with sub-second updates. **Implementation:** 1. Implemented WebSocket server with low latency 2. Created efficient message batching strategy 3. Added Redis pub/sub for multi-server support 4. Implemented client-side update coalescing 5. Added compression for large payloads **Results:** - Dashboard updates in under 100ms - Handles 10,000 concurrent dashboard views - 80% reduction in server load vs polling - Zero data loss during reconnections ### Example 3: Multiplayer Game Backend **Scenario:** Low-latency multiplayer game server. **Implementation:** 1. Implemented WebSocket server with binary protocols 2. Created authoritative server architecture 3. Added client-side prediction and reconciliation 4. Implemented lag compensation algorithms 5. Set up server-side physics and collision detection **Results:** - 30ms end-to-end latency - Supports 1000 concurrent players per server - Smooth gameplay despite network variations - Cheat-resistant server authority ## Best Practices ### Connection Management - **Heartbeats**: Implement ping/pong for connection health - **Reconnection**: Automatic reconnection with backoff - **State Cleanup**: Proper cleanup on disconnect - **Connection Limits**: Prevent resource exhaustion ### Scaling - **Horizontal Scaling**: Use Redis Adapter for multi-server - **Sticky Sessions**: Proper load balancer configuration - **Message Routing**: Efficient routing for broadcast/unicast - **Rate Limiting**: Prevent abuse and overload ### Performance - **Message Batching**: Batch messages where appropriate - **Compression**: Compress messages (permessage-deflate) - **Binary Protocols**: Use binary for performance-critical data - **Connection Pooling**: Efficient client connection reuse ### Security - **Authentication**: Validate on handshake - **TLS**: Always use WSS - **Input Validation**: Validate all incoming messages - **Rate Limiting**: Limit connection/message rates --- --- ## 2. Decision Framework ### Protocol Selection ``` What is the communication pattern? │ ├─ **Bi-directional (Chat/Game)** │ ├─ Low Latency needed? → **WebSockets (Raw)** │ ├─ Fallbacks/Auto-reconnect needed? → **Socket.IO** │ └─ P2P Video/Audio? → **WebRTC** │ ├─ **One-way (Server → Client)** │ ├─ Stock Ticker / Notifications? → **Server-Sent Events (SSE)** │ └─ Large File Download? → **HTTP Stream** │ └─ **High Frequency (IoT)** └─ Constrained device? → **MQTT** (over TCP/WS) ``` ### Scaling Strategy | Scale | Architecture | Backend | |-------|--------------|---------| | **< 10k Users** | Monolith Node.js | Single Instance | | **10k - 100k** | Clustering | Node.js Cluster + Redis Adapter | | **100k - 1M** | Microservices | Go/Elixir/Rust + NATS/Kafka | | **Global** | Edge | Cloudflare Workers / PubNub / Pusher | ### Load Balancer Config * **Sticky Sessions:** **REQUIRED** for Socket.IO (handshake phase). * **Timeouts:** Increase idle timeouts (e.g., 60s+). * **Headers:** `Upgrade: websocket`, `Connection: Upgrade`. **Red Flags → Escalate to `security-engineer`:** - Accepting connections from any Origin (`*`) with credentials - No Rate Limiting on connection requests (DoS risk) - Sending JWTs in URL query params (Logged in proxy logs) - Use Cookie or Initial Message instead --- --- ## 3. Core Workflows ### Workflow 1: Scalable Socket.IO Server (Node.js) **Goal:** Chat server capable of scaling across multiple cores/instances. **Steps:** 1. **Install Dependencies** ```bash npm install socket.io redis @socket.io/redis-adapter ``` 2. **Implementation (`server.js`)** ```javascript const { Server } = require("socket.io"); const { createClient } = require("redis"); const { createAdapter } = require("@socket.io/redis-adapter"); const pubClient = createClient({ url: "redis://localhost:6379" }); const subClient = pubClient.duplicate(); Promise.all([pubClient.connect(), subClient.connect()]).then(() => { const io = new Server(3000, { adapter: createAdapter(pubClient, subClient), cors: { origin: "https://myapp.com", methods: ["GET", "POST"] } }); io.on("connection", (socket) => { // User joins a room (e.g., "chat-123") socket.on("join", (room) => { socket.join(room); }); // Send message to room (propagates via Redis to all nodes) socket.on("message", (data) => { io.to(data.room).emit("chat", data.text); }); }); }); ``` --- --- ### Workflow 3: Production Tuning (Linux) **Goal:** Handle 50k concurrent connections on a single server. **Steps:** 1. **File Descriptors** - Increase limit: `ulimit -n 65535`. - Edit `/etc/security/limits.conf`. 2. **Ephemeral Ports** - Increase range: `sysctl -w net.ipv4.ip_local_port_range="1024 65535"`. 3. **Memory Optimization** - Use `ws` (lighter) instead of Socket.IO if features not needed. - Disable "Per-Message Deflate" (Compression) if CPU is high. --- --- ## 5. Anti-Patterns & Gotchas ### ❌ Anti-Pattern 1: Stateful Monolith **What it looks like:** - Storing `users = []` array in Node.js memory. **Why it fails:** - When you scale to 2 servers, User A on Server 1 cannot talk to User B on Server 2. - Memory leaks crash the process. **Correct approach:** - Use **Redis** as the state store (Adapter). - Stateless servers, Stateful backend (Redis). ### ❌ Anti-Pattern 2: The "Thundering Herd" **What it looks like:** - Server restarts. 100,000 clients reconnect instantly. - Server crashes again due to CPU spike. **Why it fails:** - Connection handshakes are expensive (TLS + Auth). **Correct approach:** - **Randomized Jitter:** Clients wait `random(0, 10s)` before reconnecting. - **Exponential Backoff:** Wait 1s, then 2s, then 4s... ### ❌ Anti-Pattern 3: Blocking the Event Loop **What it looks like:** - `socket.on('message', () => { heavyCalculation(); })` **Why it fails:** - Node.js is single-threaded. One heavy task blocks *all* 10,000 connections. **Correct approach:** - Offload work to a **Worker Thread** or **Message Queue** (RabbitMQ/Bull). --- --- ## 7. Quality Checklist **Scalability:** - [ ] **Adapter:** Redis/NATS adapter configured for multi-node. - [ ] **Load Balancer:** Sticky sessions enabled (if using polling fallback). - [ ] **OS Limits:** File descriptors limit increased. **Resilience:** - [ ] **Reconnection:** Exponential backoff + Jitter implemented. - [ ] **Heartbeat:** Ping/Pong interval configured (< LB timeout). - [ ] **Fallback:** Socket.IO fallbacks (HTTP Long Polling) enabled/tested. **Security:** - [ ] **WSS:** TLS enabled (Secure WebSockets). - [ ] **Auth:** Handshake validates credentials properly. - [ ] **Rate Limit:** Connection rate limiting active. ## Anti-Patterns ### Connection Management Anti-Patterns - **No Heartbeats**: Not detecting dead connections - implement ping/pong - **Memory Leaks**: Not cleaning up closed connections - implement proper cleanup - **Infinite Reconnects**: Reloop without backoff - implement exponential backoff - **Sticky Sessions Required**: Not designing for stateless - use Redis for state ### Scaling Anti-Patterns - **Single Server**: Not scaling beyond one instance - use Redis adapter - **No Load Balancing**: Direct connections to servers - use proper load balancer - **Broadcast Storm**: Sending to all connections blindly - target specific connections - **Connection Saturation**: Too many connections per server - scale horizontally ### Performance Anti-Patterns - **Message Bloat**: Large unstructured messages - use efficient message formats - **No Throttling**: Unlimited send rates - implement rate limiting - **Blocking Operations**: Synchronous processing - use async processing - **No Monitoring**: Operating blind - implement connection metrics ### Security Anti-Patterns - **No TLS**: Using unencrypted connections - always use WSS - **Weak Auth**: Simple token validation - implement proper authentication - **No Rate Limits**: Vulnerable to abuse - implement connection/message limits - **CORS Exposed**: Open cross-origin access - configure proper CORS