--- name: resilient-ws-ui description: Guidelines for building applications that maintain reliable WebSocket connections to a backend and surface connection health clearly to the user. Invoke ONLY when the user explicitly references this skill by name (e.g. "/resilient-ws-ui", "use the resilient-ws-ui skill"). Do NOT auto-invoke based on topic match. --- # Resilient WebSocket connections + visible connection health Agent guidelines for two coupled problems: 1. **Reliability** — keep a logical channel to the backend open across page freezes, IP changes, NAT/proxy timeouts, browser bugs, and flaky networks, *without* relying on the transport's own `close` event. 2. **Visibility** — give the user a compact, always-on, truthful indication of connection health, plus a drill-down when they want detail. The guidance is stack-agnostic. Concrete examples are drawn from the reference implementation in this repository (`src/client/`, `src/server.ts`) — TypeScript + browser `WebSocket` + Node `ws`, but the patterns transfer to other runtimes (native apps, mobile, SSE, long-polling, gRPC streaming). --- ## Part 1 — Reliability ### R1. Do not trust transport-level liveness `WebSocket.readyState === OPEN` does **not** mean the peer is reachable. Common failure modes where the browser never fires `close`: - Firefox stale WebSocket bug ([bugzilla 920074](https://bugzilla.mozilla.org/show_bug.cgi?id=920074)) — no TCP keepalives on WS; idle NAT drops are silent. - Mobile page freeze — tab is frozen, TCP connection is reaped server-side, `readyState` stays `OPEN` until you try to use it. - IP change (Wi-Fi ↔ cellular, VPN toggle) — old 4-tuple is dead; no close event on the client. - Intermediate proxy idle timeout (Nginx 60s, Cloudflare 100s, cellular NAT 30s) — silent drop. **Rule:** Liveness is established by application-level heartbeat, not by the transport. Everything else in this skill follows from that. ### R2. State machine, not a boolean Model each connection with at least four states. Binary "connected/disconnected" is too coarse to drive either recovery logic or UI. ``` NEW ──→ ALIVE ──→ STALE ──→ DEAD │ ↑ │ │ └───────┘ (recovery: matching pong arrives in grace) └──────────────────────→ DEAD ``` | State | Meaning | |---------|------------------------------------------------------------------------------------------| | `NEW` | Handshake in progress. | | `ALIVE` | Open + recent matching pong. | | `STALE` | A heartbeat timed out. May recover within a grace period. | | `DEAD` | Terminal. Kept visible briefly for the UI, then evicted. | Reference: `src/client/connection.ts` — `ConnectionState`, `transitionTo`. `STALE` is a first-class state with its own grace period (not just a transient moment inside `ALIVE → DEAD`), which is what enables overlapping-connection failover (R6). ### R3. Per-request heartbeat with nonces Do not heartbeat by checking "has there been any message in the last N seconds." That pattern misses the case where the server is alive but the *specific exchange* you care about is stuck. - Each heartbeat carries a random nonce (UUID v4 or 8 random bytes) and a client timestamp. - The peer echoes both and adds its own timestamp. - A per-nonce timeout fires after `pongTimeoutMs`; if the matching pong is not back, mark **that connection** `STALE`. - RTT = `now - clientTs` at the moment the matching pong arrives. Reference: `src/client/connection.ts:148` (`sendPing`) and `src/client/connection.ts:224` (`handleMessage`) — pending pings are kept in a `Map` and resolved individually. On the server side, do the same thing with protocol-level pings — see R11. ### R4. Connect timeout A hanging TCP/TLS handshake (captive portal, SYN blackhole, upstream LB overload) will leave the socket in `CONNECTING` **indefinitely**. The platform will not time it out for you on a useful timescale. Arm a timer at construction. When it fires, check *both* the wrapper state and the native `readyState` before aborting — you can race the platform's `open` event and abort a socket that just came up. Reference: `src/client/connection.ts:103`: ```ts this.connectTimeoutId = setTimeout(() => { if (this.state === ConnectionState.NEW && this.ws.readyState === WebSocket.CONNECTING) { this.close(`connect timeout after ${config.connectTimeoutMs}ms`); } }, config.connectTimeoutMs); ``` ### R5. Exponential backoff with jitter, cap, and ceiling Naïve reconnect loops cause thundering-herd spikes after a server restart and are also a good way to get your IP rate-limited. Always: - Base delay (e.g. 1s), doubling per attempt. - Cap the per-attempt delay (e.g. 30s). - **Full jitter**: `delay = cap_or_computed × random(0.5, 1.0)` (not just ±20%). - Maximum total attempts before giving up (e.g. 15), then stop and expose a manual "try again" affordance. Infinite retry loops burn battery and trick users into thinking "it's trying" when the backend is gone. - Reset the counter to 0 the moment any connection reaches `ALIVE`. Reference: `src/client/manager.ts:265` (`getBackoffDelay`). ### R6. Overlapping connections during `STALE` grace When the active connection goes `STALE`, do **not** wait for it to resolve before starting a replacement. Instead: 1. Create a replacement immediately. 2. Let the old connection keep running through its grace period (`staleGracePeriodMs`). 3. If a late pong arrives on the old connection → promote/retain one, close the other as "superseded." 4. If the old one hits `DEAD` first → the new one is already establishing. This gives you zero-gap failover and is the single biggest perceived-reliability win. Cap the total number of live connections (e.g. 3) so bug loops can't fork forever. Reference: `src/client/manager.ts` — `handleStale`, `ensureReplacement`, `handleAlive` (promotion logic), `MAX_LIVE_CONNECTIONS`. ### R7. Close-code classification Not every close is a reason to retry. | Retriable (reconnect) | Permanent (stop) | |---|---| | 1001 Going Away, 1005 No Status, 1006 Abnormal, 1011 Internal, 1012 Service Restart, 1013 Try Again Later, 1014 Bad Gateway | 1002 Protocol Error, 1003 Unsupported Data, 1007 Invalid Payload, 1009 Message Too Big, 1010 Mandatory Extension, 1015 TLS Failure | Reconnecting on a 1007 (invalid payload) or 1002 (protocol error) means you will hit the same bug on the next socket forever. Stop, surface the error, let the user decide. Reference: `src/client/manager.ts:50` (`NON_RETRIABLE_CODES`) and `handleDead`. ### R8. Detect event-loop pauses Any of these will suspend timers *and* the network stack for seconds at a time: mobile tab freeze, OS sleep, debugger breakpoint, long GC, VM suspend. The Page Lifecycle API catches *some* of these (Chrome-only), but not all. Install a 1-second interval timer that compares `Date.now()` to the expected tick. If the real elapsed time exceeds (interval + threshold), you know the event loop was paused. ```ts this.tickIntervalId = setInterval(() => { const now = Date.now(); const elapsed = now - this.lastTickAt; this.lastTickAt = now; if (elapsed > TICK_MS + JUMP_THRESHOLD_MS) { this.handleResume(elapsed); } }, TICK_MS); ``` On detection: - **Short gap** (< `pongTimeoutMs`): just ping existing connections to verify. - **Long gap** (≥ `pongTimeoutMs`): open a replacement **proactively, in parallel**. NAT tables are gone, TCP state is gone, Firefox won't fire close. Don't waste the pong timeout confirming what's already known. Reference: `src/client/manager.ts:472` (`startTimeJumpDetector`), `handleResume`. ### R9. Page Lifecycle integration On the browser, wire all of these. Each plugs a real hole: | Event | Action | |---|---| | `visibilitychange` → visible | Check all connections; run deferred reconnect if pending. | | `freeze` (Chrome) | Log only — `resume` does the work. | | `resume` (Chrome) | Treat as long freeze — proactive reconnect. | | `pagehide` (persisted) | Close all sockets to allow BFCache. | | `pageshow` (persisted) | Reconnect after BFCache restore. | | `online` | Check all connections. | | `offline` | Log. | | Network Information API `change` | Check all connections. | BFCache is worth calling out: **an open WebSocket disqualifies the page from BFCache in all browsers**. If BFCache matters for perceived navigation speed, close sockets on `pagehide` (when `event.persisted`) and reopen on `pageshow`. Do not use `beforeunload`/`unload` — those events *themselves* disqualify BFCache. Reference: `src/client/manager.ts:407` (`setupLifecycleListeners`). ### R10. Defer reconnection while hidden (Phoenix pattern) If a reconnect would fire while `document.visibilityState === "hidden"`, **defer it** until the tab becomes visible. Don't burn backoff attempts, battery, and rate-limit budget on a user who isn't there. The moment the tab is visible again, run the deferred attempt immediately (no extra wait). Reference: `src/client/manager.ts:240` — reconnect is skipped and `pendingReconnectOnVisible` is set; the `visibilitychange` handler clears it and re-runs `scheduleReconnect()`. Pattern documented in [Phoenix PR #6534](https://github.com/phoenixframework/phoenix/pull/6534). ### R11. Server-side heartbeat: event-loop ordering matters On Node.js (and anything with a phased event loop), the naïve server heartbeat pattern has a race: ``` timer phase: "did we get a pong since last tick? no? terminate." ↓ I/O poll phase: buffered pong for the previous tick arrives here ``` After a brief server stall (GC, debugger, CPU spike), the timer runs first and kills clients whose pongs are waiting one phase later. False positives — and they mass-disconnect all your clients at once. Two mitigations, used together: 1. **Defer termination to `setImmediate`** (check phase, *after* I/O poll). Buffered pongs get to clear the pending flag first. 2. **Nonce correlation with a one-tick lookback**. Each ping carries a random 8-byte nonce. The pong handler only clears `pending` if the echoed nonce matches the current *or previous* nonce. This (a) rejects unsolicited pongs (legal per RFC 6455 § 5.5.3) that would otherwise keep dead clients alive, and (b) covers the gap where a stall caused the nonce to rotate before the old pong landed. Reference: `src/server.ts:68` — read the full comment block; it explains why no drift-detection threshold is needed once both pieces are in place. ### R12. Destroyed flag (re-entry guard) When tearing down a manager, calling `conn.close()` synchronously drives the connection into `DEAD`, which calls your `handleDead` handler, which may schedule a reconnect timer. You just armed a new reconnect during destroy. Set a `destroyed` flag first, and short-circuit on it in every state handler, factory method, and scheduler. Do the final `clearTimeout` after closing all connections, belt-and-suspenders. Reference: `src/client/manager.ts:177` (`destroy`). ### R13. Avoid silent graceful-degradation When the library can't reconnect (max retries hit, non-retriable close code), **stop** and surface a terminal state. Don't just keep trying quietly. Users need to know the app is degraded so they can refresh, switch network, or give up — hiding it is worse than showing "disconnected." Reference: `isTerminal` flag surfaced in `ManagerStats` and rendered in the indicator tooltip as "STOPPED." ### R14. Known gaps to call out in designs Be honest about what a heartbeat *cannot* fix in a pure main-thread browser implementation: - Chrome throttles main-thread timers to 1/min after 5 min hidden + 30s silent. Time-jump detection catches the resulting gap on resume, but if you need heartbeats to keep running while hidden, use a **dedicated Web Worker** for the heartbeat timer. - No session resumption. A reconnect = a fresh logical session. If you need at-least-once delivery across reconnects, you need sequence numbers + replay on top. - The freeze-simulation in this demo is imperfect: real freezes suspend ALL timers simultaneously; a flag-based simulation only suspends handlers. The time-jump detector triggers the same recovery path, so the code under test is still exercised. --- ## Part 2 — Visual indication Rules for surfacing the above to the user without being annoying or misleading. ### V1. One compact indicator in a stable location There should be exactly one "how's the connection?" widget, and it should live in the same pixel on every screen of the app. The reference implementation uses a 32×32 circle in the status bar (`#ws-indicator`) with: - Colored dot (state channel). - Ring around the dot (countdown channel). - Data-state attribute driving CSS variants. - Tooltip on hover with full detail. - `aria-label` for accessibility. Reference: `src/client/ui.ts:67`–`75` (markup), `ws-indicator` CSS in `public/index.html`. ### V2. Derive the widget state; never store it The widget has its own derived state (`alive / stale / connecting / dead / terminal / frozen`) computed from manager stats. Do not store this separately — it will drift. ```ts function deriveWidgetState(stats: ManagerStats): WidgetState { if (stats.frozen) return "frozen"; const seen = new Set(stats.connections.map(c => c.state)); if (seen.has("ALIVE")) return "alive"; if (seen.has("STALE")) return "stale"; if (seen.has("NEW")) return "connecting"; if (stats.isTerminal) return "terminal"; if (stats.reconnectScheduledAt !== null || stats.reconnectDeferredUntilVisible) return "connecting"; return "dead"; } ``` Reference: `src/client/ui.ts:306`. ### V3. Non-color channels are mandatory Color alone fails colorblind users and prints/screenshots. Encode state in at least two of: color, motion (pulse/spin), shape (solid/ring/×), text label. The reference widget uses color + ring fill + a stateful label in the tooltip, and the document title mirrors state in text: `(1/1) WS Reconnect` vs `(0/1) WS Reconnect [FROZEN]`. ### V4. Visualize waiting phases as depleting budgets When the connection is in a state that has a deadline (`ALIVE` awaiting a pong, `STALE` in grace period, `NEW` in connect timeout, scheduled reconnect counting down), show a ring/bar that depletes from full to empty over the remaining time. This turns "is it about to fail?" into something the user can see at a glance. Reference: `src/client/ui.ts:273` (`computeRingRemaining`). One function, four cases, driven by the derived state. ### V5. Rendering loop: rAF with throttle + immediate refresh on events Two signals trigger a render: 1. A `requestAnimationFrame` loop that throttles itself to ~10 Hz (every 100ms) — enough to animate countdowns smoothly. 2. An `onUpdate` callback from the manager on every state change, for sub-100ms responsiveness to real events. Do not poll at 60 Hz. Do not poll at 1 Hz and miss transitions. Throttled rAF + event push is the right combination. Reference: `src/client/ui.ts:113` (`scheduleUpdate`) and `src/client/manager.ts:59` (`onUpdate`). ### V6. Tooltip/expanded view is where the engineer detail lives The compact widget says "something is wrong." The expanded view explains what: - Pool summary (how many connections, in what state). - Active connection id, uptime. - In-flight pings. - Packet loss percentage. - RTT windows: 30s / 1m / 5m — min / median / max / count. - Backoff state: attempt N/max, time until next try, or "deferred (tab hidden)", or "stopped." - Last close reason + code. Reference: `src/client/ui.ts:354` (`renderTooltipHtml`). ### V7. Mirror the state in `document.title` Hidden tabs do not show your compact indicator. They *do* show the title. Encode connection state there so it's visible at tab-level: ```ts document.title = `(${alive}/${total}) WS Reconnect${frozen ? " [FROZEN]" : ""}`; ``` Reference: `src/client/ui.ts:243`. ### V8. Event log for diagnosis (user-facing) A scrollback of timestamped state transitions, lifecycle events, and reconnect attempts lets both developers and savvy users diagnose flakes without a browser devtools session. Keep it bounded (e.g. 500 entries, display first 100). Reference: `src/client/manager.ts:536` (`log`), `src/client/ui.ts:227` (`renderLog`). ### V9. Use active-vs-background states in card styling When you render a pool, mark the **active** connection distinctly (border, badge). Non-active connections in the pool during a failover are informational noise; the user cares about "which one am I talking through right now." Reference: `card-active` class, `activeConnectionId` in `ManagerStats`. ### V10. Never lie If reconnection has been stopped (terminal), show "stopped" — not "connecting" with an animation that will never resolve. If a reconnect is deferred because the tab is hidden, say "deferred" — not "retrying." Users adapt their behavior to what the indicator says; making it optimistic is a UX bug that trains users to ignore it. --- ## Part 3 — Bringup checklist When building a new client (or reviewing one), tick through these in order. A missing item from the top of the list invalidates the ones below it. **Reliability layer:** - [ ] Application-level heartbeat with per-ping nonces and per-ping timeouts. - [ ] Four-state machine (`NEW/ALIVE/STALE/DEAD`) with grace period on `STALE`. - [ ] Connect timeout that also checks native `readyState`. - [ ] Exponential backoff with full jitter, cap, max retries, terminal state after cap. - [ ] Overlapping-connection failover with a pool-size cap. - [ ] Close-code classification (retriable vs permanent). - [ ] Time-jump detector driving proactive reconnect on long gaps. - [ ] Page Lifecycle wiring: `visibilitychange`, `freeze`/`resume`, `pagehide`/`pageshow` for BFCache, `online`/`offline`, Network Information API. - [ ] Defer-while-hidden reconnect (Phoenix pattern). - [ ] Server heartbeat with `setImmediate` deferral and nonce correlation (current + previous). - [ ] `destroyed` flag guarding all state handlers and schedulers. **Indicator layer:** - [ ] Single compact widget in a stable location, with `aria-label`. - [ ] Widget state *derived* from manager stats, never stored. - [ ] Two independent state channels (color + shape/motion/text). - [ ] Countdown ring for every waiting phase. - [ ] Render loop = throttled rAF (≈10 Hz) + event-driven immediate refresh. - [ ] Expanded/tooltip view with pool, RTT windows, loss %, backoff, last close. - [ ] `document.title` reflects state so hidden tabs surface it. - [ ] Bounded, timestamped event log. - [ ] Terminal/deferred states are truthfully labeled. If you can't tick an item, write a one-line note on why and what mitigates it — don't silently ship a gap. --- ## References - This repo: `src/client/connection.ts`, `src/client/manager.ts`, `src/client/ui.ts`, `src/server.ts`, `README.md`. - [Chrome Page Lifecycle API](https://developer.chrome.com/docs/web-platform/page-lifecycle-api) - [web.dev BFCache guide](https://web.dev/articles/bfcache) - [Phoenix visibility-aware reconnection (PR #6534)](https://github.com/phoenixframework/phoenix/pull/6534) - Firefox bugs: [920074](https://bugzilla.mozilla.org/show_bug.cgi?id=920074) (no TCP keepalive), [1360753](https://bugzilla.mozilla.org/show_bug.cgi?id=1360753) (connection throttling), [1921382](https://bugzilla.mozilla.org/show_bug.cgi?id=1921382). - [ws library heartbeat pattern](https://github.com/websockets/ws#how-to-detect-and-close-broken-connections) - [RFC 6455 § 5.5 — Control frames / Ping/Pong](https://www.rfc-editor.org/rfc/rfc6455#section-5.5) - [WebSocket close codes reference](https://websocket.org/reference/close-codes/)