# Performance — measured, 2026-07-02 Load-test results for the Tikron room engine, measured with [`tools/loadtest`](../tools/loadtest) (N simulated players, 20 Hz inputs, 30 s runs). Commit under test: P1/P1b hardened core (`34829a0`). > These are raw measurements. For what they mean for production decisions — the scale > envelope, room placement, and transport trade-offs — see AGENTS.md → "Limits & roadmap". Two environments: - **Local** — `wrangler dev` (workerd) on the dev machine. Useful for *relative* comparisons and server-CPU cost (network RTT ≈ 0). **Not representative for multi-room load**: local workerd runs every DO in one process, so parallel rooms contend for one CPU (see "horizontal scaling" below). - **Deployed** — real Cloudflare Workers + Durable Objects (`tikron-gateway.workers.dev`), measured from a residential connection in Korea. Absolute numbers include ~80 ms network RTT to the assigned edge/DO location; server-side cost is the local column. (These numbers were measured before the rename, when the worker was named `playedge-gateway`; the methodology and figures are unchanged.) Metrics: **ack RTT** = client input → server `s:ack` round trip (the "responsiveness" number a player feels on top of their base ping). **Jitter** = |state-frame gap − expected 50 ms|. **Downlink** = per-client received bytes/s. ## Scenario: `agar` — the realistic case (binary delta + AOI + acks, 20 Hz sim) | players/room | env | ack RTT p50 | p95 | p99 | downlink/client | errors | |---|---|---|---|---|---|---| | 2 | local | 1.1 ms | 2.7 | 5.4 | 1.4 KiB/s | 0 | | 8 | local | 2.5 ms | 5.6 | 21.7 | 2.8 KiB/s | 0 | | 16 | local | 5.6 ms | 9.9 | 23.9 | 4.2 KiB/s | 0 | | 20 (cap) | local | 8.6 ms | 16.4 | 320 | 5.2 KiB/s | 0 | | 2 | deployed | 79.9 ms | 86.0 | 103.8 | 1.3 KiB/s | 0 | | 8 | deployed | 80.8 ms | 102.1 | 112.0 | 2.0 KiB/s | 0 | | 16 | deployed | 59.4 ms | 71.6 | 83.4 | 3.2 KiB/s | 0 | | 20 (cap) | deployed | 84.9 ms | 97.6 | 114.5 | 3.8 KiB/s | 0 | Takeaways: - **Server processing is cheap.** Local ack RTT (pure server cost) stays under ~9 ms p50 up to the 20-player cap; deployed RTT is dominated by network distance (~80 ms from this client), i.e. the engine adds single-digit ms. - **A full 20-player AOI room is comfortable** on real infrastructure: p99 ≈ 115 ms including the network, zero errors, tick jitter pinned at the 50 ms cadence. - **AOI keeps bandwidth flat-ish**: 1.3 → 3.8 KiB/s per client from 2 → 20 players (each client only receives its view, not the whole room). ## Horizontal scaling: 8 rooms × 16 players (128 concurrent) | env | ack RTT p50 | p95 | p99 | connects | errors | |---|---|---|---|---|---| | local | **645 ms** | 3586 | 4602 | 128/128 | 0 | | deployed | **81.9 ms** | 96.5 | 134.0 | 128/128 | 1 transient close | The single most important measurement of the sweep. Locally, 8 simultaneous simulations share one workerd process and collapse (p50 645 ms). On real Cloudflare, each room is its own Durable Object placed independently — 128 concurrent players across 8 rooms behave like one 16-player room (p50 82 ms vs 59–85 ms). **Rooms scale horizontally on the edge; never benchmark multi-room capacity against `wrangler dev`.** ## Capacity enforcement under load Driving 32/64/128 connections at a 20-cap agar room rejects exactly 12/44/108 of them with `room_full` + close 4002 while the 20 seated players keep playing cleanly — the P1 room-side cap holds under connection storms, not just in unit tests. ## Scenario: `movement` — no AOI (before the flush throttle: the failure case) | players/room | env | ack RTT p50 | p95 | downlink/client | unexpected closes | |---|---|---|---|---|---| | 32 | local | 11.5 ms | 3492 | 20.5 KiB/s | 0 | | 64 | local | 85.8 ms | 160 | 34.6 KiB/s | 0 | | 32 | deployed | 61.2 ms | 80.7 | 17.2 KiB/s | 0 | | 64 | deployed | 82.2 ms | 594 | **0.6 KiB/s** | **58 / 64** | Before the sync throttle, `markStateChanged()` coalesced flushes only per microtask, so the broadcast rate tracked the *input* rate: 64 players × 20 Hz inputs → hundreds of full-room delta broadcasts/sec × 64 recipients. On real Cloudflare the Durable Object's output path saturates and **58 of 64 sockets were force-closed mid-run**. This motivated the tick-aligned flush throttle (`syncIntervalMs`, default 50 ms) in the room core. ### After the flush throttle (`syncIntervalMs = 50`, same commit) | players/room | env | ack RTT p50 | p95 | downlink/client | frames/s/client | unexpected closes | |---|---|---|---|---|---|---| | 64 | local | **6.0 ms** (was 85.8) | 11.2 | 26.0 KiB/s | **16.9** (was 741) | 0 | | 32 | deployed | 80.2 ms | 93.3 | 14.6 KiB/s | ~21 | 0 | | 64 | deployed | **83.3 ms** | 102.1 | 30.0 KiB/s | **~21** (was ~741) | **0 / 64** (was 58/64) | The mechanism is the *event rate*, not raw bytes: coalescing flushes to a 50 ms boundary cut broadcasts per client ~44× while each delta carries the same information (bandwidth drops a more modest ~15–35%, since delta payloads grow to cover the window). With the DO no longer drowning in per-input broadcast work, local ack RTT at 64 players fell from 86 ms to 6 ms and the deployed 64-player room went from dropping 91% of its sockets to fully stable. Regression check (agar, AOI): local 16-player p50 2.0 ms / 2.7 KiB/s (was 5.6 ms / 4.2 KiB/s); deployed 20-player p50 79.9 ms / 3.7 KiB/s and deployed 8×16 (128 CCU) p50 72.8 ms, p99 105.7, zero closes — strictly better across the board. State cadence now locks to the 50 ms boundary (raw inter-arrival p50 49.8 ms, jitter p50 ~1 ms). ### Protocol v2 / netcode-hardening addendum (2026-07-02, later same day) Two semantics changed after the tables above were measured; the bandwidth and stability numbers remain representative, but interpret ack RTT with this in mind: - **Tick-aligned input queue** (IoArenaRoom): inputs now wait for the next simulation tick before being processed and acked, so input→ack includes up to one tick (50 ms) of queue wait *by design* — measured local agar 16-player ack p50 ~64 ms (was 5.6 ms with immediate dispatch). Client prediction masks this; the win is a consistent world per tick. Wall-clock responsiveness for players is governed by prediction, not ack RTT. - **AOI is now grid-indexed** (uniform spatial hash, ~O(viewers + entities) per flush instead of O(viewers × entities)), with property tests asserting exact parity with the naive filter. Same-room numbers at ≤20 players are unchanged; the benefit grows with entity count. ## 100 players, one room — F1 hot-path pass (2026-07-02, local) Target: FPS-grade server processing (<20 ms per tick+flush) at 100 CCU in a single room. The F1 pass (shared TextEncoder/Decoder, global AOI change-guard, one-pass `encodeDeltaOrNull`, codec-based baseline snapshots instead of `structuredClone`, integer grid keys, incremental orb grid) plus a new in-room `tk:stats` probe that reports real tick/flush durations from inside the DO. | 100 players/room (local) | before F1 | after F1 | |---|---|---| | server tick processing p50 / max | (tick budget overrun) | **0 ms / 2 ms** | | server flush processing p50 / max | — | **3 ms / 11 ms** | | ack RTT p50 | 76.6 ms | 74.1 ms | | ack RTT p95 / p99 / max | 271 / 893 / 1187 ms | **120 / 171 / 232 ms** | | unexpected closes | 0 | 0 | Readings: - **Server processing is now 3–11 ms at 100 CCU** — the <20 ms FPS budget is met with 2× headroom before any structural (cell-sharing / priority) work. - ack p50 barely moves because it is dominated by the *by-design* tick-queue wait (up to one 50 ms tick), not CPU; the tail (p95/p99) collapsing 2–5× is the CPU win. - The ~62 ms local state cadence (vs the 50 ms `syncIntervalMs`) is a workerd timer artifact: it is identical at 20/50/100 players while flush cost is 1–3 ms, and the deployed cadence measured 49.8 ms (above). Validate on real Cloudflare when the deployed 100-CCU gate runs. - Sweep hygiene: back-to-back runs against one local workerd leave the previous room alive (reconnection windows + alarms), which fabricates stalls and mass closes in later runs. Trust solo runs (or long cooldowns) locally; deployed rooms are isolated DOs. ### Deployed 100 CCU, one room (same day, real Cloudflare DO) Staging worker (`wrangler.staging.jsonc`: workers.dev, DEV_MODE, no D1), measured from the same Korean residential connection: | 100 players/room (deployed) | 3 s ramp | 10 s ramp | |---|---|---| | connect success / unexpected closes | 80/100 · 21 | **100/100 · 0** | | ack RTT p50 / p99 / max | 90 / 137 / 296 ms | 97 / 144 / 319 ms | | state cadence (raw gap p50) / jitter p50 | 48.8 ms / 3.5 ms | **48.4 ms / 6.0 ms** | | server tick+flush (tk:stats) | 0 ms (all buckets) | 0 ms (all buckets) | | ack spikes >1 s | 0 | 0 | - **A single Durable Object holds 100 concurrent players cleanly**: 2,000 incoming input msg/s + 2,000 outgoing state frames/s, sustained, with zero drops and the cadence locked at 50 ms. Production DO hardware is far faster than local workerd (tick+flush under the 1 ms measurement resolution). - The 62 ms local cadence is confirmed as a workerd artifact (deployed locks at ~48.5 ms). - The one real limit found: **connection admission**. 100 upgrades inside 3 s to one DO fails ~20% of connects and destabilizes early sockets; pacing joins over 10 s is fully clean. Mitigation (join pacing / client backoff) is an F3 work item — steady-state operation is not affected. ## 100 players, one room — F3 FPS stack (2026-07-02, deployed) `fps` loadtest scenario against the deployed ShooterRoom (hitscan shooter: 20 Hz moves + 1 Hz subtick-timestamped shots resolved via lag-comp rewinds, quant-coded state, AOI 600 with priority tiers `[{300,1},{600,4}]`, AOI-filtered shot tracers). 100 players, one room, 45 s, 10 s ramp: | metric | value | |---|---| | connect success / unexpected closes / spikes | **100/100 · 0 · 0** | | server tick+flush (tk:stats) | 0 ms (all buckets) | | ack RTT p50 / p99 / max | 136 / 213 / 483 ms | | shot events delivered | ~3,200/s total (129,794 over the run) | | downlink/client | 12.8 KiB/s (state + ~32 tracer events/s) | | state frame gap p50 | 62 ms | - **The full FPS feature stack holds at 100 CCU on one DO**: state fan-out + tiered AOI + per-field map deltas + subtick shot resolution, zero drops. - The 62 ms frame gap (vs agar's 48.4 ms on the same infra) is the priority tiers working: viewers whose visible entities are all far-tier get their flush suppressed entirely that round (null delta → no send), so gaps mix 50/100 ms. Near-tier updates stay at cadence. - ack p50 136 ms (vs ~90 ms agar) includes shot/tick processing semantics of the shooter room; the tail (p99 213 ms, no spikes) is what matters for prediction-masked play. ## 100 players, one room — track-A netcode pass (2026-07-03, deployed) Same `fps` scenario after the latency/feel pass (PLAN-LATENCY-UDP track A): **30 Hz room loop** (input drain + lag snapshots + flush; `tickMs = 33` — the IoArena preset ties all three to the tick, so `syncIntervalMs` alone cannot raise the send rate), aggregate `MoveBudget` token bucket, rewind depth capped at 200 ms, `receivedAt`-based move timing (drain-time `Date.now()` quantized real arrival spacing onto the tick grid and mis-rejected legal moves — fixed in the core by stamping receipt time on every input). 100 players, one room, 30 s, 10 s ramp (`results/trackA-100p-fps-30hz-v2.json`): | metric | 20 Hz loop (before) | 30 Hz loop (after) | |---|---|---| | connect success / closes / spikes | 100/100 · 0 · 0 | **100/100 · 0 · 0** | | server tick+flush (tk:stats) | 0 ms | **0 ms** (all buckets) | | raw state-frame gap p50 / p95 | 49.6 / 62.2 ms | **35.4 / 72.4 ms** | | ack RTT p50 / p95 / p99 | 112.2 / 132.4 / 148.6 ms | **70.6 / 95.2 / 112.0 ms** | | downlink/client | 7.85 KiB/s | 8.32 KiB/s | - **Ack p50 −37%** (112 → 71 ms): inputs drain every 33 ms instead of 50 (mean queue wait ~25 → ~17 ms) and the acked flush leaves sooner. - **+20% state frames for +6% bandwidth** — deltas per frame shrink when frames come more often, so the rate hike is nearly free on the wire. - Server processing stays at 0 ms with 1.5× flush/lag-snapshot cadence — the <20 ms budget has enormous headroom at 100 CCU. - Client side (shipped with the same pass): adaptive interpolation delay (`SnapshotBuffer` starvation-feedback controller, 60–200 ms band), ~50 ms capped velocity extrapolation across a late/lost frame, min-RTT clock-sync offset filter, and instant local fire feedback (muzzle/tracer/sound at mousedown; the server's rewound hitscan stays authoritative). ## 100 players, one room — LAT-2 pass: 60 Hz loop (2026-07-03, deployed) Second latency pass: room loop 30 → **60 Hz** (`tickMs = 16`), client sends 20 → 30 Hz (`stepMs = 33`), adaptive interpolation floor 60 → 30 ms. The one NEGATIVE result matters most: `queueInputs = false` (per-input immediate dispatch) was tried first and REGRESSED ack p50 70.6 → 90.7 ms with 152 >1 s stalls — every input's ack became its own write on top of the 60 Hz flush fan-out (~8k sends/s at 100p). With the queue kept ON, drain-batched acks coalesce those writes AND the 60 Hz drain only costs a mean 8 ms of wait (`results/lat2-100p-fps.json` vs `results/lat2b-100p-fps.json`): | metric | 30 Hz loop | 60 Hz + immediate acks | **60 Hz + queued (shipped)** | |---|---|---|---| | ack RTT p50 / p95 | 70.6 / 95.2 ms | 90.7 / 115.0 ❌ | **61.3 / 97.1 ms** | | ack spikes >1 s | 0 | 152 ❌ | **0** | | raw state-frame gap p50 | 35.4 ms | 25.2 ms | **21.5 ms** | | downlink/client | 16.2 KiB/s | 16.4 KiB/s | 19.5 KiB/s | | server tick+flush / closes | 0 ms · 0 | 0 ms · 0 | **0 ms · 0** | - The remaining ack latency is mostly physics: the measuring connection's Cloudflare edge is **HKG** (TCP connect ~48 ms from Seoul — the ICN PoP mostly serves Enterprise plans), so ~50+ ms of the 61.3 is the Seoul↔Hong Kong round trip. Server-side waiting is now ~8 ms mean and CPU is 0 ms. - Perceived remote latency (one-way + gap/2 + adaptive interp) lands around ~75–85 ms, down ~30% from the 30 Hz pass (~110 ms). ## Baselines - `ttt-json` (turn-based, JSON sync, no tick): 77 B/s per idle client — a turn-based room's cost is effectively the WebSocket keepalive. - Load generator ceiling: single Node process drove 128 concurrent clients with ≤16 ms max event-loop lag (metrics not generator-bound). ## Method notes / caveats - One client machine (Windows 11, residential fiber, Korea), 30 s runs, one run per configuration — treat small deltas (<20%) as noise. The 16-player deployed run (p50 59 ms) vs its neighbors (80–85 ms) illustrates run-to-run routing variance. - Deployed RTTs include the client's distance to the edge and the DO's placement; players closer to their room's DO will see proportionally less. - Raw JSON reports for every run live in `tools/loadtest/results/` (gitignored); re-run the sweep with the commands in `tools/loadtest/README.md`.