# Performance Notes

This document records the profiling loop used to optimize `@stateforward/yamux.ts` against `yamux-js@0.2.1`.

## Method

The benchmark uses client-opened echo streams over an in-memory transport. It counts payload bytes in both directions and validates echoed bytes.

```sh
npm run benchmark
```

For isolated profiling of this implementation:

```sh
npm run build
YAMUX_BENCH_IMPLEMENTATION='@stateforward/yamux.ts' \
YAMUX_BENCH_WARMUPS=2 \
YAMUX_BENCH_RUNS=12 \
YAMUX_BENCH_SCENARIO='1 KiB sequential:2500:1024:1' \
node --cpu-prof --cpu-prof-dir=/tmp/yamux-ts-profiles benchmarks/yamux-js.mjs
```

## Optimization Log

All numbers below were collected on Node `v24.11.1`. They are local-machine benchmarks, so use them for direction and regression detection rather than as portable absolute performance claims.

| Iteration | Isolated 1 KiB sequential result | Change | Profile evidence |
| --- | ---: | --- | --- |
| Baseline | 1200 streams in 92.52 ms, 12,971 streams/s | Starting point | Hot frames included per-frame no-op session hsm dispatch, byte copying in `ByteReader`, stream hsm startup, and protocol frame allocation. |
| Remove no-op session dispatches | 73.10 ms, 16,415 streams/s | Kept real session state transitions, removed hot-path no-op frame/open/accept dispatches. | hsm cost dropped but stream startup and byte copying remained hot. |
| Fast contiguous `ByteReader` reads | 67.28 ms, 17,835 streams/s | Returned existing contiguous chunks/subarrays instead of copying every header/payload. | `ByteReader` self-time dropped; stream startup became the largest owned cost. |
| Remove per-stream hsm startup | 38.51 ms, 31,164 streams/s | Kept session/client/server as hsm state machines; made `Stream` direct state because it already owned the authoritative flags. | hsm fell to about 1 ms in the next profile, proving it was no longer a material bottleneck. |
| Write protocol headers directly | 34.20 ms, 35,093 streams/s | Wrote headers into the final frame buffer and avoided DataView/header-copy overhead. | Protocol encoding self-time dropped but remained visible. |
| Avoid async promises for sync frame paths | 32.04 ms, 37,447 streams/s | Made data/window frame handling synchronous unless it actually has to write/reset/open asynchronously. | Remaining hot spots were benchmark byte validation, Web Stream construction, `readLoop`, `ByteReader`, and frame encoding. |
| Tried `Stream.write` single-frame fast path | 59.49 ms for 2500-stream long run versus 59.60 ms before | Reverted. The result was neutral and not worth extra branch complexity. | Improvement was micro-level noise. |

## Current Comparison

Current `npm run benchmark` result:

| Scenario | `@stateforward/yamux.ts` | `yamux-js@0.2.1` | Relative |
| --- | ---: | ---: | ---: |
| 1 KiB sequential | 36,256 streams/s, 70.8 MiB/s | 129,791 streams/s, 253.5 MiB/s | 0.28x |
| 1 KiB concurrent | 38,312 streams/s, 74.8 MiB/s | 155,958 streams/s, 304.6 MiB/s | 0.25x |
| 64 KiB concurrent | 10,616 streams/s, 1327.0 MiB/s | 9,576 streams/s, 1197.1 MiB/s | 1.11x |

## Current Bottlenecks

The final profile no longer supports blaming `@stateforward/hsm.ts`: hsm frames were about 1 ms in a roughly 927 ms profiling run. Remaining costs are dominated by:

- Web Stream construction per logical stream.
- Web Stream read/pull/write scheduling.
- Benchmark-side byte validation in `expectWebReadable`.
- Residual frame encoding and `ByteReader` work.

Further small-stream gains likely require changing the public stream representation or adding a lower-level non-Web-Streams fast path. That would be a larger API/product decision rather than another local micro-optimization.