# Token Waste Index — Muster benchmark > Deterministic measurement. No LLM is called: this is pure token accounting over > multi-turn, multi-tool task transcripts. Reproduce with `node benchmark/run.mjs` > (after `pnpm build`) or `muster benchmark`. ## Headline Across 5 realistic agent tasks (170 total turns), a naive replay-everything harness sends **876k tokens**. Muster sends **355k** — a **59.4% reduction**, achieved by the immutable-transcript renderer (older tool results become stubs) and the never-wedge compactor. None of it requires a model call. ``` scenario turns naive muster reduction replay-overhead -------------------------------------------------------------------------------------- codebase-refactor-20 21 84.6k 42.7k 49.6% 90.5% incident-triage-30 31 144.9k 60.5k 58.2% 93.6% erp-data-audit-40 41 205.5k 79.5k 61.3% 95.1% research-synthesis-25 26 160.0k 67.7k 57.7% 92.3% long-support-thread-50 51 280.8k 104.9k 62.7% 96.1% -------------------------------------------------------------------------------------- AGGREGATE 170 875.8k 355.2k 59.4% 94.2% ``` ## What the columns mean - **naive** — tokens a harness pays if it re-sends the full transcript every turn (what stateless chat APIs charge for, and what replay-heavy harnesses incur; cf. Hermes [#5563](https://github.com/NousResearch/hermes-agent/issues/5563), where ~89% of a session's tokens were replay). - **muster** — tokens Muster sends after the renderer stubs old tool results and the compactor fits the per-turn budget (8000 tokens here). - **reduction** — `(naive − muster) / naive`. This is what Muster actually saves. - **replay-overhead** — `(naive − necessary) / naive`, where *necessary* counts each unique message once. This is the inherent cost of stateless re-sending; it is the ceiling on what any harness could save by not re-sending. ## Method (honest about scope) This benchmark isolates **context-management waste** — the single largest, best-documented source of agent token burn. It does not measure model quality, tool-selection efficiency, or reasoning tokens. Tool outputs are sized to realistic file-read / log-pull / API-response lengths (700–1800 chars). The optimized path uses Muster's shipped `renderContext` + `compact` with a per-turn budget; the naive path is the same transcript re-sent in full each turn. Both are computed by the same token estimator, so the comparison is apples-to-apples. Generated by `benchmark/run.mjs` from `benchmark/scenarios.mjs`.