# Token Waste Index — Muster benchmark

> Deterministic measurement. No LLM is called: this is pure token accounting over
> multi-turn, multi-tool task transcripts. Reproduce with `node benchmark/run.mjs`
> (after `pnpm build`) or `muster benchmark`.

## Headline

Across 5 realistic agent tasks (170 total turns), a naive
replay-everything harness sends **876k tokens**.
Muster sends **355k** — a **59.4% reduction**,
achieved by the immutable-transcript renderer (older tool results become stubs)
and the never-wedge compactor. None of it requires a model call.

```
scenario                          turns  naive    muster   reduction  replay-overhead
--------------------------------------------------------------------------------------
codebase-refactor-20              21     84.6k    42.7k    49.6%      90.5%
incident-triage-30                31     144.9k   60.5k    58.2%      93.6%
erp-data-audit-40                 41     205.5k   79.5k    61.3%      95.1%
research-synthesis-25             26     160.0k   67.7k    57.7%      92.3%
long-support-thread-50            51     280.8k   104.9k   62.7%      96.1%
--------------------------------------------------------------------------------------
AGGREGATE                         170    875.8k   355.2k   59.4%      94.2%
```

## What the columns mean

- **naive** — tokens a harness pays if it re-sends the full transcript every turn
  (what stateless chat APIs charge for, and what replay-heavy harnesses incur;
  cf. Hermes [#5563](https://github.com/NousResearch/hermes-agent/issues/5563),
  where ~89% of a session's tokens were replay).
- **muster** — tokens Muster sends after the renderer stubs old tool results and
  the compactor fits the per-turn budget (8000 tokens here).
- **reduction** — `(naive − muster) / naive`. This is what Muster actually saves.
- **replay-overhead** — `(naive − necessary) / naive`, where *necessary* counts
  each unique message once. This is the inherent cost of stateless re-sending;
  it is the ceiling on what any harness could save by not re-sending.

## Method (honest about scope)

This benchmark isolates **context-management waste** — the single largest,
best-documented source of agent token burn. It does not measure model quality,
tool-selection efficiency, or reasoning tokens. Tool outputs are sized to realistic
file-read / log-pull / API-response lengths (700–1800 chars). The optimized path
uses Muster's shipped `renderContext` + `compact` with a per-turn budget; the
naive path is the same transcript re-sent in full each turn. Both are computed by
the same token estimator, so the comparison is apples-to-apples.

Generated by `benchmark/run.mjs` from `benchmark/scenarios.mjs`.