# CLAUDE.md automated benchmark - three-model results

Harness: `benchmark/run.py`. Engine: `claude -p --output-format json` over OAuth.
Each prompt runs in a fresh `/tmp` dir (baseline = empty dir, treatment = repo `CLAUDE.md` copied in).
`--setting-sources project` blocks global settings/hooks. N=5 per cell.
Metric: mean output tokens per prompt (word count in parens).

Raw per-run data: `benchmark/raw-{haiku,sonnet,opus}.jsonl`. Per-model tables: `benchmark/report-*.md`.

## Total output tokens, baseline to CLAUDE.md (current minimal profile)

| Model  | All 5 prompts | T1-T3,T5 only (excl. T4 format test) | Cost (base to file) |
|--------|---------------|--------------------------------------|---------------------|
| haiku  | 2266 to 2254 **-1%**  | 1742 to 1706 **-2%**  | $0.0585 to $0.0596 |
| sonnet | 1856 to 1524 **-18%** | 1350 to 1200 **-11%** | $0.1609 to $0.1599 |
| opus   | 1888 to 1799 **-5%**  | 1187 to 1102 **-7%**  | $0.3850 to $0.3570 |

T4 ("Generate a prompt...") is a format test, not a length test. The rules push it toward structured output which can add tokens. Excluding it is the fairer length comparison and matches BENCHMARK.md's 63% average.

## Aggressive variant: `profiles/CLAUDE.compressed.md`

Measured separately with the same harness, N=5 per cell:

| Model  | All 5 prompts | Cost (base to file) |
|--------|---------------|---------------------|
| haiku  | 2212 to 1726 **-22%** | $0.0677 to $0.0589 |
| sonnet | 1745 to 1182 **-32%** | $0.1646 to $0.1605 |
| opus   | 1936 to 727 **-62%**  | $0.3997 to $0.3451 |

The compressed profile drops fabrication and re-read guards in exchange for shorter output. Use when token cost dominates and the workload is low-risk.

## Findings (directional, single-session means, output varies run-to-run)

1. **The published 63% reduction does not reproduce on the current minimal CLAUDE.md.**
   Real effect ranges from ~-2% (haiku) to -11% (opus, excl. T4). It does reproduce with `profiles/CLAUDE.compressed.md` on opus (-62%).

2. **Effect scales with the model's default verbosity.** Opus, the chattiest baseline, has the most to trim. Haiku barely moves on the minimal profile.

3. **Words drop more consistently than tokens.** Markdown and structure tokens partly offset word savings, so word-count benchmarks overstate the token (cost) win.

4. **On short prompts the minimal profile is roughly cost-neutral.** Per-turn input overhead cancels output saving. Net savings need high output volume to offset the persistent input cost.

## Caveats

- N=5, single session, no statistical controls. Directional signal only.
- Measures one-shot Q&A prompts, not agentic/coding loops (see Issue #1 for that).
- Reproduce: `python3 benchmark/run.py -n 5 --model {haiku,sonnet,opus}`.
- This file covers tokens. For behavior and correctness see `SEMANTIC.md`.