# Benchmarks

This file records deterministic context-quality benchmark reports generated by
anamnesis.

Run:

```bash
anamnesis benchmark report --append
anamnesis benchmark report --json > before.json
anamnesis benchmark report --json > after.json
anamnesis benchmark compare --baseline before.json --after after.json --append
anamnesis benchmark gallery --write
anamnesis benchmark gallery --validate
anamnesis benchmark trace --append
anamnesis benchmark prompt-gate
anamnesis benchmark session-context --write
anamnesis benchmark task-compare --template
```

Append runs also write a machine-readable evidence record to
`.anamnesis/evidence/events.jsonl`.

The report measures concrete surfaces on disk, not model intelligence:

- static ontology slices
- Layer A `.bootstrap.yaml` facts
- Layer B `.enriched.yaml` semantics
- continuity readiness
- adapter surface readiness
- scorecard v2 raw metrics: ready layers, continuity checks, ontology gaps,
  doctor issues, Codex hook warnings, adapter surfaces, and runtime evidence
  freshness

Public README claims must be based only on self-checks or explicitly sanitized
fixtures that expose no proprietary source snippets, project names, credentials,
hostnames, temp paths, or internal business logic.

Model-dependent task outcomes live in
[`docs/AGENT-TASK-BENCHMARKS.md`](AGENT-TASK-BENCHMARKS.md) via
`anamnesis benchmark task`. Do not merge those scores into deterministic
scorecards or generated benchmark-gallery README claims.

Use `benchmark prompt-gate` before adding any Codex `UserPromptSubmit`
context delta injection. The gate reads deterministic scorecard evidence,
session-context benchmark JSON, and model-dependent task evidence, including
optional compact/full retrieval metrics and paired task-compare evidence. It
estimates duplicate ontology/handoff prompt overhead, checks duplicate-context
risk, and returns one of three decisions:

- `defer`: keep prompt-time context delta disabled.
- `collect-more-evidence`: a gap exists, but repeated-gap evidence or
  token/noise controls are not sufficient.
- `prototype`: a bounded non-default experiment is justified; default shipping
  still needs dedupe and smoke evidence.

Use `benchmark compare` for before/after adoption evidence. It reads two
`benchmark report --json` files and reports raw scorecard deltas rather than
collapsing the result into a single opaque score.

Use `benchmark session-context --write` for the v1.5 startup-context budget
check. It compares full SessionStart file injection with compact invariant
digest/source-pointer context across public-safe fixtures, then writes JSON,
markdown, and dependency-free SVG charts under
[`docs/benchmark-evidence/session-context/`](benchmark-evidence/session-context/).
`benchmark prompt-gate` reads the generated `session-context.json` artifact
from this directory automatically.

Public-safe summaries and claim boundaries live in
[`docs/BENCHMARK-GALLERY.md`](BENCHMARK-GALLERY.md). Sanitized public-shape
machine evidence is stored in
[`docs/benchmark-evidence/public-shapes.jsonl`](benchmark-evidence/public-shapes.jsonl).

## Current Public Evidence

Current public evidence is intentionally conservative:

- self-dogfood for this repository
- sanitized fixture shapes only
- no private project names or organization repository identifiers
- no hostnames, credentials, local source paths, or proprietary domain details

## Benchmark Report — self repository

Project: anamnesis
Tools: claude-code, codex, cursor
Fragments: base

| Dimension | Value |
|---|---:|
| Continuity checks | 6/6 |
| Doctor errors | 0 |
| Codex hook warnings | 0 |
| Adapter surfaces | ready |

Boundary:

- This is deterministic self-check evidence for the repository itself.
- It is not a claim that every framework or private project shape is fully
  covered.
- Private validation may inform internal prioritization, but it must not be
  copied into public docs or package artifacts.