# Benchmarks This file records deterministic context-quality benchmark reports generated by anamnesis. Run: ```bash anamnesis benchmark report --append anamnesis benchmark report --json > before.json anamnesis benchmark report --json > after.json anamnesis benchmark compare --baseline before.json --after after.json --append anamnesis benchmark gallery --write anamnesis benchmark gallery --validate anamnesis benchmark trace --append anamnesis benchmark prompt-gate anamnesis benchmark session-context --write anamnesis benchmark task-compare --template ``` Append runs also write a machine-readable evidence record to `.anamnesis/evidence/events.jsonl`. The report measures concrete surfaces on disk, not model intelligence: - static ontology slices - Layer A `.bootstrap.yaml` facts - Layer B `.enriched.yaml` semantics - continuity readiness - adapter surface readiness - scorecard v2 raw metrics: ready layers, continuity checks, ontology gaps, doctor issues, Codex hook warnings, adapter surfaces, and runtime evidence freshness Public README claims must be based only on self-checks or explicitly sanitized fixtures that expose no proprietary source snippets, project names, credentials, hostnames, temp paths, or internal business logic. Model-dependent task outcomes live in [`docs/AGENT-TASK-BENCHMARKS.md`](AGENT-TASK-BENCHMARKS.md) via `anamnesis benchmark task`. Do not merge those scores into deterministic scorecards or generated benchmark-gallery README claims. Use `benchmark prompt-gate` before adding any Codex `UserPromptSubmit` context delta injection. The gate reads deterministic scorecard evidence, session-context benchmark JSON, and model-dependent task evidence, including optional compact/full retrieval metrics and paired task-compare evidence. It estimates duplicate ontology/handoff prompt overhead, checks duplicate-context risk, and returns one of three decisions: - `defer`: keep prompt-time context delta disabled. - `collect-more-evidence`: a gap exists, but repeated-gap evidence or token/noise controls are not sufficient. - `prototype`: a bounded non-default experiment is justified; default shipping still needs dedupe and smoke evidence. Use `benchmark compare` for before/after adoption evidence. It reads two `benchmark report --json` files and reports raw scorecard deltas rather than collapsing the result into a single opaque score. Use `benchmark session-context --write` for the v1.5 startup-context budget check. It compares full SessionStart file injection with compact invariant digest/source-pointer context across public-safe fixtures, then writes JSON, markdown, and dependency-free SVG charts under [`docs/benchmark-evidence/session-context/`](benchmark-evidence/session-context/). `benchmark prompt-gate` reads the generated `session-context.json` artifact from this directory automatically. Public-safe summaries and claim boundaries live in [`docs/BENCHMARK-GALLERY.md`](BENCHMARK-GALLERY.md). Sanitized public-shape machine evidence is stored in [`docs/benchmark-evidence/public-shapes.jsonl`](benchmark-evidence/public-shapes.jsonl). ## Current Public Evidence Current public evidence is intentionally conservative: - self-dogfood for this repository - sanitized fixture shapes only - no private project names or organization repository identifiers - no hostnames, credentials, local source paths, or proprietary domain details ## Benchmark Report — self repository Project: anamnesis Tools: claude-code, codex, cursor Fragments: base | Dimension | Value | |---|---:| | Continuity checks | 6/6 | | Doctor errors | 0 | | Codex hook warnings | 0 | | Adapter surfaces | ready | Boundary: - This is deterministic self-check evidence for the repository itself. - It is not a claim that every framework or private project shape is fully covered. - Private validation may inform internal prioritization, but it must not be copied into public docs or package artifacts.