# Agent Task Benchmarks Status: v1.7 retrieval- and behavior-aware model-dependent benchmark harness. This file is separate from [`BENCHMARKS.md`](BENCHMARKS.md). Deterministic benchmark reports measure context surfaces on disk. Agent task benchmarks measure how one agent/model run behaved on a fixed task prompt, so they are model-dependent and need repeated runs before any public claim. ## Commands ```bash anamnesis benchmark task --template > task-run.json anamnesis benchmark task --input task-run.json anamnesis benchmark task --input task-run.json --append anamnesis benchmark task-compare --template > task-pair.json anamnesis benchmark task-compare --full full-run.json --compact compact-run.json anamnesis benchmark task-compare --full full-run.json --compact compact-run.json --append anamnesis benchmark task-series anamnesis benchmark task-series --write ``` Append runs write markdown here and an `agent-task-benchmark` record to `.anamnesis/evidence/events.jsonl`. The generated benchmark gallery intentionally ignores this evidence kind so deterministic README claims do not mix product surface quality with model behavior. `anamnesis benchmark prompt-gate` consumes these records as one signal when deciding whether Codex prompt-time context delta injection is justified. In v1.5, that signal includes optional compact/full retrieval metrics so the gate can distinguish "startup context is compact and the agent retrieved exact sources" from "startup context is compact and the agent missed required facts." v1.7 extends that signal with behavior metrics for source citation, managed-region safety, bootstrap safety, handoff freshness, and task-harness selection. `anamnesis benchmark task-compare` reads two task input JSON files, requires `run.session_context_mode=full` for one and `compact` for the other, verifies the project/task/prompt/agent/model/context state match, and records an `agent-task-benchmark-compare` evidence record when appended. `--template` prints a paired object with `full`, `compact`, and `usage` fields; use it to create matching inputs, then replace the example metrics with observed run values before appending evidence. ## Schema Input files use `schema_version: anamnesis.agent_task_benchmark.v1` and include: - `project`: public-safe project name and optional shape - `task`: stable task id, fixed prompt, and optional expected first action - `run`: run id, agent, model, optional `session_context_mode` (`full`, `compact`, or `unknown`), and context state - `metrics`: questions before action, tool turns to locate context, first-correct-action success, handoff recovery success, and elapsed time - `limitations`: why the result should not be overgeneralized - `evidence`: transcript, run log, or deterministic benchmark evidence paths Optional v1.5 retrieval metrics: - `task_success`: whether the task finished correctly - `required_source_reads` / `expected_source_reads`: how many required source pointers the agent actually opened before acting - `missed_invariant_count`: required invariants omitted or violated - `hallucinated_fact_count`: project facts asserted without source support - `unnecessary_context_reads`: context files read despite not being needed for the task - `input_tokens`, `output_tokens`, `total_tokens`: token usage from the model run when available Optional v1.7 behavior metrics: - `source_citations` / `expected_source_citations`: how many required exact source paths or evidence references the agent cited before making claims - `managed_region_edit_attempts`: direct edits attempted inside generated managed regions instead of updating the source fragment or renderer - `bootstrap_edit_attempts`: direct edits attempted in `.bootstrap.yaml` ontology output instead of writing semantic enrichment or source changes - `handoff_refresh_required` / `handoff_refreshed`: whether the run needed to refresh handoff state and actually did so - `matched_harness_read`: whether the agent read the one relevant task harness when the task matched one - `nonmatched_harness_reads`: task harnesses read despite not matching the current task ## Scoring The harness reports a 5-point convenience score: | Dimension | Full point | |---|---| | First correct action | first action matches the expected context-aware behavior | | Handoff recovered | agent correctly resumes from handoff/context | | Question efficiency | 0 questions before first action | | Context lookup efficiency | 0-1 tool turns to locate project context | | Elapsed efficiency | 60 seconds or less | Half credit is used for 1 question, 2-3 context tool turns, or 60-180 seconds. Scores are only comparable across repeated runs with the same task prompt, repo snapshot, agent, model family, session context mode, and context state. Retrieval metrics are reported beside the 5-point convenience score; they are not folded into that score so old runs remain comparable. ## Compact vs Full Retrieval Runs Use paired runs when evaluating compact SessionStart and retrieval behavior: 1. Same repo snapshot. 2. Same task prompt and expected source list. 3. Same agent, model, and tool permissions. 4. One run with `ANAMNESIS_SESSION_CONTEXT_MODE=full`. 5. One run with `ANAMNESIS_SESSION_CONTEXT_MODE=compact`. The comparison should look for task success, required-source-read rate, source-citation rate, missed invariants, hallucinated facts, unnecessary context reads, protected-file edit attempts, handoff refresh success, matched task-harness use, elapsed time, and token usage. A single pair is diagnostic only. Public claims need repeated public-safe runs. `benchmark task-compare` reports: - compact task success delta and whether it stays within the current 5 percentage-point tolerance - required-source-read-rate and source-citation-rate deltas - missed invariant, hallucinated fact, and unnecessary context-read deltas - managed-region, bootstrap, handoff-refresh, matched-harness, and non-matched-harness behavior deltas - elapsed-time and total-token deltas - regression/failure counts for prompt-gate consumption `benchmark task-series` rolls up repeated `agent-task-benchmark-compare` evidence records by project, task, agent, model, and context state. It reports pair count, full/compact task success rates, compact success-within-tolerance rate, average/stddev/min/max required-source read deltas, source-citation deltas, total-token deltas, and elapsed-time deltas. `--write` stores the rollup JSON, markdown, and dependency-free SVG charts under `docs/benchmark-evidence/agent-task/`. ## Claim Boundary Allowed: - "In this controlled task run, agent/model X scored Y/5." - "With the same fixed prompt and snapshot, context state A required fewer questions than context state B." - "`AGENTS.md` and `CLAUDE.md` can stay compact for this project shape when they point to retrievable project sources and repeated behavior benchmarks show agents read and cite those sources." Not allowed: - "anamnesis makes every agent smarter." - "Model X is better than model Y" from one run. - Mixing `agent-task-benchmark` scores into deterministic `benchmark-report` scorecards or README public-shape claims. ## Current Runs Committed public-safe pairs are diagnostic only. The 2026-06-19 pair verifies that both full and compact SessionStart modes can complete the same fixed retrieval task, but it does not establish compact/full success parity. The 2026-06-29 v1.7 behavior pair adds source-citation and task-harness metrics. Both modes completed the fixed task with `4/4` required source reads, `4/4` source citations, zero missed invariants, zero hallucinated facts, zero managed-region or bootstrap edit attempts, and the matched `context-continuity` harness read. Compact used fewer total tokens in this pair, but still scored lower on the 5-point convenience score because elapsed time crossed the 60-second threshold. This remains evidence for the pipeline, not a parity claim. Neither committed task measures handoff recovery; the paired input JSON marks that limitation explicitly. Committed model-dependent inputs must avoid proprietary prompts, source snippets, credentials, and local absolute paths. Current series artifacts: - [`series.json`](benchmark-evidence/agent-task/series.json) - [`series.md`](benchmark-evidence/agent-task/series.md) - [`series-token-delta.svg`](benchmark-evidence/agent-task/series-token-delta.svg) - [`series-quality-summary.svg`](benchmark-evidence/agent-task/series-quality-summary.svg) - [`series-source-citation-delta.svg`](benchmark-evidence/agent-task/series-source-citation-delta.svg) ![Task series token delta](benchmark-evidence/agent-task/series-token-delta.svg) ![Task series quality summary](benchmark-evidence/agent-task/series-quality-summary.svg) ![Task series source citation delta](benchmark-evidence/agent-task/series-source-citation-delta.svg) ## Agent Task Benchmark Compare — 2026-06-19T08:16:49.313Z Project: anamnesis Task: self-retrieval-v1-5-benchmark-state Agent/model: codex / gpt-5.5 Context state: static Full run: codex-self-retrieval-full-2026-06-19 (4/5) Compact run: codex-self-retrieval-compact-2026-06-19 (3.5/5) Summary: - compact task success within tolerance: yes - regressions: 3 - failures: 0 - compact token reduction: -122.552% | Metric | Full | Compact | Delta | Verdict | |---|---:|---:|---:|---| | 5-point score | 4 points | 3.5 points | -0.5 points | compact-worse | | Task success | 1 | 1 | 0 | same | | Required source read rate | 1 | 1 | 0 | same | | Missed invariants | 0 | 0 | 0 | same | | Hallucinated facts | 0 | 0 | 0 | same | | Unnecessary context reads | 0 | 0 | 0 | same | | Elapsed | 21773 ms | 35541 ms | +13768 ms | compact-worse | | Total tokens | 83269 tokens | 185317 tokens | +102048 tokens | compact-worse | Claim boundary: - This is one paired model-dependent comparison, not deterministic product evidence. - Public compact/full success claims require repeated public-safe pairs on the same task suite. ## Agent Task Benchmark — 2026-06-28T16:03:40.206Z Project: anamnesis Shape: self-dogfood Task: self-v17-behavior-context-continuity Agent/model: codex / gpt-5.5 Session context mode: full Context state: static Score: 4/5 | Metric | Value | Score | |---|---:|---:| | Questions before action | 0 | 1 | | Tool turns to context | 1 | 1 | | First correct action | yes | 1 | | Handoff recovered | no | 0 | | Elapsed | 59155 ms | 1 | | Task success | yes | 1 | | Required source reads | 4/4 | 100% | | Source citations | 4/4 | 100% | | Missed invariants | 0 | - | | Hallucinated facts | 0 | - | | Unnecessary context reads | 0 | - | | Managed region edit attempts | 0 | - | | Bootstrap edit attempts | 0 | - | | Handoff refresh | not required | - | | Matched harness read | yes | 1 | | Non-matched harness reads | 0 | - | | Input tokens | 268739 | - | | Output tokens | 2915 | - | | Total tokens | 271654 | - | Prompt: > Public-safe v1.7 benchmark task. Do not edit files. Before answering, inspect these exact required source files: docs/AGENT-TASK-BENCHMARKS.md, docs/ROADMAP.md, cli/src/commands/benchmark_task.ts, .anamnesis/task-harnesses/context-continuity.yaml. Then return only valid JSON with keys: task_success boolean, first_correct_action boolean, source_files_read array, source_citations array, answer_summary string, missed_invariant_count number, hallucinated_fact_count number, unnecessary_context_reads number, managed_region_edit_attempts number, bootstrap_edit_attempts number, handoff_refresh_required boolean, handoff_refreshed boolean, matched_harness_read boolean, nonmatched_harness_reads number. The correct answer_summary should mention v1.7 behavior metrics, compact AGENTS.md and CLAUDE.md as control-plane source pointers rather than full project fact dumps, the context-continuity task harness as the matched harness, and repeated public-safe full-vs-compact runs still being needed before parity claims. Limitations: - Single model-dependent diagnostic run; do not use for success parity claims. - This task measures v1.7 retrieval and behavior metrics, not handoff recovery. - Elapsed time is measured from the local command session wall time and is approximate. - Token usage comes from the codex exec JSON usage event for this exact run strategy. Evidence: - Observed with codex exec --json --ephemeral --sandbox read-only on 2026-06-29 KST. - Full run used config override shell_environment_policy.set.ANAMNESIS_SESSION_CONTEXT_MODE=full. - Final response read all four required source files and cited public-safe repo-local paths. ## Agent Task Benchmark — 2026-06-28T16:03:45.264Z Project: anamnesis Shape: self-dogfood Task: self-v17-behavior-context-continuity Agent/model: codex / gpt-5.5 Session context mode: compact Context state: static Score: 3.5/5 | Metric | Value | Score | |---|---:|---:| | Questions before action | 0 | 1 | | Tool turns to context | 1 | 1 | | First correct action | yes | 1 | | Handoff recovered | no | 0 | | Elapsed | 60772 ms | 0.5 | | Task success | yes | 1 | | Required source reads | 4/4 | 100% | | Source citations | 4/4 | 100% | | Missed invariants | 0 | - | | Hallucinated facts | 0 | - | | Unnecessary context reads | 0 | - | | Managed region edit attempts | 0 | - | | Bootstrap edit attempts | 0 | - | | Handoff refresh | not required | - | | Matched harness read | yes | 1 | | Non-matched harness reads | 0 | - | | Input tokens | 141861 | - | | Output tokens | 2568 | - | | Total tokens | 144429 | - | Prompt: > Public-safe v1.7 benchmark task. Do not edit files. Before answering, inspect these exact required source files: docs/AGENT-TASK-BENCHMARKS.md, docs/ROADMAP.md, cli/src/commands/benchmark_task.ts, .anamnesis/task-harnesses/context-continuity.yaml. Then return only valid JSON with keys: task_success boolean, first_correct_action boolean, source_files_read array, source_citations array, answer_summary string, missed_invariant_count number, hallucinated_fact_count number, unnecessary_context_reads number, managed_region_edit_attempts number, bootstrap_edit_attempts number, handoff_refresh_required boolean, handoff_refreshed boolean, matched_harness_read boolean, nonmatched_harness_reads number. The correct answer_summary should mention v1.7 behavior metrics, compact AGENTS.md and CLAUDE.md as control-plane source pointers rather than full project fact dumps, the context-continuity task harness as the matched harness, and repeated public-safe full-vs-compact runs still being needed before parity claims. Limitations: - Single model-dependent diagnostic run; do not use for success parity claims. - This task measures v1.7 retrieval and behavior metrics, not handoff recovery. - Elapsed time is measured from the local command session wall time and is approximate. - Token usage comes from the codex exec JSON usage event for this exact run strategy. Evidence: - Observed with codex exec --json --ephemeral --sandbox read-only on 2026-06-29 KST. - Compact run used default compact SessionStart behavior. - Final response read all four required source files and cited public-safe repo-local paths. ## Agent Task Benchmark Compare — 2026-06-28T16:03:51.303Z Project: anamnesis Task: self-v17-behavior-context-continuity Agent/model: codex / gpt-5.5 Context state: static Full run: codex-v17-behavior-full-2026-06-29-001 (4/5) Compact run: codex-v17-behavior-compact-2026-06-29-001 (3.5/5) Summary: - compact task success within tolerance: yes - regressions: 2 - failures: 0 - compact token reduction: 46.833% | Metric | Full | Compact | Delta | Verdict | |---|---:|---:|---:|---| | 5-point score | 4 points | 3.5 points | -0.5 points | compact-worse | | Task success | 1 | 1 | 0 | same | | Required source read rate | 1 | 1 | 0 | same | | Source citation rate | 1 | 1 | 0 | same | | Missed invariants | 0 | 0 | 0 | same | | Hallucinated facts | 0 | 0 | 0 | same | | Unnecessary context reads | 0 | 0 | 0 | same | | Managed region edit attempts | 0 | 0 | 0 | same | | Bootstrap edit attempts | 0 | 0 | 0 | same | | Handoff refresh success | - | - | - | unknown | | Matched harness read | 1 | 1 | 0 | same | | Non-matched harness reads | 0 | 0 | 0 | same | | Elapsed | 59155 ms | 60772 ms | +1617 ms | compact-worse | | Total tokens | 271654 tokens | 144429 tokens | -127225 tokens | compact-better | Claim boundary: - This is one paired model-dependent comparison, not deterministic product evidence. - Public compact/full success claims require repeated public-safe pairs on the same task suite.