--- name: audit-review description: Use when producing a read-only CoordExp audit of code, configs, specs, artifacts, docs, progress notes, or OpenSpec changes for correctness, reproducibility, pipeline, and eval-validity risks. --- # Audit Review ## Overview Produce an audit report that helps an implementer safely change/refine CoordExp without guessing. Optimize for correctness, reproducibility, and contract/pipeline integrity rather than style refactors. Use the repo's source-of-truth order: `openspec/specs/` -> `docs/` -> `openspec/changes//` -> `progress/`. Use `progress/` for evidence, diagnostics, benchmark scope, and history; do not answer current-behavior questions from it when docs/specs cover the contract. ## Output Contract (What You Deliver) - Severity-ranked findings (`P0`/`P1`/`P2`) with concrete evidence handles (`path:line`, config keys, exact commands, or tool output). - “Confirmed OK / ruled out” notes to prevent backtracking. - Verification steps: exact commands/tests to reproduce or validate each claim. - Open questions: smallest set of clarifications required to remove ambiguity. - Suggested next actions for an implementer (do not implement changes yourself). Use `references/report-template.md` if you want a ready-made skeleton. ## Guardrails (Read-Only Audit) - Do not modify production code/configs/specs. No `apply_patch` against `src/`, `configs/`, `openspec/`, etc. - Prefer read-only exploration: `rg`, `find`, `git diff`, `sed`, `python -m pytest`, `python -m py_compile`. - If you must write a temporary test or probe: - Default: write under `/tmp/` so the repo stays clean. - If you need it under `temp/` for sharing, ask the user first and keep artifacts minimal. - Always check and report `git status --porcelain` at the start; if the worktree is dirty and it matters, ask before proceeding. - For Python code exploration: Serena MCP is mandatory (symbol-aware navigation; provide `relative_path` constraints). - Never invent results. If a claim cannot be verified, label it as a hypothesis and keep it out of severity-ranked findings. - Do not conflate benchmark scopes. Report `val200`, `limit=200`, first-200, proxy view, full-val, raw-text vs coord-token, checkpoint ids, and launch shape when they affect the claim. ## Workflow (Breadth Pass -> Depth Pass -> Report) ### Step 0: Clarify The Ask (Smallest Unblocking Questions) - If scope is ambiguous, ask 1–3 questions max: - What exact artifact(s) are we auditing: path(s) or concept? - Is the goal: “spec/design review only” or “implementation vs spec audit”? - Any constraints: time budget, no-network, specific configs/datasets, must-pass tests? Assume the deliverable is a report for a separate implementer unless the user explicitly asks you to change code. ### Step 1: Snapshot + Map The Surface Area (Breadth Pass) - Safety snapshot: - Run `git status --porcelain` and note any dirty files. - If auditing a change/PR, capture `git diff --name-only` (or change directory file listing) to bound the search. - Identify entrypoints and contracts: - Docs/specs: `docs/AGENT_INDEX.md`, `docs/catalog.yaml`, `docs/PROJECT_CONTEXT.md`, `docs/SYSTEM_OVERVIEW.md`, `docs/IMPLEMENTATION_MAP.md`, relevant `openspec/specs/`, relevant domain docs. - Progress: `progress/index.yaml`, `progress/README.md`, and the matching category router when you need empirical evidence. - Code: likely entrypoints (`src/bootstrap/`, `src/config/loader.py`, `src/datasets/geometry.py`, `src/trainers/`, `src/infer/`, `src/eval/`, `public_data/`). - Tests: locate tests adjacent to the target area and any policy scans. - Grep for relevant context (fast, wide net): - Use `rg` to find: config keys, CLI flags, spec terms, artifact filenames, error messages. - Use `references/grep-seeds.md` when you need good starting patterns. - Build a short “context index”: - Key files with 1-line reason each. - Key symbols to inspect (class/function names) with file paths. ### Step 2: Inspect The Highest-Risk Flows (Depth Pass) Pick 3–5 top risk areas based on impact and likelihood, then deep-dive with evidence: - Pipeline and process flow: - Trace data flow: input -> transforms -> packing -> training/infer/eval -> artifacts. - Verify invariant-sensitive steps (geometry, ordering, normalization). - Route geometry checks through `src/datasets/geometry.py`, not ad hoc bbox math. - Configuration and contracts: - Check strict parsing / unknown-key behavior (fail-fast vs silently ignored). - Check backward-compat surfaces (stable CLI contracts, deprecated keys policy). - Check that stable workflows stay YAML-first instead of adding CLI flags. - Artifacts and eval validity: - Verify training manifests: `resolved_config.json`, `runtime_env.json`, `effective_runtime.json`, `pipeline_manifest.json`, `experiment_manifest.json`, `run_metadata.json`. - Verify infer/eval artifacts: `summary.json`, `resolved_config.json`, `resolved_config.path`, `gt_vs_pred.jsonl`, `gt_vs_pred_scored.jsonl`, `metrics.json`, and guarded companions when enabled. - For current infer behavior inspect `src/infer/pipeline.py::run_pipeline`; for current eval behavior inspect `src/eval/detection.py::evaluate_and_save`. - Determinism and reproducibility: - Look for ordering-dependent behavior, random seeds, multiprocess I/O, filesystem-dependent nondeterminism. - Silent failure policy: - Ensure unexpected exceptions are not swallowed in core paths; best-effort behavior should be narrow and justified. ### Step 3: Validate Or Falsify With Targeted Tests (Optional, But High Value) - Prefer running existing targeted tests first. - If a hypothesis needs a minimal repro, write a temporary test: - Put it in `/tmp/` and run it with `PYTHONPATH=.` so the repo stays unchanged. - Keep it tiny and single-purpose; delete it afterwards (or ask before deleting if the user wants to keep it). - When tests are too expensive to run, provide a verification plan with expected artifacts and failure signals. ### Step 4: Write The Audit Report - Lead with findings (ranked). Each finding must include: - Evidence handle (`path:line`, config key, or command output summary). - Why it matters (correctness/repro/eval validity/maintainability). - Suggested fix direction (for implementer) and how to verify. - Add “confirmed OK / ruled out” checks that reduce backtracking. - End with open questions (only what’s truly needed). ## Resources (optional) Open these only when helpful (progressive disclosure): - `references/report-template.md`: audit report skeleton (P0/P1/P2 + evidence + verification). - `references/grep-seeds.md`: high-signal `rg` starting points for broad context discovery. - `references/pipeline-checklist.md`: checklist for pipeline/process correctness and reproducibility risks.