--- name: quality-playbook description: "Run a complete quality engineering audit on any codebase. Derives behavioral requirements from the code, generates spec-traced functional tests, runs a three-pass code review with regression tests, executes a multi-model spec audit (Council of Three), and produces a consolidated bug report with TDD-verified patches. Finds the 35% of real defects that structural code review alone cannot catch. Works with any language. Trigger on 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', or 'coverage theater'." license: Complete terms in LICENSE.txt metadata: version: 1.5.2 # NOTE: Inline occurrences of the skill version exist throughout this file (frontmatter, # banner, version stamp template, sidecar JSON examples, run metadata, recheck template). # When bumping the version, update ALL occurrences — search for the old version string # globally. One historical reference to v1.4.6 edgequake benchmarking is intentionally # preserved in the challenge-gate section and must NOT be bumped. author: Andrew Stellman github: https://github.com/andrewstellman/quality-playbook --- # Quality Playbook Generator ## Plan Overview — read this first, then explain it to the user Before reading any other section of this skill, understand the plan and its dependencies. Each phase produces artifacts that the next phase depends on. Skipping or rushing a phase means every downstream phase works from incomplete information. **Phase 0 (Prior Run Analysis):** If previous quality runs exist, load their findings as seed data. This is automatic and only applies to re-runs. **Phase 1 (Explore):** Run the v1.5.2 documentation intake first (`python -m bin.reference_docs_ingest ` to walk `reference_docs/` — `cite/` files produce `quality/formal_docs_manifest.json` records; top-level files are loaded as Tier 4 context via `reference_docs_ingest.load_tier4_context()`). Then explore the codebase in three stages: open exploration driven by domain knowledge, domain-knowledge risk analysis, and selected structured exploration patterns. Write all findings to `quality/EXPLORATION.md`. This file is the foundation — Phase 2 reads it as its primary input. **Phase 2 (Generate):** Read EXPLORATION.md and produce the quality artifacts: requirements, constitution, functional tests, code review protocol, integration tests, spec audit protocol, TDD protocol, AGENTS.md. **Phase 3 (Code Review):** Run the three-pass code review against HEAD. Write regression tests for every confirmed bug. Generate patches. **Phase 4 (Spec Audit):** Three independent AI auditors review the code against requirements. Triage with verification probes. After triage, the same Council runs the v1.5.2 Layer-2 semantic citation check — one prompt per reviewer, structured per-REQ verdicts for every Tier 1/2 citation, output to `quality/citation_semantic_check.json`. Write regression tests for net-new findings. **Phase 5 (Reconciliation):** Close the loop — every bug from code review and spec audit is tracked, regression-tested or explicitly exempted. Run TDD red-green cycle. Finalize the completeness report. **Phase 6 (Verify):** Run self-check benchmarks against all generated artifacts. Check for internal consistency, version stamp correctness, and convergence. **Phase 7 (Present, Explore, Improve):** Present results to the user with a scannable summary table, offer drill-down on any artifact, and provide a menu of improvement paths (iteration strategies, requirement refinement, integration test tuning). This is the interactive phase where the user takes ownership of the quality system. Every bug found traces back to a requirement, and every requirement traces back to an exploration finding. **The critical dependency chain:** Exploration findings → EXPLORATION.md → Requirements → Code review + Spec audit → Bug discovery. A shallow exploration produces abstract requirements. Abstract requirements miss bugs. The exploration phase is where bugs are won or lost. **MANDATORY FIRST ACTION:** After reading and understanding the plan above, print the following message to the user, then explain the plan in your own words — what you'll do, what each phase produces, and why the exploration phase matters most. Emphasize that exploration starts with open-ended domain-driven investigation, followed by domain-knowledge risk analysis that reasons about what goes wrong in systems like this, then supplemented by selected structured patterns. Do not copy the plan verbatim; paraphrase it to demonstrate understanding. > Quality Playbook v1.5.2 — by Andrew Stellman > https://github.com/andrewstellman/quality-playbook Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds. ### Locating reference files This skill references files in a `references/` directory (e.g., `references/iteration.md`, `references/review_protocols.md`). The location depends on how the skill was installed. When a reference file is mentioned, resolve it by checking these paths in order and using the first one that exists: 1. `references/` (relative to SKILL.md — works when running from the skill directory) 2. `.claude/skills/quality-playbook/references/` (Claude Code installation) 3. `.github/skills/references/` (GitHub Copilot flat installation) 4. `.github/skills/quality-playbook/references/` (alternate Copilot installation) All reference file mentions in this skill use the short form `references/filename.md`. If the relative path doesn't resolve, walk the fallback list above. ## Why This Exists Most software projects have tests, but few have a quality *system*. Tests check whether code works. A quality system answers harder questions: what does "working correctly" mean for this specific project? What are the ways it could fail that wouldn't be caught by tests? What should every developer (human or AI) know before touching this code? Without a quality playbook, every new contributor (and every new AI session) starts from scratch — guessing at what matters, writing tests that look good but don't catch real bugs, and rediscovering failure modes that were already found and fixed months ago. A quality playbook makes the bar explicit, persistent, and inherited. ## What This Skill Produces Nine files that together form a repeatable quality system: | File | Purpose | Why It Matters | Executes Code? | |------|---------|----------------|----------------| | `quality/QUALITY.md` | Quality constitution — coverage targets, fitness-to-purpose scenarios, theater prevention | Every AI session reads this first. It tells them what "good enough" means so they don't guess. | No | | `quality/REQUIREMENTS.md` | Testable requirements with project overview, use cases, and narrative — generated by a five-phase pipeline (contract extraction → derivation → verification → completeness → narrative) | The foundation for Passes 2 and 3 of the code review. Without requirements, review is limited to structural anomalies (~65% ceiling). With them, the review can catch intent violations — absence bugs, cross-file contradictions, and design gaps that are invisible to code reading alone. | No | | `quality/test_functional.*` | Automated functional tests derived from specifications | The safety net. Tests tied to what the spec says should happen, not just what the code does. Use the project's language: `test_functional.py` (Python), `FunctionalSpec.scala` (Scala), `functional.test.ts` (TypeScript), `FunctionalTest.java` (Java), etc. | **Yes** | | `quality/RUN_CODE_REVIEW.md` | Three-pass code review protocol: structural review, requirement verification, cross-requirement consistency | Structural review alone misses ~35% of real defects. The three-pass pipeline adds requirement verification and consistency checking — backed by experiment evidence showing it finds bugs invisible to all structural review conditions. | No | | `quality/RUN_INTEGRATION_TESTS.md` | Integration test protocol — end-to-end pipeline across all variants | Unit tests pass, but does the system actually work end-to-end with real external services? | **Yes** | | `quality/BUGS.md` | Consolidated bug report with patches | Every confirmed bug in one place with reproduction details, spec basis, severity, and patch references. The single source of truth for what's broken and how to verify it. | No | | `quality/RUN_TDD_TESTS.md` | TDD red-green verification protocol | Proves each bug is real (test fails on unpatched code) and each fix works (test passes after patch). Stronger evidence than a bug report alone — maintainers trust FAIL→PASS demonstrations. | **Yes** | | `quality/RUN_SPEC_AUDIT.md` | Council of Three multi-model spec audit protocol | No single AI model catches everything. Three independent models with different blind spots catch defects that any one alone would miss. | No | | `AGENTS.md` | Bootstrap context for any AI session working on this project | The "read this first" file. Without it, AI sessions waste their first hour figuring out what's going on. | No | Plus output directories: `quality/code_reviews/`, `quality/spec_audits/`, `quality/results/`, `quality/history/`. The pipeline also generates supporting artifacts: `quality/PROGRESS.md` (phase-by-phase checkpoint log with cumulative BUG tracker), `quality/CONTRACTS.md` (behavioral contracts), `quality/COVERAGE_MATRIX.md` (traceability), `quality/COMPLETENESS_REPORT.md` (final gate), and `quality/VERSION_HISTORY.md` (review log). Phase 7 can additionally generate `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol) for iterative improvement. The two critical deliverables are the requirements file and the functional test file. The requirements file (`quality/REQUIREMENTS.md`) feeds the code review protocol's verification and consistency passes — it's what makes the code review catch more than structural anomalies. The functional test file (named for the project's language and test framework conventions) is the automated safety net. The Markdown protocols are documentation for humans and AI agents. ### Complete Artifact Contract The quality gate (`quality_gate.py`) validates these artifacts. If the gate checks for it, this skill must instruct its creation. This is the canonical list — any artifact not listed here should not be gate-enforced, and any gate check should trace to an artifact listed here. | Artifact | Location | Required? | Created In | |----------|----------|-----------|------------| | Formal docs manifest (v1.5.2) | `quality/formal_docs_manifest.json` | Yes | Phase 1 (`bin/reference_docs_ingest.py`) | | Requirements manifest (v1.5.2) | `quality/requirements_manifest.json` | Yes | Phase 2 | | Use cases manifest (v1.5.2) | `quality/use_cases_manifest.json` | Yes | Phase 2 | | Bugs manifest (v1.5.2) | `quality/bugs_manifest.json` | If bugs found | Phase 3/4/5 | | Citation semantic check (v1.5.2) | `quality/citation_semantic_check.json` | Yes | Phase 4 (Layer 2 Council) | | Exploration findings | `quality/EXPLORATION.md` | Yes | Phase 1 | | Quality constitution | `quality/QUALITY.md` | Yes | Phase 2 | | Requirements (UC identifiers) | `quality/REQUIREMENTS.md` | Yes | Phase 2 | | Behavioral contracts | `quality/CONTRACTS.md` | Yes | Phase 2 | | Functional tests | `quality/test_functional.*` | Yes | Phase 2 | | Regression tests | `quality/test_regression.*` | If bugs found | Phase 3 | | Code review protocol | `quality/RUN_CODE_REVIEW.md` | Yes | Phase 2 | | Integration test protocol | `quality/RUN_INTEGRATION_TESTS.md` | Yes | Phase 2 | | Spec audit protocol | `quality/RUN_SPEC_AUDIT.md` | Yes | Phase 2 | | TDD verification protocol | `quality/RUN_TDD_TESTS.md` | Yes | Phase 2 | | Bug tracker | `quality/BUGS.md` | Yes | Phase 3 | | Coverage matrix | `quality/COVERAGE_MATRIX.md` | Yes | Phase 2 | | Completeness report | `quality/COMPLETENESS_REPORT.md` | Yes | Phase 2 (baseline), Phase 5 (final verdict) | | Progress tracker | `quality/PROGRESS.md` | Yes | Throughout | | AI bootstrap | `AGENTS.md` | Yes | Phase 2 | | Bug writeups | `quality/writeups/BUG-NNN.md` | If bugs found | Phase 5 | | Regression patches | `quality/patches/BUG-NNN-regression-test.patch` | If bugs found | Phase 3 | | Fix patches | `quality/patches/BUG-NNN-fix.patch` | Optional | Phase 3 | | TDD traceability | `quality/TDD_TRACEABILITY.md` | If bugs have red-phase results | Phase 5 | | TDD sidecar | `quality/results/tdd-results.json` | If bugs found | Phase 5 | | TDD red-phase logs | `quality/results/BUG-NNN.red.log` | If bugs found | Phase 5 | | TDD green-phase logs | `quality/results/BUG-NNN.green.log` | If fix patch exists | Phase 5 | | Integration sidecar | `quality/results/integration-results.json` | When integration tests run | Phase 5 | | Mechanical verify script | `quality/mechanical/verify.sh` | Yes (benchmark) | Phase 2 | | Verify receipt | `quality/results/mechanical-verify.log` + `.exit` | Yes (benchmark) | Phase 5 | | Triage probes | `quality/spec_audits/triage_probes.sh` | When triage runs | Phase 4 | | Code review reports | `quality/code_reviews/*.md` | Yes | Phase 3 | | Spec audit reports | `quality/spec_audits/*auditor*.md` + `*triage*` | Yes | Phase 4 | | Recheck results (JSON) | `quality/results/recheck-results.json` | When recheck runs | Recheck | | Recheck summary (MD) | `quality/results/recheck-summary.md` | When recheck runs | Recheck | | Seed checks | `quality/SEED_CHECKS.md` | If Phase 0b ran | Phase 0b | | Run metadata | `quality/results/run-YYYY-MM-DDTHH-MM-SS.json` | Yes | Phase 1 (created), Throughout (updated) | **Sidecar JSON lifecycle:** Write all bug writeups *before* finalizing `tdd-results.json` — the sidecar's `writeup_path` field must point to an existing file, not a placeholder. Similarly, run integration tests and collect results before writing `integration-results.json`. ### Sidecar JSON Canonical Examples **`quality/results/tdd-results.json`** — the gate validates field names, not just presence: ```json { "schema_version": "1.1", "skill_version": "1.5.2", "date": "2026-04-12", "project": "repo-name", "bugs": [ { "id": "BUG-001", "requirement": "REQ-003", "red_phase": "fail", "green_phase": "pass", "verdict": "TDD verified", "fix_patch_present": true, "writeup_path": "quality/writeups/BUG-001.md" } ], "summary": { "total": 3, "confirmed_open": 1, "red_failed": 0, "green_failed": 0, "verified": 2 } } ``` `verdict` must be one of: `"TDD verified"`, `"red failed"`, `"green failed"`, `"confirmed open"`, `"deferred"`. `date` must be ISO 8601 (YYYY-MM-DD), not a placeholder, not in the future. **`quality/results/integration-results.json`:** ```json { "schema_version": "1.1", "skill_version": "1.5.2", "date": "2026-04-12", "project": "repo-name", "recommendation": "SHIP", "groups": [{ "group": 1, "name": "Group 1", "use_cases": ["UC-01"], "result": "pass", "tests_passed": 3, "tests_failed": 0, "notes": "" }], "summary": { "total_groups": 12, "passed": 11, "failed": 1, "skipped": 0 }, "uc_coverage": { "UC-01": "covered_pass", "UC-02": "not_mapped" } } ``` `recommendation` must be one of: `"SHIP"`, `"FIX BEFORE MERGE"`, `"BLOCK"`. `uc_coverage` maps UC identifiers from REQUIREMENTS.md to coverage status. ### Run Metadata Every playbook run creates a timestamped metadata file at `quality/results/run-YYYY-MM-DDTHH-MM-SS.json`. This enables multi-model comparison and run history tracking. **Lifecycle:** Create this file at the start of Phase 1. Update `phases_completed`, `bug_count`, and `end_time` as each phase finishes. The final update happens after the terminal gate. ```json { "schema_version": "1.0", "skill_version": "1.5.2", "project": "repo-name", "model": "claude-sonnet-4-6", "model_provider": "anthropic", "runner": "claude-code", "start_time": "2026-04-16T10:30:00Z", "end_time": "2026-04-16T11:45:00Z", "duration_minutes": 75, "phases_completed": ["Phase 0b", "Phase 1", "Phase 2", "Phase 3", "Phase 4", "Phase 5"], "iterations_completed": ["gap", "unfiltered", "parity", "adversarial"], "bug_count": 12, "bug_severity": { "HIGH": 2, "MEDIUM": 5, "LOW": 5 }, "gate_result": "PASS", "gate_fail_count": 0, "gate_warn_count": 2, "notes": "" } ``` **Required fields:** `schema_version`, `skill_version`, `project`, `model`, `start_time`. All other fields are populated as the run progresses. `model` should be the exact model string (e.g., `"claude-sonnet-4-6"`, `"gpt-4.1"`, `"claude-opus-4-6"`). `runner` identifies the tool used to execute the playbook (e.g., `"claude-code"`, `"copilot-cli"`, `"cursor"`, `"cowork"`). `duration_minutes` is computed from `end_time - start_time`. If the model or runner cannot be determined, use `"unknown"`. ## How to Use **The playbook is designed to run one phase at a time.** Each phase runs in its own session with a clean context window, producing files on disk that the next phase reads. This gives much better results than running all phases at once — each phase gets the full context window for deep analysis instead of competing for space with other phases. **Default behavior: run Phase 1 only.** When someone says "run the quality playbook" or "execute the quality playbook," run Phase 1 (Explore) and stop. After Phase 1 completes, tell the user what happened and what to say next. The user drives each phase forward explicitly. ### Interactive protocol — how to guide the user **After every phase and every iteration, STOP and print guidance.** Use a `#` header so it's prominent in the chat. The guidance must include: what just happened (one line), what the key outputs are, and the exact prompt to continue. See the end-of-phase messages defined after each phase section below. **If the user says "keep going", "continue", "next phase", "next", or anything similar**, run the next phase in sequence. If all phases are complete, suggest the first iteration strategy (gap). If an iteration just finished, suggest the next strategy in the recommended cycle. **If the user says "run all phases", "run everything", or "run the full pipeline"**, run all phases sequentially in a single session. This uses more context but some users prefer it. **If the user asks "help", "how does this work", "what is this", or any similar phrasing**, respond with this explanation (adapt the wording naturally, don't copy verbatim): > The Quality Playbook finds bugs that structural code review alone can't catch — the 35% of real defects that require understanding what the code is *supposed* to do. It works phase by phase: > > - **Phase 1 (Explore):** Understand the codebase — architecture, risks, failure modes, specifications > - **Phase 2 (Generate):** Produce quality artifacts — requirements, tests, review protocols > - **Phase 3 (Code Review):** Three-pass review with regression tests for every confirmed bug > - **Phase 4 (Spec Audit):** Three independent AI auditors check the code against requirements > - **Phase 5 (Reconciliation):** Close the loop — TDD red-green verification for every bug > - **Phase 6 (Verify):** Self-check benchmarks validate all generated artifacts > > After the numbered phases complete, you can run iteration strategies (gap, unfiltered, parity, adversarial) to find additional bugs — iterations typically add 40-60% more confirmed bugs on top of the baseline. > > The playbook works best when you provide documentation alongside the code — specs, API docs, design documents, community documentation. It also gets significantly better results when you run each phase separately rather than all at once. > > To get started, say: **"Run the quality playbook on this project."** **If the user asks "what happened", "what's going on", "where are we", or "what should I do next"**, read `quality/PROGRESS.md` and give them a concise status update: which phases are complete, how many bugs found so far, and what the next step is. ### Documentation warning **At the start of Phase 1, before exploring any code, check for documentation.** Look for directories named `docs/`, `docs_gathered/`, `doc/`, `documentation/`, or any gathered documentation files. Also check if the user mentioned documentation in their prompt. **If no documentation is found, print this warning immediately (before proceeding):** > **Important: No project documentation found.** The quality playbook works without documentation, but it finds significantly more bugs — and higher-confidence bugs — when you provide specs, API docs, design documents, or community documentation. In controlled experiments, documentation-enriched runs found different and better bugs than code-only baselines. > > If you have documentation available, you can add it to a `docs_gathered/` directory and re-run Phase 1. Otherwise, I'll proceed with code-only analysis. Then proceed with Phase 1 — don't block on this, just make sure the user sees the warning. ### Running a specific phase The user can request any individual phase: ``` Run quality playbook phase 1. Run quality playbook phase 3 — code review. Run phase 5 reconciliation. ``` When running a specific phase, check that its prerequisites exist (e.g., Phase 3 requires Phase 2 artifacts). If prerequisites are missing, tell the user which phases need to run first. ### Iteration mode — improve on a previous run Use this when a previous playbook run exists and you want to find additional bugs. Iteration mode replaces Phase 1's from-scratch exploration with a targeted exploration using one of five strategies, then merges findings with the previous run and re-runs Phases 2–6 against the combined results. **When to use iteration mode:** After a complete playbook run, when you believe the codebase has more bugs than the first run found. This is especially effective for large codebases where a single run can only cover 3–5 subsystems, and for library/framework codebases where different exploration paths find different bug classes. **Read `references/iteration.md` for detailed strategy instructions.** That file contains the full operational detail for each strategy, shared rules, merge steps, and the completion gate. The summary below describes when to use each strategy. **TDD applies to iteration runs.** Every newly confirmed bug in an iteration run must go through the full TDD red-green cycle and produce `quality/results/BUG-NNN.red.log` (and `.green.log` if a fix patch exists). The quality gate enforces this — missing logs cause FAIL. See `references/iteration.md` shared rule 5 and the TDD Log Closure Gate in Phase 5. **Iteration strategies.** The user selects a strategy by naming it in the prompt. If no strategy is named, default to `gap`. ``` Run the next iteration of the quality playbook. # default: gap strategy Run the next iteration of the quality playbook using the gap strategy. Run the next iteration using the unfiltered strategy. Run the next iteration using the parity strategy. Run an iteration using the adversarial strategy. ``` **Recommended cycle:** gap → unfiltered → parity → adversarial. Each strategy finds different bug classes: - **`gap`** (default) — Scan previous coverage, explore uncovered subsystems and thin sections. Best when the first run was structurally sound but only covered a subset of the codebase. - **`unfiltered`** — Pure domain-driven exploration with no structural constraints. No pattern templates, no applicability matrices, no section format requirements. Recovers bugs that structured exploration suppresses. - **`parity`** — Systematically enumerate parallel implementations of the same contract (transport variants, fallback chains, setup-vs-reset paths) and diff them for inconsistencies. Finds bugs that only emerge from cross-path comparison. - **`adversarial`** — Re-investigate dismissed/demoted triage findings and challenge thin SATISFIED verdicts. Recovers Type II errors from conservative triage. - **`all`** — Runner-level convenience: executes gap → unfiltered → parity → adversarial in sequence, each as a separate agent session. Stops early if a strategy finds zero new bugs. ### Phase-by-phase execution Each phase produces files on disk that the next phase reads. This is how context transfers between phases — through files, not through conversation history. The key handoff files are: - **`quality/EXPLORATION.md`** — Phase 1 writes this, Phase 2 reads it. Contains everything Phase 2 needs to generate artifacts without re-exploring the codebase. - **`quality/PROGRESS.md`** — Updated after every phase. Cumulative BUG tracker ensures no finding is lost. - **Generated artifacts** (REQUIREMENTS.md, CONTRACTS.md, etc.) — Phase 2 writes these, Phases 3–5 read them to run reviews, audits, and reconciliation. The pattern for each phase boundary: finish the current phase, write everything to disk, then print the end-of-phase message and stop. When the user starts the next phase, read back the files you need before proceeding. This "write then read" cycle is the phase boundary — it lets you drop exploration context from working memory before loading review context, for example. Write your Phase 1 exploration findings to `quality/EXPLORATION.md` before proceeding. This file is mandatory in all modes. Make it thorough: domain identification, architecture map, existing tests, specification summary, quality risks, skeleton/dispatch analysis, derived requirements (REQ-NNN), and derived use cases (UC-NN). Everything Phase 2 needs to generate artifacts must be in this file. The discipline of writing exploration findings to disk is what forces thorough analysis. Without it, the model keeps vague impressions in working memory and produces broad, abstract requirements that miss function-level defects. Writing forces specificity: file paths, line numbers, exact function names, concrete behavioral rules. That specificity is what makes requirements precise enough to catch bugs during code review. --- ## Phase 0: Prior Run Analysis (Automatic) **This phase runs only if `quality/runs/` exists and contains prior quality artifacts.** If there are no prior runs, skip to Phase 1. If `quality/runs/` exists but is empty or contains no conformant quality artifacts (no subdirectories with `quality/BUGS.md` under them), skip Phase 0a and fall through to Phase 0b. When prior runs exist, the playbook enters **continuation mode**. This enables iterative bug discovery: each run inherits confirmed findings from prior runs, verifies them mechanically, and explores for additional bugs. The iteration converges when a run finds zero net-new bugs. **Step 0a: Build the seed list.** Read `quality/runs/*/quality/BUGS.md` from all prior runs. For each confirmed bug, extract: bug ID, file:line, summary, and the regression test assertion. Deduplicate by file:line (the same bug found in multiple runs counts once). Write the merged seed list to `quality/SEED_CHECKS.md` with this format: ```markdown ## Seed Checks (from N prior runs) | Seed | Origin Run | File:Line | Summary | Assertion | |------|-----------|-----------|---------|-----------| | SEED-001 | run-1 | virtio_ring.c:3509-3529 | RING_RESET dropped | `"case VIRTIO_F_RING_RESET:" in func` | ``` **Step 0b: Execute seed checks mechanically.** For each seed, run the assertion against the current source tree. Record PASS (bug was fixed since last run) or FAIL (bug still present). A failing seed is a confirmed carry-forward bug — it must appear in this run's BUGS.md regardless of whether any auditor independently finds it. A passing seed means the bug was fixed — note it in PROGRESS.md as "SEED-NNN: resolved since prior run." **Step 0c: Identify prior-run scope.** Read `quality/runs/*/quality/PROGRESS.md` for scope declarations. Note which subsystems were covered in prior runs. During Phase 1 exploration, prioritize areas NOT covered by prior runs to maximize the chance of finding new bugs. If all subsystems were covered in prior runs, explore the same scope but with different emphasis (e.g., different scrutiny areas, different entry points). **Step 0d: Inject seeds into downstream phases.** The seed list becomes input to: - **Phase 3 (code review):** Add to the code review prompt: "Prior runs confirmed these bugs — verify they are still present and look for additional findings in the same subsystems." - **Phase 4 (spec audit):** Add to `RUN_SPEC_AUDIT.md`: "Known open issues from prior runs: [seed list]. Expect auditors to find these. If an auditor does NOT flag a known seed bug, that is a coverage gap in their review, not evidence the bug was fixed." **Why this exists:** Non-deterministic scope exploration means different runs notice different bugs. In cross-version testing, 4/8 repos had bugs found in some versions but not others — not because the bugs were fixed, but because the model explored different parts of the codebase. Iterating with seed injection solves this: confirmed bugs carry forward mechanically (no re-discovery needed), and each new run can focus exploration on uncovered territory. ### Phase 0b: Sibling-Run Seed Discovery (Automatic) **This step runs only if `quality/runs/` does not exist OR `quality/runs/` exists but contains no conformant quality artifacts** (i.e., Phase 0a has nothing to work with) **and** the project directory is versioned (e.g., `httpx-1.3.23/` sits alongside `httpx-1.3.21/`). If `quality/runs/` exists with conformant artifacts, Phase 0a already handles seed injection — skip this step. **If `quality/runs/` exists but is empty or contains only non-conformant subdirectories**, emit a warning: "Phase 0b: `quality/runs/` exists but contains no conformant artifacts — consulting sibling versioned directories for seeds." Then proceed with the sibling discovery below. When no `quality/runs/` directory exists but sibling versioned directories do, look for prior quality artifacts in those siblings: 1. **Discover siblings.** List directories matching the pattern `-/quality/BUGS.md` relative to the parent directory. Exclude the current directory. Sort by version descending (most recent first). 2. **Import confirmed bugs as seeds.** For each sibling with a `quality/BUGS.md`, extract confirmed bugs using the same format as Step 0a. Write them to `quality/SEED_CHECKS.md` with origin noted as the sibling directory name. 3. **Execute seed checks mechanically** (same as Step 0b in Phase 0a). For each imported seed, run the assertion against the current source tree and record PASS/FAIL. 4. **Inject into downstream phases** (same as Step 0d in Phase 0a). **Why this exists:** In v1.3.23 benchmarking, httpx produced a zero-bug result despite httpx-1.3.21 having found the `Headers.__setitem__` non-ASCII encoding bug. The model simply explored different code paths and never examined the Headers area. Sibling-run seeding ensures that bugs confirmed in prior versioned runs carry forward even without an explicit `quality/runs/` archive. This is a different failure class than mechanical tampering — it addresses **exploration non-determinism**, not evidence corruption. --- ## Phase 1: Explore the Codebase (Write As You Go) > **Required references for this phase** — read these before proceeding: > - `references/exploration_patterns.md` — six bug-finding patterns to apply after open exploration **First action: create run metadata.** Before any exploration, create the run metadata file: ```bash mkdir -p quality/results cat > "quality/results/run-$(date -u +%Y-%m-%dT%H-%M-%S).json" <<'METADATA' { "schema_version": "1.0", "skill_version": "1.5.2", "project": "", "model": "", "model_provider": "", "runner": "", "start_time": "", "end_time": null, "duration_minutes": null, "phases_completed": [], "iterations_completed": [], "bug_count": 0, "bug_severity": { "HIGH": 0, "MEDIUM": 0, "LOW": 0 }, "gate_result": null, "gate_fail_count": null, "gate_warn_count": null, "notes": "" } METADATA ``` Fill in `project`, `model` (exact model string, e.g., `"claude-sonnet-4-6"`), `model_provider` (e.g., `"anthropic"`, `"openai"`, `"cursor"`), `runner` (e.g., `"claude-code"`, `"copilot-cli"`, `"cursor"`), and `start_time` (UTC ISO 8601). Update this file at the end of each phase — append the completed phase to `phases_completed` and update `bug_count`/`bug_severity` as bugs are confirmed. The final update after the terminal gate fills in `end_time`, `duration_minutes`, and `gate_result`. **Second action: run v1.5.2 document ingest (before exploring any code).** A single stdlib-only module in `bin/` produces the authoritative documentation record that Phase 1 requirement derivation depends on: 1. **`python -m bin.reference_docs_ingest `** — walks `reference_docs/` in the target repo once. Files under `reference_docs/cite/` are hashed and written to `quality/formal_docs_manifest.json` per `schemas.md` §4 and the §1.6 manifest wrapper. Files at the top level of `reference_docs/` are not written to the manifest but are available as Tier 4 context via `bin.reference_docs_ingest.load_tier4_context()`, which returns a sorted list of `(path, text)` tuples. If the ingest command fails (unsupported extension, non-UTF-8 bytes), stop the run and surface the stderr output to the user verbatim — ingest errors are actionable and must be fixed before exploration continues. **No sidecar needed.** Folder placement is the flag: top-level `reference_docs/.` files are Tier 4 context; files under `reference_docs/cite/.` are citable sources. Tier 1 is the default for `cite/` contents; a file may override to Tier 2 with an optional in-file marker on the first non-blank line: `` (Markdown) or `# qpb-tier: 2` (plaintext). `README.md` under either folder is skipped. **When `reference_docs/` is missing or empty**, Phase 1 MUST print this actionable message and proceed: > Phase 1 found no documentation in reference_docs/. The playbook will proceed > using only Tier 3 evidence (the source tree itself). For better results, drop > plaintext documentation into: > reference_docs/ ← AI chats, design notes, retrospectives (Tier 4 context) > reference_docs/cite/ ← project specs, RFCs, API contracts (citable, byte-verified) > See README.md "Step 1: Provide documentation" for details. **Plaintext only — conversion happens outside the playbook.** Reference docs are `.txt` or `.md` only (schemas.md §2). PDFs, DOCX, HTML, etc. are rejected with an actionable conversion hint (`pdftotext`, `pandoc -t plain`, `lynx -dump`). Do NOT attempt to parse binary or formatted documents inside the skill — run the conversion outside and commit the plaintext. Spend the first phase understanding the project. The quality playbook must be grounded in this specific codebase — not generic advice. **Why explore first?** The most common failure in AI-generated quality playbooks is producing generic content — coverage targets that could apply to any project, scenarios that describe theoretical failures, tests that exercise language builtins instead of project code. Exploration prevents this by forcing every output to reference something real: a specific function, a specific schema, a specific defensive code pattern. If you can't point to where something lives in the code, you're guessing — and guesses produce quality playbooks nobody trusts. **Scaling for large codebases:** For projects with more than ~50 source files, don't try to read everything. Focus exploration on the 3–5 core modules (the ones that handle the primary data flow, the most complex logic, and the most failure-prone operations). Read representative tests from each subsystem rather than every test file. The goal is depth on what matters, not breadth across everything. **Depth over breadth (critical).** A narrow scope with function-level detail finds more bugs than a broad scope with subsystem-level summaries. For each core module you explore, identify the specific functions that implement critical behavior and document them by name, file path, and line number. Requirements derived from "the reset subsystem should handle errors" will not catch bugs. Requirements derived from "`vm_reset()` at `virtio_mmio.c:256` must poll the status register after writing zero" will. The difference between a useful exploration and a useless one is specificity — file paths, function names, line numbers, exact behavioral rules. **Three-stage exploration: open first, then domain risks, then selected patterns.** Exploration has three stages, and the order matters: 1. **Open exploration (domain-driven).** Before applying any structured pattern, explore the codebase the way an experienced developer would: read the code, understand the architecture, identify risks based on your domain knowledge of what goes wrong in systems like this one. Ask yourself: "What would an expert in [this domain] check first?" For an HTTP library, that means redirect handling, header encoding, connection lifecycle. For a CLI framework, that means flag parsing, help generation, completion/validation consistency. For a serialization library, that means type coverage, round-trip fidelity, edge-case handling. Write concrete findings with file paths and line numbers. This stage must produce at least 8 concrete bug hypotheses or suspicious findings — not architectural observations, but specific "this code at file:line might be wrong because [reason]" findings. At least 4 must reference different modules or subsystems. 2. **Domain-knowledge risk analysis.** After open exploration, step back from the code and reason about what you know from training about systems like this one. This is the primary bug-hunting pass for library and framework codebases. Complete the Step 6 questions below using two sources — the code you just explored AND your domain knowledge of similar systems. Generate at least 5 ranked failure scenarios, each naming a specific function, file, and line, and explaining why a domain-specific edge case produces wrong behavior. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write the results to the `## Quality Risks` section of EXPLORATION.md before proceeding to patterns. **What this stage must NOT produce:** A section that lists defensive patterns the code already has (things the code does RIGHT) is not a risk analysis. A section that lists risky modules without specific failure scenarios is not a risk analysis. A section that concludes "this is a mature, well-tested library so basic bugs are unlikely" is actively harmful — mature libraries have the most subtle bugs, precisely because the obvious ones were found years ago. The test: could a code reviewer read each scenario and immediately know what to check? If not, the scenario is too abstract. 3. **Pattern-driven exploration (selected, not exhaustive).** After open exploration and domain-risk analysis are written to disk, evaluate all six analysis patterns from `exploration_patterns.md` using a pattern applicability matrix. For each pattern, assess whether it applies to this codebase and what it would target. Then select 3 to 4 patterns for deep-dive treatment — the highest-yield patterns for this specific codebase. The remaining patterns get a brief "not applicable" or "deferred" note with codebase-specific rationale. Do not produce deep sections for all six patterns — depth on 3–4 beats shallow coverage of 6. Select 4 when a fourth pattern has clear applicability and would cover code areas not reached by the other three; default to 3 when in doubt. For each selected pattern deep dive, use the output format from the reference file and trace code paths across 2+ functions. The deep dives should pressure-test, refine, or extend the findings from the open exploration and risk analysis — not repeat them. The Phase 1 completion gate checks for all three stages. The open exploration section, the quality risks section, the pattern applicability matrix, and the pattern deep-dive sections must all be present. **Write incrementally — do not hold findings in memory.** This is the single most important execution rule in Phase 1. After you explore each subsystem or apply each pattern, **immediately append your findings to `quality/EXPLORATION.md` on disk before moving to the next subsystem or pattern.** Do not try to hold findings in working memory across multiple subsystems. The write-as-you-go discipline serves two purposes: 1. **Depth recovery.** If you explore the PCI interrupt routing subsystem and find suspicious code at `vp_find_vqs_intx()`, write that finding to EXPLORATION.md immediately. Then when you move to the admin queue subsystem, your working memory is free to go deep there. Without incremental writes, findings from the first subsystem compete with findings from the second, and both end up shallow. 2. **Nothing gets lost.** In v1.3.41 benchmarking, the model explored 8 pattern sections but wrote only 5–7 lines per section — perfectly uniform, perfectly shallow. Every section passed the gate but none went deep enough to find bugs that require tracing code paths across multiple functions. The model was trying to compose the entire EXPLORATION.md at the end, after reading everything, and could only recall the surface-level findings. Incremental writes prevent this. **The rhythm is: read a subsystem → write findings to disk → read the next subsystem → append findings → repeat.** Each append should include specific function names, file paths, line numbers, and concrete bug hypotheses. A 5-line section that says "checked cross-implementation consistency, found one gap" is a gate-passing placeholder, not an exploration finding. A useful section traces a code path: "function A at file:line calls function B at file:line, which does X but not Y; compare with function C at file:line which does both X and Y." **Mandatory consolidation step.** After all three stages (open exploration, quality risks, and selected pattern deep dives) are explored and written to EXPLORATION.md, add a final section: `## Candidate Bugs for Phase 2`. This section consolidates the strongest bug hypotheses from all earlier sections into a prioritized handoff list. For each candidate, include: the hypothesis, the specific file:line references, which stage surfaced it (open exploration, quality risks, or pattern), and what the code review should look for. This section is the bridge between exploration and artifact generation — it tells Phase 3 exactly where to focus. Minimum: 4 candidate bugs with file:line references — at least 2 from open exploration or quality risks, and at least 1 from a pattern deep dive. There is no maximum. **Pre-flight: Scope declaration for large repositories** Before exploring any source code, estimate scale: approximate source-file count (excluding tests, docs, and generated files), major subsystem count, and documentation volume. Note the count in PROGRESS.md. - **Fewer than 200 source files:** Proceed with full exploration. The depth-vs-breadth guidance above still applies. - **200–500 source files:** Declare your intended scope before exploring. Write a `## Scope declaration` section to PROGRESS.md naming the 3–5 subsystems you will cover, the expected file count for each, and which subsystems you are deferring with rationale. Then proceed with exploration of the declared scope only. - **More than 500 source files:** Stop and write a mandatory scope declaration to PROGRESS.md before reading any source files. The scope declaration must include: (a) the subsystems covered in this run, (b) the subsystems explicitly deferred, (c) the exclusion rationale for each deferred subsystem, and (d) recommended subsystem scope for follow-on runs. Do not begin exploration until this is written. A scope declaration that covers "everything" is not valid for repositories above this threshold. **Resuming a previous session:** If PROGRESS.md already exists and shows phases marked complete, read it first. Do not redo phases already marked complete — resume from the first phase marked incomplete. If a scope declaration is already written, honor it exactly. If the previous session's scope declaration deferred subsystems, do not expand scope to cover them unless this run is explicitly a follow-on for the deferred areas. **Specification-primary repositories:** Some repositories ship a specification, configuration, or protocol document as their primary product, with executable code as supporting infrastructure. Examples: a skill definition with benchmark tooling, a schema registry with validation scripts, a pipeline config with orchestration helpers. When the primary product is a specification rather than executable code, derive requirements from the specification's internal consistency, completeness, and correctness — not just from the executable code paths. The specification is the thing users depend on; the tooling is secondary. If you find yourself writing 80%+ of requirements about helper scripts and <20% about the primary specification, you have the focus inverted. ### Step 0: Ask About Development History Before exploring code, ask the user one question: > "Do you have exported AI chat history from developing this project — Claude exports, Gemini takeouts, ChatGPT exports, Claude Code transcripts, or similar? If so, point me to the folder. The design discussions, incident reports, and quality decisions in those chats will make the generated quality playbook significantly better." If the user provides a chat history folder: 1. **Scan for an index file first.** Look for files named `INDEX*`, `CONTEXT.md`, `README.md`, or similar navigation aids. If one exists, read it — it will tell you what's there and how to find things. 2. **Search for quality-relevant conversations.** Look for messages mentioning: quality, testing, coverage, bugs, failures, incidents, crashes, validation, retry, recovery, spec, fitness, audit, review. Also search for the project name. 3. **Extract design decisions and incident history.** The most valuable content is: (a) incident reports — what went wrong, how many records affected, how it was detected, (b) design discussions — why a particular approach was chosen, what alternatives were rejected, (c) quality framework discussions — coverage targets, testing philosophy, model review experiences, (d) cross-model feedback — where different AI models disagreed about the code. 4. **Don't try to read everything.** Chat histories can be enormous. Use the index to find the most relevant conversations, then search within those for quality-related content. 10 minutes of targeted searching beats 2 hours of exhaustive reading. This context is gold. A chat history where the developer discussed "why we chose this concurrency model" or "the time we lost 1,693 records in production" transforms generic scenarios into authoritative ones. If the user doesn't have chat history, proceed normally — the skill works without it, just with less context. **Autonomous fallback:** When running in benchmark mode, via `bin/run_playbook.py` (benchmark runner, not shipped with the skill), or without user interaction (e.g., `--single-pass`), skip Step 0's question and proceed directly to Step 1. If chat history folders are visible in the project tree (e.g., `AI Chat History/`, `.chat_exports/`), scan them without asking. If no chat history is found, proceed — do not block waiting for a response that won't come. ### Step 1: Identify Domain, Stack, and Specifications Read the README, existing documentation, and build config (`pyproject.toml` / `package.json` / `Cargo.toml`). Answer: - What does this project do? (One sentence.) - What language and key dependencies? - What external systems does it talk to? - What is the primary output? **Find the specifications.** Specs are the source of truth for functional tests. Search in order: `AGENTS.md`/`CLAUDE.md` in root, `specs/`, `docs/`, `spec/`, `design/`, `architecture/`, `adr/`, then `.md` files in root. Record the paths. **If no formal spec documents exist**, the skill still works — but you need to assemble requirements from other sources. In order of preference: 1. **Ask the user** — they often know the requirements even if they're not written down. 2. **README and inline documentation** — many projects embed requirements in their README, API docs, or code comments. 3. **Existing test suite** — tests are implicit specifications. If a test asserts `process(x) == y`, that's a requirement. 4. **Type signatures and validation rules** — schemas, type annotations, and validators define what the system accepts and rejects. 5. **Infer from code behavior** — as a last resort, read the code and infer what it's supposed to do. Mark these as *inferred requirements* in QUALITY.md and flag them for user confirmation. When working from non-formal requirements, label each scenario and test with a **requirement tag** that includes a confidence tier and source: - `[Req: formal — README §3]` — written by humans in a spec document. Authoritative. - `[Req: user-confirmed — "must handle empty input"]` — stated by the user but not in a formal doc. Treat as authoritative. - `[Req: inferred — from validate_input() behavior]` — deduced from code. Flag for user review. Use this exact tag format in QUALITY.md scenarios, functional test documentation, and spec audit findings. It makes clear which requirements are authoritative and which need validation. ### Step 1b: Evaluate Documentation Depth If `docs_gathered/` exists, read every file in it before deciding which subsystems to focus on. For each document, classify its depth: - **Deep** — contains internal contracts, safety invariants, concurrency models, defensive patterns, error handling details, or line-number-level source references. Suitable for deriving requirements. - **Moderate** — covers architecture and API surface with some implementation detail. Useful for orientation but insufficient alone for requirement derivation. - **Shallow** — API catalog, feature overview, or marketing-level summary. Lists what exists but not how it works, how it fails, or what contracts it enforces. **Not sufficient for scoping decisions.** **The scoping rule:** Do not narrow the audit scope to only the subsystems that have deep documentation. If the most complex or most failure-prone module has only shallow documentation, that is a **documentation gap to flag in PROGRESS.md**, not a reason to skip the module. The highest-risk code with the thinnest documentation is where bugs hide — auditing only well-documented areas produces a safe-looking report that misses real defects. When documentation is shallow for a high-risk area: 1. Note the gap explicitly in PROGRESS.md under a `## Documentation depth assessment` section. 2. Derive requirements from source code directly (doc comments, safety annotations, defensive patterns, existing tests) and tag them as `[Req: inferred — from source]`. 3. Flag the area for deeper documentation gathering in the completeness report. Record the depth classification for each `docs_gathered/` file in PROGRESS.md so reviewers can assess whether the documentation influenced the scope appropriately. **Coverage commitment table:** After classifying all `docs_gathered/` documents, produce this table in PROGRESS.md under the `## Documentation depth assessment` section: | Document | Depth | Subsystem | Requirements commitment | If excluded: justification | |----------|-------|-----------|------------------------|---------------------------| For every **deep** document, map it to the subsystem it covers, then either commit to deriving requirements from it ("will cover in Phase 2") or provide a specific justification that names the tradeoff. A sentence like "out of scope for this run" is not sufficient — the justification must say *why*, e.g., "interpreter JIT is excluded because this run focuses on the parser/compiler/GC pipeline; separate run recommended." **Gate:** A high-risk subsystem documented deeply in `docs_gathered/` must not silently disappear from the requirements set. If a deep document has a "will cover" commitment but produces zero requirements by the end of Step 7, the requirements pipeline is incomplete — go back and derive requirements for the gap before proceeding to Phase 2 artifact generation. ### Step 2: Map the Architecture List source directories and their purposes. Read the main entry point, trace execution flow. Identify: - The 3–5 major subsystems - The data flow (Input → Processing → Output) - The most complex module - The most fragile module ### Step 3: Read Existing Tests Read the existing test files — all of them for small/medium projects, or a representative sample from each subsystem for large ones. Identify: test count, coverage patterns, gaps, and any coverage theater (tests that look good but don't catch real bugs). **Critical: Record the import pattern.** How do existing tests import project modules? Every language has its own conventions (Python `sys.path` manipulation, Java/Scala package imports, TypeScript relative paths or aliases, Go package/module paths, Rust `use crate::` or `use myproject::`). You must use the exact same pattern in your functional tests — getting this wrong means every test fails with import/resolution errors. See `references/functional_tests.md` § "Import Pattern" for the full six-language matrix. **Identify integration test runners.** Look for scripts or test files that exercise the system end-to-end against real external services (APIs, databases, etc.). Note their patterns — you'll need them for `RUN_INTEGRATION_TESTS.md`. ### Step 4: Read the Specifications Walk each spec document section by section. For every section, ask: "What testable requirement does this state?" Record spec requirements without corresponding tests — these are the gaps the functional tests must close. If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the `[Req: tier — source]` format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 7. ### Step 4b: Read Function Signatures and Real Data Before writing any test, you must know exactly how each function is called. For every module you identified in Step 2: 1. **Read the actual function signatures** — parameter names, types, defaults. Don't guess from usage context — read the function definition and any documentation (Python docstrings, Java/Scala Javadoc/ScalaDoc, TypeScript type annotations, Go godoc comments, Rust doc comments and type signatures). 2. **Read real data files** — If the project has items files, fixture files, config files, or sample data (in `pipelines/`, `fixtures/`, `test_data/`, `examples/`), read them. Your test fixtures must match the real data shape exactly. 3. **Read existing test fixtures** — How do existing tests create test data? Copy their patterns. If they build config dicts with specific keys, use those exact keys. 4. **Check library versions** — Check the project's dependency manifest (`requirements.txt`, `build.sbt`, `package.json`, `pom.xml`/`build.gradle`, `go.mod`, `Cargo.toml`) to see what's actually available. Don't write tests that depend on library features that aren't installed. If a dependency might be missing, use the test framework's skip mechanism — see `references/functional_tests.md` § "Library version awareness" for framework-specific examples. Record a **function call map**: for each function you plan to test, write down its name, module, parameters, and what it returns. This map prevents the most common test failure: calling functions with wrong arguments. ### Step 5: Find the Skeletons This is the most important step. Search for defensive code patterns — each one is evidence of a past failure or known risk. **Why this matters:** Developers don't write `try/except` blocks, null checks, or retry logic for fun. Every piece of defensive code exists because someone got burned. A `try/except` around a JSON parse means malformed JSON happened in production. A null check on a field means that field was missing when it shouldn't have been. These patterns are the codebase whispering its history of failures. Each one becomes a fitness-to-purpose scenario and a boundary test. **Read `references/defensive_patterns.md`** for the systematic search approach, grep patterns, and how to convert findings into fitness-to-purpose scenarios and boundary tests. Minimum bar: at least 2–3 defensive patterns per core source file. If you find fewer, you're skimming — read function bodies, not just signatures. ### Step 5a: Trace State Machines If the project has any kind of state management — status fields, lifecycle phases, workflow stages, mode flags — trace the state machine completely. This catches a category of bugs that defensive pattern analysis alone misses: states that exist but aren't handled. **How to find state machines:** Search for status/state fields in models, enums, or constants (e.g., `status`, `state`, `phase`, `mode`). Search for guards that check status before allowing actions (e.g., `if status == "running"`, `match self.state`). Search for state transitions (assignments to status fields). **For each state machine you find:** 1. **Enumerate all possible states.** Read the enum, the constants, or grep for every value the field is assigned. List them all. 2. **For each consumer of state** (UI handlers, API endpoints, control flow guards), check: does it handle every possible state? A `switch`/`match` without a meaningful default, or an `if/elif` chain that doesn't cover all states, is a gap. 3. **For each state transition**, check: can you reach every state? Are there states you can enter but never leave? Are there states that block operations that should be available? 4. **Record gaps as findings.** A status guard that allows action X for "running" but not for "stuck" is a real bug if the user needs to perform action X on stuck processes. A process that enters a terminal state but never triggers cleanup is a real bug. **Why this matters:** State machine gaps produce bugs that are invisible during normal operation but surface under stress or edge conditions — exactly when you need the system to work. A batch processor that can't be killed when it's in "stuck" status, or a watcher that never self-terminates after all work completes, or a UI that refuses to resume a "pending" run, are all symptoms of incomplete state handling. These bugs don't show up in defensive pattern analysis because the code isn't defending against them — it's simply not handling them at all. ### Step 5b: Map Schema Types If the project has a validation layer (Pydantic models in Python, JSON Schema, TypeScript interfaces/Zod schemas, Java Bean Validation annotations, Scala case class codecs), read the schema definitions now. For every field you found a defensive pattern for, record what the schema accepts vs. rejects. **Read `references/schema_mapping.md`** for the mapping format and why this matters for writing valid boundary tests. ### Step 6: Domain-Knowledge Risk Analysis (Code + Domain Knowledge) **This is the primary bug-hunting pass for library and framework codebases.** Complete it before selecting any structured patterns. Write the results to the `## Quality Risks` section of EXPLORATION.md immediately — do not hold them in memory. Every project has a different failure profile. This step uses **two sources** — not just code exploration, but your training knowledge of what goes wrong in similar systems. **From code exploration**, ask: - What does "silently wrong" look like for this project? - What external dependencies can change without warning? - What looks simple but is actually complex? - Where do cross-cutting concerns hide? **From domain knowledge**, ask: - "What goes wrong in systems like this?" — If it's an HTTP router, think about header parsing edge cases (quality values, token lists, case sensitivity), middleware ordering dependencies, and path normalization. If it's an HTTP client, think about redirect credential stripping, encoding detection, and connection state leaking. If it's a serialization library, think about null handling asymmetry, API surface consistency between direct methods and view wrappers, lazy evaluation caching bugs, and round-trip fidelity. If it's a web framework, think about response helper edge cases, configuration compilation chains, and middleware state isolation. If it's a batch processor, think about crash recovery, idempotency, silent data loss, state corruption. If it handles randomness or statistics, think about seeding, correlation, distribution bias. - "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted. A response with a `200 OK` but the wrong `Content-Type`. A redirect that succeeds but leaks credentials. A deserialized object that has silently truncated values. - "What happens at 10x scale that doesn't happen at 1x?" — Chunk boundaries, rate limits, timeout cascading, memory pressure. - "What happens when this process is killed at the worst possible moment?" — Mid-write, mid-transaction, mid-batch-submission. - "Where do two surfaces that should behave the same drift on edge inputs?" — Overloads, aliases, sync/async APIs, builder vs direct APIs, direct mutators vs live views/wrappers, stdlib-compatible wrappers vs framework-native surfaces. For Java/Kotlin: `add(null)` vs `asList().add(null)`, `put(key,null)` vs `asMap().put(key,null)`. For Python: constructor encoding vs mutator encoding, sync vs async client behavior. - "What emits plausible output with subtly wrong metadata?" — Content type, charset, route pattern, ETag strength, byte count, auth/header/cookie propagation, status code, cache validators. - "What standard grammar or list syntax is being parsed with ad hoc string logic?" — Quality values (`q=0`), comma-separated headers, digest challenges, MIME types with parameters, query strings, enum/keyword sets, cookie merging. - "What edge-case inputs would a domain expert reach for?" — For HTTP code: `Accept-Encoding: gzip;q=0`, `Connection: keep-alive, Upgrade`, `Content-Type: application/problem+json`. For serialization code: `null` through different API surfaces, values at `Integer.MAX_VALUE + 1`, round-tripping through encode-then-decode. For routing code: overlapping patterns, mounted prefix propagation, same path with different methods. - "What information does the user need before committing to an irreversible or expensive operation?" — Pre-run cost estimates, confirmation of scope (especially when fan-out or expansion will multiply the work), resource warnings. If the system can silently commit the user to hours of processing or significant cost without showing them what they're about to do, that's a missing safeguard. Search for operations that start long-running processes, submit batch jobs, or trigger expansion/fan-out — and check whether the user sees a preview, estimate, or confirmation with real numbers before the point of no return. - "What happens when a long-running process finishes — does it actually stop?" — Polling loops, watchers, background threads, and daemon processes that run until completion should have explicit termination conditions. If the loop checks "is there more work?" but never checks "is all work done?", it will run forever after completion. This is especially common in batch processors and queue consumers. Generate at least 5 ranked failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as **specific bug hypotheses with file-path and line-number citations**, ranked by priority. Frame each as: "Because [code at file:line] does [X], a [domain-specific edge case] will produce [wrong behavior] instead of [correct behavior]." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern." **Anti-patterns that fail the gate:** A Quality Risks section that lists defensive patterns the code already has (things the code does right) is not a risk analysis — it is a reassurance exercise and will not find bugs. A section that lists risky modules without specific failure scenarios is not actionable. A section that concludes "this is a mature, well-tested library so basic bugs are unlikely" is actively harmful — mature libraries have the most subtle API-contract and edge-case bugs, precisely because the obvious ones were found years ago. The test: could a code reviewer read each scenario and immediately know what function to open and what input to test? If not, the scenario is too abstract. ### Step 7: Derive Testable Requirements **Read `references/requirements_pipeline.md`** for the complete five-phase pipeline, domain checklist, and versioning protocol. This is the most important step for the code review protocol. Everything found during exploration — specs, ChangeLog entries, config structs, source comments, chat history — gets distilled into a set of testable requirements that the code review will verify. The pipeline separates contract discovery from requirement derivation, uses file-based external memory, and includes mechanical verification with a completeness gate. **Why this matters:** Structural code review catches about 65% of real defects. The remaining 35% are intent violations — absence bugs, cross-file contradictions, and design gaps. These are invisible to code reading because the code that IS there is correct. You need to know what the code is supposed to do, then check whether it does it. That's what testable requirements provide. **The five-phase pipeline:** 1. **Phase A — Contract extraction.** Read all source files, list every behavioral contract. Write to `quality/CONTRACTS.md`. This is discovery — list everything, even if it seems obvious. 2. **Phase B — Requirement derivation.** Read CONTRACTS.md and documentation. Group related contracts, enrich with user intent, write formal requirements. Write REQ records to `quality/requirements_manifest.json` (source of truth) and render to `quality/REQUIREMENTS.md`. For each requirement, record the `tier` (1–5 per schemas.md §3.1) and — when `tier ∈ {1, 2}` — the `citation` block produced by `bin/reference_docs_ingest` invoking `bin/citation_verifier` per schemas.md §5.4 / §5.5. The LLM does not shell out to `citation_verifier` directly; the excerpt is a product of the ingest pipeline and is re-verified by `quality_gate.py` at gate time. For Tier 3 REQs (code-is-the-spec), cite the source `file:line` in the `description`; citations are for FORMAL_DOC references only and must not appear on Tier 3/4/5 REQs. The tier + citation pair creates the forward link in the traceability chain: reference_docs/cite → requirements → bugs → tests. See the tier/citation framing block later in this step for the full field list and the Tier-1-wins-over-Tier-2 rule. **Optional `Pattern:` field on REQs.** A requirement that needs a Phase 3 compensation grid should declare its pattern class: - `Pattern: whitelist` — authoritative list of items, every site must handle each one. - `Pattern: parity` — symmetric operations that must match (encode↔decode, setup↔teardown). - `Pattern: compensation` — sites that must compensate for a shared gap. Missing the field means no grid. Setting an invalid value fails `quality_gate.py`. **Preservation rule (Phase 2).** While `Pattern:` is optional in the design sense (some REQs are single-site and need no grid), it is REQUIRED to preserve when the Phase-1 hypothesis already carried it. Phase 2 must transcribe `Pattern:` from EXPLORATION.md to `quality/REQUIREMENTS.md` and `quality/requirements_manifest.json` whenever present. Silent omission is a documented v1.4.5-regression vector — the Phase 5 cardinality gate cannot enforce coverage on a REQ it doesn't know is pattern-tagged. The gate's structural backstop (C13.7/Fix 2) cross-checks REQs that carry per-site UC references (`UC-N.a`/`UC-N.b` form emitted by Phase 1's Cartesian UC rule) and fails the gate if Pattern is missing on such a REQ. **Primary-source extraction rule for code-presence claims.** When writing a requirement that asserts specific constants, values, or labels are handled by a specific function (e.g., "the whitelist must preserve X, Y, and Z"), the requirement must distinguish between what the **spec says should be there** and what the **code actually contains**. Extract the actual contents from the code (case labels, map keys, if-else branches) and compare to the spec's list. If a constant appears in the spec but NOT in the code, write the requirement as "must handle X — **[NOT IN CODE]**: defined in header.h:NN but absent from function() at file.c:NN-NN." Do not write "must preserve X" without verifying X is actually preserved. This prevents a contamination chain where a requirement asserts code presence, the code review copies the assertion, the spec audit inherits it, and the triage accepts it — all without anyone reading the actual code. This exact chain was observed in v1.3.17 virtio testing: REQUIREMENTS.md asserted RING_RESET was preserved in a switch, the code review copied the list, three spec auditors inherited the claim, and the bug went undetected. **Mechanical verification artifact for dispatch functions (mandatory).** When a contract asserts that a function handles, preserves, or dispatches a set of named constants (feature bits, enum values, opcode tables, event types, handler registries), you must generate and execute a shell command or script that mechanically extracts the actual case labels/branches from the function body **before writing the contract line**. Save the raw output to `quality/mechanical/_cases.txt`. The command must be a non-interactive pipeline (e.g., `awk` + `grep`) that cannot hallucinate — it reads file bytes and prints matches. Example: ```bash awk '/void vring_transport_features/,/^}$/' drivers/virtio/virtio_ring.c \ | grep -E '^\s*case\s+' > quality/mechanical/vring_transport_features_cases.txt ``` After execution, read the output file and use it as the sole source of truth for what the function handles. A contract line asserting "function preserves constant X" is **forbidden** unless `quality/mechanical/_cases.txt` contains a matching `case X:` line. If a constant appears in a spec or header but NOT in the mechanical output, the contract must record it as absent: `"must handle X — **[NOT IN CODE]**: defined in header.h:NN but absent from function() per mechanical check."` Downstream artifacts (`REQUIREMENTS.md`, `RUN_SPEC_AUDIT.md`, code review) must cite the mechanical file path when referencing dispatch-function coverage — they may not replace the mechanical output with a hand-written list. **Mechanical artifact integrity check (mandatory).** For each mechanical extraction command, also append it to `quality/mechanical/verify.sh` as a verification step. The script must re-run the same extraction pipeline and diff the result against the saved file. Generate `verify.sh` with this structure: ```bash #!/bin/bash # Auto-generated: re-run mechanical extraction commands and verify saved artifacts set -euo pipefail FAIL=0 # Verify ACTUAL=$(awk '/void vring_transport_features/,/^}$/' drivers/virtio/virtio_ring.c | grep -nE '^\s*case\s+') SAVED=$(cat quality/mechanical/vring_transport_features_cases.txt) if [ "$ACTUAL" != "$SAVED" ]; then echo "MISMATCH: vring_transport_features_cases.txt" diff <(echo "$ACTUAL") <(echo "$SAVED") || true FAIL=1 else echo "OK: vring_transport_features_cases.txt" fi exit $FAIL ``` **Phase 6 must execute `bash quality/mechanical/verify.sh`** and the benchmark fails if any artifact mismatches. This catches a failure mode observed in v1.3.19: the model executed the extraction command but wrote its own expected output to the file instead of letting the shell redirect capture it, inserting a hallucinated `case VIRTIO_F_RING_RESET:` line that the real command does not produce. Re-running the same command in a separate step and diffing against the file detects this tampering. **Immediate integrity gate (mandatory, Phase 2a).** Run `bash quality/mechanical/verify.sh` **immediately** after writing each `*_cases.txt` file and **before** writing any contract, requirement, or prose artifact that cites the extraction. If exit code ≠ 0: stop, delete the failed `*_cases.txt`, re-run the extraction command with a fresh shell redirect (do not hand-edit the output), and re-verify. Do not advance to Phase 3/2c until verify.sh exits 0. Save verify.sh stdout and exit code to `quality/results/mechanical-verify.log` and `quality/results/mechanical-verify.exit` as durable receipt files. This gate exists because v1.3.23 showed that deferring verification to Phase 6 allows downstream artifacts (CONTRACTS.md, REQUIREMENTS.md, triage probes) to build on a forged extraction — the model reconciles a discrepancy between requirements and the artifact by corrupting the artifact instead of correcting the requirement. **Mechanical artifacts are immutable after extraction.** Once a `*_cases.txt` file has been written by the shell redirect and verified by `verify.sh`, it must not be modified, overwritten, or regenerated for the remainder of the run. If a downstream step discovers a discrepancy between the mechanical artifact and a requirement or contract, the requirement or contract is wrong — not the artifact. Fix the prose, not the extraction. This rule prevents the v1.3.23 failure mode where the model overwrote a correct extraction with fabricated content to match its own narrative. **Forbidden probe pattern (triage and verification).** Triage probes, verification probes, and audit assertions must not use `open('quality/mechanical/...')` or `cat quality/mechanical/...` as sole evidence for what a source file contains at a given line. To verify that function F handles constant C at line N, the probe must either: (a) read the source file directly (`open('drivers/virtio/virtio_ring.c')` with line-anchored assertions), or (b) re-execute the same extraction pipeline used by `verify.sh` and check its output. Reading the saved artifact proves only what the artifact says, not what the code says — this is circular verification. In v1.3.23, Probe C validated the forged artifact instead of the source code, passing with fabricated data. **Do not create an empty mechanical/ directory.** Only create `quality/mechanical/` if the project's contracts include dispatch functions, registries, or enumeration checks that require mechanical extraction. If no such contracts exist, skip the directory entirely and record in PROGRESS.md: `Mechanical verification: NOT APPLICABLE — no dispatch/registry/enumeration contracts in scope.` Creating an empty mechanical/ directory (or one without verify.sh) is non-conformant — it signals that extraction was attempted and abandoned. Decide before creating the directory: does this project have dispatch-function contracts? If no, don't `mkdir`. If yes, populate it fully. **Normative vs. descriptive split.** Requirements and contracts must use normative language ("must preserve," "should handle") for expected behavior. They may only use descriptive language ("preserves," "handles") when the mechanical verification artifact confirms the claim. A requirement that says "the implementation preserves VIRTIO_F_RING_RESET" without a confirming mechanical artifact is non-conformant — write "the implementation **must** preserve VIRTIO_F_RING_RESET" and cite the mechanical check result showing whether the constant is currently present or absent. 3. **Phase C — Coverage verification.** Cross-reference every contract against every requirement. Fix gaps. Loop up to 3 times until coverage reaches 100%. Write to `quality/COVERAGE_MATRIX.md`. The matrix must have **one row per requirement** (REQ-001, REQ-002, etc.) — not grouped ranges like "C-001 to C-007 | REQ-001, REQ-003". Grouped ranges make machine verification impossible and hide gaps. 4. **Phase D — Completeness check + self-refinement loop.** Apply the domain checklist, testability audit, and cross-requirement consistency check. Also verify that every deep document with a "will cover" commitment in the coverage commitment table has at least one requirement traced to it — if not, add requirements for the gap before continuing. Write to `quality/COMPLETENESS_REPORT.md` as a **baseline** completeness report (without a `## Verdict` section — the verdict is deferred to Phase 5 post-reconciliation, which produces the only verdict that counts for closure). Then run up to 3 self-refinement iterations: read the report, fix gaps, re-check. Short-circuit when fewer than 3 changes per iteration. 5. **Phase E — Narrative pass.** Add project overview (with overview validation gate), then derive use cases (with use case derivation gate). Both gates must pass before proceeding to category narratives, cross-cutting concerns, and final reordering. This sequencing prevents multi-pass loops where a failed late gate forces re-derivation. Reorder for top-down flow. Renumber sequentially. **REQUIREMENTS.md must begin with a human-readable overview** that answers: What is this project? What does it do? Who are the actors (users, systems, hardware, protocols)? What are the highest-risk areas? This overview should be useful to someone who has never seen the project before. If the project is a library or driver where all actors are systems, describe the system actors (kernel maintainers, protocol peers, integrators, end-user developers) and their interactions. Do not start with raw scope metadata or HTML comments — lead with a plain-language description. **Overview validation gate (mandatory).** After writing the overview, perform this self-check before proceeding to use case derivation: > Does this overview describe the project the way its actual users would recognize it? Specifically: > - Does it name the project's ecosystem role and real-world significance? > - Does it identify who depends on it and for what? > - Would a developer who uses this project daily say "yes, that's what it is and why it matters"? > - For well-known projects, does it reflect publicly known adoption (e.g., Cobra → kubectl/Hugo/GitHub CLI; Express → millions of Node.js API servers; Zod → form validation/tRPC; Serde → the default Rust serialization layer)? If the overview reads like it was written by someone who only read the source code and never used the software, revise it before proceeding. The overview sets the frame for everything downstream — feature-oriented use cases and internally focused requirements are symptoms of an overview that only describes the code, not the project. **Use case derivation (mandatory, runs after overview gate).** Derive 5–7 use cases from the validated overview and gathered documentation, then validate them against the code. Each use case must: - Describe a **real user outcome**, not a code feature. "Developer builds a CLI tool with nested subcommands, persistent flags, and shell completion" — not "Framework supports command trees." - Name a **concrete actor** and what they are trying to accomplish. Actors include end-user developers, system administrators, kernel maintainers, protocol peers, integrators, and automated consumers. - Be **recognizable to an actual user** of the software. For well-known projects, validate use cases against the model's own knowledge of the project, community docs, tutorials, and real-world adoption patterns. - Connect to at least one requirement through testable conditions of satisfaction. The pipeline should explicitly ask: "Based on this project's overview, gathered documentation, and known user base, what are the 5–7 most important things real users do with this software?" Derive use cases from that question — not from scanning the code and grouping features into categories. **Use case validation against code:** After deriving use cases from the overview and docs, verify each one against the codebase. If a use case describes something the code doesn't actually support, revise or remove it. If the code supports an important user outcome that no use case covers, add one. The goal is use cases that are both user-recognizable AND code-grounded. **Acceptance criteria span check (mandatory, runs after use case derivation).** After use cases are finalized and validated against code, check whether the conditions of satisfaction across all requirements collectively span the project's main behaviors: > Do these acceptance criteria, taken together, cover the project? Is there a major user-facing behavior described in the overview or use cases that no requirement's conditions of satisfaction would catch if it broke? For each use case, at least one requirement's conditions of satisfaction must be traceable to it, and at least one linked requirement must be `specific` (not `architectural-guidance`). Use cases with no linked specific requirements indicate a gap. When gaps are found, either: (a) add new requirements or sharpen existing conditions to cover the gap, or (b) revise the use case if it doesn't reflect what the requirements actually protect. Record the results of this check in the completeness report. Follow the use cases with the individual requirements. **v1.5.2 tier and citation scheme (schemas.md §3.1, §5).** Every REQ carries a `tier` integer 1–5 per `schemas.md` §3.1: - **Tier 1** — project's own formal spec (a `FORMAL_DOC` record with `tier=1`; highest authority). - **Tier 2** — external formal standard (RFC, W3C, ISO, published API contract — a `FORMAL_DOC` record with `tier=2`). - **Tier 3** — source-of-truth code when no formal spec exists; the code IS the spec. - **Tier 4** — informal documentation loaded by `bin/reference_docs_ingest.load_tier4_context` from top-level `reference_docs/` (AI chats, design notes, retrospectives). - **Tier 5** — inferred from code behavior with no documentation backing. For `tier ∈ {1, 2}`, the REQ also carries a `citation` block per `schemas.md` §5 with `document`, `document_sha256`, at least one of `section`/`line`, and a mechanically-extracted `citation_excerpt`. Do NOT write the excerpt by hand. The excerpt is produced at ingest time by `bin/reference_docs_ingest` invoking `bin/citation_verifier` per the deterministic algorithm in `schemas.md` §5.4 (with section resolution per §5.5) — the LLM consumes the excerpt from `formal_docs_manifest.json`; it never shells out to the verifier directly. Ingest-time extraction is how Layer 1 of the hallucination gate works. If you cannot cite a document in `quality/formal_docs_manifest.json` (with hash and locator), the REQ is at most Tier 3. `page`-only locators are diagnostic-only and are never sufficient. **Tier-1-wins-over-Tier-2 rule.** When a project's own spec (Tier 1) and an external standard (Tier 2) contradict each other, record the REQ at Tier 1 citing the project's position. A project's documented deviation from an external standard is authoritative intent, not a defect — the `upstream-spec-issue` disposition applies only when the project's spec is silent on the conflict. **Spec-Gap degradation (valid output state).** If `formal_docs_manifest.json` contains zero `FORMAL_DOC` records covering the project's own behavior, every REQ ends up at Tier 3/4/5 and the run degrades gracefully into a Spec Gap Analyzer. Report the meta-finding "0 Tier 1/2 requirements" in the completeness report as a metric, not a failure. Do NOT fabricate citations to make the tier distribution look richer — `quality_gate.py` re-invokes `bin/citation_verifier` (via `extract_excerpt`) per §5.4 at verification time and rejects any Tier 1/2 REQ whose `citation_excerpt` does not byte-equal the fresh extraction (schemas.md §10 invariant #11). **`functional_section` is a required field.** Every REQ carries a short `functional_section` string (e.g., `"Authentication"`, `"Bus enumeration"`) that groups related REQs. This is LLM-derived from the code and documentation; there is no predefined ontology. Phase 2's rendering groups REQs under these sections (with a short intro paragraph per section) and the Phase 4 Council reviews the grouping for coherence. See `schemas.md` §6.1. **Traceability is one-way: REQ → UC.** The REQ carries a `use_cases[]` list of UC-NN IDs. The UC record does NOT carry a `requirements[]` back-link — the reverse direction is derived at render time by querying REQ records for matching entries (schemas.md §7). Do not populate a `requirements[]` field on UC records. **For each requirement, provide all of these fields:** - **ID**: `REQ-NNN` (zero-padded three-digit sequence). - **Title**: Short, one-line statement. - **Tier**: Integer 1–5 per schemas.md §3.1. - **Functional section**: Short LLM-derived string (see above). - **Citation** (required when `tier ∈ {1, 2}`): produced by `bin/reference_docs_ingest` invoking `bin/citation_verifier`; never hand-authored and never invoked directly by the LLM. Shape per schemas.md §5.1. - **Summary / Description**: State the requirement as a testable assertion: "X must satisfy Y" or "When A, the system must B". - **User story**: Frame it from the caller's perspective: "As a [role] doing [action], I expect [behavior] **so that** [outcome]." The "so that" clause is mandatory — it forces you to articulate the intent behind the requirement. - **Implementation note**: How the code achieves this requirement — the mechanism, the relevant code paths, the design choice. - **Conditions of satisfaction**: Specific, testable scenarios that prove this requirement is met. Include the happy path, edge cases, and failure modes. Each individual contract from Phase A that was grouped into this requirement becomes a condition of satisfaction. - **Alternative paths**: Multiple code paths, modes, or entry points that must all satisfy the requirement. Alternative paths are where bugs hide. - **Use cases**: `use_cases[]` — list of `UC-NN` IDs this REQ participates in. One-way forward link. - **References**: Cite the source — spec section, ChangeLog entry, config field definition, source comment, issue number, or domain knowledge. For Tier 1/2 REQs the `citation` block carries the authoritative locator; free-form references are supplementary. - **Specificity**: **specific** (testable — must have conditions of satisfaction that a code reviewer can check against a specific code location or behavior; this is the default and counts toward coverage metrics) or **architectural-guidance** (not testable against individual code paths — covers cross-cutting properties like "remain lightweight and stdlib-compatible" or "no_std support"; informs the quality constitution but is not counted in coverage metrics; most projects should have 0–3 architectural-guidance requirements — more than 3 triggers the mandatory self-check below). The category "directional" is retired. Any requirement that would have been "directional" must either be made specific (with testable conditions) or explicitly classified as architectural-guidance. **Architectural-guidance self-check (mandatory, runs after requirement derivation).** Count the requirements tagged `architectural-guidance`. Apply both bounds: - **Maximum bound (>3):** If the count exceeds 3, stop and re-examine each one. For each, ask: "Can I add a testable condition of satisfaction that a code reviewer could verify against a specific code location?" If yes, reclassify it as `specific` and add the condition. Only requirements that genuinely cannot be verified against any specific code path should remain `architectural-guidance`. A final count above 3 requires an explicit justification per excess requirement explaining why it cannot be made specific. - **Minimum bound (0 on 15+ requirements):** If the total requirement count is 15 or more and the `architectural-guidance` count is 0, re-examine the requirements for cross-cutting design invariants. Libraries that span protocol layers, manage resource lifecycles, enforce ordering guarantees, or maintain compatibility contracts (e.g., "remain stdlib-compatible," "preserve no_std support," "maintain wire-format backward compatibility") typically have 1–3 architectural-guidance requirements. Write one sentence in the completeness report explaining why no requirement qualified as architectural-guidance, or reclassify the appropriate requirements. Record the count and any reclassifications in the completeness report. **Do not cap the requirement count.** Derive as many as the project warrants. A small utility might have 20. A mature library might have 100+. The goal is completeness. **Step 7a: Documentation-to-requirement reconciliation** Re-read the coverage commitment table from PROGRESS.md. For each deep document you committed to covering ("will cover in Phase 2"), verify that at least one requirement traces to the subsystem it documents. If your requirements cover only some committed subsystems, add requirements for the gaps before completing Step 7. For each subsystem, record one of the following in PROGRESS.md: - the requirement IDs that cover it, or - an explicit exclusion with rationale, risk acknowledgment, and recommended follow-up A deep-documented subsystem with a "will cover" commitment and zero mapped requirements is a process failure, not a legitimate scope choice. Do not proceed to artifact generation until every commitment is satisfied or explicitly converted to a justified exclusion. **Step 7b: Code-path → REQ reverse traceability audit (mandatory)** **Timing: Execute Step 7a and 7b after Phase E completes** (i.e., after the overview validation gate, use case derivation, and acceptance criteria span check have all run). The audit depends on finalized requirements AND finalized use cases. After requirements derivation is complete, run a reverse traceability audit. Forward traceability (gathered docs → requirements → bugs → tests) is already built into the pipeline. This step checks the reverse direction at code-path granularity: do significant code paths map back to requirement conditions? This is an audit activity — NOT a structural bidirectional link. (Structural traceability in v1.5.2 is one-way REQ → UC per `schemas.md` §7 and is enforced by schema; this audit checks code coverage against REQs, which is a separate concern.) This operates at **path/branch/helper granularity**, not file level. File-level coverage was 100% in v1.3.13 and still missed two real bugs. The question is not "does this file map to some requirement?" but "does this significant branch map to a requirement clause that states what must be preserved here?" **Scoped to four categories** (not an open-ended branch audit): 1. **Alternative paths already named in requirements.** If a requirement mentions fallback or alternative paths (e.g., "primary vs. degraded mode," "negotiated vs. default configuration," "sync vs. async"), each alternative must have an explicit **symmetry condition** — a statement of what invariant must hold across both paths. A requirement that says "the system handles both X and Y" without specifying what "handles" means for each is incomplete. 2. **Helpers that translate public constants into runtime behavior.** If a helper function whitelists, filters, or translates between defined constants and runtime behavior (e.g., feature flag gates, codec registry lookups, capability whitelist helpers), it must have a helper-specific requirement enumerating the expected preserved/translated values. 3. **Capability-negotiation and fallback logic.** Code paths where the system negotiates capabilities with an external peer (protocol version negotiation, feature detection, graceful degradation) must have requirements covering both the negotiated-up and negotiated-down paths. 4. **Functions named in prior BUGS.md, VERSION_HISTORY.md, or spec audit outputs.** If a previous run found a bug in a specific function, future runs must show explicit re-check evidence for that function ("known bug class sentinels"). This prevents the "lost requirement" regression class. If prior spec audit outputs exist in `quality/spec_audits/`, read them before running the sentinel check — cross-model findings from council reviews are a high-value source of known bug surfaces. For each category, check whether the requirements contain specific conditions covering the identified paths. Orphaned paths — significant code paths without requirement coverage — trigger a "coverage gap" marker in the completeness report. These gaps must be resolved (by adding requirement conditions or by providing explicit justification) before the completeness report can declare requirements sufficient. **Carry-forward rule:** When a prior run's REQUIREMENTS.md exists in the quality directory, the pipeline must read it and check whether any conditions from the prior version were dropped. If conditions were dropped, the pipeline must either: (a) re-derive them with updated justification, or (b) document why the condition is no longer relevant. Silent drops are not permitted — they are a direct cause of regressions where previously learned requirements are lost between runs. **After the pipeline:** Phase 7 can generate `quality/REVIEW_REQUIREMENTS.md` (interactive review protocol) and `quality/REFINE_REQUIREMENTS.md` (refinement pass protocol). These are not Phase 2 artifacts — they support the Phase 7 interactive improvement paths. The user can review requirements interactively, run refinement passes with different models, and keep versioned backups of each iteration. See `references/requirements_pipeline.md` for the full versioning protocol and backup structure. Record all requirements in a structured format. These feed directly into the code review protocol's verification and consistency passes. ### Checkpoint: Initialize PROGRESS.md After completing Phase 1 exploration, create `quality/PROGRESS.md`. This file is the skill's external memory — it persists state across phases so that context is never lost, even if the session is interrupted and resumed. It also serves as an audit trail for debugging and improvement. **Why this exists:** In single-session runs, the agent holds context in memory. But context degrades over long sessions — findings from Phase 1 are forgotten by Phase 6, BUG counts drift, spec-audit bugs get orphaned because the closure check never saw them. PROGRESS.md solves this by making every phase write its state to disk. The agent reads it back before each phase, so it always has an accurate picture of what happened so far. As a side benefit, it makes the skill work correctly even if the run is split across multiple sessions. **Checkpoint discipline for long runs:** After each requirements-pipeline phase (Contracts, Requirements, Coverage Matrix, Completeness, Narrative), update `quality/PROGRESS.md` with: completed phase, artifact paths, current scoped subsystems, remaining work, and exact resume point. This ensures a resumed session can continue from the last completed checkpoint without redoing work. **Timestamp discipline:** Write each phase completion entry to PROGRESS.md immediately when you finish that phase, before starting the next phase. Do not batch-write or back-fill timestamps after the fact. The timestamps are an audit trail — if Phase 2 shows a completion time earlier than Phase 1, a reviewer cannot verify that phases ran in the correct sequence. If you realize you forgot to write a checkpoint, write it now with an honest timestamp and a note explaining the gap. Write the initial PROGRESS.md: ```markdown # Quality Playbook Progress ## Run metadata Started: [date/time] Project: [project name] Skill version: [read from SKILL.md metadata using the reference file resolution order — must match exactly] With docs: [yes/no] ## Phase completion - [x] Phase 1: Exploration — completed [date/time] - [ ] Phase 2: Artifact generation (QUALITY.md, REQUIREMENTS.md, tests, protocols, RUN_TDD_TESTS.md, AGENTS.md) - [ ] Phase 3: Code review + regression tests - [ ] Phase 4: Spec audit + triage - [ ] Phase 5: Post-review reconciliation + closure verification - [ ] TDD logs: red-phase log for every confirmed bug, green-phase log for every bug with fix patch - [ ] Phase 6: Verification benchmarks - [ ] Phase 7: Present, Explore, Improve (interactive) ## Artifact inventory | Artifact | Status | Path | Notes | |----------|--------|------|-------| | QUALITY.md | pending | | | | REQUIREMENTS.md | pending | | | | CONTRACTS.md | pending | | | | COVERAGE_MATRIX.md | pending | | | | COMPLETENESS_REPORT.md | pending | | | | Functional tests | pending | | | | RUN_CODE_REVIEW.md | pending | | | | RUN_INTEGRATION_TESTS.md | pending | | | | BUGS.md | pending | | | | RUN_TDD_TESTS.md | pending | | | | RUN_SPEC_AUDIT.md | pending | | | | AGENTS.md | pending | | | | tdd-results.json | pending | quality/results/ | Structured TDD output | | integration-results.json | pending | quality/results/ | Structured integration output | | Bug writeups | pending | quality/writeups/ | One per TDD-verified bug | ## Cumulative BUG tracker | # | Source | File:Line | Description | Severity | Closure Status | Test/Exemption | |---|--------|-----------|-------------|----------|----------------|----------------| ## Terminal Gate Verification ## Exploration summary [Brief notes on architecture, key modules, spec sources, defensive patterns found] ``` Update this file after every phase. The cumulative BUG tracker is the most important section — it ensures no finding is orphaned regardless of which phase produced it. ### Write exploration findings to disk After initializing PROGRESS.md, write your full exploration findings to `quality/EXPLORATION.md`. This file captures everything you learned in Phase 1 so it can survive a context boundary (session break, multi-pass handoff, or long-run memory degradation). Structure it as: ```markdown # Exploration Findings ## Domain and Stack [Language, framework, build system, deployment target] ## Architecture [Key modules with file paths, entry points, data flow, layering] ## Existing Tests [Test framework, test count, coverage areas, gaps] ## Specifications [What docs_gathered/ contains, key spec sections, behavioral rules] ## Open Exploration Findings [At least 8 concrete findings from domain-driven investigation. Each must have a file path, line number, and specific bug hypothesis. At least 4 must reference different modules or subsystems. At least 3 must trace a behavior across 2+ functions.] ## Quality Risks [At least 5 domain-driven failure scenarios ranked by priority. Each must name a specific function, file, and line and explain the failure mechanism using domain knowledge of what goes wrong in systems like this. These are hypotheses, not confirmed bugs — they tell Phase 2 where to look. Frame each as: "Because [code at file:line] does [X], a [domain-specific edge case] will produce [wrong behavior] instead of [correct behavior]." A section that lists defensive patterns the code already has does NOT belong here.] ## Skeletons and Dispatch [State machines, dispatch tables, feature registries — with file:line citations] ## Pattern Applicability Matrix | Pattern | Decision (`FULL` / `SKIP`) | Target modules | Why | |---|---|---|---| | Fallback and Degradation Path Parity | | | | | Dispatcher Return-Value Correctness | | | | | Cross-Implementation Consistency | | | | | Enumeration and Representation Completeness | | | | | API Surface Consistency | | | | | Spec-Structured Parsing Fidelity | | | | [3 to 4 patterns must be marked FULL. The rest are SKIP with codebase-specific rationale. Select 4 when a fourth pattern clearly applies and covers different code areas.] ## Pattern Deep Dive — [Pattern Name] [Use the output format from `exploration_patterns.md`. Trace the relevant code path across 2+ functions, implementations, or API surfaces. Each deep dive should pressure-test, refine, or extend findings from the open exploration and quality risks stages.] ## Pattern Deep Dive — [Pattern Name] [Repeat for each selected FULL pattern. 3 to 4 deep-dive sections total.] ## Pattern Deep Dive — [Pattern Name] [Third and final deep dive.] ## Candidate Bugs for Phase 2 [Consolidated from ALL earlier sections — open exploration, quality risks, AND patterns. Minimum 4 candidates with file:line references. At least 2 from open exploration or quality risks, at least 1 from a pattern deep dive. For each candidate include the source stage and what the Phase 2 code review should inspect.] ## Derived Requirements [REQ-001 through REQ-NNN, each with spec basis and tier] ## Derived Use Cases [UC-01 through UC-NN, each with actor, trigger, expected outcome] ## Notes for Artifact Generation [Anything the next phase needs to know — naming conventions, test patterns, framework quirks] ## Gate Self-Check [Written by the Phase 1 gate. Each check 1–12 with PASS/FAIL and one-line evidence. This section proves the gate was executed. Do not write this section until you have actually verified each check against the file contents.] ``` **Minimum depth expectation:** EXPLORATION.md must contain at least 120 lines of substantive content — not padding or boilerplate headers, but actual findings (file paths, behavioral rules, derived requirements, architecture observations). A skeleton that lists section headers with one-line placeholders is not a valid handoff artifact. If the file is thinner than this, go back and add the detail Phase 2 will need. **Re-read after writing (mandatory).** After writing EXPLORATION.md, explicitly read the file back from disk before proceeding to Phase 2. This serves two purposes: (1) it confirms the file was written correctly, and (2) it loads the structured findings into working memory for artifact generation. Do not skip this step and rely on what you remember writing — the "write then read" cycle is the context bridge. This file is essential in all modes. In single-pass mode it forces the model to articulate specific findings (file paths, function names, line numbers) before generating artifacts. In multi-pass mode it is also the handoff artifact between passes. Either way, the write-then-read cycle is the quality gate for exploration depth. **Phase 1 completion gate (mandatory — STOP HERE before Phase 2).** You MUST execute this gate before proceeding to Phase 2. This is not optional. Re-read `quality/EXPLORATION.md` from disk and run every check below. After checking, append a `## Gate Self-Check` section to the bottom of EXPLORATION.md that lists each check number (1–12) with PASS or FAIL and a one-line evidence note. If any check fails, fix EXPLORATION.md and re-run the gate. Do not proceed to Phase 2 until all checks pass AND the Gate Self-Check section is written to disk. **Common gate-bypass failure mode:** In v1.3.43 benchmarking, two repos (chi, zod) produced EXPLORATION.md files with completely wrong section structure — sections like "Architecture summary", "Behavioral contracts", "Repository and architecture map" instead of the required sections. The model never ran the gate checks and proceeded directly to Phase 2, producing zero bugs. If your EXPLORATION.md does not contain sections with the EXACT titles listed below, it is non-conformant and must be rewritten before proceeding. 1. The file exists on disk and contains at least 120 lines of substantive content. 2. `quality/PROGRESS.md` exists and marks Phase 1 complete. 3. The Derived Requirements section contains at least one REQ-NNN with specific file paths and function names — not abstract subsystem descriptions. 4. A section titled **exactly** `## Open Exploration Findings` exists and contains at least 8 concrete bug hypotheses or suspicious findings, each with a file path and line number. These must come from domain-driven investigation, not just from applying patterns. At least 4 must reference different modules or subsystems. 5. **Open-exploration depth check:** At least 3 findings in `## Open Exploration Findings` must trace a behavior across 2 or more functions or 2 concrete code locations. A list of isolated single-function suspicions is not sufficient depth. 6. A section titled **exactly** `## Quality Risks` exists and contains at least 5 domain-driven failure scenarios ranked by priority. Each scenario must: (a) name a specific function, file, and line, (b) describe a domain-specific edge case or failure mode, and (c) explain why the code produces wrong behavior. These must come from domain knowledge about what goes wrong in systems like this one — not from structural analysis of the code alone. A section that lists defensive patterns the code already has (things the code does RIGHT) does not satisfy this gate. A section that lists risky modules without specific failure scenarios does not satisfy this gate. A section that concludes the library is mature and unlikely to have basic bugs does not satisfy this gate. 7. A section titled **exactly** `## Pattern Applicability Matrix` exists and evaluates all six patterns from `exploration_patterns.md`, marking each as `FULL` or `SKIP` with target modules and codebase-specific rationale. 8. Between 3 and 4 patterns (inclusive) are marked `FULL` in the applicability matrix. 9. There are between 3 and 4 sections (inclusive) whose titles begin with `## Pattern Deep Dive — `. Each must contain concrete file:line evidence, not just pattern-name placeholders. The count must match the number of `FULL` patterns in the matrix. 10. **Pattern depth check:** At least 2 of the pattern deep-dive sections must trace a code path across 2 or more functions. A section that says "function X at file:line has a gap" is a surface finding. A section that says "function X at file:line calls function Y at file:line, which does A but not B; compare with function Z which does both" is a depth finding. 11. A section titled **exactly** `## Candidate Bugs for Phase 2` exists and contains at least 4 prioritized bug hypotheses with file:line references, the stage that surfaced each one (open exploration, quality risks, or pattern), and what the code review should look for. 12. **Ensemble balance check:** At least 2 candidate bugs must originate from open exploration or quality risks, and at least 1 must originate from or be materially strengthened by a pattern deep dive. This ensures both domain-knowledge and structural-analysis findings flow into Phase 2. Do not begin Phase 2 until all twelve checks pass AND the `## Gate Self-Check` section is written to EXPLORATION.md on disk. Phase 1 is your only chance to understand the codebase deeply. Every requirement you miss here is a bug you will not find in Phase 3. Invest the time. **If you find yourself about to start Phase 2 without having written a Gate Self-Check section, STOP.** Go back and run the gate. This instruction exists because models reliably skip the gate when they feel confident about their exploration — and that confidence is precisely when bugs are missed. **End-of-phase message (mandatory — print this after Phase 1 completes, then STOP):** ``` # Phase 1 Complete — Exploration I've finished exploring the codebase and written my findings to `quality/EXPLORATION.md`. [Summarize: how many candidate bugs, which subsystems explored, key risks identified.] To continue to Phase 2 (Generate quality artifacts), say: Run quality playbook phase 2. Or say "keep going" to continue automatically. ``` **After printing this message, STOP. Do not proceed to Phase 2 unless the user explicitly asks.** --- ## Phase 2: Generate the Quality Playbook > **Required references for this phase** — read these before proceeding: > - `quality/EXPLORATION.md` — your Phase 1 findings (architecture, requirements, use cases, pattern analysis) > - `references/requirements_pipeline.md` — five-phase pipeline for requirement derivation > - `references/defensive_patterns.md` — grep patterns for finding defensive code > - `references/schema_mapping.md` — field mapping format for schema-aware tests > - `references/constitution.md` — QUALITY.md template > - `references/functional_tests.md` — test structure and anti-patterns > - `references/review_protocols.md` — code review and integration test templates **Phase 2 entry gate (mandatory — HARD STOP).** Before generating any artifacts, read `quality/EXPLORATION.md` from disk and verify ALL of the following exact section titles exist (grep or search — do not rely on memory): 1. `quality/EXPLORATION.md` must have at least 120 lines — a shorter file indicates incomplete exploration 2. `## Open Exploration Findings` — must exist verbatim 3. `## Quality Risks` — must exist verbatim 4. `## Pattern Applicability Matrix` — must exist verbatim 5. At least 3 sections starting with `## Pattern Deep Dive — ` — must exist verbatim 6. `## Candidate Bugs for Phase 2` — must exist verbatim 7. `## Gate Self-Check` — must exist (proves the Phase 1 gate was run) 8. `quality/PROGRESS.md` exists and its Phase 1 line is marked `[x]` — proves Phase 1 was completed, not just started 9. The `## Open Exploration Findings` section contains at least 8 concrete bug hypotheses — count lines with file:line citations 10. At least 3 findings in `## Open Exploration Findings` trace behavior across 2+ functions — look for multi-location traces 11. Between 3 and 4 patterns are marked `FULL` in `## Pattern Applicability Matrix` — count FULL entries 12. At least 2 pattern deep-dive sections trace code paths across 2+ functions — look for multi-function traces 13. `## Candidate Bugs for Phase 2` has ≥2 bugs from open exploration/risks AND ≥1 from a pattern deep dive — check source stage labels If the file does not exist, has fewer than 120 lines, or is **missing ANY of these exact section titles**, STOP and go back to Phase 1. Do not attempt to proceed with "equivalent" sections under different names — the exact titles above are required. Write EXPLORATION.md now, starting with domain-driven open exploration, then domain-knowledge risk analysis, then selecting 3–4 patterns from `references/exploration_patterns.md` for deep dives. Do not proceed with Phase 2 until EXPLORATION.md passes the Phase 1 completion gate. This check exists because single-pass execution can skip the Phase 1 gate — this is the backstop. In v1.3.43, two repos bypassed both gates and produced zero bugs. Use `quality/EXPLORATION.md` as your primary source for this phase — do not re-explore the codebase from scratch. The exploration findings contain the architecture map, derived requirements, use cases, and risk analysis that drive every artifact below. If you find yourself reading source files to figure out what the project does, go back to EXPLORATION.md instead. Re-exploration wastes context and produces inconsistencies between what Phase 1 found and what Phase 2 generates. Now write the Phase 2 artifacts. The requirements pipeline above produced REQUIREMENTS.md, CONTRACTS.md, COVERAGE_MATRIX.md, and COMPLETENESS_REPORT.md. The seven files below complete the set. For each one, follow the structure below and consult the relevant reference file for detailed guidance. **Version stamp (mandatory on every generated file).** Every Markdown file the playbook generates must begin with the following attribution line immediately after the file's title heading: ``` > Generated by [Quality Playbook](https://github.com/andrewstellman/quality-playbook) v1.5.2 — Andrew Stellman > Date: YYYY-MM-DD · Project: ``` Every generated code file (test files, scripts) must begin with a comment header: ``` # Generated by Quality Playbook v1.5.2 — https://github.com/andrewstellman/quality-playbook # Author: Andrew Stellman · Date: YYYY-MM-DD · Project: ``` Use the comment syntax appropriate for the language (`#`, `//`, `/* */`, etc.). The version in the stamp must match the `metadata.version` in this skill's frontmatter. This stamp makes every generated artifact traceable back to the tool, version, and run that created it — essential when files are emailed, attached to tickets, or reviewed outside the repository context. Use the date the playbook generation started, not the date each individual file was written. **Stamp placement and exemptions:** - For Python files with an encoding pragma (`# -*- coding: utf-8 -*-`) or a shebang (`#!/usr/bin/env python`), place the stamp comment *after* the pragma/shebang, not before — pushing it past line 2 causes `SyntaxWarning`. - For sidecar JSON files (`tdd-results.json`, `integration-results.json`), the `skill_version` field already serves as the version stamp. JSON does not support comments — do not inject one. - For JUnit XML files, no stamp is needed — these are framework-generated. - For `.patch` files, do not inject a stamp into the diff body — it would break `git apply`. Rely on the surrounding artifact metadata (BUGS.md, tdd-results.json) for provenance. **Artifact dependency rules:** - `quality/RUN_CODE_REVIEW.md` Pass 2 depends on a stable `quality/REQUIREMENTS.md` — thin requirements produce thin Pass 2 review. If the requirements count seems low for the code surface (fewer than ~3–4 requirements per core module), note this at the start of the Pass 2 report. - Functional tests depend on `quality/REQUIREMENTS.md` and `quality/QUALITY.md` — after any requirements refinement, re-verify that `test_functional.*` still covers every requirement. - `quality/RUN_SPEC_AUDIT.md` depends on requirements, quality scenarios, and docs validation. - `quality/COMPLETENESS_REPORT.md` has two stages: baseline (pre-review, no verdict section) and final (post-reconciliation in Phase 5, with the authoritative verdict). - `quality/PROGRESS.md` is the authoritative state file and must be updated before each downstream artifact begins. **Why nine files instead of just tests?** Tests catch regressions but don't prevent new categories of bugs. The quality constitution (`QUALITY.md`) tells future sessions what "correct" means before they start writing code. The protocols (`RUN_*.md`) provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three. ### v1.5.2 JSON manifest discipline (read before writing any artifact) Phase 2 writes two parallel renderings of every derived record: a **JSON manifest** (machine-readable, gate-validated — the source of truth) and a **Markdown artifact** (human-readable, rendered from the manifest). A phase script that updates one without updating the other is a bug. Manifests live at: - `quality/formal_docs_manifest.json` — written by `bin/reference_docs_ingest.py` in Phase 1. Do not rewrite. - `quality/requirements_manifest.json` — authoritative REQ records per schemas.md §6. - `quality/use_cases_manifest.json` — authoritative UC records per schemas.md §7. - `quality/bugs_manifest.json` — authoritative BUG records per schemas.md §8. Written after Phase 3/4/5 confirm bugs. - `quality/citation_semantic_check.json` — Phase 4 Council Layer 2 verdicts (see Phase 4 below). Every manifest follows the §1.6 wrapper with `schema_version`, `generated_at`, and a top-level records array. The four record-shaped manifests (`formal_docs_manifest.json`, `requirements_manifest.json`, `use_cases_manifest.json`, `bugs_manifest.json`) use `records` as the array key: ```json { "schema_version": "", "generated_at": "", "records": [ /* per-schema records, per schemas.md §4–§8 */ ] } ``` **Exception — `citation_semantic_check.json` uses `reviews` instead of `records`** per `schemas.md` §9.1. Same wrapper shape, different array key; the records inside are Council review entries, not REQ/UC/BUG records. If you find yourself writing `records` into `citation_semantic_check.json`, stop and re-read schemas.md §9 — the gate rejects this file as a manifest-wrapper violation (schemas.md §10 invariant #13) when the key is wrong. `schema_version` MUST equal this skill's `metadata.version` at generation time — read it from SKILL.md, do not hardcode. `generated_at` uses `datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z")`. Record shapes and invariants are defined in `schemas.md` — do not redefine them in this skill. `quality_gate.py` (Phase 5/6) validates manifests field-by-field against those schemas. **REQUIREMENTS.md rendering convention.** REQUIREMENTS.md is organized by `functional_section`. Each section opens with a brief LLM-written intro paragraph describing what that functional area does, then lists the REQs under it (ordered by REQ id). Use cases render their `formal_doc_refs` but DO NOT list `requirements[]` — traceability is one-way REQ → UC, and the reverse direction is derived at render time by querying REQ records. ### File 1: `quality/QUALITY.md` — Quality Constitution **Read `references/constitution.md`** for the full template and examples. The constitution has six sections: 1. **Purpose** — What quality means for this project, grounded in Deming (built in, not inspected), Juran (fitness for use), Crosby (quality is free). Apply these specifically: what does "fitness for use" mean for *this system*? Not "tests pass" but the actual operational requirement. 2. **Coverage Targets** — Table mapping each subsystem to a target with rationale referencing real risks. Every target must have a "why" grounded in a specific scenario — without it, a future AI session will argue the target down. 3. **Coverage Theater Prevention** — Project-specific examples of fake tests, derived from what you saw during exploration. (Why: AI-generated tests often pad coverage numbers without catching real bugs — asserting that imports worked, that dicts have keys, or that mocks return what they were configured to return. Calling this out explicitly stops the pattern.) 4. **Fitness-to-Purpose Scenarios** — The heart of it. Each scenario documents a realistic failure mode with code references and verification method. Aim for 2+ scenarios per core module — typically 8–10 total for a medium project, fewer for small projects, more for complex ones. Quality matters more than count: a scenario that precisely captures a real architectural vulnerability is worth more than three generic ones. (Why: Coverage percentages tell you how much code ran, not whether it ran correctly. A system can have 95% coverage and still lose records silently. Fitness scenarios define what "working correctly" actually means in concrete terms that no one can argue down.) 5. **AI Session Quality Discipline** — Rules every AI session must follow 6. **The Human Gate** — Things requiring human judgment **Scenario voice is critical.** Write "What happened" as architectural vulnerability analyses with specific quantities, cascade consequences, and detection difficulty — not as abstract specifications. "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume. At scale, this risks silent loss of 1,693+ records with no detection mechanism." An AI session reading that will not argue the standard down. Use your knowledge of similar systems to generate realistic failure scenarios, then ground them in the actual code you explored. Scenarios come from both code exploration AND domain knowledge about what goes wrong in systems like this. Every scenario's "How to verify" must map to at least one test in the functional test file. ### File 2: Functional Tests **This is the most important deliverable.** Read `references/functional_tests.md` for the complete guide. Organize the tests into three logical groups (classes, describe blocks, modules, or whatever the test framework uses): - **Spec requirements** — One test per testable spec section. Each test's documentation cites the spec requirement it verifies. - **Fitness scenarios** — One test per QUALITY.md scenario. 1:1 mapping, named to match. - **Boundaries and edge cases** — One test per defensive pattern from Step 5. Key rules: - **Match the existing import pattern exactly.** Read how existing tests import project modules and do the same thing. Getting this wrong means every test fails. - **Read every function's signature before calling it.** Read the actual `def` line — parameter names, types, defaults. Read real data files from the project to understand data shapes. Do not guess at function parameters or fixture structures. - **No placeholder tests.** Every test must import and call actual project code. If the body is `pass` or the assertion is trivial (`assert isinstance(x, list)`), delete it. A test that doesn't exercise project code inflates the count and creates false confidence. - **Test count heuristic** = (testable spec sections) + (QUALITY.md scenarios) + (defensive patterns). For a medium project (5–15 source files), this typically yields 35–50 tests. Significantly fewer suggests missed requirements or shallow exploration. Significantly more is fine if every test is meaningful — don't pad to hit a number. - **Cross-variant heuristic: ~30%** — If the project handles multiple input types, aim for roughly 30% of tests parametrized across all variants. The exact percentage matters less than ensuring every cross-cutting property is tested across all variants. - **Test outcomes, not mechanisms** — Assert what the spec says should happen, not how the code implements it. - **Use schema-valid mutations** — Boundary tests must use values the schema accepts (from Step 5b), not values it rejects. ### File 3: `quality/RUN_CODE_REVIEW.md` **Read `references/review_protocols.md`** for the template. The code review protocol has three passes. Each pass runs independently — a fresh session with no shared context except the requirements document. This clean separation prevents cross-contamination between structural review and requirement-based review. **Pass 1 — Structural Review.** Read the code and spot anomalies. This is what every AI code review tool already does well. No requirements, no focus areas — just the model's own knowledge of code correctness. Keep these mandatory guardrails: - Line numbers are mandatory — no line number, no finding - Read function bodies, not just signatures - If unsure: flag as QUESTION, not BUG - Grep before claiming missing - Do NOT suggest style changes — only flag things that are incorrect **Minimum required Pass 1 scrutiny areas (address each explicitly):** 1. **Input validation and boundary handling** — check every trust boundary where external or caller-supplied data enters the code. Every string parser, enum lookup, and binary-format parser must reject input that shares a valid prefix with a valid token but contains additional characters. 2. **Resource lifecycle** — allocation, refcount management, error-path cleanup, lock release on failure, file descriptor/handle lifetime. Every function that acquires a reference or resource must release it on every early-exit path, or must complete all validation before acquiring the resource. 3. **Concurrency and state management** — lock ordering, atomic operation correctness (every atomic modification of a shared state word must use read-modify-write semantics and preserve bits outside the intended modification), state machine completeness (all states handled at all consumers). 4. **Unit and encoding correctness** — every field read from hardware, protocol structures, or user input that has defined units must be converted correctly before use in calculations or comparisons. 5. **Enumeration and whitelist completeness** — when a function uses a `switch`/`case`, `match`, if-else chain, or any branching construct to handle a set of named constants (feature bits, enum values, event types, command codes, permission flags), perform a **mechanical enumeration check**: (a) **List A (code extraction):** If a `quality/mechanical/_cases.txt` artifact exists for this function, use it as the authoritative code-side list — do not re-extract manually. If no mechanical artifact exists, extract every branch/case label actually present in the code. List each with its exact line number: "line 3511: `case VIRTIO_RING_F_INDIRECT_DESC`", "line 3513: `case VIRTIO_RING_F_EVENT_IDX`", etc. **Extract this list from the code only — do not copy from REQUIREMENTS.md, CONTRACTS.md, or any other generated artifact.** If you cannot cite a line number for a case label, it is not present. (b) **List B (spec extraction):** List every constant defined in the relevant header, enum, or spec that *should* be handled. (c) **Diff:** Compare the two lists. For each constant in List B, mark it as "FOUND (line NNN)" or "NOT IN CODE." Report any constants that are defined but not handled. **Do not assert that a whitelist "covers all values" or "preserves supported bits" without performing this two-list comparison.** AI models reliably hallucinate completeness for switch/case constructs — the model sees the function, sees the constants defined elsewhere, and assumes coverage without checking each case label. The most dangerous form of this hallucination is copying from an upstream artifact (like REQUIREMENTS.md) that asserts a constant is present, rather than extracting from the code. In v1.3.17, the code review's "case labels present" list was word-for-word identical to the requirements list — proving it was copied rather than extracted. The mechanical check with per-label line numbers is the fix. These five areas must appear as labeled subsections in the Pass 1 report. If a project has no meaningful concurrency, say so explicitly and document why rather than omitting the section. Add project-specific scrutiny areas beyond these five as warranted. Pass 1 catches ~65% of real defects: race conditions, null pointer hazards, resource leaks, off-by-one errors, type mismatches — structural problems visible in the code. **Pass 2 — Requirement Verification.** For each testable requirement derived in Step 7 of Phase 1, check whether the code satisfies it. For each requirement, either show the code that satisfies it or explain specifically why it doesn't. This is a pure verification pass — the reviewer's only job is "does the code satisfy this requirement?" Not a general review. Not looking for other bugs. Just verification. **Minimum evidence rule:** Pass 2 must cite at least one code location (file:line or file:function) **per requirement**. Blanket satisfaction claims like "REQ-003 through REQ-012 — satisfied by the client paths reviewed during the pass" without per-requirement code citations do not satisfy Pass 2. If two or three requirements are satisfied by the same function, cite the function once and list those specific requirements — but each requirement must appear individually with its own SATISFIED/VIOLATED verdict, not as part of an unverified range. A group of more than three requirements under a single citation is a sign that the verification was superficial. The point is traceability — a reviewer reading Pass 2 should be able to follow the evidence chain from any single requirement to the code that satisfies it without re-reading the entire codebase. **Enumeration completeness claims require mechanical proof.** When evaluating a requirement that involves a whitelist, lookup table, feature-bit set, handler registry, or any claim of the form "all X are covered by Y," the reviewer must perform the two-list enumeration check from Pass 1 scrutiny area 5: extract every item from the code (with line numbers), extract every item from the spec, and diff. **The code-side list must be extracted fresh from the source — do not reuse any list from REQUIREMENTS.md, CONTRACTS.md, the code review prompt, or any other generated artifact.** If the code-side list matches the requirements list word-for-word, that is a sign the list was copied rather than extracted, and the check must be redone. Do not mark such a requirement SATISFIED based on reading the function and believing it handles everything — that is the specific hallucination pattern this rule prevents. Example: a requirement says "the transport feature whitelist must preserve all supported ring features." The reviewer reads `vring_transport_features()` and sees it has a switch/case. The correct check: extract each case label with its line number (`line 3511: INDIRECT_DESC`, `line 3513: EVENT_IDX`, ..., `line 3527: default`), then list the header constants, then diff. The hallucination: "the whitelist preserves supported bits including VIRTIO_F_RING_RESET" without checking that RING_RESET actually appears as a case label. This exact failure mode has been observed in practice across multiple versions — the model asserted coverage of a constant that was absent from the switch, and in v1.3.17, the code review's "case labels present" list was copied from the requirements rather than extracted from the code, causing three independent spec auditors to inherit the false claim. Pass 2 catches violations of individual requirements — cases where the code doesn't do what the specification says it should. This finds bugs that structural review misses because the code that IS there is correct; the bug is what's missing or what doesn't match the spec. **Pass 3 — Cross-Requirement Consistency.** Compare pairs of requirements that reference the same field, constant, range, or security policy. For each pair, verify that their constraints are mutually consistent. Do numeric ranges match bit widths? Do security policies propagate to all connection types? Do validation bounds in one file agree with encoding limits in another? Pass 3 catches contradictions where two individually-correct pieces of code disagree about a shared constraint. These bugs are invisible to both structural review and per-requirement verification because each requirement IS satisfied individually — the bug only appears when you compare them. This is the pass that catches cross-file arithmetic bugs and design gaps where a security configuration doesn't propagate to all connection paths. **Source code boundary rule:** The playbook must never modify files outside the `quality/` directory. All source-tree changes — bug fixes, test additions to the project's own test suite — must be expressed as `git diff`-format patch files saved under `quality/patches/`. This ensures the original source tree remains untouched, patches are reviewable and reversible, and the playbook's findings are cleanly separable from the code it audited. **BUGS.md:** After all review and audit phases, generate `quality/BUGS.md` — a consolidated bug report with full reproduction details for each confirmed bug. For each bug, include: bug ID, source (code review or spec audit), file:line, description, severity, minimal reproduction scenario (what input or sequence triggers the bug), expected vs actual behavior, references to the regression test and any proposed fix patch, and **spec basis**. **BUGS.md — v1.5.2 BUG record fields (schemas.md §8).** Every BUGS.md entry (and every `quality/writeups/BUG-NNN.md`) corresponds to a BUG record written to `quality/bugs_manifest.json`. In addition to the narrative conventions above, every BUG carries: - `id` — `BUG-NNN` zero-padded three-digit sequence. - `divergence_description` — one-paragraph summary of the divergence between documented intent and code behavior. - `documented_intent` — direct quote or close paraphrase of the REQ / spec language. - `code_behavior` — what the code actually does, with `file:line` references. - `disposition` — enum from schemas.md §3.2: `code-fix` | `spec-fix` | `upstream-spec-issue` | `mis-read` | `deferred`. Required, non-null. Do NOT invent new values. - `disposition_rationale` — one paragraph explaining why THIS disposition and not an adjacent one (e.g., why `code-fix` and not `upstream-spec-issue`, or why `spec-fix` and not `mis-read`). Formulaic rationales ("code is wrong because spec says so") fail the Council review. - `req_id` — singular. The primary REQ that revealed the divergence. If a bug appears to touch multiple REQs, split into one bug per REQ sharing a root cause and cross-link in `disposition_rationale` (schemas.md §8.1). Do NOT smuggle multiple REQ IDs into one entry. - `proposed_fix` — required unless `disposition == "mis-read"`. Patch-shaped when `fix_type` ∈ {`code`, `both`}; textual redline when `fix_type == "spec"`. For `mis-read` records the field is optional and, when present, documents the re-read (what the playbook misread and how the correct reading was established), not a shipped change. - `fix_type` — enum from schemas.md §3.4: `code` | `spec` | `both`. The combination of `disposition × fix_type` is constrained by the legal-combinations matrix in §3.4 (enforced by §10 invariant #12). Illegal pairings: `code-fix` × `spec`, `spec-fix` × `code`, `upstream-spec-issue` × `code`, `mis-read` × `both`. Consult the matrix before authoring the record — the gate rejects illegal combinations. **Divergence framing (writeup voice, schemas.md v1.5.2).** A defect is a divergence between documented intent and code implementation — not a judgment about whether the code is "good." The writeup's opening sections (Summary, Spec reference, The code) are the human rendering of `divergence_description`, `documented_intent`, and `code_behavior`. Write them to read as a **side-by-side diff**, not a narrative. The reader should be able to scan the REQ/spec language next to the code behavior and see the gap immediately. An adversarial tone ("the code is sloppy") or value judgment in the title ("Sloppy trailer handling") fails this framing — a bug is an observation of a divergence, not an accusation. Upstream maintainers engage with disposition (code-fix vs spec-fix vs upstream-spec-issue) rather than defending the code against a critique. **What counts as sufficient evidence to confirm a bug (critical).** A code-path trace that demonstrates a specific behavioral violation IS sufficient evidence to confirm a bug. You do NOT need executed request-level evidence, a running test, or an integration-level reproduction to promote a finding from candidate to confirmed. Specifically: - A code-path trace showing function A calls function B which does X but should do Y, with file:line references — **sufficient to confirm**. - A missing case/branch identified by enumeration comparison (spec says X should be handled, code has no handler for X) — **sufficient to confirm**. - A requirement violation identified in Pass 2 where the code demonstrably does not implement the specified behavior — **sufficient to confirm**. - A domain-knowledge finding where you can trace from input through specific code to wrong output — **sufficient to confirm**. Do NOT demand "executed request-level evidence" or defer findings because "they require runtime testing to distinguish implementation choice from spec gap." If the spec or documentation says the behavior should be X, and the code demonstrably produces Y (traceable through the code path), that is a confirmed bug — not a candidate awaiting runtime validation. The regression test and TDD protocol exist to provide runtime evidence AFTER confirmation, not as a prerequisite FOR confirmation. **Why this rule exists:** In v1.3.43 javalin benchmarking, the code review and triage both identified 4 legitimate candidate bugs with code-path traces and requirement violations, then demoted all of them because "the highest-confidence items still require executed request-level evidence." This produced zero confirmed bugs from a codebase where previous versions found 5. The evidentiary bar was set at runtime-proof-before-confirmation, which is backwards — the playbook's design is confirm-then-prove-with-TDD. **Severity calibration:** Credential leakage, authentication bypass, and injection-class bugs are always high severity regardless of assessed likelihood. Authorization header exposure across trust boundaries (e.g., cross-domain redirects) is credential leakage. When in doubt about security-relevant severity, default to high. **Development-scaffolding exclusion (early filter before confirming any security-class finding).** The severity calibration rule above auto-escalates credential leakage and authentication bypass findings. Before confirming any such finding as a bug, apply this mechanical test: > **Does the value under scrutiny contain a self-documenting marker — words like `change-me`, `replace-me`, `placeholder`, `example`, `default`, `dummy`, `TODO`, `FIXME`, `your-secret-here`, `insert-`, or `set-this-` — that explicitly tells the deployer to replace it?** If yes: **this is not a bug.** Do not log it. A self-documenting development placeholder is scaffolding, not a defect — it exists to make the project build and run locally without configuration. The same logic applies to test fixtures, example configs, seed data, and any value whose name or surrounding comments declare it non-production. This is an early filter that catches the most obvious false positives at confirmation time. The challenge gate in Phase 5 (see `references/challenge_gate.md`) is the broader mechanism that catches subtler cases — documented feature gaps, design decisions with WHY annotations, and findings where the "expected behavior" was invented by the auditor. Any security-class finding that passes the scaffolding exclusion will still be challenge-gated in Phase 5 if it matches an auto-trigger pattern. **Spec basis (mandatory field per bug):** Cite the specific documentation passage that establishes the expected behavior — the gathered doc filename, section/page, and the behavioral contract it defines. If no gathered doc covers the behavior, check whether the project's own comments, README, or API docs define it. If no documentation exists for the expected behavior, classify the bug as a "code inconsistency" rather than a "spec violation" and note this in the severity assessment. A spec violation is a stronger finding than a code inconsistency — it means the code contradicts an authoritative source, not just that the code looks wrong. This distinction matters when reporting upstream: maintainers respond to "your code violates section X.Y of your own spec" differently than "this looks like it might be a bug." **Patch files (mandatory for every confirmed bug).** For each confirmed bug, generate: - `quality/patches/BUG-NNN-regression-test.patch` — a `git diff` that adds a test demonstrating the bug. **This patch is mandatory, not optional.** It is the strongest evidence a bug exists — independent of any opinion about the fix. A confirmed bug without a regression-test patch is incomplete and will cause `quality_gate.py` to FAIL. Generate this patch immediately after confirming the bug, before moving to the next bug. - `quality/patches/BUG-NNN-fix.patch` (optional but strongly encouraged) — a `git diff` with the proposed fix. For bugs where the fix is a single-line or few-line change (e.g., adding a case label, fixing an argument), generate the fix patch — these are low-effort and high-value. **How to generate patch files.** Use `git diff` format. The simplest approach: write the patch content directly as a unified diff. Example for a regression test patch: ``` --- /dev/null +++ b/quality/test_regression_virtio.c @@ -0,0 +1,15 @@ +// Generated by Quality Playbook v1.5.2 +// Regression test for BUG-004: VIRTIO_F_RING_RESET missing from vring_transport_features() +#include +#include +... ``` For fix patches that modify existing source files, use the `--- a/path` / `+++ b/path` format with correct line offsets. If you cannot determine exact line offsets, generate the patch content and note "offsets approximate" — an approximate patch is more valuable than no patch. Patches must apply cleanly against the original source tree with `git apply`. Do not modify the source tree directly. **Patch validation gate (mandatory).** Before declaring any bug as confirmed with a fix patch, run this gate: 1. **Apply test:** `git apply --check quality/patches/BUG-NNN-regression-test.patch` — must exit 0. 2. **Apply test + fix:** `git apply --check quality/patches/BUG-NNN-fix.patch` — must exit 0 (test against clean tree, not against regression-test-applied tree, unless the fix patch depends on the regression test). 3. **Compile check:** After applying both patches, run the project's build/compile command (e.g., `go build ./...`, `mvn compile`, `cargo check`, `tsc --noEmit`). Must succeed. **Temporary worktree for step 3.** Steps 1–2 use `--check` (non-destructive). Step 3 requires actually applying patches and compiling, which modifies the source tree. To comply with the source code boundary rule ("never modify files outside `quality/`"), run step 3 in a disposable worktree: ```bash git worktree add /tmp/qpb-patch-check HEAD --quiet cd /tmp/qpb-patch-check git apply quality/patches/BUG-NNN-regression-test.patch quality/patches/BUG-NNN-fix.patch cd - git worktree remove /tmp/qpb-patch-check --force ``` If `git worktree` is unavailable (shallow clone, detached HEAD), use `git stash && git apply ... && && git checkout . && git stash pop` as a fallback, or accept `--check`-only validation and note the limitation. **Compile check for interpreted languages.** The compile command varies by ecosystem: - **Go:** `go build ./...` - **Rust:** `cargo check` - **Java/Kotlin (Maven):** `mvn compile -q` - **Java/Kotlin (Gradle):** `./gradlew compileJava compileTestJava -q` - **TypeScript:** `tsc --noEmit` - **Python:** `python -m py_compile ` for syntax, then `pytest --collect-only -q` for import/discovery validation - **JavaScript (Node.js):** `node --check ` for syntax; if the project uses ESLint, `npx eslint ` for structural issues - **JavaScript (Mocha/Jest):** Run the specific test in discovery-only mode (`mocha --dry-run` or `jest --listTests`) to verify it loads without errors If no compile/syntax check is feasible for the project's language, document this in the patch entry and rely on the TDD red phase to catch syntax errors. If any step fails, fix the patch before recording the bug as confirmed. A bug with a corrupt patch that won't apply is not a confirmed bug — it's a hypothesis with broken evidence. The TDD red-green cycle cannot run on patches that don't apply, and reporting a bug with an unapplyable patch undermines credibility with upstream maintainers. Common patch failures: truncated hunks (missing closing braces), wrong line offsets (patch generated against modified tree instead of clean tree), and syntax errors in generated test code. **Fix patch requirement.** Every confirmed bug must have either: - A `quality/patches/BUG-NNN-fix.patch` that passes the validation gate above, OR - An explicit justification in BUGS.md explaining why no fix patch is provided (e.g., "fix requires architectural change beyond patch scope," "multiple valid fix strategies — deferring to maintainer judgment," "bug is in upstream dependency"). A bug with a regression test but no fix patch and no justification is incomplete. The regression test proves the bug exists; the fix patch (or justification for its absence) completes the evidence chain. Bugs without fix patches cannot achieve "TDD verified (FAIL→PASS)" status — they remain at "confirmed open (xfail)" until a fix is provided. **TDD verification cycle:** Each confirmed bug with a fix patch should go through the red-green TDD cycle (test fails on unpatched code, passes after fix). This is executed via the `quality/RUN_TDD_TESTS.md` protocol (File 7), not inline during the code review. The protocol generates spec-grounded tests where every assertion message, variable name, and comment traces back to gathered documentation. **After all three passes:** Combine findings. Write regression tests in `quality/test_regression.*` that reproduce each confirmed bug. Use the same test framework as `test_functional.*` — if functional tests use pytest, regression tests use pytest (with `@pytest.mark.xfail(strict=True)`); if functional tests use unittest, regression tests use unittest (with `@unittest.expectedFailure`). Report results as a confirmation table (BUG CONFIRMED / FALSE POSITIVE / NEEDS INVESTIGATION). See `references/review_protocols.md` for the full three-pass template and regression test protocol. **Regression test skip guards (mandatory).** Every regression test in `quality/test_regression.*` must include a skip/xfail guard so that running the full test suite on unpatched code does not produce unexpected failures. The guard must be the **earliest syntactic guard for the framework** — a decorator or annotation where idiomatic, otherwise the first executable line in the test body. Use the language-appropriate mechanism: - **Python (pytest):** `@pytest.mark.xfail(strict=True, reason="BUG-NNN: [description]")` — placed as a **decorator above** `def test_...():`, not inside the function body. When the bug is present, the test fails → XFAIL (expected). When the bug is fixed but the marker isn't removed, the test passes → XPASS → strict mode makes this a failure, signaling the guard should be removed. - **Python (unittest):** `@unittest.expectedFailure` — decorator above the test method. - **Go:** `t.Skip("BUG-NNN: [description] — unskip after applying quality/patches/BUG-NNN-fix.patch")` — first line inside the test function. Note: Go's `t.Skip` hides the test entirely (reports SKIP, not FAIL), which is weaker evidence than Python's xfail. This is a known limitation of Go's test primitives. - **Java (JUnit 5):** `@Disabled("BUG-NNN: [description]")` — annotation above the test method. - **Rust:** `#[ignore]` attribute on the test function (the standard "don't run in default suite" mechanism). Use `#[should_panic]` only for bugs that manifest as panics; use `compile_fail` doctest annotation only for compile-time bugs. - **TypeScript/JavaScript (Jest):** `test.failing("BUG-NNN: [description]", () => { ... })` - **TypeScript/JavaScript (Vitest):** `test.fails("BUG-NNN: [description]", () => { ... })` - **JavaScript (Mocha):** `it.skip("BUG-NNN: [description]", () => { ... })` or `this.skip()` inside the test body for conditional skipping. When a bug is fixed (fix patch applied permanently), remove the skip guard and update the BUG tracker closure status from "confirmed open" to "fixed (test passes)". The skip guard message must reference the bug ID and the fix patch path so that someone encountering a skipped test knows exactly how to resolve it. **Source-inspection tests must execute (no `run=False`).** Regression tests that verify source-file structure — string presence in function bodies, case label existence, enum extraction, generated-code shape checks — are safe, deterministic, and fast. They read repository files and perform string matches. For these tests, use `@pytest.mark.xfail(strict=True)` with execution enabled. **Do not use `run=False`** unless the test would mutate external state, hang, or require unavailable infrastructure. A source-inspection test with `run=False` is the worst possible state: the correct check exists but is inert. In v1.3.18, the regression test for BUG-004 (`test_bug_004_transport_feature_whitelist_keeps_ring_reset`) contained the correct assertion `assert "case VIRTIO_F_RING_RESET:" in func` but was marked `run=False` — so the test never executed, the assertion never fired, and the bug remained undetected despite the test suite "passing." When an `xfail(strict=True)` test actually executes and fails, the test suite reports it as XFAIL (expected failure) — this is correct behavior, not a suite failure. **TDD red/green interaction with skip guards.** During the TDD verification cycle, the red and green phases must temporarily bypass the skip guard to actually execute the test. The protocol should instruct the agent to: - **Red phase (NEVER SKIPPED):** Remove or disable the skip/xfail guard, then run the test against unpatched code. It must fail. Re-enable the guard after recording the result. **The red phase is mandatory for every confirmed bug, even when no fix patch exists.** A bug without red-phase evidence is unverified — do not record `verdict: "skipped"` without a failing red run. If the red phase cannot execute for a documented reason (compilation failure, environment unavailable), record `red_phase: "error"` with an explanation in `notes`. - **Green phase:** Remove or disable the guard, apply the fix patch, run the test. It must pass. If the fix will be reverted, re-enable the guard. **If no fix patch exists, record `green_phase: "skipped"` — but the red phase must still have run.** - **After TDD cycle:** The guard remains in the committed regression test file. It is only permanently removed when the fix is merged into the source tree. **TDD execution enforcement (mandatory).** Regression tests must be actually executed during the TDD verification cycle, not just generated as patch files. For every confirmed bug, the red-phase test run must produce a log file at `quality/results/BUG-NNN.red.log` capturing the test output. The green-phase (if a fix patch exists) must produce `quality/results/BUG-NNN.green.log`. Each log file's first line must be a status tag: `RED` (test failed as expected), `GREEN` (test passed after fix), `NOT_RUN` (test could not be executed — with explanation), or `ERROR` (test infrastructure failed — with explanation). **Language-aware test execution commands.** Use the project's native test runner to execute regression tests. Detect the project language and use the appropriate command: - **Go:** `go test -v -run TestBugNNN ./path/to/package` - **Python (pytest):** `python -m pytest -xvs quality/test_regression.py::test_bug_nnn` - **Python (unittest):** `python -m unittest quality.test_regression.TestRegression.test_bug_nnn` - **Java (Maven + JUnit):** `mvn test -pl module -Dtest=RegressionTest#testBugNnn` - **Java (Gradle + JUnit):** `./gradlew test --tests RegressionTest.testBugNnn` - **Rust:** `cargo test bug_nnn -- --nocapture` - **TypeScript/JavaScript (Jest):** `npx jest --verbose --testNamePattern="BUG-NNN"` - **TypeScript/JavaScript (Vitest):** `npx vitest run --reporter=verbose --testNamePattern="BUG-NNN"` - **C (kernel/make-based):** Source-inspection tests via shell script (grep/awk on source files) — log the script output. If the project uses a language or test framework not listed above, use whatever test runner the project already uses (check for `Makefile`, `package.json`, `build.gradle`, `Cargo.toml`, `go.mod`, `setup.py`, `pyproject.toml`, etc.) and adapt the pattern. If no test runner is available or the language runtime is not installed, record `NOT_RUN` with an explanation — do not skip the log file entirely. **Log capture format.** Each `BUG-NNN.red.log` and `BUG-NNN.green.log` must follow this format: ``` RED --- Test output for BUG-NNN red phase --- Command: [exact command run] Exit code: [exit code] [full stdout/stderr from test execution] ``` The status tag (`RED`, `GREEN`, `NOT_RUN`, `ERROR`) on the first line is machine-readable — `quality_gate.py` will check for its presence. The `NOT_RUN` status is acceptable when the test runner is unavailable (e.g., a C project where the kernel build environment is not present), but the log file must still exist with an explanation of why the test could not be executed. **Ready-to-run TDD log template.** For each confirmed BUG-NNN, execute this sequence (adapt the test command for the project's language per the table above): ```bash # ── Red phase: revert fix, run test, expect FAIL ── git apply -R quality/patches/BUG-NNN-fix.patch 2>/dev/null # revert fix if applied TEST_CMD="python -m pytest -xvs quality/test_regression.py::test_bug_nnn" # adapt per language OUTPUT=$($TEST_CMD 2>&1); EXIT=$? printf 'RED\n--- Test output for BUG-NNN red phase ---\nCommand: %s\nExit code: %d\n%s\n' \ "$TEST_CMD" "$EXIT" "$OUTPUT" > quality/results/BUG-NNN.red.log # ── Green phase: apply fix, run test, expect PASS ── git apply quality/patches/BUG-NNN-fix.patch OUTPUT=$($TEST_CMD 2>&1); EXIT=$? printf 'GREEN\n--- Test output for BUG-NNN green phase ---\nCommand: %s\nExit code: %d\n%s\n' \ "$TEST_CMD" "$EXIT" "$OUTPUT" > quality/results/BUG-NNN.green.log ``` Run this for every confirmed bug. If the test runner is not available, create the log file with `NOT_RUN` on the first line and an explanation. Do not skip this step — the TDD Log Closure Gate in Phase 5 will block completion if logs are missing. **TDD execution gate.** Before the terminal gate in Phase 5, verify that for every confirmed bug in `quality/BUGS.md`, a corresponding `quality/results/BUG-NNN.red.log` exists. Bugs without red-phase logs are incomplete — the regression test patch exists but was never proven to detect the bug. This gate exists because v1.3.45 benchmarking showed that most repos generate regression test patches but never execute them, leaving the TDD verdict unverified. ### File 4: `quality/RUN_INTEGRATION_TESTS.md` **Read `references/review_protocols.md`** for the template. Must include: safety constraints, pre-flight checks, test matrix with specific pass criteria, an execution UX section, and a structured reporting format. Cover happy path, cross-variant consistency, output correctness, and component boundaries. **Use-case traceability (mandatory).** The test matrix must include a **use-case traceability column**. Each integration test group must either: 1. **Map to a use case** — Name the use case (e.g., UC-03) it validates and describe how the test exercises the user outcome from that use case. These are primary integration tests — they verify that the end-to-end behavior described in the use case actually works. 2. **Be labeled as infrastructure** — Tests that don't map to a use case (build validation, race detection, compatibility checks, existing test suite regression guards) are explicitly labeled `[Infrastructure]` in the traceability column. They have value but don't count toward use-case coverage. After generating the test matrix, check: does every use case in REQUIREMENTS.md have at least one integration test mapped to it? If not, flag the uncovered use case as a gap. Integration tests mapped to use cases should test the **end-to-end behavior** described in the use case — not just run existing unit tests that happen to touch the same code paths. For example, if a use case says "Developer authenticates and follows redirects without leaking secrets," the integration test should perform a redirect across domains with auth headers and verify they're stripped — not just run `pytest -k auth`. **Per-UC group splitting (mandatory).** Each integration test group must map to at most **2 use cases**. A group that maps to 3+ UCs is too coarse — it can't distinguish which use case failed when a test breaks. If a single test command (e.g., `mvn test`, `go test ./...`) would exercise multiple use cases, split it into separate groups with targeted test selectors (`-Dtest=`, `-run`, `-k`, `--tests`, `-- test_name`, etc.) so each group isolates 1–2 UCs. Groups covering all UCs in one undifferentiated command are explicitly prohibited — they provide no diagnostic value when a failure occurs. **No-selector fallback.** If the project's test framework cannot select tests at the granularity needed for splitting (e.g., a monolithic test suite with no tag/filter support), document the limitation in the integration protocol and use the narrowest feasible command. Record which UCs the group covers and why further splitting is not possible. **A single-command project must still use the grouped JSON schema** — wrap the command in one group with a `use_cases` list covering all UCs that command exercises. A flat list of commands is never a valid substitute for the `groups[]` structure. **Pre-flight command validation (mandatory).** Before finalizing `RUN_INTEGRATION_TESTS.md`, verify that each group's test command actually discovers and runs tests. Use the framework's dry-run or list mode to confirm: - **Python:** `pytest --collect-only -q ` — must list at least one test - **Go:** `go test -list "." ` — must list at least one test name - **Java/Kotlin:** `mvn -Dtest= test -pl --batch-mode -DfailIfNoTests=true` - **TypeScript (Vitest):** `vitest list --config ` — must list at least one test - **TypeScript (Jest):** `jest --listTests ` — must list at least one file - **Rust:** `cargo test -- --list` — must list at least one test - **JavaScript (Mocha):** `mocha --dry-run ` — must list at least one test If the dry-run exits with "no tests found," "No test files found," or a zero-test count, fix the selector before recording the group. Common fixes: add `--config` or `--root` flags, use file paths instead of `-t` name patterns, anchor regex patterns to the right package. Do not record a group whose command fails discovery — it will produce a `covered_fail` result that masks a selector bug as a code bug. If the dry-run fails with a **build error** (compilation failure, import error, missing dependency, test setup exception) rather than "no tests found," record the failure in the group's `notes` field as `"pre_flight_error": "environment"` and do not attempt to fix the selector. Environment errors during pre-flight require environment setup, not selector changes. **Infrastructure group definition.** A single `[Infrastructure]` group may cover build validation, race detection, static analysis, and platform compatibility checks without UC mapping. Infrastructure tests verify build toolchain and platform support, not user-observable behavior. Infrastructure groups: - Do **not** count toward use-case coverage (the UC coverage check ignores them) - Must include a one-line rationale explaining what they validate - May **not** be used to relabel broad user-workflow commands to avoid splitting — if the tests exercise user-facing behavior described in a use case, they must be mapped to that UC regardless of how the test is organized **All commands must use relative paths.** The generated protocol should include a "Working Directory" section at the top stating that all commands run from the project root using relative paths. Never generate commands that `cd` to an absolute path — this breaks when the protocol is run from a different machine or directory. Use `./scripts/`, `./pipelines/`, `./quality/`, etc. **Include an Execution UX section.** When someone tells an AI agent to "run the integration tests," the agent needs to know how to present its work. The protocol should specify three phases: (1) show the plan as a numbered table before running anything, (2) report one-line progress updates as each test runs (`✓`/`✗`/`⧗`), (3) show a summary table with pass/fail counts and a recommendation. See `references/review_protocols.md` section "Execution UX" for the template and examples. Without this, the agent dumps raw output or stays silent — neither is useful. **Structured output (mandatory).** The protocol must instruct the agent to produce machine-readable results alongside the Markdown report, using **JUnit XML** for test execution and a **sidecar JSON** for QPB-specific metadata. **JUnit XML output:** Each test group should run with the framework's native JUnit XML reporter: - Python: `pytest --junitxml=quality/results/integration-group-N.xml` - Go: `gotestsum --junitxml quality/results/integration-group-N.xml -- -run "TestPattern"` - Java/Kotlin: Copy Surefire XML reports to `quality/results/` - TypeScript: `jest --reporters=jest-junit` with `JEST_JUNIT_OUTPUT_DIR=quality/results/` - Rust: `cargo test 2>&1 | cargo2junit > quality/results/integration-group-N.xml` (if available) If the JUnit XML reporter is unavailable, skip XML and note `"junit_available": false` in the sidecar JSON. **Sidecar JSON:** Generate `quality/results/integration-results.json` by copying the template below verbatim and filling in only the values. Do not invent fields, rename keys, or restructure the schema. A flat list of commands without the `groups` array is **invalid** — even if the project runs all tests through a single command, wrap it in one group. ```json { "schema_version": "1.1", "skill_version": "", "date": "YYYY-MM-DD", "project": "", "recommendation": "SHIP", "groups": [ { "group": 1, "name": "Core routing dispatch", "use_cases": ["UC-01", "UC-02"], "result": "pass", "tests_passed": 5, "tests_failed": 0, "junit_file": "integration-group-1.xml", "junit_available": true, "notes": "" } ], "summary": { "total_groups": 9, "passed": 8, "failed": 1, "skipped": 0 }, "uc_coverage": { "UC-01": "covered_pass", "UC-02": "covered_pass", "UC-03": "not_mapped" } } ``` **Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. If any of these fields are missing from your output, the result is non-conformant. **Invalid examples (do not emit these):** - A flat `"results": [{"command": "go test ./...", "result": "pass"}]` — this is not the grouped schema. - A schema with `"commands_run"` instead of `"groups"` — wrong key name. - A schema missing `"uc_coverage"` — every use case from REQUIREMENTS.md must appear. - A schema with `"use_case_traceability"` instead of `"use_cases"` — wrong field name. Valid `result` values: `"pass"`, `"fail"`, `"skipped"`, `"error"`. Valid `recommendation` values: `"SHIP"` (all groups pass), `"FIX BEFORE MERGE"` (failures in non-blocking groups), `"BLOCK"` (failures in critical groups). The `uc_coverage` section maps every use case from REQUIREMENTS.md to one of: `"covered_pass"` (at least one mapped group passed), `"covered_fail"` (groups mapped but all failed), or `"not_mapped"` (no integration test group maps to this use case). The distinction between `"covered_fail"` and `"not_mapped"` matters: the first means the test exists but the code is broken; the second means the test is missing. Runner scripts and CI tools should read the sidecar JSON for results rather than grepping the Markdown report. This eliminates the class of bugs where grep-based counting produces wrong numbers from matching words in prose. **Post-write validation (mandatory).** After writing `integration-results.json`, reopen the file and verify: (1) every required top-level field is present, (2) every `groups[]` entry has `group`, `name`, `use_cases`, `result`, and `notes`, (3) all `result` and `recommendation` values use only the allowed enum values listed above, (4) `uc_coverage` maps every use case from REQUIREMENTS.md, (5) no extra undocumented root keys exist. If any check fails, fix the file before proceeding. **This protocol must exercise real external dependencies.** If the project talks to APIs, databases, or external services, the integration test protocol runs real end-to-end executions against those services — not just local validation checks. Design the test matrix around the project's actual execution modes and external dependencies. Look for API keys, provider abstractions, and existing integration test scripts during exploration and build on them. **Derive quality gates from the code, not generic checks.** Read validation rules, schema enums, and generation logic during exploration. Turn them into per-pipeline quality checks with specific fields and acceptable value ranges. "All units validated" is not enough — the protocol must verify domain-specific correctness. **Script parallelism, don't just describe it.** Group runs so independent executions (different providers) run concurrently. Include actual bash commands with `&` and `wait`. One run per provider at a time to avoid rate limits. **Calibrate unit counts to the project.** Read `chunk_size` or equivalent config. Use enough units to span at least 2 chunks and enough to verify distribution checks. Typically 10–30 for integration testing. **Deep post-run verification.** Don't stop at "process completed." Verify log files, manifest state, output data existence, sample record content, and any existing quality check scripts — for every run. **Find and use existing verification tools.** Search for existing scripts that verify output quality (e.g., `integration_checks.py`, validation scripts, quality gate functions). If they exist, call them from the protocol. If the project has a TUI or dashboard, include TUI verification commands (e.g., `--dump` flags) in the post-run checklist. **Build a Field Reference Table before writing quality gates.** This is the most important step for protocol accuracy. AI models confidently write wrong field names even after reading schemas — `document_id` becomes `doc_id`, `sentiment_score` becomes `sentiment`, `float 0-1` becomes `int 0-100`. The fix is procedural: **re-read each schema file IMMEDIATELY before writing each table row.** Do not rely on what you read earlier in the conversation — your memory of field names drifts over thousands of tokens. Copy field names character-for-character from the file contents. Include ALL fields from each schema (if the schema has 8 fields, the table has 8 rows). See `references/review_protocols.md` section "The Field Reference Table" for the full process and format. Do not skip this step — it prevents the single most common protocol inaccuracy. ### File 5: `quality/RUN_SPEC_AUDIT.md` — Council of Three **Read `references/spec_audit.md`** for the full protocol. Three independent AI models audit the code against specifications. Why three? Because each model has different blind spots — in practice, different auditors catch different issues. Cross-referencing catches what any single model misses. The protocol defines: a copy-pasteable audit prompt with guardrails, project-specific scrutiny areas, a triage process (merge findings by confidence level), and fix execution rules (small batches by subsystem, not mega-prompts). **Secondary emphasis lenses:** Optionally assign each audit model a secondary emphasis — for example, one starts with input validation, one with resource lifecycle, one with concurrency. Each model still performs a full independent audit; the emphasis biases attention without restricting coverage. Do not split models into disjoint ownership by bug class. **Minority finding rule:** During triage, any finding where only one of three auditors flags it (a minority finding) requires a re-investigation — read the specific code location and make an explicit CONFIRMED/FALSE-POSITIVE determination rather than discarding by default. Minority findings are disproportionately likely to be real bugs that two models missed. **Triage must not raise the evidentiary bar above code-path analysis.** The triage step confirms or rejects findings — it does not defer them pending runtime evidence. If a finding includes a code-path trace showing a behavioral violation (function calls, missing branches, wrong return values with file:line references), the triage should confirm it. Do not demote code-path-traced findings to "candidate" or "needs runtime verification." The TDD protocol (Phase 5) provides runtime evidence AFTER confirmation. See "What counts as sufficient evidence to confirm a bug" in the BUGS.md section for the full evidentiary standard. **Code review vs spec audit conflicts:** If the code review and spec audit disagree on the same finding, the spec audit finding is not automatically correct. Deploy a verification probe — read the specific code location and determine which assessment is accurate. Record the resolution in the BUG tracker. A code review BUG not flagged by any spec auditor is still confirmed but should be verified with a targeted probe before closure. **Verification probes must produce executable evidence.** When the triage step confirms OR rejects a finding via verification probe, prose reasoning alone is not sufficient. The probe must produce a test assertion that mechanically proves the determination: - **For rejections** (finding is false positive): Write an assertion that PASSES, proving the finding is wrong. Example: if rejecting "function X is missing null check," write `assert "if (ptr == NULL)" in source_of("X"), "X has null check at line NNN"`. If you cannot write a passing assertion that proves your rejection, **do not reject the finding** — escalate it to confirmed or flag it for manual review. - **For confirmations** (finding is a real bug): Write an assertion that FAILS (expected-failure), proving the bug exists. Example: if confirming "RING_RESET missing from switch," write `assert "case VIRTIO_F_RING_RESET:" in source_of("vring_transport_features"), "RING_RESET should be in the switch but is not"`. - **Every assertion must cite an exact line number** for the evidence it references. Not "lines 3527-3528" but "line 3527: `default:`" — showing what the line actually contains. Assertions without line-number citations are insufficient. **Why this rule exists:** In v1.3.16 virtio testing, the triage correctly received a minority finding that `VIRTIO_F_RING_RESET` was missing from a switch/case whitelist. The triage performed a "verification probe" that claimed lines 3527-3528 "explicitly preserve VIRTIO_F_RING_RESET" — but those lines actually contained the `default:` branch. The triage hallucinated compliance with the code. Had it been required to write `assert "case VIRTIO_F_RING_RESET:" in source`, the assertion would have failed, exposing the hallucination. Requiring executable evidence for rejections makes hallucinated rejections self-defeating: the model cannot write a passing assertion for something that isn't in the code. **Triage evidence must be written to disk.** Verification probe assertions must appear in a file on disk — either appended to `quality/mechanical/verify.sh` or written to a dedicated `quality/spec_audits/triage_probes.sh`. Assertions described in the triage report prose but never written to an executable file are not executable evidence. The gate checks for the existence of probe assertions in the triage output; a triage report that says "verification probe confirms..." without a corresponding assertion in an executable file is non-conformant. This prevents the failure mode where the model narrates what a probe *would* show without actually running it. ### File 6: `AGENTS.md` If `AGENTS.md` already exists, update it — don't replace it. Add a Quality Docs section pointing to all generated files. If creating from scratch: project description, setup commands, build & test commands, architecture overview, key design decisions, known quirks, and quality docs pointers. ### File 7: `quality/RUN_TDD_TESTS.md` — TDD Verification Protocol This protocol is executed after the code review and spec audit have confirmed bugs and generated fix patches. It runs the red-green TDD cycle for each confirmed bug: test fails on unpatched code, apply fix, test passes. **Why a separate protocol?** The code review finds bugs and writes regression tests with `xfail` markers. The TDD protocol takes those tests and proves they actually detect the bug — and that the fix actually fixes it. This is a stronger claim than "we found a bug and wrote a test." It's "here's a test that fails without the patch and passes with it." The distinction matters when reporting bugs upstream: maintainers trust a FAIL→PASS demonstration more than a bug description. The generated protocol must include: 1. **Spec-grounded test requirements.** For each bug in `quality/BUGS.md`, the protocol instructs the agent to: - Read the bug's **spec basis** field to identify the documentation passage that defines the expected behavior - Read the gathered doc (from `docs_gathered/` or the project's own docs) at the cited section - Write test assertions using **language from the spec** — variable names, constants, function names, and assertion messages should echo the spec's terminology, not the code's internal naming - Include a comment block in each test citing: the requirement ID (from REQUIREMENTS.md), the bug ID (from BUGS.md), and the spec passage (doc name, section, and a ≤15-word quote of the behavioral contract) 2. **Red-green execution steps.** For each bug with a fix patch: - **Red:** Run the regression test against unpatched source. It must fail. If it passes, the test doesn't detect the bug — rewrite it using the spec basis to understand what behavior to assert. - **Green:** Apply the fix patch (`git apply quality/patches/BUG-NNN-fix.patch`), run the same test. It must pass. - **Record:** Log both results in the BUG tracker with closure status "TDD verified (FAIL→PASS)". 3. **Framework adaptation.** The protocol must detect the project's test framework and generate idiomatic tests: - **Projects with test infrastructure** (pytest, JUnit, Go testing, Jest, cargo test, etc.): Write tests in the project's own framework, following existing test conventions discovered during exploration. - **Projects without test infrastructure** (e.g., Linux kernel, embedded C): Extract the target function with `sed`, write a self-contained C test file with minimal type shims, compile and run directly. Include the extraction command in the test file's header comment so it's self-documenting. 4. **Upstream reporting format.** For each TDD-verified bug, generate a ready-to-send report block containing: - One-sentence description citing the spec section violated - The FAIL→PASS output (copy-pasteable terminal session) - The test file (as an attachment or inline) - The fix patch (as an attachment or inline) 5. **Traceability table.** The protocol produces a `quality/TDD_TRACEABILITY.md` file mapping: | Bug ID | Requirement ID | Spec Doc | Spec Section | Behavioral Contract | Test File:Function | Red Result | Green Result | |--------|---------------|----------|-------------|--------------------|--------------------|------------|--------------| Every row must be fully populated. A bug without a spec doc entry is a code inconsistency, not a spec violation — note this in the table and adjust the upstream reporting language accordingly. 6. **Structured output (mandatory).** The protocol must produce machine-readable results alongside the Markdown report. Use **JUnit XML** for test execution results and a **sidecar JSON** file for QPB-specific metadata that JUnit XML cannot represent. **JUnit XML output:** For each red-green phase, run the test with the framework's native JUnit XML output flag: - Python: `pytest --junitxml=quality/results/tdd-red-BUG-NNN.xml` - Go: `gotestsum --junitxml quality/results/tdd-red-BUG-NNN.xml -- -run TestRegression_BUG_NNN` - Java/Kotlin: Maven Surefire reports are generated automatically in `target/surefire-reports/`; copy relevant XML to `quality/results/` - Rust: `cargo test --test regression 2>&1 | cargo2junit > quality/results/tdd-red-BUG-NNN.xml` (if cargo2junit available; otherwise skip XML for Rust) - TypeScript: `jest --reporters=default --reporters=jest-junit` with `JEST_JUNIT_OUTPUT_DIR=quality/results/` If the framework's JUnit XML reporter is not available or requires a missing dependency, skip the XML output for that language and note it in the sidecar JSON (`"junit_available": false`). Do not fail the TDD run over missing XML tooling. **Sidecar JSON (strict schema enforcement):** Generate `quality/results/tdd-results.json` by copying the template below **verbatim** and filling in only the values. Do not invent fields, rename keys, or restructure the schema. The template is the schema — any deviation (extra keys, missing keys, renamed keys, restructured nesting) makes the output non-conformant. Copy-paste the template into your editor first, then fill in the values. Do not write the JSON from memory. ```json { "schema_version": "1.1", "skill_version": "", "date": "YYYY-MM-DD", "project": "", "bugs": [ { "id": "BUG-001", "requirement": "REQ-003", "red_phase": "fail", "green_phase": "pass", "verdict": "TDD verified", "regression_patch": "quality/patches/BUG-001-regression-test.patch", "fix_patch": "quality/patches/BUG-001-fix.patch", "fix_patch_present": true, "patch_gate_passed": true, "writeup_path": "quality/writeups/BUG-001.md", "junit_red": "tdd-red-BUG-001.xml", "junit_green": "tdd-green-BUG-001.xml", "junit_available": true, "notes": "" } ], "summary": { "total": 6, "verified": 4, "confirmed_open": 1, "red_failed": 1, "green_failed": 0 } } ``` **Required top-level fields:** `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. **Required per-bug fields:** `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. If any required field is missing, the result is non-conformant. **Optional per-bug fields** (shown in the template above but not gate-checked): `regression_patch`, `fix_patch`, `patch_gate_passed`, `junit_red`, `junit_green`, `junit_available`, `notes`. Include these when the data is available; omit them without penalty. **Required summary sub-keys:** The `summary` object must contain exactly these keys: `total`, `verified`, `confirmed_open`, `red_failed`, `green_failed`. All five are required — omitting any of them (especially `red_failed` or `green_failed`) makes the summary non-conformant. **Canonical patch file names:** Regression test patches must be named `BUG-NNN-regression-test.patch`. Fix patches must be named `BUG-NNN-fix.patch`. The gate script globs for these exact patterns — creative variants like `BUG-001-regression.patch` or `BUG-001-test.patch` will not be counted. **Date field:** Use the actual date of this session (e.g., `"2026-04-12"`), not the template placeholder `"YYYY-MM-DD"`. The gate validates that the date is a real ISO 8601 date and rejects placeholder strings and future dates. **Invalid examples (do not emit these):** - `"runs": [{"phase": "red", "command": "...", "result": "4 xfailed"}]` — this is a flat runs array, not the bug-indexed `"bugs"` schema. - A schema with ad-hoc root keys like `"generated"`, `"scope"`, `"status"`, `"testsRun"` — these are not the standard schema fields. - `"verdict": "skipped"` — this value is deprecated; use `"confirmed open"` with `red_phase: "fail"` and `green_phase: "skipped"`. - Missing `"schema_version"` at the root — every tdd-results.json must include this field. Valid `verdict` values: `"TDD verified"` (FAIL→PASS), `"red failed"` (test passed on unpatched code — test doesn't detect the bug), `"green failed"` (test still fails after fix — fix is incomplete or patch is corrupt), `"confirmed open"` (red phase ran and confirmed the bug, no fix patch available), `"deferred"` (TDD cannot execute in this environment — use with `notes` explaining why). **Do not use `"skipped"` as a verdict** — every confirmed bug must have a red-phase result. A bug with `verdict: "confirmed open"` must have `red_phase: "fail"` (red ran and confirmed the bug) and `green_phase: "skipped"` (no fix to apply). Valid `red_phase`/`green_phase` values: `"fail"`, `"pass"`, `"error"` (compile/apply failure), `"skipped"` (green only — red is never skipped). The `patch_gate_passed` field records whether the patch validation gate (apply-check + compile) succeeded — `false` if the gate failed and the patch was repaired, `null` if no fix patch exists. The `writeup_path` field points to the per-bug writeup file (see "Bug writeup generation" below) — `null` if no writeup was generated for this bug. Runner scripts and CI tools should read the sidecar JSON for pass/fail counts rather than grepping the Markdown report. **Post-write validation (mandatory).** After writing `tdd-results.json`, reopen the file and verify: (1) every required top-level field is present, (2) every required per-bug field is present in each `bugs[]` entry, (3) all `verdict`, `red_phase`, and `green_phase` values use only the allowed enum values listed above, (4) no extra undocumented root keys exist. If any check fails, fix the file before proceeding. This step catches the most common failure mode: the agent paraphrases the schema from memory instead of copying the template, producing plausible but non-conformant output. **TDD artifact closure gate (mandatory).** If `quality/BUGS.md` contains any confirmed bugs, `quality/results/tdd-results.json` is mandatory — not optional. If any bug has a red-phase result (whether TDD-verified or confirmed-open), `quality/TDD_TRACEABILITY.md` is also mandatory. Zero-bug repos may omit both files. A run that confirms bugs but produces no tdd-results.json is incomplete — the phase cannot close. For repos where TDD cannot execute (environment blocked, no test infrastructure), generate tdd-results.json with `verdict: "deferred"` and a `notes` field explaining why (e.g., `"environment_blocked: missing workspace Cargo.toml"`, `"no_test_infrastructure: kernel C code without userspace harness"`). The deferred verdict makes the gap visible instead of silently omitting the file. **Execution UX:** Same three-phase pattern as the integration tests — (1) show the plan as a numbered table of bugs to verify, (2) report one-line progress as each red-green cycle runs (`FAIL ✓ → PASS ✓` or `FAIL ✗ — test passes on unpatched code, rewriting`), (3) show a summary table with verified/failed/rewritten counts. 7. **Bug writeup generation (for all confirmed bugs).** After a successful red→green cycle (`verdict: "TDD verified"`) or confirmation without a fix (`verdict: "confirmed open"`), generate a self-contained writeup at `quality/writeups/BUG-NNN.md`. This file is designed to be emailed to a maintainer, attached to a Jira ticket, or reviewed outside the repository — it must stand alone without requiring the reader to navigate the rest of the quality artifacts. **Template (sections 1–4, 6, 7 are required in every writeup; add 5 when the depth judgment fires; add 8 when related bugs exist):** 1. **Summary** — One paragraph: what's wrong, where (file:line), what breaks in practice. 2. **Spec reference** — The specific spec section violated, with URL if available. Quote the behavioral contract (≤15 words) that the code fails to satisfy. 3. **The code** — The buggy code with file:line citation. Explain why it's wrong in terms of the spec, not just "it looks weird." 4. **Observable consequence** — What actually breaks. Not "could theoretically fail" — what does fail, under what conditions, with what symptoms. 5. **Depth judgment** *(include only when expansion is warranted)* — After drafting sections 1–4, assess: is the consequence self-evident from the code and test alone? If a reader would reasonably ask "why hasn't anyone noticed this?" or "does this affect all configurations equally?", expand the analysis. Trace the buggy function's callers. Show which code paths expose the bug and which mask it. Concrete expansion triggers: transport/config-dependent behavior, feature flags that mask the bug on some paths, indirect dispatch hiding callers, bugs in negotiation/initialization code that only manifest under specific runtime conditions. If the consequence is obvious from the immediate code (e.g., a null dereference, an off-by-one), keep sections 1–4 tight and omit this section. 6. **The fix** — A proposed fix as an inline diff (unified diff format), with a brief explanation of why this is the right fix. **Always include a concrete diff** — even for confirmed-open bugs without a separate `.patch` file. If the fix is a one-line change (adding a case label, fixing an argument), write the diff. If the fix requires broader changes, write the minimal diff that addresses the core defect and note what additional changes a full fix would need. The inline diff in the writeup is what makes the writeup actionable — a writeup that says "No fix patch is included" is incomplete and not useful to a maintainer. Example format: ```diff --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -3527,6 +3527,7 @@ void vring_transport_features(...) case VIRTIO_F_ORDER_PLATFORM: case VIRTIO_F_IN_ORDER: + case VIRTIO_F_RING_RESET: default: ``` 7. **The test** — What the test proves, how to run it, and what output to expect on unpatched vs patched code. 8. **Related issues** *(include only when related bugs exist)* — Other bugs in the same class, if any. Flag them even if they're not confirmed yet. Omit this section if no related issues were identified. **Include the version stamp** at the top of the writeup file (same format as all other generated files). **Writeup generation for all confirmed bugs (mandatory).** Generate a writeup at `quality/writeups/BUG-NNN.md` for every confirmed bug — both TDD-verified and confirmed-open. Use the numbered section template above (sections 1–8). For confirmed-open bugs, follow the same template including a proposed fix diff in section 6 (the diff is always required even without a separate `.patch` file). The writeup threshold is bug confirmation, not TDD completion. A run with confirmed bugs and no writeups directory is incomplete. **Inline diff is gate-enforced.** The `quality_gate.py` script checks that every writeup contains a ` ```diff ` block. A writeup without an inline diff will cause the gate to FAIL. Do not write "see patch file" — paste the actual diff inline in the writeup body, inside a fenced ` ```diff ` code block. This is the single most important element of the writeup because it makes the bug actionable for a maintainer reading just the writeup. ### Checkpoint: Update PROGRESS.md after artifact generation Re-read `quality/PROGRESS.md`. Update: - Mark Phase 2 complete with timestamp - Update the artifact inventory: set each generated artifact to "generated" with its file path - Add exploration summary notes if not already present **Phase 2 completion gate (mandatory).** Before proceeding to Phase 3, verify: 1. All core artifacts exist on disk (`QUALITY.md`, `CONTRACTS.md`, `REQUIREMENTS.md`, `COVERAGE_MATRIX.md`, `COMPLETENESS_REPORT.md`, `test_functional.*`, `RUN_CODE_REVIEW.md`, `RUN_INTEGRATION_TESTS.md`, `RUN_SPEC_AUDIT.md`, `RUN_TDD_TESTS.md`, `AGENTS.md`). 2. `REQUIREMENTS.md` contains requirements with specific conditions of satisfaction referencing actual code (file paths, function names, line numbers) — not abstract behavioral descriptions. 3. If dispatch/enumeration contracts exist: `quality/mechanical/verify.sh` exists and has been executed. 4. PROGRESS.md marks Phase 2 complete with timestamp. Re-read `quality/PROGRESS.md` and `quality/REQUIREMENTS.md` before starting Phase 3. The requirements are the target list for the code review — every requirement is a potential bug if the code doesn't satisfy its conditions. **End-of-phase message (mandatory — print this after Phase 2 completes, then STOP):** ``` # Phase 2 Complete — Quality Artifacts Generated I've generated the quality infrastructure for this project: [List the key artifacts created: REQUIREMENTS.md with N requirements and N use cases, QUALITY.md with N scenarios, functional tests, review protocols, etc.] The requirements are now the target list for Phase 3's code review — every requirement is a potential bug if the code doesn't satisfy it. To continue to Phase 3 (Code review with regression tests), say: Run quality playbook phase 3. Or say "keep going" to continue automatically. ``` **After printing this message, STOP. Do not proceed to Phase 3 unless the user explicitly asks.** --- ## Phase 3: Code Review and Regression Tests > **Required references for this phase:** > - `quality/REQUIREMENTS.md` — target list for the code review > - `references/review_protocols.md` — three-pass protocol and regression test conventions Run the code review protocol (all three passes) as described in File 3. After producing findings, write regression tests for every confirmed BUG per the closure mandate in `references/review_protocols.md`. **Update PROGRESS.md:** Add every confirmed BUG to the cumulative BUG tracker with source "Code Review", the file:line reference, description, severity, and closure status (regression test function name or exemption reason). Mark Phase 3 (Code review + regression tests) complete. **End-of-phase message (mandatory — print this after Phase 3 completes, then STOP):** ``` # Phase 3 Complete — Code Review The three-pass code review is done. [Summarize: N bugs confirmed, N regression test patches generated, N fix patches generated. List the bug IDs and one-line summaries.] To continue to Phase 4 (Spec audit — Council of Three), say: Run quality playbook phase 4. Or say "keep going" to continue automatically. ``` **After printing this message, STOP. Do not proceed to Phase 4 unless the user explicitly asks.** --- ## Phase 4: Spec Audit and Triage > **Required references for this phase:** > - `references/spec_audit.md` — Council of Three protocol, triage process, verification probes Run the spec audit protocol as described in File 5. The triage report **must** include a `## Pre-audit docs validation` section (see `references/spec_audit.md` for the full template). This section is required even if `docs_gathered/` is empty — in that case, note what baseline the auditors used instead. Every verification probe in the triage must produce executable evidence (test assertions with line-number citations) per the "Verification probes must produce executable evidence" rule above. After triage, categorize each confirmed finding. **Effective council gating for enumeration checks.** If the effective council is less than 3/3 (fewer than three auditors returned usable reports) and the run includes any whitelist/enumeration/dispatch-function checks or any carried-forward seed checks, the audit may not conclude "no confirmed defects" for those checks without executed mechanical proof artifacts. An incomplete council with mechanical verification is acceptable. An incomplete council relying on prose-only validation for code-presence claims is not — escalate to "NEEDS VERIFICATION" and run the mechanical check before closing. **Pre-audit spot-checks must extract from code, not assert from docs.** When the spec audit prompt includes spot-check claims for pre-validation (e.g., "verify that function X handles constant Y at line Z"), the triage must validate each claim by extracting the actual code at the cited lines — not by confirming that the claim sounds plausible. For each spot-check claim about code contents, the pre-validation must report what the cited lines actually contain: "Line 3527 contains `default:` — NOT `case VIRTIO_F_RING_RESET:` as claimed." If the spot-check was generated from requirements or gathered docs rather than from the code itself, treat it as a hypothesis to test, not a fact to confirm. This rule prevents the contamination chain observed in v1.3.17 where a false spot-check claim ("RING_RESET at 3527-3528") was accepted as "accurate" without reading the actual lines, then propagated through the triage and into every downstream artifact. **Update PROGRESS.md:** Add every confirmed **code bug** from the spec audit to the cumulative BUG tracker with source "Spec Audit". This is critical — spec-audit bugs are systematically orphaned if they aren't added to the same tracker that the closure verification reads. ### Layer 2 — Semantic Citation Check (v1.5.2 Council sub-pass) After the main spec audit triage, each Council member runs a per-REQ verdict against every Tier 1/2 REQ's `citation_excerpt`. This is **Layer 2** of the hallucination gate: Layer 1 is the mechanical byte-equality check — `bin/citation_verifier` is invoked by `bin/reference_docs_ingest` at ingest time and re-invoked by `quality_gate.py` at gate time; the LLM never shells out to it directly. Layer 2 is semantic — the reviewer decides whether the excerpt actually supports the requirement as stated, or whether the requirement overreaches what the excerpt says. **Protocol.** 1. **One prompt per Council member, all Tier 1/2 REQs batched in.** Not one REQ at a time (3×N prompts is too many). Not a prose response (pattern-matching risk). The reviewer receives the full list of `(req_id, citation_excerpt, REQ description)` tuples and returns a structured per-REQ JSON response. 2. **Structured response schema (schemas.md §9.2).** For each REQ the reviewer records `{"req_id": "REQ-NNN", "reviewer": "", "verdict": "supports" | "overreaches" | "unclear", "notes": ""}`. Valid `verdict` values are enumerated in schemas.md §3.5. 3. **Batching threshold.** When a run produces more than 15 Tier 1/2 REQs, split into batches of up to 15 REQs per prompt per Council member. The same reviewer sees each batch sequentially; their response entries are concatenated into one `reviews[]` array under the same `reviewer` string. 4. **Reviewer identifier stability.** Use fixed strings like `"claude-opus-4.7"`, `"gpt-5.4"`, `"gemini-2.5-pro"`. The majority computation in schemas.md §10 invariant #17 groups on this field — a typo silently becomes a fourth reviewer and breaks the 2-of-3 majority check. 5. **Output.** Concatenate all Council members' responses into `quality/citation_semantic_check.json` using the §1.6 manifest wrapper, except the record array is named `reviews` rather than `records` (schemas.md §9.1). One file per run, regenerated on every audit pass. **Majority rule (gate-enforced).** For each Tier 1/2 REQ, the gate groups reviews by `req_id` and fails the run when ≥2 of 3 reviewers recorded `verdict == "overreaches"`. A single-member `overreaches` or `unclear` verdict surfaces as a warning but does not fail the gate. A REQ with fewer than three reviewer entries (missing reviewer, skipped batch) has insufficient evidence — the gate treats that as a fail. **No-op for Spec Gap runs.** If a run produces zero Tier 1/2 REQs, `citation_semantic_check.json` is still written with an empty `reviews` array — the file's existence is part of the artifact contract even when the check has nothing to evaluate. ### Post-spec-audit regression tests After the spec audit triage, check the cumulative BUG tracker in PROGRESS.md. Any spec-audit BUG that doesn't have a regression test yet needs one now. Write regression tests for spec-audit confirmed code bugs using the same conventions as code-review regression tests (expected-failure markers, test-finding alignment, executable source files). **Why this step exists:** Code review bugs get regression tests immediately because tests are written right after the review. Spec audit runs after the tests are written, so its confirmed bugs are orphaned — they appear in the triage report but never get tests. This step closes that gap. **Individual auditor artifacts (mandatory).** The spec audit must produce individual auditor report files at `quality/spec_audits/` with filenames containing `auditor` (canonical format: `YYYY-MM-DD-auditor-N.md`, e.g., `2026-04-12-auditor-1.md`; also accepted: `auditor__.md`). The gate globs for `*auditor*` — any conformant name will match. One file per auditor, not just the triage synthesis. Each auditor report records what that auditor found independently before triage reconciliation. If only the triage file exists with no individual auditor artifacts, the audit is incomplete — the triage cannot be verified because there is no record of pre-reconciliation findings. This requirement exists because a single triage file conflates discovery with reconciliation, making it impossible to tell whether a finding was independently confirmed or synthesized from a single source. **Phase 4 completion gate.** Phase 4 is not complete until a triage file exists at `quality/spec_audits/YYYY-MM-DD-triage.md` **and** individual auditor reports exist. If only auditor reports exist with no triage synthesis, mark Phase 4 as "partial — triage pending" in PROGRESS.md and complete the triage before proceeding. If only the triage exists with no individual reports, mark Phase 4 as "partial — auditor artifacts missing" and regenerate them. The PROGRESS.md checkbox must not be set until both the triage file and auditor reports are confirmed present. Update the BUG tracker entries with regression test references. Mark Phase 4 (Spec audit + triage) complete. **End-of-phase message (mandatory — print this after Phase 4 completes, then STOP):** ``` # Phase 4 Complete — Spec Audit The Council of Three spec audit is done. [Summarize: N auditors ran, N net-new bugs confirmed from triage, total bugs now at N. List any new bug IDs and summaries.] To continue to Phase 5 (Reconciliation — TDD verification, writeups, closure), say: Run quality playbook phase 5. Or say "keep going" to continue automatically. ``` **After printing this message, STOP. Do not proceed to Phase 5 unless the user explicitly asks.** --- ## Phase 5: Post-Review Reconciliation and Closure Verification > **Required references for this phase:** > - `quality/PROGRESS.md` — cumulative BUG tracker (authoritative finding list) > - `references/challenge_gate.md` — two-round challenge protocol for false-positive detection > - `references/requirements_pipeline.md` — post-review reconciliation process > - `references/review_protocols.md` — regression test cleanup after reversals > - `references/spec_audit.md` — verification probe protocol for conflicts **Phase 5 entry gate (mandatory — HARD STOP).** Before proceeding, verify ALL of the following Phase 4 artifacts exist: 1. `quality/spec_audits/` directory exists and contains at least one `*triage*` file (the triage synthesis) 2. `quality/spec_audits/` contains at least one `*auditor*` file (individual auditor reports) 3. `quality/PROGRESS.md` exists and its Phase 4 line is marked `[x]` If any of these are missing, STOP and go back to Phase 4. Do not proceed with reconciliation until the spec audit artifacts are confirmed present — reconciliation without triage data produces an incomplete closure report. Re-read `quality/PROGRESS.md` — specifically the cumulative BUG tracker. This is the authoritative list of all findings across both code review and spec audit. **Challenge gate (mandatory before reconciliation).** Before running closure verification, apply the challenge gate to every confirmed bug that matches an auto-trigger pattern. Read `references/challenge_gate.md` for the full protocol. In summary: 1. Scan the BUG tracker for bugs matching any auto-trigger pattern (security-class findings, code with design-decision comments at the cited location, findings with no spec basis, sibling code paths handling the same concern differently, findings about missing functionality). 2. For each triggered bug, run the two-round challenge using fresh sub-agents as described in the reference. 3. Record verdicts in `quality/challenge/BUG-NNN-challenge.md`. 4. Apply verdicts: CONFIRMED bugs proceed normally. DOWNGRADED bugs get their severity adjusted. REJECTED bugs are removed from the BUG tracker and relocated to a "Reviewed and dismissed" appendix in BUGS.md with the challenge reasoning. **Apply common sense throughout.** The challenge gate's primary purpose is to catch findings where pattern-matching overrode judgment. If a bug would make you look foolish reporting it to the upstream maintainer — a self-documenting placeholder flagged as a critical vulnerability, a documented design decision flagged as a defect, an intentional feature gap flagged as a security hole — it should not survive the challenge. The common-sense test is not one factor among many; it is the framing for the entire review. **Why this gate exists:** In v1.4.6 edgequake benchmarking, the code review confirmed 42 bugs including 7 rated CRITICAL. After manual review, the strongest finding (BUG-001, source_ids overwrite) was HIGH, not CRITICAL. Six "CRITICAL" tenant-isolation bugs were documented feature gaps with explicit WHY-OODA81 annotations. One "CRITICAL" JWT finding (BUG-041) was a self-documenting development placeholder containing the literal string "change-me-in-production." The model defended these findings through multiple rounds of pushback because its instinct was to find and defend bugs, not to apply common sense about what constitutes a defect. The challenge gate forces that common-sense review to happen before findings are finalized. 1. **Run the Post-Review Reconciliation** as described in `references/requirements_pipeline.md`. Update COMPLETENESS_REPORT.md. 2. **Run closure verification:** For every row in the BUG tracker, verify it has either a regression test reference or an explicit exemption. If any BUG lacks both, write the test or exemption now. 3. **Triage-to-BUGS.md sync gate (mandatory).** Re-read the triage report (`quality/spec_audits/*-triage.md`). For every finding confirmed as a code bug, verify it appears in `quality/BUGS.md`. If BUGS.md does not exist, create it now. If BUGS.md exists but is missing confirmed bugs from the triage, append them. A triage report with confirmed code bugs and no corresponding BUGS.md entries is non-conformant — the phase cannot be marked complete until they are synced. This gate exists because in v1.3.21 benchmarking, javalin's triage confirmed 2 bugs but BUGS.md was never created. 4. **Clean up after spec-audit reversals:** If the spec audit reclassified any code review BUG as a design choice or false positive, remove or relocate the corresponding regression test per `references/review_protocols.md`. 5. **Resolve CR vs spec-audit conflicts:** If the code review and spec audit disagree on the same finding (one says BUG, the other says design choice), deploy a verification probe per `references/spec_audit.md` and record the resolution in the BUG tracker. **TDD sidecar-to-log consistency check (mandatory).** For every bug entry in `tdd-results.json`, verify the corresponding log files exist and agree. If `tdd-results.json` contains a bug with `verdict: "TDD verified"`, then `quality/results/BUG-NNN.red.log` must exist with first line `RED` and `quality/results/BUG-NNN.green.log` must exist with first line `GREEN`. If the sidecar claims "TDD verified" but no red-phase log exists, the verdict is unsubstantiated — either create the log by running the test, or downgrade the verdict to `"confirmed open"`. This check exists because v1.3.46 benchmarking showed agents writing "TDD verified" verdicts in the JSON based on narrative reasoning without ever executing the test. **Executed evidence outranks narrative artifacts (contradiction gate).** Before running the terminal gate, check for contradictions between executed evidence and prose artifacts. Executed evidence includes: mechanical verification artifacts (`quality/mechanical/*`), verification receipt files (`quality/results/mechanical-verify.log`, `quality/results/mechanical-verify.exit`), regression test results (`test_regression.*` with `xfail` outcomes), TDD red-phase log files (`quality/results/BUG-NNN.red.log`), and any shell command output saved during the pipeline. Prose artifacts include: `REQUIREMENTS.md`, `CONTRACTS.md`, code reviews, spec audit triage, and `BUGS.md`. If an executed artifact shows a constant is absent (mechanical check), a test fails (regression test), or a red-phase confirms a bug (TDD traceability) — but a prose artifact claims the constant is present, the bug is fixed, or the code is compliant — the executed result wins. Re-open and correct the contradictory prose artifact before proceeding. Specifically: if `mechanical-verify.exit` contains a non-zero value, PROGRESS.md may not claim "Mechanical verification: passed" and the terminal gate may not pass — regardless of what any other artifact says. In v1.3.18, the triage claimed RING_RESET was preserved (`spec_audits/triage.md`), BUGS.md claimed "fixed in working tree," but TDD traceability showed the assertion `assert "case VIRTIO_F_RING_RESET:" in func` failed on the current source. Those three cannot all be true — the executed failure is the ground truth. This gate would have caught that contradiction. **Version stamp consistency check (mandatory).** Read the `version:` field from the SKILL.md metadata (using the reference file resolution order). Then check every generated artifact: PROGRESS.md's `Skill version:` field, every `> Generated by` attribution line, every code file header stamp, and every sidecar JSON `skill_version` field. Every version stamp must match the SKILL.md metadata exactly. A single mismatch is a benchmark failure — fix the stamp before proceeding. This check exists because in v1.3.21 benchmarking, 5 of 9 repos had version stamps from older skill versions (v1.3.16 or v1.3.20) because the PROGRESS.md template contained a hardcoded version number. **Mechanical directory conformance check.** If `quality/mechanical/` exists, it must contain at minimum a `verify.sh` file. An empty `quality/mechanical/` directory is non-conformant — it implies the step was attempted but abandoned. If no dispatch-function contracts exist in this project's scope, do not create a `mechanical/` directory at all. Instead, record in PROGRESS.md: `Mechanical verification: NOT APPLICABLE — no dispatch/registry/enumeration contracts in scope.` If dispatch contracts do exist, `verify.sh` must include one verification block per saved extraction file under `quality/mechanical/` (not just one). A verify.sh that checks only one artifact when multiple exist is incomplete. **Verification receipt gate (mandatory before terminal gate).** If `quality/mechanical/` exists, the following receipt files must also exist before the terminal gate may run: - `quality/results/mechanical-verify.log` — full stdout/stderr from `bash quality/mechanical/verify.sh` - `quality/results/mechanical-verify.exit` — a single line containing the exit code (e.g., `0`) If either file is missing, run `bash quality/mechanical/verify.sh > quality/results/mechanical-verify.log 2>&1; echo $? > quality/results/mechanical-verify.exit` now. If the exit code is not `0`, the terminal gate fails — do not proceed until the mechanical mismatch is resolved (by fixing the extraction, not by editing verify.sh or the receipt). PROGRESS.md may not claim "Mechanical verification: passed" unless `mechanical-verify.exit` contains `0`. This gate exists because v1.3.23 PROGRESS.md claimed all verification passed when verify.sh actually returned exit 1 — the receipt file makes this claim auditable. **TDD Log Closure Gate (mandatory before terminal gate).** Before proceeding to the terminal gate, enumerate all confirmed bug IDs from `quality/BUGS.md` and verify: 1. `quality/results/BUG-NNN.red.log` exists for every confirmed bug. 2. If `quality/patches/BUG-NNN-fix.patch` exists for that bug, `quality/results/BUG-NNN.green.log` also exists. 3. The first line of each log file is one of: `RED`, `GREEN`, `NOT_RUN`, `ERROR`. If any check fails, stop and generate the missing logs now using the language-aware test execution commands from the TDD execution enforcement section. Do not proceed to the terminal gate with missing TDD logs — a bug with a "TDD verified" verdict in tdd-results.json but no corresponding red-phase log is a contradiction. **Terminal gate (mandatory before marking Phase 5 complete):** **Prerequisite check:** The terminal gate may run only if Phase 3 (code review) and Phase 4 (spec audit) are both complete, or explicitly marked skipped with rationale in PROGRESS.md. A zero-bug outcome is valid only if code review and spec audit artifacts exist (i.e., `quality/code_reviews/` and `quality/spec_audits/` directories contain report files). If these artifacts are missing and the phases are not explicitly skipped, the terminal gate fails — do not mark Phase 5 complete. **BUGS.md is always required.** Every completed run must produce `quality/BUGS.md`, regardless of whether bugs were found. If code review and spec audit confirmed zero source-code bugs, create BUGS.md with a `## Summary` stating "No confirmed source-code bugs found" and listing how many candidates were evaluated and eliminated (e.g., "Code review evaluated N candidates; spec audit evaluated M candidates; all were reclassified as design choices, test-only issues, or false positives"). This provides a positive assertion of a clean outcome rather than ambiguous file absence. A completed run with no BUGS.md is non-conformant. **BUGS.md heading format.** Each confirmed bug must use the heading level `### BUG-NNN` (e.g., `### BUG-001` or `### BUG-H1`). Both numeric IDs (`BUG-001`) and severity-prefixed IDs (`BUG-H1`, `BUG-M3`, `BUG-L6`) are valid. This is the canonical heading format — not `## BUG-001`, not `**BUG-001**`, not a bullet point. The `### BUG-NNN` heading is what downstream tools grep for when counting bugs, and what the tdd-results.json `id` field must match. Inconsistent heading levels cause machine-readable counts to disagree with the document. Re-read `quality/PROGRESS.md`. Count the BUG tracker entries. Then: 1. Print the following statement to the user (this is mandatory, not optional): > "BUG tracker has N entries. N have regression tests, N have exemptions, N are unresolved. Code review confirmed M bugs. Spec audit confirmed K code bugs (L net-new). Expected total: M + L." 2. Write the same statement into PROGRESS.md under a new `## Terminal Gate Verification` section (immediately after the BUG tracker table). This persists the gate into the artifact so reviewers can verify it without reading session logs. If the tracker entry count does not equal M + L, stop and reconcile — a BUG was orphaned from the tracker. Do not mark Phase 5 complete until the counts match. This gate exists because the v1.3.5 bootstrap showed that agents reliably skip the tracker update after spec audit, orphaning 30-50% of confirmed bugs. **Regression test function-name verification:** For each BUG tracker entry that references a regression test, grep for the test function name in the regression test file and confirm it exists. An agent can write a test name in the tracker without actually creating the test. If any referenced test function does not exist, write it now before passing the gate. 3. Verify the `With docs` metadata field in PROGRESS.md matches reality: if `docs_gathered/` exists and contains files, it should say `yes`; otherwise `no`. Fix it if wrong. **Artifact file-existence gate (mandatory before marking Phase 5 complete).** Before writing the Phase 5 completion checkbox, verify that every required artifact exists as a file on disk — not just mentioned in PROGRESS.md. Run these checks (use `ls` or equivalent): - `quality/BUGS.md` exists (required for all completed runs, per benchmark 34) - `quality/REQUIREMENTS.md` exists - `quality/QUALITY.md` exists - `quality/PROGRESS.md` exists (obviously — you're writing to it) - `quality/COVERAGE_MATRIX.md` exists - `quality/COMPLETENESS_REPORT.md` exists - `quality/formal_docs_manifest.json` exists (v1.5.2 — written by `bin/reference_docs_ingest.py` in Phase 1; empty `records[]` is valid when no formal docs present) - `quality/requirements_manifest.json` exists (v1.5.2 — authoritative REQ records, rendered to REQUIREMENTS.md) - `quality/use_cases_manifest.json` exists (v1.5.2 — authoritative UC records, rendered to USE_CASES.md / the REQUIREMENTS.md narrative) - `quality/citation_semantic_check.json` exists (v1.5.2 — Phase 4 Layer-2 output; empty `reviews[]` is valid for Spec Gap runs) - If Phase 3 ran: `quality/code_reviews/` contains at least one `.md` file - If Phase 4 ran: `quality/spec_audits/` contains a triage file AND individual auditor files - If Phase 0 or 0b ran: `quality/SEED_CHECKS.md` exists as a standalone file (not inlined in PROGRESS.md) - If confirmed bugs exist: `quality/bugs_manifest.json` exists (v1.5.2 — authoritative BUG records per schemas.md §8) - If confirmed bugs exist: `quality/results/tdd-results.json` exists - If confirmed bugs exist: `quality/results/BUG-NNN.red.log` exists for every confirmed bug ID in `quality/BUGS.md` - If confirmed bugs exist with fix patches: `quality/results/BUG-NNN.green.log` exists for each bug that has a `quality/patches/BUG-NNN-fix.patch` For each missing file, create it now. Do not mark Phase 5 complete with missing artifacts — the terminal gate verification in PROGRESS.md is meaningless if the files it references don't exist on disk. This gate exists because v1.3.24 benchmarking showed express completing all phases and writing a terminal gate section in PROGRESS.md, but BUGS.md, SEED_CHECKS.md, and code review/spec audit files were never written to disk. **Sidecar JSON post-write validation (mandatory).** After writing `quality/results/tdd-results.json` and/or `quality/results/integration-results.json`, immediately reopen each file and verify it contains all required keys. For `tdd-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `bugs`, `summary`. Each entry in `bugs` must have: `id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`. The `summary` object must include `confirmed_open` alongside `verified`, `red_failed`, `green_failed`. For `integration-results.json`, the required root keys are: `schema_version`, `skill_version`, `date`, `project`, `recommendation`, `groups`, `summary`, `uc_coverage`. Both files must have `schema_version: "1.1"`. If any key is missing, add it now — do not leave a non-conformant JSON file on disk. This validation exists because v1.3.25 benchmarking showed 6 of 8 repos with non-conformant sidecar JSON: httpx invented an alternate schema, serde used legacy shape, javalin omitted `summary` and per-bug fields, and others used invalid enum values. **Script-verified closure gate (mandatory, final step before marking Phase 5 complete).** Locate `quality_gate.py` using the same fallback as reference files (check `quality_gate.py`, `.claude/skills/quality-playbook/quality_gate.py`, `.github/skills/quality_gate.py` in order) and run it from the project root directory. This script mechanically validates: file existence, BUGS.md heading format, sidecar JSON required keys AND per-bug field names (`id`, `requirement`, `red_phase`, `green_phase`, `verdict`, `fix_patch_present`, `writeup_path`) AND enum values AND summary consistency, use case identifiers, terminal gate section, mechanical verification receipts, version stamps, writeup completeness, **regression-test patch presence for every confirmed bug**, and **inline fix diffs in every writeup** (every `quality/writeups/BUG-NNN.md` must contain a ` ```diff ` block). If the script reports any FAIL results, fix each failing check before proceeding — the most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs (section 6 must include a concrete diff, not just "see patch file"). Do not mark Phase 5 complete until `quality_gate.py` exits 0. Append the script's full output to `quality/results/quality-gate.log`. **v1.5.2 Layer-1 mechanical checks (schemas.md §10 invariants #1–#18).** Beyond the legacy gate checks above, `quality_gate.py` in v1.5.2 also enforces the Layer-1 invariants defined in `schemas.md` §10. A compact map of what each invariant covers: - **#1–#10 — core contract checks.** Citation tier gating, citation document existence, citation hash match, citation excerpt presence + locatability (section/line only; page never sufficient), bug→REQ resolution, forward-link resolution, disposition completeness, functional section presence, no orphan formal docs, INDEX.md field presence. - **#11 — citation excerpt byte-equality.** The gate re-runs `bin/citation_verifier.extract_excerpt` per schemas.md §5.4 on every Tier 1/2 citation and rejects any stored `citation_excerpt` that does not byte-equal the freshly-extracted one. This is the Layer-1 anti-hallucination mechanism — it catches fabricated or paraphrased excerpts even when the locator is real. - **#12 — legal `fix_type × disposition` combination** per schemas.md §3.4. - **#13 — manifest wrapper validity** per schemas.md §1.6. - **#14 — REQ tier bound to cited FORMAL_DOC tier** (a Tier 1 REQ cannot cite a Tier 2 FORMAL_DOC). - **#15 — ID uniqueness** within each manifest. - **#16 — redundant citation metadata** (`version`, `date`, `url`, `retrieved`) must match FORMAL_DOC when present. - **#17 — semantic-check majority rule.** ≥2 of 3 `overreaches` verdicts for the same Tier 1/2 REQ fails the gate (see Layer 2 sub-pass in Phase 4). - **#18 — array value uniqueness** in `REQ.use_cases` and `UC.formal_doc_refs`. **`citation_stale` is a gate-report marker, not a field on the citation record.** When the stored `citation.document_sha256` diverges from the live `FORMAL_DOC.document_sha256`, `quality_gate.py` writes a `citation_stale` entry into `quality_gate_report.json` (or equivalent). Do NOT write `citation_stale` onto the citation record itself — the record stays pure input, and the stale marker is gate-report output per schemas.md §5.1 / §10 invariant #3. **Do not implement the gate in this prose.** The Layer-1 check list above is a summary of what `quality_gate.py` enforces — the authoritative definitions live in schemas.md. Implementation of the gate (Phase 5 of v1.5.2 implementation) lives in `quality_gate.py`; SKILL.md describes the protocol but does not re-state the invariants. **Use case identifier format.** REQUIREMENTS.md must use canonical use case identifiers in the format `UC-01`, `UC-02`, etc. for all derived use cases. Each use case must be labeled with its identifier. This is required for machine-readable traceability — the identifier format enables `quality_gate.py` and downstream tooling to count and cross-reference use cases programmatically. Use cases written as prose paragraphs without identifiers are non-conformant. Update PROGRESS.md: mark Phase 5 complete. The BUG tracker should now show closure status for every entry. **End-of-phase message (mandatory — print this after Phase 5 completes, then STOP):** ``` # Phase 5 Complete — Reconciliation and TDD Verification All confirmed bugs now have regression tests, writeups, and TDD red-green verification. [Summarize: N total confirmed bugs, N with TDD verified status, N with fix patches. List all bug IDs with one-line summaries and their TDD verdicts.] To continue to Phase 6 (Final verification and quality gate), say: Run quality playbook phase 6. Or say "keep going" to continue automatically. ``` **After printing this message, STOP. Do not proceed to Phase 6 unless the user explicitly asks.** --- ## Phase 6: Verify > **Required references for this phase:** > - `references/verification.md` — 45 self-check benchmarks **Why a verification phase?** AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else. **Phase 6 execution model: incremental, not monolithic.** Phase 6 runs as a series of independent verification steps, each reading only the file(s) it needs, checking one thing, and writing its result to `quality/results/phase6-verification.log` before moving to the next step. Do NOT load all artifacts into context at once. Do NOT try to hold the full verification checklist in memory while reading artifacts. Each step below is self-contained — read the file, check the condition, append the result, drop the context. ### Step 6.1: Mechanical Verification Closure (mandatory first step) If `quality/mechanical/` exists, the **literal first action** of Phase 6 is: ```bash bash quality/mechanical/verify.sh > quality/results/mechanical-verify.log 2>&1 echo $? > quality/results/mechanical-verify.exit ``` Execute this command in the shell. Do not substitute a Python script, do not read the artifact file and assert on its contents, do not skip this step. The command must be `bash quality/mechanical/verify.sh` — not `python3 -c "..."`, not `cat quality/mechanical/... | grep ...`, not any other equivalent. Record the exit code. If non-zero, **Phase 6 fails immediately.** Do not proceed to further steps. Go back to the extraction step: delete the mismatched `*_cases.txt`, re-run the extraction command with a fresh shell redirect, re-verify, and update all downstream artifacts that cited the old extraction. Record in PROGRESS.md under `## Phase 6 Mechanical Closure` and append to `quality/results/phase6-verification.log`: ``` [Step 6.1] Mechanical verification: PASS (exit 0) ``` **Why this is non-substitutable:** In v1.3.23, the model replaced `bash verify.sh` with `python3 -c "from pathlib import Path; ..."` that read the (forged) artifact file and asserted on its contents — a circular check that passed despite the artifact being fabricated. The only trustworthy verification is re-running the same shell pipeline that produced the artifact and diffing the results. Any other method can be fooled by a corrupted intermediate file. ### Step 6.2: Run quality_gate.py (script-verified checks) Run the mechanical validation gate: ```bash python3 quality_gate.py . > quality/results/quality-gate.log 2>&1 # locate via fallback: quality_gate.py, .claude/skills/quality-playbook/quality_gate.py, .github/skills/quality_gate.py echo $? >> quality/results/phase6-verification.log ``` Read `quality/results/quality-gate.log`. If it reports any FAIL results, fix each failing check before proceeding. The most common FAILs are: (1) missing `quality/patches/BUG-NNN-regression-test.patch` files, (2) non-canonical JSON field names like `bug_id` instead of `id`, (3) missing `confirmed_open` in the TDD summary, (4) writeups without inline fix diffs, (5) missing TDD red/green log files. Do not proceed until `quality_gate.py` exits 0. Append to `quality/results/phase6-verification.log`: ``` [Step 6.2] quality_gate.py: PASS (exit 0) — N checks passed, 0 FAIL, 0 WARN ``` This step covers verification benchmarks: 14 (sidecar JSON), 17 (test file extension), 18 (use case count), 20 (writeups), 23 (mechanical artifacts), 26 (version stamps), 27 (mechanical directory), 29 (triage-to-BUGS sync), 34 (BUGS.md exists), 38 (individual auditor reports), 39 (BUGS.md heading format), 40 (artifact file existence), 41 (sidecar JSON validation), 42 (script-verified closure), 43 (use case identifiers), 44 (regression-test patches), 45 (writeup inline diffs). **v1.5.2 Layer-1 invariants also run here.** `quality_gate.py` additionally enforces schemas.md §10 invariants #1–#18 (summarized in Phase 5 above). In particular, the script re-runs `bin/citation_verifier.extract_excerpt` per schemas.md §5.4 on every Tier 1/2 citation and rejects any stored `citation_excerpt` that does not byte-equal the freshly-extracted output — this is the post-ingest tampering catch. If any Layer-1 invariant fails here, fix the underlying manifest record (not the gate, not the excerpt) and re-run. ### Step 6.3: Test execution verification Run the functional test suite. Read only `quality/test_functional.*` to determine the test command: - **Python:** `pytest quality/test_functional.py -v 2>&1 | tail -20` - **Java:** `mvn test -Dtest=FunctionalTest` or `gradle test --tests FunctionalTest` - **Go:** `go test -v` targeting the generated test file's package - **TypeScript:** `npx jest functional.test.ts --verbose` - **Rust:** `cargo test` - **Scala:** `sbt "testOnly *FunctionalSpec"` Check for both failures AND errors. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. Expected-failure (xfail) regression tests do not count against this check. Append to `quality/results/phase6-verification.log`: ``` [Step 6.3] Functional tests: PASS — N tests, 0 failures, 0 errors ``` This covers benchmarks 8 (all tests pass) and 9 (existing tests unbroken). ### Step 6.4: Verification checklist — file-by-file checks Process the remaining verification benchmarks from `references/verification.md` in small batches. For each batch, read only the file(s) needed, check the condition, and append the result. **Do not read more than 2 files per batch.** **Batch A — QUALITY.md (benchmarks 1-2, 10):** Read `quality/QUALITY.md`. Count scenarios. Verify each scenario references real code (grep for cited function names). Append results. **Batch B — Functional test file (benchmarks 3-7):** Read `quality/test_functional.*`. Check cross-variant coverage (~30%), boundary test count, assertion depth (value checks vs presence checks), layer correctness (outcomes vs mechanisms), mutation validity. **Batch C — Protocol files (benchmarks 11-13):** Read `quality/RUN_CODE_REVIEW.md`, then `quality/RUN_INTEGRATION_TESTS.md`, then `quality/RUN_SPEC_AUDIT.md` — one at a time. Check each is self-contained and executable. Verify Field Reference Table in integration tests. **Batch D — Regression tests (benchmarks 15-16, 24):** Read `quality/test_regression.*` if it exists. Verify skip guards reference bug IDs, verify patch validation gate commands, verify source-inspection tests don't use `run=False`. **Batch E — Enumeration and triage checks (benchmarks 19, 21-22, 25, 36):** Read `quality/code_reviews/*.md` (just the enumeration sections). Read `quality/spec_audits/*triage*` (just the verification probe sections). Check two-list comparisons, executable probe evidence, no circular mechanical artifact references, contradiction gate. **Batch F — Continuation mode (benchmarks 32-33):** Only if `quality/SEED_CHECKS.md` exists. Read it, verify mechanical execution, verify convergence section in PROGRESS.md. Append each batch result to `quality/results/phase6-verification.log`: ``` [Step 6.4A] QUALITY.md scenarios: PASS — 8 scenarios, all reference real code [Step 6.4B] Functional test quality: PASS — 30% cross-variant, assertion depth OK [Step 6.4C] Protocol files: PASS — all self-contained and executable [Step 6.4D] Regression tests: PASS — all skip guards present [Step 6.4E] Enumeration/triage: PASS — two-list checks present, probes have assertions [Step 6.4F] Continuation mode: SKIP — no SEED_CHECKS.md ``` If any batch fails, fix the issue immediately before proceeding to the next batch. ### Step 6.5: Metadata Consistency Check Read `quality/PROGRESS.md` (just the metadata and artifact inventory sections). Then spot-check: - The requirement count is consistent across REQUIREMENTS.md header, PROGRESS.md artifact inventory, and COVERAGE_MATRIX.md header. All three must state the same number. - The `With docs` field accurately reflects whether `docs_gathered/` exists - The Terminal Gate Verification section is present and filled in Then read `quality/COMPLETENESS_REPORT.md` (just the verdict section). Verify no stale pre-reconciliation text remains — if both a `## Verdict` and an `## Updated verdict` (or `## Post-Review Reconciliation`) section exist, **delete the original `## Verdict` section entirely**. The final document must have exactly one `## Verdict` heading. Append to `quality/results/phase6-verification.log`: ``` [Step 6.5] Metadata consistency: PASS — requirement counts match, version stamps consistent ``` If any metadata is stale, fix it now. ### Checkpoint: Finalize PROGRESS.md Re-read `quality/PROGRESS.md`. Update: - Mark Phase 6 (Verification benchmarks) complete with timestamp - Verify the BUG tracker has closure for every entry - Add a final summary line: "Run complete. N BUGs found (N from code review, N from spec audit). N regression tests written. N exemptions granted." - **Print the suggested next prompt to the user (mandatory, all runs).** This applies to EVERY run, including baseline — it is not iteration-specific. Print the following block so the user can copy-paste it to start the next iteration: For a baseline run (no iteration strategy): ``` ──────────────────────────────────────────────────────── Next iteration suggestion: "Run the next iteration of the quality playbook using the gap strategy." ──────────────────────────────────────────────────────── ``` For iteration runs, use this mapping to determine the next strategy: - **gap** → suggest unfiltered - **unfiltered** → suggest parity - **parity** → suggest adversarial - **adversarial** → suggest "Run the quality playbook from scratch." (cycle complete) The completed PROGRESS.md is a permanent audit trail. It documents what the skill did, what it found, and how it resolved each finding. Users can read it to understand the run, debug failures, and compare across runs. ### Convergence Check (continuation mode only) > **Scope:** This subsection only. The suggested-next-prompt step above is unconditional and must execute on every run regardless of whether this convergence check is skipped. **This step runs only if Phase 0 executed** (i.e., `quality/SEED_CHECKS.md` exists from prior-run analysis). If this is a first run with no prior history, skip to Phase 7. Compare this run's bug list against the seed list: 1. **Count net-new bugs:** bugs in this run's BUGS.md that do NOT match any seed (by file:line). A bug is "net-new" if it was not found in any prior run. 2. **Count seed carryovers:** seeds that were re-confirmed in this run (FAIL result in Step 0b). 3. **Count seed resolutions:** seeds that are now passing (bug was fixed since prior run). Write a `## Convergence` section to PROGRESS.md: ```markdown ## Convergence Run number: N (N prior runs in quality/runs/) Seeds from prior runs: S (S confirmed, R resolved) Net-new bugs this run: K Convergence: [CONVERGED | NOT CONVERGED] Net-new bugs: - BUG-NNN: [summary] (file:line) — not in any prior run ``` **Convergence criterion:** The run is converged if **net-new bugs = 0** — every bug found in this run was already known from a prior run. This means further runs are unlikely to find additional bugs in the declared scope. **If CONVERGED:** Print to the user: "This run found no new bugs beyond the N already known from prior runs. Bug discovery has converged for this scope. Total confirmed bugs across all runs: T." Then proceed to Phase 7. **If NOT converged — automatic re-iteration.** When the convergence check shows net-new bugs > 0 and the iteration count has not reached the maximum (default: 5), the skill re-iterates automatically: 1. Record the iteration number and net-new count in PROGRESS.md. 2. Archive the current `quality/` directory via `bin/run_playbook.archive_previous_run(repo_dir, timestamp)` (or `bin.archive_lib.archive_run()` at Phase 6 success). These snapshot `quality/` into `quality/runs//quality/` and write the per-run `INDEX.md` plus a `RUN_INDEX.md` row. 3. Restart from **Phase 0** (which will now find the newly archived run in `quality/runs/`). 4. Print to the user: "Iteration N found K net-new bugs. Archiving and starting iteration N+1 (max M)." The iteration counter starts at 1 for the first run. Each archive-and-restart increments it. When the counter reaches the maximum, stop iterating even if not converged and print: "Reached maximum iterations (M) without convergence. K net-new bugs found in the last run. Total confirmed bugs across all runs: T." **Iteration limits.** The default maximum is 5 iterations. If the user's prompt includes an explicit limit (e.g., "run the playbook with 3 iterations"), use that limit instead. If the user's prompt says "single run" or "no iteration," skip re-iteration entirely and treat NOT CONVERGED the same as the pre-iteration behavior: print the net-new count and suggest re-running. **Context window awareness.** If at any point during re-iteration you detect that your context window is substantially consumed (e.g., you are producing noticeably shorter or lower-quality output than earlier iterations), stop iterating, write the current state to PROGRESS.md, and print: "Stopping iteration due to context constraints. Completed N of M iterations. Re-run the playbook to continue — Phase 0 will pick up the seed list from quality/runs/." This is a safety valve, not a target — most codebases converge in 2-3 iterations. **Why this matters:** A single playbook run explores a subset of the codebase non-deterministically. The first run on virtio might find BUG-001 and BUG-004 but miss BUG-005. The second run might find BUG-005 and BUG-006. By the third run, if no net-new bugs appear, the exploration has likely covered the high-value territory. The seed list ensures previously found bugs are never lost between runs, and the convergence check tells the user when additional runs have diminishing returns. Automatic re-iteration means the skill is self-contained — callers don't need external scripts or manual re-runs to achieve convergence. **End-of-phase message (mandatory — print this after Phase 6 completes, then STOP):** ``` # Phase 6 Complete — All Phases Done The quality playbook baseline run is complete. Here's the summary: [Include: total confirmed bugs, quality gate pass/fail/warn counts, list of all bug IDs with one-line summaries and severities.] Key output files: - quality/BUGS.md — all confirmed bugs with spec basis and patches - quality/results/tdd-results.json — structured TDD verification results - quality/patches/ — regression test and fix patches for every bug You can now run iteration strategies to find additional bugs. Iterations typically add 40-60% more confirmed bugs on top of the baseline. The recommended cycle is: gap → unfiltered → parity → adversarial. To run all four iterations automatically, say: Run all iterations. I'll orchestrate each strategy as a separate sub-agent with its own context window. To run one iteration at a time, say: Run the next iteration of the quality playbook. Or ask me about the results: "Tell me about BUG-001" or "Which bugs are highest priority?" After you fix the bugs, say "recheck" to verify the fixes were applied correctly. ``` **After printing this message, STOP. Do not proceed to iterations unless the user explicitly asks.** **End-of-iteration message (mandatory — print this after each iteration completes, then STOP):** ``` # Iteration Complete — [Strategy Name] [Summarize: N net-new bugs found in this iteration, total now at N. List new bug IDs with one-line summaries.] [If there are remaining strategies in the recommended cycle, suggest the next one:] The next recommended strategy is [next strategy]. To run it, say: Run the next iteration using the [next strategy] strategy. [If all four strategies have been run:] All four iteration strategies have been run. Total confirmed bugs: N. You can review the results, ask about specific bugs, or re-run any strategy. After you fix the bugs, say "recheck" to verify the fixes were applied correctly. Or say "keep going" to run the next iteration automatically. ``` **After printing this message, STOP. Do not proceed to the next iteration unless the user explicitly asks.** --- ## Recheck Mode — Verify Bug Fixes Recheck mode is a lightweight verification pass that checks whether bugs from a previous run have been fixed. Instead of re-running the full six-phase pipeline (60-90 minutes), recheck reads the existing `quality/BUGS.md`, checks each bug against the current source tree, and reports which bugs are fixed vs. still open. A typical recheck takes 2-10 minutes. **When to use recheck mode:** After the user (or another agent) has applied fixes for bugs found by the playbook. The user says "recheck" or "verify the bug fixes" or "check which bugs are fixed." **Do not use recheck mode** as a substitute for running the full playbook. Recheck only verifies previously found bugs — it does not find new ones. ### Recheck procedure **Step 1: Read the bug inventory.** Read `quality/BUGS.md` and parse every `### BUG-NNN` entry. For each bug, extract: - Bug ID (e.g., BUG-001) - File path and line number from the `**File:**` field - Description summary (first sentence of `**Description:**`) - Severity - Fix patch path from `**Fix patch:**` field (e.g., `quality/patches/BUG-001-fix.patch`) - Regression test path from `**Regression test:**` field **Step 2: Check each bug against the current source.** For each bug, perform these checks in order: 1. **Fix patch check.** If a fix patch exists at the referenced path, run `git apply --check --reverse quality/patches/BUG-NNN-fix.patch` against the current tree. If the reverse-apply succeeds (exit 0), the fix patch is already applied — the bug is likely fixed. If it fails, the fix has not been applied or the code has changed. 2. **Source inspection.** Open the file at the cited line number. Read the surrounding context (±20 lines). Compare what you see against the bug description. Has the problematic code been changed? Does the fix address the root cause described in the bug report? 3. **Regression test execution.** If a regression test patch exists: - Apply it: `git apply quality/patches/BUG-NNN-regression-test.patch` - Run the test (using the project's test runner). If the test PASSES, the bug is fixed. If it FAILS, the bug is still present. - Reverse the patch: `git apply -R quality/patches/BUG-NNN-regression-test.patch` If the regression test patch doesn't apply cleanly (because the source has changed), note this and fall back to source inspection alone. 4. **Verdict.** Assign one of these statuses: - **FIXED** — Fix patch is applied AND regression test passes (or source inspection confirms the fix if test can't run) - **PARTIALLY_FIXED** — The problematic code has changed but the regression test still fails, or the fix addresses some but not all aspects of the bug - **STILL_OPEN** — The original problematic code is unchanged, or the regression test still fails - **INCONCLUSIVE** — Can't determine status (file moved, code heavily refactored, patches don't apply) **Step 3: Generate recheck results.** Write `quality/results/recheck-results.json` with this schema: Note: The recheck schema uses `"schema_version": "1.0"` (not `"1.1"`) because it has a different structure from the TDD sidecar — the `source_run` and per-bug `status`/`evidence` fields are unique to the recheck schema. The quality gate validates this value as `"1.0"`. ```json { "schema_version": "1.0", "skill_version": "1.5.2", "date": "YYYY-MM-DD", "project": "", "source_run": { "bugs_md_date": "", "total_bugs": }, "results": [ { "id": "BUG-001", "severity": "HIGH", "summary": "", "status": "FIXED", "evidence": "" } ], "summary": { "total": , "fixed": , "partially_fixed": , "still_open": , "inconclusive": } } ``` Also write a human-readable summary to `quality/results/recheck-summary.md`: ```markdown # Recheck Results > Recheck of quality/BUGS.md from > Recheck run: > Skill version: ## Summary | Status | Count | |--------|-------| | Fixed | N | | Partially fixed | N | | Still open | N | | Inconclusive | N | | **Total** | **N** | ## Per-Bug Results | Bug | Severity | Status | Evidence | |-----|----------|--------|----------| | BUG-001 | HIGH | FIXED | Reverse-apply succeeded, regression test passes | | BUG-002 | MEDIUM | STILL_OPEN | Original code unchanged at quality_gate.py:125 | | ... | ... | ... | ... | ## Still Open — Details [For each STILL_OPEN or PARTIALLY_FIXED bug, include a brief explanation of what remains to be fixed.] ``` **Step 4: Print the recheck summary.** Print the summary table to the user, then STOP. Example: ``` # Recheck Complete Checked 19 bugs from quality/BUGS.md against current source. | Status | Count | |--------|-------| | Fixed | 17 | | Still open | 2 | | **Total** | **19** | Fixed bugs: BUG-001, BUG-002, BUG-003, BUG-004, BUG-005, BUG-006, BUG-007, BUG-008, BUG-009, BUG-010, BUG-011, BUG-013, BUG-014, BUG-015, BUG-016, BUG-017, BUG-018 Still open: BUG-012 (stale .orig file still present), BUG-019 (benchmark 40 artifact list not updated) Results saved to: - quality/results/recheck-results.json (machine-readable) - quality/results/recheck-summary.md (human-readable) ``` ### Triggering recheck mode Recheck mode activates when the user says any of: "recheck", "verify the bug fixes", "check which bugs are fixed", "recheck the bugs", "run recheck mode", or similar phrasing that clearly indicates they want to verify fixes rather than find new bugs. When triggered, skip Phases 1-7 entirely and execute only the recheck procedure above. --- ## Phase 7: Present, Explore, Improve (Interactive) After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths. **Do not skip this phase.** The autonomous output from Phases 1-6 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads. ### Part 1: The Summary Table Present a single table the user can scan in 10 seconds: ``` Here's what I generated: | File | What It Does | Key Metric | Confidence | |------|-------------|------------|------------| | REQUIREMENTS.md | Testable requirements with use cases | N requirements, N use cases | ██████░░ Medium — solid baseline from 5-phase pipeline, improves with refinement passes | | QUALITY.md | Quality constitution | 10 scenarios | ██████░░ High — grounded in code, but scenarios are inferred, not from real incidents | | Functional tests | Automated tests | 47 passing | ████████ High — all tests pass, 35% cross-variant | | RUN_CODE_REVIEW.md | Three-pass code review | 3 passes | ████████ High — structural + requirement verification + consistency | | RUN_INTEGRATION_TESTS.md | Integration test protocol | 9 runs × 3 providers | ██████░░ Medium — quality gates need threshold tuning | | RUN_SPEC_AUDIT.md | Council of Three audit | 10 scrutiny areas | ████████ High — guardrails included | | AGENTS.md | AI session bootstrap | Updated | ████████ High — factual | | RUN_TDD_TESTS.md | TDD verification protocol | N bugs to verify | ████████ High — mechanical red-green cycle with spec traceability | ``` Adapt the table to what you actually generated — the file names, metrics, and confidence levels will vary by project. The confidence column is the most important: it tells the user where to focus their attention. **Confidence levels:** - **High** — Derived directly from code, specs, or schemas. Unlikely to need revision. - **Medium** — Reasonable inference, but could be wrong. Benefits from user input. - **Low** — Best guess. Definitely needs user input to be useful. After the table, add a "Quick Start" block with ready-to-copy prompts for executing each artifact: ``` To use these artifacts, start a new AI session and try one of these prompts: • Run a code review: "Read quality/RUN_CODE_REVIEW.md and follow its instructions to review [module or file]." • Run the functional tests: "[test runner command, e.g. pytest quality/ -v, mvn test -Dtest=FunctionalTest, etc.]" • Run the integration tests: "Read quality/RUN_INTEGRATION_TESTS.md and follow its instructions." • Start a spec audit (Council of Three): "Read quality/RUN_SPEC_AUDIT.md and follow its instructions using [model name]." • Run TDD verification for confirmed bugs: "Read quality/RUN_TDD_TESTS.md and follow its instructions to verify all confirmed bugs." ``` Adapt the test runner command and module names to the actual project. The point is to give the user copy-pasteable prompts — not descriptions of what they could do, but the actual text they'd type. After the Quick Start block, add one line: > "You can ask me about any of these to see the details — for example, 'show me Scenario 3' or 'walk me through the integration test matrix.'" ### Part 2: Drill-Down on Demand When the user asks about a specific item, give a focused summary — not the whole file, but the key decisions and what you're uncertain about. Examples: - **"Tell me about Scenario 4"** → Show the scenario text, explain where it came from (which defensive pattern or domain knowledge), and flag what you inferred vs. what you know. - **"Show me the integration test matrix"** → Show the run groups, explain the parallelism strategy, and note which quality gates you derived from schemas vs. guessed at. - **"How do the functional tests work?"** → Show the three test groups, explain the mapping to specs and scenarios, and highlight any tests you're least confident about. The user may go through several drill-downs before they're ready to improve anything. That's fine — let them explore at their own pace. ### Part 3: The Improvement Menu After the user has seen the summary (and optionally drilled into details), present the improvement options: > "Five ways to make this better:" > > **1. Review requirements interactively** — Read `quality/REVIEW_REQUIREMENTS.md` for a guided walkthrough of the requirements organized by use case. You can pick specific use cases to drill into, or walk through all of them sequentially. A different model can also fact-check the completeness report (cross-model audit). Good for: finding gaps the pipeline missed. > > **2. Refine requirements with a different model** — Read `quality/REFINE_REQUIREMENTS.md` and run a refinement pass. You can run this with any AI model — Claude, GPT, Gemini — and each will catch different gaps. Run as many models as you want until you hit diminishing returns. Each pass backs up the current version and logs changes in `quality/VERSION_HISTORY.md`. Good for: pushing requirements from the baseline toward completeness. > > **3. Review and harden other items** — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases. > > **4. Guided Q&A** — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative. > > **5. Feed in additional documentation** — The requirements pipeline works better with more intent sources. Point me to any of these and I'll use them to refine the requirements and quality constitution: > - Exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) > - Slack or Teams channels where the project was discussed > - Email threads, Jira/Linear tickets, or GitHub issues about the project > - Design documents, architecture decision records, or meeting notes > - Newsgroup posts, forum discussions, or mailing list archives > > You can use tools like Claude Cowork, GitHub Copilot, or OpenClaw to connect to these sources and gather them into a folder, then point me at the folder. Good for: grounding scenarios and requirements in real project history instead of inference. > > "You can do any combination of these, in any order. Which would you like to start with?" ### Executing Each Improvement Path **Path 1: Review requirements interactively.** Point the user to `quality/REVIEW_REQUIREMENTS.md` and offer to walk through it together. The protocol supports self-guided (pick use cases), fully guided (sequential walkthrough), and cross-model audit (different model fact-checks the completeness report). Progress is tracked in `quality/REFINEMENT_HINTS.md` so the user can pick up where they left off. **Path 2: Refine requirements with a different model.** Point the user to `quality/REFINE_REQUIREMENTS.md`. Each refinement pass: backs up the current version to `quality/history/vX.Y/`, reads feedback from REFINEMENT_HINTS.md, makes targeted improvements, bumps the minor version, and logs changes in VERSION_HISTORY.md. The user can run this with Claude, GPT, Gemini, or any other model — each catches different blind spots. Run until diminishing returns. **Path 3: Review and harden other items.** The user picks a scenario, test, or protocol section. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change. **Path 4: Guided Q&A.** Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps: - **Incident history for scenarios.** "I found [specific defensive code]. What failure caused this? How many records were affected?" - **Quality gate thresholds.** "I'm checking that [field] contains [values]. What distribution is normal? What signals a problem?" - **Integration test scale and cost.** "The protocol runs [N] tests costing roughly $[X]. Should I increase or decrease coverage?" - **Test scope.** "I generated [N] functional tests. Your existing suite covers [other areas]. Are there gaps?" - **Model preferences for spec audit.** "Which AI models do you use? Have you noticed specific strengths?" After the user answers, revise the generated files and re-run tests. **Path 5: Feed in additional documentation.** The user points you to additional intent sources — chat history, Slack exports, email threads, Jira tickets, design docs, meeting notes, forum archives. These contain design decisions, incident history, and quality discussions that didn't make it into formal documentation. 1. Scan for index files and navigate to quality-relevant content (same approach as Step 0, but now with specific targets — you know which requirements need grounding, which scenarios need thresholds, which gaps need closing). 2. Extract: incident stories with specific numbers, design rationale for defensive patterns, quality framework discussions, cross-model audit results, and behavioral contracts that weren't visible from the code alone. 3. Feed findings into `quality/REFINEMENT_HINTS.md` as new feedback items, then run a refinement pass to update the requirements. 4. Revise QUALITY.md scenarios with real incident details. Update integration test thresholds with real-world values. Re-run tests after revisions. If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations, connect additional sources, or ask you to dig deeper into a particular topic. ### Iteration The user can cycle through these paths as many times as they want. Each pass makes the quality playbook more grounded. When they're satisfied, they'll move on naturally — there's no explicit "done" step. --- ## Fixture Strategy The `quality/` folder is separate from the project's unit test folder. Create the appropriate test setup for the project's language: - **Python:** `quality/conftest.py` for pytest fixtures. If fixtures are defined inline (common with pytest's `tmp_path` pattern), prefer that over shared fixtures. - **Java:** A test class with `@BeforeEach`/`@BeforeAll` setup methods, or a shared test utility class. - **Scala:** A trait mixed into test specs (e.g., `trait FunctionalTestFixtures`), or inline data builders. - **TypeScript/JavaScript:** A `quality/setup.ts` with `beforeAll`/`beforeEach` hooks, or inline test factories. - **Go:** Helper functions in the same `_test.go` file or a shared `testutil_test.go`. Use `t.Helper()` for test helpers. Go convention prefers inline test setup over shared fixtures. - **Rust:** Helper functions in a `#[cfg(test)] mod tests` block, or a shared `test_utils.rs` module. Use builder patterns for test data. Examine existing test files to understand how they set up test data. Whatever pattern the existing tests use, copy it. Study existing fixture patterns for realistic data shapes. --- ## Terminology - **Functional testing** — Does the code produce the output specs say it should? Distinct from unit testing (individual functions in isolation). - **Integration testing** — Do components work together end-to-end, including real external services? - **Spec audit** — AI models read code and compare against specs. No code executed. Catches where code doesn't match documentation. - **Coverage theater** — Tests that produce high coverage numbers but don't catch real bugs. Example: asserting a function didn't throw without checking its output. - **Fitness-to-purpose** — Does the code do what it's supposed to do under real-world conditions? A system can have 95% coverage and still lose records silently. --- ## Principles 1. Fitness-to-purpose over coverage percentages 2. Scenarios come from code exploration AND domain knowledge 3. Concrete failure modes make standards non-negotiable — abstract requirements invite rationalization 4. Guardrails transform AI review quality (line numbers, read bodies, grep before claiming) 5. Triage before fixing — many "defects" are spec bugs or design decisions 6. Structural review has a ceiling (~65%). The remaining ~35% are intent violations — absence bugs, cross-file contradictions, design gaps — invisible to any tool that only reads code. Requirements make the invisible visible. 7. The specification is the unique contribution, not the review structure. Focus areas and review protocols are secondary to having the right testable requirements derived from intent sources. 8. Cross-requirement consistency checking is essential. Bugs often live in the gap between two individually-correct pieces of code. Per-requirement verification alone can't find these. 9. Keep all derived requirements — do not filter. The cost of checking an extra requirement is low; the cost of missing a bug because you pruned the requirement that would have caught it is high. 10. A failing test is the strongest evidence a bug exists. Run the red-green TDD cycle (test fails on buggy code, passes on fixed code) for every confirmed bug with a fix patch. Show the FAIL→PASS output — reviewers can disagree with your fix but can't argue with a reproducing test. --- ## Reference Files Read these as you work through each phase: | File | When to Read | Contains | |------|-------------|----------| | `references/exploration_patterns.md` | Phase 1 (explore) | Pattern applicability matrix, deep-dive templates, domain-knowledge questions | | `references/defensive_patterns.md` | Step 5 (finding skeletons) | Grep patterns, how to convert findings to scenarios | | `references/schema_mapping.md` | Step 5b (schema types) | Field mapping format, mutation validity rules | | `references/requirements_pipeline.md` | Phase 2 (requirements) | Five-phase pipeline, versioning protocol, carry-forward rules | | `references/constitution.md` | File 1 (QUALITY.md) | Full template with section-by-section guidance | | `references/functional_tests.md` | File 2 (functional tests) | Test structure, anti-patterns, cross-variant strategy | | `references/review_protocols.md` | Files 3–4 (code review, integration) | Templates for both protocols, patch validation, skip guards | | `references/spec_audit.md` | File 5 (Council of Three) | Full audit protocol, triage process, fix execution | | `references/iteration.md` | Iterations (after Phase 6) | Four iteration strategies: gap, unfiltered, parity, adversarial | | `references/verification.md` | Phase 6 (verify) | Complete self-check checklist (45 benchmarks) including structured output, patch gate, skip guard validation, pre-flight discovery, version stamps, bug writeups, enumeration completeness, triage executable evidence, code-extracted enumeration lists, mechanical verification artifacts, source-inspection test execution, contradiction gate, seed check execution, convergence tracking, sidecar JSON schema validation, script-verified closure gate, canonical use case identifiers, and writeup inline fix diffs |