--- name: ai-debug description: "Use when something is broken and you need to find out why: test failures, runtime errors, crashes, regressions, or unexpected behavior. Trigger for 'it's not working', 'something broke', 'this used to work', 'I'm getting an error', 'CI is failing', 'the output is wrong', or 'why is X happening'. Systematic 4-phase diagnosis — never patches symptoms without finding the root cause." effort: high argument-hint: "[error description or file:line]" --- # Debug ## Purpose Systematic debugging skill. Four phases, always in order. NEVER fix symptoms -- always find and fix the root cause. After 2 failed fix attempts, escalate to the user. ## When to Use - Test failures (expected vs actual mismatch) - Runtime errors (exceptions, crashes, hangs) - Regressions (worked before, broken now) - Unexpected behavior (no error, but wrong result) ## Process Step 0 (load contexts): per `.ai-engineering/contexts/stack-context.md`. ### Phase 1: Symptom Analysis (WHAT, WHEN, WHERE) Gather facts before forming hypotheses: 1. **WHAT**: exact error message, stack trace, log output 2. **WHEN**: always? intermittent? after a specific change? under load? 3. **WHERE**: which file, function, line? which test? which environment? 4. **SINCE WHEN**: `git log --oneline -20` -- what changed recently? Output: symptom report with all facts classified as KNOWN or SUSPECTED. ### Phase 2: Reproduction (MINIMAL REPRO) Make the bug reproducible with the smallest possible case: 1. Run the failing test or reproduce the error 2. If not reproducible: document exact conditions and STOP (cannot debug what cannot be reproduced) 3. Strip to minimal repro: remove unrelated code, simplify inputs, isolate the component 4. Confirm: the minimal repro fails consistently Output: exact command to reproduce the failure. ### Phase 3: Root Cause (WHY) Apply the 5 Whys to move from symptom to cause: 1. **Why** does it fail? -> [immediate cause] 2. **Why** does that happen? -> [deeper cause] 3. **Why** does that happen? -> [root cause] (Continue until you reach a cause you can fix directly) **Techniques** (use as appropriate): - **Binary search**: comment out code, add assertions to narrow the location - **Git bisect**: `git bisect start HEAD ` to find the breaking commit - **Print tracing**: add targeted print/log statements at decision points - **Diff analysis**: `git diff ..HEAD -- ` to see what changed - **Assumption check**: list every assumption the code makes, verify each one **Classification**: identify the root cause category: - Logic error (wrong condition, off-by-one, missing case) - State corruption (mutation, shared state, race condition) - Contract violation (caller sends wrong type, missing field) - Environment (missing dependency, wrong version, config) - Data (unexpected input, encoding, edge case) Output: root cause statement (1-2 sentences, specific and testable). ### Phase 4: Solution Design (FIX + REGRESSION TEST) 1. **Design the fix**: minimal change that addresses the root cause; one logical change only. If the fix is large, the root cause analysis may be wrong — revisit Phase 3. 2. **Write regression test**: a test that fails without the fix and passes with it 3. **Apply the fix** 4. **Verify**: regression test passes AND all existing tests pass 5. **Check for siblings**: does the same bug pattern exist elsewhere? (`grep` for similar code) ## Escalation Protocol | Attempt | Action | |---------|--------| | 1st fix fails | Try a different approach (not the same thing again) | | 2nd fix fails | STOP. Escalate to user with: symptom, repro, root cause analysis, 2 approaches tried | Never retry the same approach. Never loop silently. ## 5 Whys Example ``` Symptom: test_parse_config_handles_empty fails with KeyError Why 1: config["database"] raises KeyError Why 2: parse_config returns empty dict when file is empty Why 3: the YAML parser returns None for empty files, not empty dict Root cause: missing None -> {} coercion after yaml.safe_load() Fix: add `config = yaml.safe_load(f) or {}` instead of `config = yaml.safe_load(f)` ``` ## Common Mistakes - Fixing the symptom (e.g., add a try/except) instead of the root cause. - Not writing a regression test for the fix. - Changing multiple things at once (change one thing, verify, repeat). ## Handlers When the error involves a build or compilation failure, load the language-specific handler for resolution patterns. | Handler | Trigger | File | |---------|---------|------| | C++ | `.cpp`, `.h`, `.hpp` files or CMake/Make errors | `handlers/cpp.md` | | Go | `.go` files or `go build` errors | `handlers/go.md` | | Java | `.java` files or Maven/Gradle errors | `handlers/java.md` | | Kotlin | `.kt` files or Gradle/KSP errors | `handlers/kotlin.md` | | Python Build | `pyproject.toml`, build failures, dependency resolution | `handlers/python-build.md` | | PyTorch | CUDA, tensor, GPU-related errors | `handlers/pytorch.md` | | Rust | `.rs` files or `cargo` errors | `handlers/rust.md` | | TypeScript Build | `.ts`, `.tsx` files or tsc/webpack/vite errors | `handlers/typescript-build.md` | ## Integration - **Called by**: `/ai-dispatch` (debug tasks), `ai-build agent` (when tests fail), user directly - **Calls**: test runners (to reproduce), `/ai-test` (regression test) - **Transitions to**: `ai-build` (fix implementation), `/ai-commit` (after verified fix) - **See also**: `/ai-test` (reproduce failures before fixing), `/ai-postmortem` (for production incidents) $ARGUMENTS