--- name: systematic-debugging description: Root cause investigation methodology — loaded by agents when debugging failures to enforce evidence-first diagnosis over guess-and-fix approaches --- # Systematic Debugging Never skip to fixing. Understand the cause first. A fix applied without understanding the root cause is a coin flip — it may mask the symptom while leaving the disease. ## 4-Phase Investigation ### Phase 1: OBSERVE Gather evidence before forming any theories. The goal is to build a factual picture of what is happening. - **Read error messages completely.** The first line is the symptom; the stack trace is the geography; the last frame before your code is where to look. - **Reproduce the failure.** If you cannot reproduce it, you cannot verify your fix. Document the exact reproduction steps. - **Collect multiple data points.** One error message is an anecdote. Three error messages are a pattern. Gather logs, stack traces, test output, and runtime state. - **Note what IS working.** The boundary between working and broken code narrows the search space dramatically. - **Record timestamps and sequence.** When did it start failing? What changed just before? Check git log, deployment history, and dependency updates. Do not hypothesize during OBSERVE. Just collect. ### Phase 2: HYPOTHESIZE Form theories that explain ALL the observed evidence. A hypothesis that explains only some observations is incomplete. - **Generate multiple hypotheses.** Premature commitment to a single theory creates confirmation bias. List at least two plausible explanations. - **Rank by likelihood.** Common causes before exotic ones. Configuration before code. Environment before logic. - **Check consistency.** Each hypothesis must explain why the failure occurs AND why related functionality still works. - **State what each hypothesis predicts.** If hypothesis A is correct, what else should be true? What should be false? These predictions become your tests. ### Phase 3: TEST Validate or eliminate hypotheses through targeted experiments. Test by elimination, not confirmation. - **Design discriminating tests.** A good test eliminates at least one hypothesis regardless of the outcome. A test that can only confirm is subject to confirmation bias. - **Change one variable at a time.** Multiple simultaneous changes make it impossible to attribute the result. - **Record results immediately.** Note what you tried, what you expected, and what actually happened — even for negative results. - **Eliminate definitively.** If a hypothesis is disproven, cross it off and do not revisit it unless new evidence emerges. ### Phase 4: CONCLUDE Identify the root cause and design the fix. - **Root cause, not proximate cause.** The proximate cause is "this variable is null." The root cause is "this function is called before initialization completes." Fix the root cause. - **Verify the fix addresses the root cause.** The original reproduction steps must succeed after the fix. No other behavior should change. - **Check for related instances.** If the root cause is a pattern (e.g., missing null check), search for the same pattern elsewhere in the codebase. - **Document what you found.** Future debuggers (including yourself) will benefit from knowing what was investigated and ruled out. ## Escalation Rules ### After 3 Failed Hypotheses If three hypotheses have been tested and eliminated, the investigation scope is too narrow. Expand: - Widen the search to adjacent systems (database, network, OS, dependencies) - Check for environmental differences (local vs CI, dev vs production) - Re-examine assumptions made during OBSERVE — is the evidence itself reliable? - Look for interactions between components that were assumed to be independent ### When to Escalate to the User Escalate when you have exhausted reasonable investigation: - All plausible hypotheses have been eliminated - The failure depends on environment or configuration you cannot inspect - The failure is intermittent and you cannot establish a reliable reproduction - The root cause is in a third-party dependency or system outside your control When escalating, provide: 1. What you observed (evidence) 2. What you hypothesized (theories) 3. What you tested and eliminated (experiments) 4. What you believe the remaining possibilities are (next steps) Never escalate with "I don't know what's wrong." Always escalate with "Here is what I've ruled out, and here is where I think the answer lies."