---
name: root-cause-analysis
description: Performs systematic root cause analysis to identify the true source of bugs, errors, and unexpected behavior through structured investigation phases — not just treating symptoms. Use when a user reports a bug, crash, error, or broken behavior and needs to debug, troubleshoot, or investigate why something is not working; especially for complex or intermittent issues across multiple components. Applies the Five Whys method, hypothesis-driven testing, stack trace analysis, git blame/log evidence gathering, and causal chain documentation to isolate and confirm root causes before applying any fix.
version: 1.0.0
triggers:
  - root cause
  - debugging
  - find the bug
  - why is this happening
  - investigate issue
  - bug investigation
tags:
  - debugging
  - investigation
  - problem-solving
  - systematic
difficulty: intermediate
estimatedTime: 20
relatedSkills:
  - debugging/trace-and-isolate
  - debugging/hypothesis-testing
---

# Root Cause Analysis

You are performing systematic root cause analysis to find the true source of a bug. Do not apply fixes until you understand WHY the bug exists.

## Core Principle

**Never fix a symptom. Always find and fix the root cause.**

## The Five Whys Method

Ask "Why?" repeatedly to drill down to the root cause:

1. **Why** did the API return an error? → The database query failed
2. **Why** did the database query fail? → The connection pool was exhausted
3. **Why** was the pool exhausted? → **ROOT CAUSE:** Missing `finally` block to close connections

## Investigation Phases

### Phase 1: Reproduce the Bug

Before investigating:

1. **Reproduce consistently** - If you can't reproduce it, you can't verify a fix
2. **Document reproduction steps** - Exact sequence of actions
3. **Note environment details** - OS, versions, configuration
4. **Identify minimal reproduction** - Smallest case that shows the bug

Questions to answer:
- Does it happen every time or intermittently?
- Does it happen in all environments?
- When did it start happening? (recent changes)

### Phase 2: Gather Evidence

Collect information before forming theories:

- Error messages and stack traces
- Log files (application, system, database)
- Recent code changes (git log, blame)
- User reports and reproduction steps
- Monitoring data (metrics, APM)
- Related issues (search issue tracker)

Do NOT:
- Make changes while gathering evidence
- Assume you know the cause without evidence
- Ignore related symptoms

### Phase 3: Form Hypotheses

Based on evidence, create ranked hypotheses:

| Priority | Hypothesis | Evidence | Test Plan |
|----------|------------|----------|-----------|
| 1 | Connection leak in UserService | Stack trace shows connection pool | Add logging, check usage |
| 2 | Query timeout too short | Occurs under load | Test with longer timeout |
| 3 | Database server overload | Correlates with peak hours | Check DB metrics |

For each hypothesis:
- What evidence supports it?
- What evidence contradicts it?
- How can we test it?

### Phase 4: Test Hypotheses

Test each hypothesis systematically:

1. **Start with highest probability**
2. **Design a definitive test** - Should clearly confirm or reject
3. **Make ONE change at a time**
4. **Document results**

If hypothesis is rejected:
- Cross it off the list
- Re-evaluate remaining hypotheses
- Consider if new evidence suggests new hypotheses

### Phase 5: Verify Root Cause

Before declaring root cause found:

- [ ] Can you explain the full causal chain?
- [ ] Does fixing it consistently prevent the bug?
- [ ] Does it explain ALL observed symptoms?
- [ ] Is there nothing earlier in the chain that could be fixed?

## Common Root Cause Categories

- **Code Defects:** logic errors, boundary conditions, race conditions, resource leaks, null/undefined handling
- **Design Issues:** missing error handling, inadequate validation, poor state management, coupling
- **Environment:** configuration errors, resource constraints, version mismatches, network issues
- **Data Issues:** invalid input, data corruption, schema mismatches, encoding problems

## Evidence Collection Commands

```bash
# Recent changes to relevant files
git log --oneline -20 -- path/to/file

# Who changed this line
git blame path/to/file

# Changes since last working version
git diff v1.2.3..HEAD -- src/

# Search for related error handling
grep -r "catch\|error\|throw" --include="*.ts" src/
```

## Red Flags - You Haven't Found Root Cause

- "I'm not sure why, but this fix works"
- "The bug went away after I restarted"
- "I added a check to prevent this case"
- "It's probably a race condition somewhere"

These suggest symptom treatment, not root cause resolution.

## Documentation Template

When root cause is found, document:

```markdown
## Bug: [Description]

### Root Cause
[Clear explanation of why the bug occurred]

### Evidence
- [Evidence 1]
- [Evidence 2]

### Causal Chain
1. [Initial trigger]
2. [Intermediate cause]
3. [Root cause]
4. [Observed symptom]

### Fix
[Description of the fix and why it addresses root cause]

### Prevention
[How to prevent similar issues in the future]
```

## Integration with Other Skills

After finding root cause:
- Use **testing/red-green-refactor** to write a test that exposes the bug
- Use **planning/verification-gates** to validate the fix
- Consider **collaboration/structured-review** for complex fixes