--- name: root-cause-analysis description: Conduct systematic root cause analysis to identify underlying problems. Use structured methodologies to prevent recurring issues and drive improvements. --- # Root Cause Analysis ## Overview Root cause analysis (RCA) identifies underlying reasons for failures, enabling permanent solutions rather than temporary fixes. ## When to Use - Production incidents - Customer-impacting issues - Repeated problems - Unexpected failures - Performance degradation ## Instructions ### 1. **The 5 Whys Technique** ```yaml Example: Website Down Symptom: Website returned 503 Service Unavailable Why 1: Why was website down? Answer: Database connection pool exhausted Why 2: Why was connection pool exhausted? Answer: Queries taking too long, connections not released Why 3: Why were queries slow? Answer: Missing index on frequently queried column Why 4: Why was index missing? Answer: Performance testing didn't use production-like data volume Why 5: Why wasn't production-like data used? Answer: Load testing environment doesn't mirror production Root Cause: Load testing environment under-provisioned Solution: Update load testing environment with production-like data Prevention: Establish environment parity requirements ``` ### 2. **Systematic RCA Process** ```yaml Step 1: Gather Facts - When did issue occur? - Who detected it? - How many users affected? - What error messages? - What system changes deployed? - Check logs, metrics, alerts - Determine impact scope Step 2: Reproduce - Can we reproduce consistently? - What are the exact steps? - What environment (prod, staging)? - Can we isolate to component? - Set up test case Step 3: Identify Contributing Factors - Direct cause - Indirect/enabling factors - System vulnerabilities - Procedural gaps - Knowledge gaps Step 4: Determine Root Cause - Use 5 Whys technique - Ask "why did this control fail?" - Look for systemic issues - Separate root cause from symptoms Step 5: Develop Solutions - Immediate: Fix the symptom - Short-term: Prevent recurrence - Long-term: Systemic fix - Prioritize by impact/effort Step 6: Implement & Verify - Implement solutions - Test in staging - Deploy carefully - Verify improvement - Monitor metrics Step 7: Document & Share - Write RCA report - Document lesson learned - Share with team - Update procedures - Training if needed ``` ### 3. **RCA Report Template** ```yaml RCA Report: Incident: Database connection failure (2024-01-15, 14:30-15:15) Impact: - Duration: 45 minutes - Users affected: 5,000 (10% of user base) - Revenue lost: ~$2,000 - Severity: P1 (Critical) Timeline: 14:30: Automated monitoring alert: High error rate (20%) 14:32: On-call engineer notified 14:35: Identified database connection error in logs 14:40: Restarted database connection pool 14:42: Service recovered, error rate returned to 0.1% 14:50: Incident declared resolved 15:15: Full recovery verified Root Cause: Poorly optimized query introduced in release 2.5.0 caused queries to take 10x longer. Connection pool exhausted as connections weren't released quickly. Contributing Factors: 1. No query performance testing pre-deployment 2. Load testing environment doesn't match production volume 3. No alerting on query duration 4. Connection pool timeout set too high Solutions: Immediate (Done): - Rolled back problematic query optimization Short-term (1 week): - Added query performance alerts (>1s) - Added index for slow query - Set query timeout to 5 seconds Long-term (1 month): - Updated load testing with production-like data - Implement performance benchmarks in CI/CD - Improve monitoring for connection pool health - Training on query optimization Prevention: - Query performance regression tests - Load testing with production data - Connection pool metrics monitoring - Code review of database changes ``` ### 4. **Root Cause Analysis Techniques** ```yaml Fishbone Diagram: Main problem: Slow API Response Branches: Code: - Inefficient algorithm - Missing cache - Unnecessary queries Data: - Large dataset - Missing index - Slow database Infrastructure: - Low CPU capacity - Slow network - Disk I/O bottleneck Process: - No monitoring - No load testing - Manual deployments People: - Lack of knowledge - Lack of tools - No peer review --- Systemic vs. Individual Causes: Individual: "Developer used inefficient code" Fix: Training Risk: Happens again with different person Systemic: "No code review process" Fix: Implement mandatory code review Risk: Prevents similar issues Prefer systemic solutions for prevention ``` ### 5. **Follow-Up & Prevention** ```yaml After RCA: 1. Track Action Items - Assign owner - Set deadline - Follow up in retrospective 2. Prevent Recurrence - Automated tests - Monitoring/alerts - Procedural changes - Training 3. Monitor Metrics - Track similar incidents - Verify fix effectiveness - Monitor preventive measures - Catch early warnings 4. Share Learnings - Document incident - Share with team - Industry sharing if relevant - Update procedures --- Checklist: [ ] Incident details documented [ ] Timeline established [ ] Logs reviewed [ ] Metrics analyzed [ ] Root cause identified (via 5 Whys) [ ] Contributing factors listed [ ] Immediate actions completed [ ] Short-term solutions planned [ ] Long-term solutions identified [ ] Solutions prioritized [ ] RCA report written [ ] Team debriefing scheduled [ ] Action items assigned [ ] Prevention measures planned [ ] Follow-up scheduled ``` ## Key Points - Distinguish symptom from root cause - Use 5 Whys technique systematically - Look for systemic issues, not individual blame - Focus on prevention, not just fixing - Document thoroughly for team learning - Assign clear ownership for solutions - Follow up to verify effectiveness - Use RCA to drive improvements