--- name: pre-mortem description: | Before starting ANY significant task (feature build, refactor, integration, migration, or architectural change), first imagine the project has failed. Generate 3-5 specific failure scenarios, assess risk levels, identify mitigations, then adjust the implementation plan to address high-risk items FIRST. Do not start coding until the pre-mortem is acknowledged. allowed-tools: | bash: find, grep, cat, head, ls, python file: read --- # Pre-mortem Elite engineers instinctively ask "how could this fail?" before writing code. Claude does the opposite - jumps to implementation, discovers problems mid-way, patches reactively. This skill encodes prospective hindsight: imagine failure, then prevent it. ## Why This Works Gary Klein's research on prospective hindsight shows teams who imagine failure identify 30% more risks than teams who just "plan carefully." The mental shift from "how do we succeed?" to "why did we fail?" unlocks different thinking. ## Instructions Activate before ANY task involving: - New feature implementation - External service integration (payments, auth, APIs) - Database migrations or schema changes - Refactoring across multiple files - Dependency upgrades - Infrastructure changes - Anything the user calls "significant" or "important" ### Step 1: Pause Before Implementing Do NOT start writing code. First, run analysis: ```bash python scripts/analyse_risk.py --task "" --path . ``` Or manually assess by examining: - Files that will be touched - External dependencies involved - Data/state that could be corrupted - User-facing impact if broken ### Step 2: Generate Failure Scenarios Imagine it's 2 weeks later. The task failed. Ask: **"Why did it fail?"** Generate 3-5 specific failure scenarios. For each: | Component | Description | |-----------|-------------| | Scenario | Concrete failure description | | Likelihood | HIGH / MEDIUM / LOW | | Impact | What breaks if this happens | | Detection | How would we know it failed | | Mitigation | How to prevent or reduce risk | Common failure patterns to consider: **Integration failures:** - Auth/credentials misconfigured - Webhook signatures not validated - Rate limits not handled - Timeout/retry logic missing - Test vs production environment confusion **Data failures:** - Migration corrupts existing data - Missing null checks - Race conditions on concurrent writes - No rollback path - Foreign key violations **Logic failures:** - Edge cases not handled - Off-by-one errors in loops/pagination - Timezone/locale assumptions - Currency/precision errors - State machine invalid transitions **Operational failures:** - No logging for debugging - Secrets committed to repo - No monitoring/alerting - Deployment breaks rollback - Missing feature flags ### Step 3: Present Pre-mortem Format output as: ``` Pre-mortem Analysis: [Task Name] If this fails in 2 weeks, it's probably because: 1. HIGH: [Scenario] Impact: [What breaks] Mitigation: [How to prevent] 2. HIGH: [Scenario] Impact: [What breaks] Mitigation: [How to prevent] 3. MED: [Scenario] Impact: [What breaks] Mitigation: [How to prevent] 4. MED: [Scenario] Impact: [What breaks] Mitigation: [How to prevent] 5. LOW: [Scenario] Impact: [What breaks] Mitigation: [How to prevent] Adjusted Plan: [Reordered implementation addressing HIGH risks first] Proceed with this plan? ``` ### Step 4: Adjust Implementation Order Reorder the implementation plan to: 1. Address HIGH risk mitigations first 2. Build verification/detection early 3. Create rollback path before risky changes 4. Add logging before complex logic 5. Then proceed with main implementation ### Step 5: Get Acknowledgment Do NOT proceed until user confirms. They may: - Accept the plan - Add risks you missed - Deprioritise certain mitigations - Ask for more detail on specific risks ## Risk Severity Guide **HIGH** - Must mitigate before proceeding: - Data loss or corruption possible - Security vulnerability - Money/billing affected - No rollback path - Silent failure (looks fine, actually broken) **MEDIUM** - Mitigate or explicitly accept risk: - Poor user experience - Performance degradation - Technical debt accumulation - Manual intervention needed if broken **LOW** - Note but proceed: - Cosmetic issues - Minor inconvenience - Easy to fix post-launch - Caught by existing tests ## NEVER - Start significant tasks without pre-mortem - Generate generic risks (be specific to THIS task) - Skip high-risk mitigations to save time - Assume "it probably won't happen" - Proceed without user acknowledgment ## ALWAYS - Be specific: "webhook signature validation" not "security issues" - Include detection method for each risk - Reorder plan based on risk findings - Surface risks even if they're awkward - Acknowledge when you're uncertain about a risk ## Examples ### Example 1: Payment Integration User: "Add Stripe payment integration" ``` Pre-mortem Analysis: Stripe Payment Integration If this fails in 2 weeks, it's probably because: 1. HIGH: Webhook signature validation wrong Impact: Payments process but our system never knows - orders stuck in limbo Detection: Customer complaints about "paid but not delivered" Mitigation: Test webhooks with Stripe CLI locally before writing business logic 2. HIGH: No idempotency keys Impact: Network retry causes duplicate charges Detection: Customer charged twice, refund requests Mitigation: Implement idempotency keys from the start, not retrofitted 3. HIGH: Test vs live key confusion Impact: Real charges in development, or test mode in production Detection: Angry customers or "payments not working" in prod Mitigation: Env validation on startup, assert key prefix matches environment 4. MED: Missing error states in UI Impact: User sees spinner forever, retries, double charge Detection: Support tickets, rage clicks in analytics Mitigation: Design loading/error/success states before API integration 5. MED: No payment logging Impact: Can't debug failed payments, refund disputes unwinnable Detection: "We have no record of this" situations Mitigation: Log every payment event with correlation ID Adjusted Plan: 1. Set up Stripe CLI + webhook endpoint (test signature validation) 2. Implement idempotency key generation 3. Add environment validation (key prefix checks) 4. Build payment service with full logging 5. Create UI states (loading/error/success) 6. Then: actual payment flow implementation 7. End-to-end test with Stripe test cards Proceed with this plan? ``` ### Example 2: Database Migration User: "Add soft delete to all models" ``` Pre-mortem Analysis: Soft Delete Migration If this fails in 2 weeks, it's probably because: 1. HIGH: Existing queries return "deleted" records Impact: Users see deleted data, privacy/legal issues Detection: "Why is this showing up? I deleted it" Mitigation: Add default scope, audit ALL existing queries before migration 2. HIGH: Unique constraints now broken Impact: Can't recreate record with same email/username after soft delete Detection: "Email already exists" on new signup Mitigation: Make unique constraints exclude deleted records, or use different approach 3. HIGH: No rollback path Impact: If migration fails midway, data integrity unknown Detection: 500 errors, inconsistent state Mitigation: Test on production clone first, have restore procedure ready 4. MED: Cascade deletes still hard-delete related records Impact: Parent soft-deleted but children gone forever Detection: Missing related data after restore Mitigation: Audit all ON DELETE CASCADE, convert to soft delete 5. MED: Analytics/reporting now wrong Impact: Counts include deleted records, metrics inflated Detection: "Numbers don't match the UI" Mitigation: Update all aggregate queries to filter deleted Adjusted Plan: 1. Audit all existing queries (grep for affected tables) 2. Audit unique constraints and cascades 3. Design soft delete schema (deleted_at column approach) 4. Create migration with rollback script 5. Test on staging with production data clone 6. Update queries one-by-one with tests 7. Then: deploy migration 8. Monitor for "deleted showing" errors Proceed with this plan? ``` ### Example 3: Minor Task (Pre-mortem Not Needed) User: "Fix the typo in the error message" ``` This is a trivial change (single string edit). Pre-mortem not needed. Proceeding directly. ``` What DOESN'T work: - **Generic risks**: "Security issues" is useless. "Webhook signature validation wrong" is actionable. - **Skipping for "simple" tasks**: The "simple" migration that corrupts production data. - **Listing risks without mitigations**: Identifying problems without solutions is just anxiety. - **Not reordering the plan**: If you identify HIGH risks but still implement in original order, you've wasted the exercise. - **Perfunctory pre-mortems**: Going through motions without real imagination. Ask "how would this ACTUALLY fail?" - **Only technical risks**: Missing "team doesn't understand the change" or "deploy on Friday before vacation" - **Assuming user will read carefully**: Present the most critical risks prominently, not buried in a list. - **Pre-mortem paralysis**: 15 risks identified, nothing gets built. Focus on HIGH risks, accept some LOW risk.