--- name: chaos-plan description: Design chaos engineering experiments for a system - identifies failure modes, creates experiment hypotheses, and generates GameDay plans allowed-tools: Read, Glob, Grep, Task, AskUserQuestion argument-hint: --- # Chaos Plan Command This command helps design chaos engineering experiments and GameDay plans for a system. ## Purpose Generate comprehensive chaos engineering plans including: 1. System resilience assessment 2. Failure mode identification 3. Experiment hypotheses and designs 4. GameDay planning 5. Safety measures and rollback procedures ## Workflow ### Phase 1: System Discovery First, understand the system: **If a system/service name is provided:** - Search the codebase for service definitions - Identify dependencies and integrations - Look for existing resilience patterns (circuit breakers, retries) - Check for monitoring and alerting configuration **Analyze architecture:** ```text System Analysis Checklist: □ Service boundaries and responsibilities □ External dependencies (databases, APIs, queues) □ Internal service dependencies □ Data flows and critical paths □ Current resilience patterns in place □ Existing monitoring and observability ``` ### Phase 2: Failure Mode Identification Identify potential failure modes: **Infrastructure failures:** - Instance/container crashes - Network partitions between services - Disk exhaustion - Memory pressure - CPU saturation **Application failures:** - Service unavailability - Slow responses / latency spikes - Error responses - Connection pool exhaustion - Resource leaks **Dependency failures:** - Database failover/unavailability - Cache miss/unavailability - External API timeout/failure - Message queue backup **Data failures:** - Corrupted data - Stale data (replication lag) - Schema incompatibility ### Phase 3: Steady State Definition Define what "healthy" looks like: **Identify key metrics:** ```text Steady State Metrics: Request-based (RED): - Request rate: [baseline] requests/sec - Error rate: < [threshold]% - Duration (p99): < [threshold]ms Resource-based (USE): - CPU utilization: < [threshold]% - Memory utilization: < [threshold]% - Queue depth: < [threshold] Business metrics: - [Metric 1]: [baseline/threshold] - [Metric 2]: [baseline/threshold] ``` ### Phase 4: Experiment Design Design experiments for identified failure modes: **For each priority failure mode, create:** ```text Experiment: [Name] Hypothesis: "When [fault condition] occurs, [system component] will [expected behavior] because [reasoning]." Fault Injection: - Type: [Latency/Error/Termination/Partition/etc.] - Target: [Service/instance/dependency] - Magnitude: [Degree of fault] - Duration: [How long] Blast Radius: - Affected components: [List] - User impact estimate: [Percentage/description] Abort Conditions: - Error rate > [threshold] - Latency p99 > [threshold] - [Business metric] breached - Customer complaints received Rollback Steps: 1. [Step to revert fault] 2. [Step to verify recovery] Success Criteria: □ [Metric] remains within [bounds] □ [Recovery] happens within [time] □ [Alerts] fire as expected ``` ### Phase 5: Experiment Prioritization Prioritize experiments by: **Risk-based prioritization:** | Factor | Weight | |--------|--------| | Likelihood of failure | High | | Impact if it occurs | High | | Current uncertainty | Medium | | Ease of testing | Low | **Recommended order:** 1. High impact + high uncertainty 2. High impact + low uncertainty (validate assumptions) 3. Medium impact + high uncertainty 4. Lower priority items ### Phase 6: GameDay Planning If multiple experiments or team exercise desired: ```text GameDay Plan: [Title] Date: [Proposed date] Duration: [Hours] Participants: [Teams/roles needed] Objectives: 1. Validate [resilience pattern/assumption] 2. Practice [incident response/coordination] 3. Test [runbook/recovery procedure] Pre-GameDay Checklist: □ Stakeholder approval □ Participant briefing scheduled □ Monitoring dashboards verified □ Kill switches tested □ Rollback procedures documented □ Communication channels set up Schedule: [Time] - Pre-brief and role assignment [Time] - Baseline capture [Time] - Scenario 1: [Name] [Time] - Debrief / break [Time] - Scenario 2: [Name] [Time] - Hot debrief [Time] - Cleanup and verification Scenarios: Scenario 1: [Name] - Objective: [What we're testing] - Hypothesis: [Expected behavior] - Injection: [Fault details] - Duration: [Time] - Success criteria: [Metrics] Scenario 2: [Name] [Same structure] Safety: - Kill switch: [How to immediately stop] - Rollback: [How to revert all changes] - Communication: [Primary channel] - Escalation: [Who to contact if real incident] Roles: - GameDay Lead: [Responsibilities] - Scenario Executor: [Responsibilities] - Observers: [Responsibilities] - Scribe: [Responsibilities] Post-GameDay: - Hot debrief: Same day - Formal postmortem: Within 1 week - Action items tracked in: [System] ``` ### Phase 7: Output Generation Generate deliverables: 1. **Resilience Assessment** - Current state analysis 2. **Experiment Catalog** - Prioritized list of experiments 3. **Detailed Experiment Plans** - Ready-to-execute designs 4. **GameDay Plan** - If requested, full GameDay documentation 5. **Implementation Checklist** - Steps to execute safely ## Usage Examples ```bash # Plan chaos for a specific service /sd:chaos-plan order-service # Plan with architecture context /sd:chaos-plan @docs/architecture/payment-system.md # Plan GameDay for entire system /sd:chaos-plan "e-commerce platform" --gameday ``` ## Interactive Elements Use `AskUserQuestion` to: - Clarify system boundaries - Validate failure mode priorities - Confirm blast radius acceptability - Review experiment designs - Finalize GameDay parameters ## Output The command produces: 1. **Resilience Assessment Report** 2. **Prioritized Experiment Catalog** 3. **Detailed Experiment Designs** 4. **GameDay Plan** (if applicable) 5. **Safety Checklist** ## Related Skills This command leverages: - `chaos-engineering-fundamentals` - Experiment design principles - `resilience-patterns` - Patterns to test and validate - `gameday-planning` - GameDay execution guidance - `incident-response` - Handling discovered issues ## Related Agent For ongoing chaos engineering consultation: - `chaos-engineer` - Resilience testing expertise