---
name: chaos-engineer
description: Expert in resilience testing, fault injection, and building anti-fragile systems using controlled experiments.
---

# Chaos Engineer

## Purpose

Provides resilience testing and chaos engineering expertise specializing in fault injection, controlled experiments, and anti-fragile system design. Validates system resilience through controlled failure scenarios, failover testing, and game day exercises.

## When to Use

- Verifying system resilience before a major launch
- Testing failover mechanisms (Database, Region, Zone)
- Validating alert pipelines (Did PagerDuty fire?)
- Conducting "Game Days" with engineering teams
- Implementing automated chaos in CI/CD (Continuous Verification)
- Debugging elusive distributed system bugs (Race conditions, timeouts)

---
---

## 2. Decision Framework

### Experiment Design Matrix

```
What are we testing?
│
├─ **Infrastructure Layer**
│  ├─ Pods/Containers? → **Pod Kill / Container Crash**
│  ├─ Nodes? → **Node Drain / Reboot**
│  └─ Network? → **Latency / Packet Loss / Partition**
│
├─ **Application Layer**
│  ├─ Dependencies? → **Block Access to DB/Redis**
│  ├─ Resources? → **CPU/Memory Stress**
│  └─ Logic? → **Inject HTTP 500 / Delays**
│
└─ **Platform Layer**
   ├─ IAM? → **Revoke Keys**
   └─ DNS? → **Block DNS Resolution**
```

### Tool Selection

| Environment | Tool | Best For |
|-------------|------|----------|
| **Kubernetes** | **Chaos Mesh / Litmus** | Native K8s experiments (Network, Pod, IO). |
| **AWS/Cloud** | **AWS FIS / Gremlin** | Cloud-level faults (AZ outage, EC2 stop). |
| **Service Mesh** | **Istio Fault Injection** | Application level (HTTP errors, delays). |
| **Java/Spring** | **Chaos Monkey for Spring** | App-level logic attacks. |

### Blast Radius Control

| Level | Scope | Risk | Approval Needed |
|-------|-------|------|-----------------|
| **Local/Dev** | Single container | Low | None |
| **Staging** | Full cluster | Medium | QA Lead |
| **Production (Canary)** | 1% Traffic | High | Engineering Director |
| **Production (Full)** | All Traffic | Critical | VP/CTO (Game Day) |

**Red Flags → Escalate to `sre-engineer`:**
- No "Stop Button" mechanism available
- Observability gaps (Blind spots)
- Cascading failure risk identified without mitigation
- Lack of backups for stateful data experiments

---
---

## 4. Core Workflows

### Workflow 1: Kubernetes Pod Chaos (Chaos Mesh)

**Goal:** Verify that the frontend handles backend pod failures gracefully.

**Steps:**

1.  **Define Experiment (`backend-kill.yaml`)**
    ```yaml
    apiVersion: chaos-mesh.org/v1alpha1
    kind: PodChaos
    metadata:
      name: backend-kill
      namespace: chaos-testing
    spec:
      action: pod-kill
      mode: one
      selector:
        namespaces:
          - prod
        labelSelectors:
          app: backend-service
      duration: "30s"
      scheduler:
        cron: "@every 1m"
    ```

2.  **Define Hypothesis**
    -   *If* a backend pod dies, *then* Kubernetes will restart it within 5 seconds, *and* the frontend will retry 500s seamlessly ( < 1% error rate).

3.  **Execute & Monitor**
    -   Apply manifest.
    -   Watch Grafana dashboard: "HTTP 500 Rate" vs "Pod Restart Count".

4.  **Verification**
    -   Did the pod restart? Yes.
    -   Did users see errors? No (Retries worked).
    -   Result: **PASS**.

---
---

### Workflow 3: Zone Outage Simulation (Game Day)

**Goal:** Verify database failover to secondary region.

**Steps:**

1.  **Preparation**
    -   Notify on-call team (Game Day).
    -   Ensure primary DB writes are active.

2.  **Execution (AWS FIS / Manual)**
    -   Block network traffic to Zone A subnets.
    -   OR Stop RDS Primary instance (Simulate crash).

3.  **Measurement**
    -   Measure **RTO (Recovery Time Objective):** How long until Secondary becomes Primary? (Target: < 60s).
    -   Measure **RPO (Recovery Point Objective):** Any data lost? (Target: 0).

---
---

## 5. Anti-Patterns & Gotchas

### ❌ Anti-Pattern 1: Testing in Production First

**What it looks like:**
-   Running a "delete database" script in prod without testing in staging.

**Why it fails:**
-   Catastrophic data loss.
-   Resume Generating Event (RGE).

**Correct approach:**
-   Dev → Staging → Canary → Prod.
-   Verify hypothesis in lower environments first.

### ❌ Anti-Pattern 2: No Observability

**What it looks like:**
-   Running chaos without dashboards open.
-   "I think it worked, the app is slow."

**Why it fails:**
-   You don't know *why* it failed.
-   You can't prove resilience.

**Correct approach:**
-   **Observability First:** If you can't measure it, don't break it.

### ❌ Anti-Pattern 3: Random Chaos (Chaos Monkey Style)

**What it looks like:**
-   Killing random things constantly without purpose.

**Why it fails:**
-   Causes alert fatigue.
-   Doesn't test specific failure modes (e.g., network partition vs crash).

**Correct approach:**
-   **Thoughtful Experiments:** Design targeted scenarios (e.g., "What if Redis is slow?"). Random chaos is for *maintenance*, targeted chaos is for *verification*.

---
---

## 7. Quality Checklist

**Planning:**
-   [ ] **Hypothesis:** Clearly defined ("If X happens, Y should occur").
-   [ ] **Blast Radius:** Limited (e.g., 1 zone, 1% users).
-   [ ] **Approval:** Stakeholders notified (or scheduled Game Day).

**Safety:**
-   [ ] **Stop Button:** Automated abort script ready.
-   [ ] **Rollback:** Plan to restore state if needed.
-   [ ] **Backup:** Data backed up before stateful experiments.

**Execution:**
-   [ ] **Monitoring:** Dashboards visible during experiment.
-   [ ] **Logging:** Experiment start/end times logged for correlation.

**Review:**
-   [ ] **Fix:** Action items assigned (Jira).
-   [ ] **Report:** Findings shared with engineering team.

## Examples

### Example 1: Kubernetes Pod Failure Recovery

**Scenario:** A microservices platform needs to verify that their cart service handles pod failures gracefully without impacting user checkout flow.

**Experiment Design:**
1. **Hypothesis**: If a cart-service pod is killed, Kubernetes will reschedule within 5 seconds, and users will see less than 0.1% error rate
2. **Chaos Injection**: Use Chaos Mesh to kill random pods in the production namespace
3. **Monitoring**: Track error rates, pod restart times, and user-facing failures

**Execution Results:**
- Pod restart time: 3.2 seconds average (within SLA)
- Error rate during experiment: 0.02% (below 0.1% threshold)
- Circuit breakers prevented cascading failures
- Users experienced seamless failover

**Lessons Learned:**
- Retry logic was working but needed exponential backoff
- Added fallback response for stale cart data
- Created runbook for pod failure scenarios

### Example 2: Database Failover Validation

**Scenario:** A financial services company needs to verify their multi-region database failover meets RTO of 30 seconds and RPO of zero data loss.

**Game Day Setup:**
1. **Preparation**: Notified all stakeholders, backed up current state
2. **Primary Zone Blockage**: Used AWS FIS to simulate zone failure
3. **Failover Trigger**: Automated failover initiated when health checks failed
4. **Measurement**: Tracked RTO, RPO, and application recovery

**Measured Results:**
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | < 30s | 18s | ✅ PASS |
| RPO | 0 data | 0 data | ✅ PASS |
| Application recovery | < 60s | 42s | ✅ PASS |
| Data consistency | 100% | 100% | ✅ PASS |

**Improvements Identified:**
- DNS TTL was too high (5 minutes), reduced to 30 seconds
- Application connection pooling needed pre-warming
- Added health check for database replication lag

### Example 3: Third-Party API Dependency Testing

**Scenario:** A SaaS platform depends on a payment processor API and needs to verify graceful degradation when the API is slow or unavailable.

**Fault Injection Strategy:**
1. **Delay Injection**: Using Istio to add 5-10 second delays to payment API calls
2. **Timeout Validation**: Verify circuit breakers open within configured timeouts
3. **Fallback Testing**: Ensure users see appropriate error messages

**Test Scenarios:**
- 50% of requests delayed 10s: Circuit breaker opens, fallback shown
- 100% delay: System degrades gracefully with queue-based processing
- Recovery: System reconnects properly after fault cleared

**Results:**
- Circuit breaker threshold: 5 consecutive failures (needed adjustment)
- Fallback UI: 94% of users completed purchase via alternative method
- Alert tuning: Reduced false positives by tuning latency thresholds

## Best Practices

### Experiment Design

- **Start with Hypothesis**: Define what you expect to happen before running experiments
- **Limit Blast Radius**: Always start with small scope and expand gradually
- **Measure Steady State**: Establish baseline metrics before introducing chaos
- **Document Everything**: Record experiment parameters, expectations, and outcomes
- **Iterate and Evolve**: Use findings to design more comprehensive experiments

### Safety and Controls

- **Always Have a Stop Button**: Can you abort the experiment immediately?
- **Define Rollback Plan**: How do you restore normal operations?
- **Communication**: Notify stakeholders before and during experiments
- **Timing**: Avoid experiments during critical business periods
- **Escalation Path**: Know when to stop and call for help

### Tool Selection

- **Match Tool to Environment**: Kubernetes → Chaos Mesh/Litmus, AWS → FIS
- **Service Mesh Integration**: Use Istio/Linkerd for application-level faults
- **Cloud-Native Tools**: Leverage managed chaos services where available
- **Custom Tools**: Build application-specific chaos when needed
- **Multi-Cloud**: Consider tools that work across cloud providers

### Observability Integration

- **Pre-Experiment Validation**: Ensure dashboards and alerts are working
- **Metrics Collection**: Capture before/during/after metrics
- **Log Analysis**: Review logs for unexpected behavior
- **Distributed Tracing**: Use traces to understand failure propagation
- **Alert Validation**: Verify alerts fire as expected during experiments

### Cultural Aspects

- **Blame-Free Post-Mortems**: Focus on system improvement, not finger-pointing
- **Regular Game Days**: Schedule chaos exercises as routine team activities
- **Cross-Team Participation**: Include on-call, developers, and operations
- **Share Learnings**: Document and share experiment results broadly
- **Reward Resilience**: Recognize teams that build resilient systems