---
name: runbooks-troubleshooting-guides
user-invocable: false
description: Use when creating troubleshooting guides and diagnostic procedures for operational issues. Covers problem diagnosis, root cause analysis, and systematic debugging.
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash
  - Grep
  - Glob
---

# Runbooks - Troubleshooting Guides

Creating effective troubleshooting guides for diagnosing and resolving operational issues.

## Troubleshooting Framework

### The 5-Step Method

1. **Observe** - Gather symptoms and data
2. **Hypothesize** - Form theories about root cause
3. **Test** - Validate hypotheses with experiments
4. **Fix** - Apply solution
5. **Verify** - Confirm resolution

## Basic Troubleshooting Guide

```markdown
# Troubleshooting: [Problem Statement]

## Symptoms

What the user/system is experiencing:
- API returning 503 errors
- Response time > 10 seconds
- High CPU usage alerts

## Quick Checks (< 2 minutes)

### 1. Is the service running?
```bash
kubectl get pods -n production | grep api-server
```

**Expected:** STATUS = Running

### 2. Are recent deploys the cause?

```bash
kubectl rollout history deployment/api-server
```

**Check:** Did we deploy in the last 30 minutes?

### 3. Is this affecting all users?

Check error rate in Datadog:

- If < 5%: Isolated issue, may be client-specific
- If > 50%: Widespread issue, likely infrastructure

## Common Causes

| Symptom | Likely Cause | Quick Fix |
|---------|-------------|-----------|
| 503 errors | Pod crashlooping | Restart deployment |
| Slow responses | Database connection pool | Increase pool size |
| High memory | Memory leak | Restart pods |

## Detailed Diagnosis

### Hypothesis 1: Database Connection Issues

**Test:**

```bash
# Check database connections
kubectl exec -it api-server-abc -- psql -h $DB_HOST -c "SELECT count(*) FROM pg_stat_activity"
```

**If connections > 90:** Pool is saturated.
**Next step:** Increase pool size or investigate slow queries.

### Hypothesis 2: High Traffic Spike

**Test:**

```bash
# Check request rate
curl -H "Authorization: Bearer $DD_API_KEY" \
  "https://api.datadoghq.com/api/v1/query?query=sum:nginx.requests{*}"
```

**If requests 3x normal:** Traffic spike.
**Next step:** Scale up pods or enable rate limiting.

### Hypothesis 3: External Service Degradation

**Test:**

```bash
# Check third-party API
curl -w "@curl-format.txt" https://api.stripe.com/v1/charges
```

**If response time > 2s:** External service slow.
**Next step:** Implement circuit breaker or increase timeouts.

## Resolution Steps

### Solution A: Immediate (< 5 minutes)

Restart affected pods:

```bash
kubectl rollout restart deployment/api-server -n production
```

**When to use:** Quick mitigation while investigating root cause.

### Solution B: Short-term (< 30 minutes)

Scale up resources:

```bash
kubectl scale deployment/api-server --replicas=10 -n production
```

**When to use:** Traffic spike or resource exhaustion.

### Solution C: Long-term (< 2 hours)

Fix root cause:

1. Identify slow database query
2. Add database index
3. Deploy code optimization

**When to use:** After immediate pressure is relieved.

## Validation

- [ ] Error rate < 1%
- [ ] Response time p95 < 200ms
- [ ] CPU usage < 70%
- [ ] No active alerts

## Prevention

How to prevent this issue in the future:

- Add monitoring alert for connection pool saturation
- Implement auto-scaling based on request rate
- Set up load testing to find capacity limits

```

## Decision Tree Format

```markdown
# Troubleshooting: Slow API Responses

## Start Here

```

                    Check response time
                           |
            ┌──────────────┴──────────────┐
            │                             │
        < 500ms                       > 500ms
            │                             │
       NOT THIS RUNBOOK            Continue below

```

## Step 1: Locate the Slowness

```bash
# Check which service is slow
curl -w "@timing.txt" https://api.example.com/users
```

**Decision:**

- Time to first byte > 2s → Database slow (go to Step 2)
- Time to first byte < 100ms → Network slow (go to Step 3)
- Timeout → Service down (go to Step 4)

## Step 2: Database Diagnosis

```bash
# Check active queries
psql -c "SELECT query, state, query_start FROM pg_stat_activity WHERE state != 'idle'"
```

**Decision:**

- Query running > 5s → Slow query (Solution A)
- Many idle in transaction → Connection leak (Solution B)
- High connection count → Pool exhausted (Solution C)

### Solution A: Optimize Slow Query

1. Identify slow query from above
2. Run EXPLAIN ANALYZE
3. Add missing index or optimize query

### Solution B: Fix Connection Leak

1. Restart application pods
2. Review code for unclosed connections
3. Add connection timeout

### Solution C: Increase Connection Pool

1. Edit database config
2. Increase max_connections
3. Update application pool size

## Step 3: Network Diagnosis

... (continue with network troubleshooting)

```

## Layered Troubleshooting

### Layer 1: Application

```markdown
## Application Layer Issues

### Check Application Health

1. **Health endpoint:**
   ```bash
   curl https://api.example.com/health
   ```

1. **Application logs:**

   ```bash
   kubectl logs deployment/api-server --tail=100 | grep ERROR
   ```

2. **Application metrics:**
   - Request rate
   - Error rate
   - Response time percentiles

### Common Application Issues

**Memory Leak**

- **Symptom:** Memory usage climbing over time
- **Test:** Check memory metrics in Datadog
- **Fix:** Restart pods, investigate with heap dump

**Thread Starvation**

- **Symptom:** Slow responses, high CPU
- **Test:** Thread dump analysis
- **Fix:** Increase thread pool size

**Code Bug**

- **Symptom:** Specific endpoints fail
- **Test:** Review recent deploys
- **Fix:** Rollback or hotfix

```

### Layer 2: Infrastructure

```markdown
## Infrastructure Layer Issues

### Check Infrastructure Health

1. **Node resources:**
   ```bash
   kubectl top nodes
   ```

1. **Pod resources:**

   ```bash
   kubectl top pods -n production
   ```

2. **Network connectivity:**

   ```bash
   kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- ping database.internal
   ```

### Common Infrastructure Issues

**Node Under Pressure**

- **Symptom:** Pods evicted, slow scheduling
- **Test:** `kubectl describe node` for pressure conditions
- **Fix:** Scale node pool or add nodes

**Network Partition**

- **Symptom:** Intermittent timeouts
- **Test:** MTR between pods and destination
- **Fix:** Check security groups, routing tables

**Disk I/O Saturation**

- **Symptom:** Slow database, high latency
- **Test:** Check IOPS metrics in CloudWatch
- **Fix:** Increase provisioned IOPS

```

### Layer 3: External Dependencies

```markdown
## External Dependencies Issues

### Check External Services

1. **Third-party APIs:**
   ```bash
   curl -w "@timing.txt" https://api.stripe.com/health
   ```

1. **Status pages:**
   - Check status.stripe.com
   - Check status.aws.amazon.com

2. **DNS resolution:**

   ```bash
   nslookup api.stripe.com
   dig api.stripe.com
   ```

### Common External Issues

**API Rate Limiting**

- **Symptom:** 429 responses from external service
- **Test:** Check rate limit headers
- **Fix:** Implement backoff, cache responses

**Service Degradation**

- **Symptom:** Slow external API responses
- **Test:** Check their status page
- **Fix:** Implement circuit breaker, use fallback

**DNS Failure**

- **Symptom:** Cannot resolve hostname
- **Test:** DNS queries
- **Fix:** Check DNS config, try alternative resolver

```

## Systematic Debugging

### Use the Scientific Method

```markdown
# Debugging: Database Connection Failures

## 1. Observation

**What we know:**
- Error: "connection refused" in logs
- Started: 2025-01-15 14:30 UTC
- Frequency: Every database query fails
- Scope: All pods affected

## 2. Hypothesis

**Possible causes:**
1. Database instance is down
2. Security group blocking traffic
3. Network partition
4. Wrong credentials

## 3. Test Each Hypothesis

### Test 1: Database instance status

```bash
aws rds describe-db-instances --db-instance-identifier prod-db | jq '.DBInstances[0].DBInstanceStatus'
```

**Result:** "available"
**Conclusion:** Database is running ✗ Hypothesis 1 rejected

### Test 2: Security group rules

```bash
aws ec2 describe-security-groups --group-ids sg-abc123 | jq '.SecurityGroups[0].IpPermissions'
```

**Result:** Port 5432 open only to 10.0.0.0/16
**Pod IP:** 10.1.0.5
**Conclusion:** Pod IP not in allowed range ✓ **ROOT CAUSE FOUND**

## 4. Fix

Update security group:

```bash
aws ec2 authorize-security-group-ingress \
  --group-id sg-abc123 \
  --protocol tcp \
  --port 5432 \
  --cidr 10.1.0.0/16
```

## 5. Verify

Test connection from pod:

```bash
kubectl exec -it api-server-abc -- psql -h prod-db.rds.amazonaws.com -c "SELECT 1"
```

**Result:** Success ✓

```

## Time-Boxed Investigation

```markdown
# Troubleshooting: Production Outage

**Time Box:** Spend MAX 15 minutes investigating before escalating.

## First 5 Minutes: Quick Wins

- [ ] Check pod status
- [ ] Check recent deploys
- [ ] Check external status pages
- [ ] Review monitoring dashboards

**If issue persists:** Continue to next phase.

## Minutes 5-10: Common Causes

- [ ] Restart pods (quick mitigation)
- [ ] Check database connectivity
- [ ] Review application logs
- [ ] Check resource limits

**If issue persists:** Continue to next phase.

## Minutes 10-15: Deep Dive

- [ ] Enable debug logging
- [ ] Capture thread dump
- [ ] Check for memory leaks
- [ ] Review network traces

**If issue persists:** ESCALATE to senior engineer.

## Escalation

**Escalate to:** Platform Team Lead
**Provide:**
- Timeline of issue
- Tests performed
- Current error rate
- Mitigation attempts
```

## Common Troubleshooting Patterns

### Binary Search

```markdown
## Finding Which Service is Slow

Using binary search to narrow down the problem:

1. **Check full request:** 5000ms total
2. **Check first half (API → Database):** 4900ms
   → Problem is in database query
3. **Check database:** Query takes 4800ms
4. **Check query plan:** Sequential scan on large table
5. **Root cause:** Missing index

**Fix:** Add index on frequently queried column.
```

### Correlation Analysis

```markdown
## Finding Related Events

Look for patterns and correlations:

**Timeline:**
- 14:25 - Deploy completed
- 14:30 - Error rate spike
- 14:35 - Database CPU at 100%
- 14:40 - Requests timing out

**Correlation:** Deploy introduced N+1 query.

**Evidence:**
- No config changes
- No infrastructure changes
- Only code deploy
- Error coincides with deploy

**Action:** Rollback deploy.
```

## Anti-Patterns

### Don't Skip Obvious Checks

```markdown
# Bad: Jump to complex solutions
## Database Slow

Must be a query optimization issue. Let's analyze query plans...

# Good: Check basics first
## Database Slow

1. Is the database actually running?
2. Can we connect to it?
3. Are there any locks?
4. What does the slow query log show?
```

### Don't Guess Randomly

```markdown
# Bad: Random changes
## API Errors

Let's try:
- Restarting the database
- Scaling to 100 pods
- Changing the load balancer config
- Updating the kernel

# Good: Systematic approach
## API Errors

1. What is the actual error message?
2. When did it start?
3. What changed before it started?
4. Can we reproduce it?
```

### Don't Skip Documentation

```markdown
# Bad: No notes
## Fixed It

I restarted some pods and now it works.

# Good: Document findings
## Resolution

**Root Cause:** Memory leak in worker process
**Evidence:** Pod memory climbing linearly over 6 hours
**Temporary Fix:** Restarted pods
**Long-term Fix:** PR #1234 fixes memory leak
**Prevention:** Added memory usage alerts
```

## Related Skills

- **runbook-structure**: Organizing operational documentation
- **incident-response**: Handling production incidents