--- name: actionable-alerting-runbook-design version: "1.0" description: > Designing effective alerts and runbooks for incident response. PROACTIVELY activate for: (1) Creating alerting rules, (2) Writing runbooks, (3) Reducing alert fatigue, (4) On-call escalation setup, (5) Incident response procedures. Triggers: "alerting", "runbook", "on-call", "pagerduty", "incident", "alert fatigue", "escalation", "playbook" core-integration: techniques: primary: ["systematic_analysis"] secondary: ["structured_evaluation"] contracts: input: "none" output: "none" patterns: "none" rubrics: "none" --- # Actionable Alerting and Runbook Design This skill provides expertise in designing alerts and runbooks for effective incident response. ## Overview Good alerting enables quick incident detection and resolution. Bad alerting causes fatigue and missed issues. ## Alerting Principles ### What Makes an Alert Actionable? 1. **Specific**: Clear about what's wrong 2. **Contextual**: Includes relevant information 3. **Timely**: Fires before users notice 4. **Actionable**: Recipient can do something about it 5. **Linked**: Points to runbook or dashboard ### Alert Anti-Patterns - **Flapping alerts**: Constantly firing and resolving - **Too sensitive**: Alerts on normal variance - **No runbook**: Alert with no remediation guidance - **Wrong audience**: Alerting people who can't help ## Runbook Structure ```markdown # Alert: High API Error Rate ## Summary API error rate exceeds 5% for 5 minutes ## Impact Users experiencing failed requests ## Diagnosis Steps 1. Check error logs: [link] 2. Check recent deployments: [link] 3. Check database health: [link] ## Remediation Steps 1. If recent deployment, rollback: `kubectl rollout undo...` 2. If database issue, scale: `gcloud sql instances patch...` 3. If unknown, escalate to: @team-leads ## Escalation - L1: On-call engineer - L2: Team lead (if not resolved in 15min) - L3: VP Engineering (if customer impact > 30min) ``` ## Best Practices 1. Alert on symptoms, not causes 2. Use multi-window alerting to reduce noise 3. Include dashboards and runbook links in alerts 4. Review and prune alerts quarterly 5. Track alert-to-incident ratio [Content to be expanded based on plugin_spec_agentient-observability.md specifications]