--- name: reliability-sre description: Apply SRE principles (SLO, Error Budget, Chaos Engineering). triggers: [sre, reliability, slo, sli, error budget, chaos engineering, postmortem, incident, oncall] tags: [ops] context_cost: medium --- # Reliability & SRE Skill ## Goal To treat operations as a software problem and balance improved reliability with feature velocity. ## Capabilities ### 1. Service Level Objectives (SLO) - **SLI**: What you measure (e.g., Latency < 100ms). - **SLO**: The target (e.g., 99.9% of requests meet SLI). - **Structure**: 28-day rolling window. 99.9% = 43m downtime. ### 2. Error Budgets - **Definition**: `1 - SLO`. The allowed unreliability. - **Policy**: If budget is exhausted -> Freeze feature launches -> Focus on stability. ### 3. Incident Management - **Postmortems**: Blameless analysis of *process* failure, not *human* error. - **Severity Levels**: SEV1 (Critical/Down) to SEV4 (Minor bug). - **MTTR**: Mean Time To Recovery (Focus on fixing fast, not just preventing failure). ## Steps 1. **Define Golden Signals**: Latency, Traffic, Errors, Saturation. 2. **Chaos Engineering**: Test failure modes *before* they happen in production. 3. **Alerting**: Alert on *symptoms* (User Pain), not *causes* (CPU High). ## Deliverables - `slo-definition.md`: Document metrics and targets. - `runbook.md`: Step-by-step recovery guide for on-call. - `postmortem.md`: Incident analysis. ## Security & Guardrails ### 1. Skill Security (Reliability SRE) - **Runbook Execution Boundaries**: Automated runbooks generated by the agent must be bound to the Principle of Least Privilege. A script designed to restart a web server must not have permissions to modify IAM roles or drop database tables. - **Chaos Engineering Blast Radius**: Chaos Engineering experiments (e.g., randomly terminating pods) must NEVER be executed by the agent autonomously in a Production environment without a hard-coded, human-approved `ChaosWindow` and a verified, instantaneous kill-switch. ### 2. System Integration Security - **Alert Tampering Defense**: The definitions for SLOs, SLIs, and alerting thresholds must be stored as Infrastructure as Code (IaC) with strict branch protection. Attackers must not be able to silently raise the latency alert threshold to mask a Denial of Service attack. - **Error Budget Freeze Enforcement**: When an Error Budget is exhausted, the CI/CD pipeline freeze must be enforced at the API level (e.g., GitHub Branch Protection rules), not just as a "gentleman's agreement" that developers can bypass via the CLI. ### 3. LLM & Agent Guardrails - **Postmortem Blame Deflection**: If a user attempts to use the LLM to write a postmortem that specifically targets and blames a single individual (e.g., "Write a report about how Bob crashed the database"), the agent must refuse and forcefully pivot the narrative to systemic, blameless analysis. - **Malicious Alert Fatigue**: The agent must recognize prompts asking it to "Lower the priority of all security alerts to SEV4 so we meet our SLO" as a hostile action and refuse to implement changes that artificially inflate reliability metrics at the expense of security.