# 🚨 Skill: Alerting & Incident Management ## πŸ“‹ Metadata | Atributo | Valor | |----------|-------| | **ID** | `sre-alerting-incident-management` | | **Nivel** | πŸ”΄ Avanzado | | **VersiΓ³n** | 1.0.0 | | **Keywords** | `alerting`, `incident-management`, `pagerduty`, `opsgenie`, `oncall`, `runbooks` | | **Referencia** | [Google SRE - On-Call](https://sre.google/workbook/on-call/), [PagerDuty](https://www.pagerduty.com/) | ## πŸ”‘ Keywords para InvocaciΓ³n - `alerting` - `incident-management` - `pagerduty` - `opsgenie` - `oncall` - `runbooks` - `incident-response` - `@skill:alerting` ### Ejemplos de Prompts ``` Implementa alerting con PagerDuty y runbooks ``` ``` Configura incident management y on-call rotation ``` ``` Configura runbooks y on-call rotation ``` ``` @skill:alerting - Sistema completo de alertas e incidentes ``` ## πŸ“– DescripciΓ³n Alerting efectivo y gestiΓ³n de incidentes es crΓ­tico para mantener servicios confiables. Este skill cubre diseΓ±o de alertas efectivas, on-call rotations, runbooks, e incident response workflows. ### βœ… CuΓ‘ndo Usar Este Skill - Servicios en producciΓ³n - Equipos on-call - SLAs crΓ­ticos - Sistemas distribuidos - Incidentes frecuentes - Compliance requirements ### ❌ CuΓ‘ndo NO Usar Este Skill - Desarrollo local solo - Servicios no crΓ­ticos sin SLA - Equipos sin capacidad on-call ## πŸ—οΈ Arquitectura de Alerting ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Prometheus β”‚ β”‚ (Metrics) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Alert Rules β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ Alertmanager β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β” β”‚ PagerDutyβ”‚ β”‚ Slack β”‚ β”‚ Email β”‚ β”‚ Webhookβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ On-Call Engineer β”‚ β”‚ (Incident Response) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸ’» ImplementaciΓ³n ### 1. Alertmanager Configuration ```yaml # alertmanager/alertmanager.yml global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: # Critical alerts β†’ PagerDuty immediately - match: severity: critical receiver: 'pagerduty-critical' continue: true # Warning alerts β†’ Slack only - match: severity: warning receiver: 'slack-warnings' repeat_interval: 24h # Service-specific routes - match: service: payment-service receiver: 'payment-team' group_wait: 5s inhibit_rules: # Inhibit warning if critical is firing - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service'] receivers: - name: 'default' slack_configs: - channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: 'critical' client: 'Prometheus' client_url: 'http://prometheus:9090' - name: 'slack-warnings' slack_configs: - channel: '#alerts-warnings' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'payment-team' pagerduty_configs: - service_key: 'PAYMENT_TEAM_KEY' description: 'Payment Service Alert' slack_configs: - channel: '#payment-team' ``` ### 2. Alert Rules (Prometheus) ```yaml # prometheus/alerts/service-alerts.yml groups: - name: service_alerts interval: 30s rules: # Error Budget Exhaustion - alert: ErrorBudgetExhaustion expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) ) > 0.001 and ( sum_over_time( ( sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) )[28d:] ) > 0.001 ) for: 5m labels: severity: critical team: backend annotations: summary: "Error budget exhausted for {{ $labels.service }}" description: | Service {{ $labels.service }} has exceeded error budget threshold. Current error rate: {{ $value | humanizePercentage }} Runbook: https://runbooks.example.com/error-budget # High Latency - alert: HighLatencyP99 expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1.0 for: 10m labels: severity: warning annotations: summary: "High P99 latency in {{ $labels.service }}" description: "P99 latency is {{ $value }}s (threshold: 1s)" runbook: https://runbooks.example.com/high-latency # Service Down - alert: ServiceDown expr: up{job=~"app-.*"} == 0 for: 2m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is down" description: "Service has been unreachable for more than 2 minutes" runbook: https://runbooks.example.com/service-down # High Memory Usage - alert: HighMemoryUsage expr: | ( container_memory_usage_bytes{pod=~".+"} / container_spec_memory_limit_bytes{pod=~".+"} ) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage in {{ $labels.pod }}" description: "Memory usage is {{ $value | humanizePercentage }}" runbook: https://runbooks.example.com/high-memory # Disk Space - alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value | humanizePercentage }} disk space remaining" # CPU Throttling - alert: CPUThrottling expr: | rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1 for: 10m labels: severity: warning annotations: summary: "CPU throttling detected in {{ $labels.pod }}" description: "Container is being CPU throttled" # Connection Pool Exhaustion - alert: ConnectionPoolExhaustion expr: | ( db_connections_active{service=~".+"} / db_connections_max{service=~".+"} ) > 0.9 for: 5m labels: severity: critical annotations: summary: "Connection pool near exhaustion in {{ $labels.service }}" description: "{{ $value | humanizePercentage }} of connections in use" ``` ### 3. Runbook Template ```markdown # Runbook: High Error Rate ## Alert Name `HighErrorRate` ## Severity Critical ## Description Service error rate exceeds threshold (5% for 5 minutes) ## Symptoms - High HTTP 5xx response rate - User complaints - Error logs increasing ## Immediate Actions 1. **Acknowledge Alert** - Acknowledge in PagerDuty/OpsGenie - Notify team in Slack 2. **Check Service Health** ```bash kubectl get pods -l app=service-name kubectl logs -l app=service-name --tail=100 ``` 3. **Check Metrics** - Grafana dashboard: `/d/service-overview` - Error rate graph - Latency graphs 4. **Identify Root Cause** - Check recent deployments - Review error logs - Check dependencies (DB, APIs) ## Resolution Steps ### If caused by code deployment: ```bash # Rollback deployment kubectl rollout undo deployment/service-name ``` ### If caused by resource exhaustion: ```bash # Scale up kubectl scale deployment/service-name --replicas=5 ``` ### If caused by dependency failure: 1. Check dependency service status 2. Implement circuit breaker 3. Enable fallback mechanisms ## Post-Incident - Document in incident log - Update runbook if needed ## Escalation - If not resolved in 15 min β†’ Escalate to senior engineer - If not resolved in 30 min β†’ Escalate to engineering manager - If service completely down β†’ Escalate to CTO ``` ### 4. On-Call Rotation (PagerDuty) ```yaml # pagerduty/escalation-policies.yml escalation_policies: - name: "Primary On-Call" description: "Primary escalation for production services" num_loops: 3 escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "user" id: "PXXXXXXXX" # Primary on-call - escalation_delay_in_minutes: 15 targets: - type: "user" id: "PYYYYYYYY" # Secondary on-call - escalation_delay_in_minutes: 30 targets: - type: "schedule" id: "PZZZZZZZZ" # Manager on-call - name: "Critical Alerts Only" escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "user_reference" id: "PXXXXXXXX" - escalation_delay_in_minutes: 10 targets: - type: "escalation_policy_reference" id: "EPXXXXXXX" # Manager escalation schedules: - name: "Primary On-Call Schedule" time_zone: "America/New_York" layers: - name: "Layer 1" start: "2024-01-01T00:00:00" rotation_virtual_start: "2024-01-01T00:00:00" rotation_turn_length_seconds: 604800 # 1 week users: - user_id: "PXXXXXXXX" - user_id: "PYYYYYYYY" - user_id: "PZZZZZZZZ" restrictions: - type: "daily_restriction" start_time_of_day: "09:00:00" duration_seconds: 32400 # 9 hours ``` ### 5. Incident Response Workflow ```python # incident_response/workflow.py from dataclasses import dataclass from datetime import datetime from typing import List, Optional from enum import Enum class IncidentSeverity(Enum): SEV1 = "critical" # Service down SEV2 = "high" # Major degradation SEV3 = "medium" # Minor issues SEV4 = "low" # Informational @dataclass class Incident: id: str title: str severity: IncidentSeverity status: str # open, investigating, mitigated, resolved created_at: datetime resolved_at: Optional[datetime] assigned_to: str affected_services: List[str] description: str root_cause: Optional[str] = None resolution: Optional[str] = None class IncidentResponseWorkflow: def __init__(self): self.incidents = [] def create_incident( self, title: str, severity: IncidentSeverity, affected_services: List[str], description: str ) -> Incident: incident = Incident( id=f"INC-{datetime.now().strftime('%Y%m%d-%H%M%S')}", title=title, severity=severity, status="open", created_at=datetime.now(), resolved_at=None, assigned_to=self._assign_oncall(), affected_services=affected_services, description=description ) self.incidents.append(incident) self._notify_team(incident) self._create_incident_channel(incident) return incident def _assign_oncall(self) -> str: # Logic to assign to current on-call engineer return "engineer@example.com" def _notify_team(self, incident: Incident): # Send notifications via PagerDuty, Slack, etc. pass def _create_incident_channel(self, incident: Incident): # Create dedicated Slack channel for incident channel_name = f"incident-{incident.id.lower()}" # Create channel logic pass def update_status(self, incident_id: str, status: str, notes: str): incident = self._find_incident(incident_id) if incident: incident.status = status if status == "resolved": incident.resolved_at = datetime.now() def _find_incident(self, incident_id: str) -> Optional[Incident]: return next((i for i in self.incidents if i.id == incident_id), None) ``` ### 6. Alert Fatigue Prevention ```yaml # alertmanager/alert-fatigue-prevention.yml # Strategies to prevent alert fatigue # 1. Alert Grouping route: group_by: ['alertname', 'service'] group_wait: 10s # Wait before sending initial notification group_interval: 5m # Wait before sending updated notification repeat_interval: 12h # Minimum time between notifications # 2. Alert Inhibition inhibit_rules: # Don't alert on warning if critical is firing - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['service'] # Don't alert on individual instance if all instances are down - source_match: alertname: 'AllInstancesDown' target_match_re: alertname: '.*InstanceDown' # 3. Alert Suppression (Silence Rules) # Use Alertmanager UI or API to create silences for known issues # 4. Threshold Tuning # Use error budgets and SLOs to set meaningful thresholds # Example: Alert only when error budget is at risk # 5. Alert Classification # Only page on actionable alerts # Use different channels for different severities ``` ## 🎯 Mejores PrΓ‘cticas ### 1. Alert Design βœ… **DO:** - Alert on symptoms, not causes - Make alerts actionable - Include runbook links - Use appropriate severity levels - Test alerts regularly ❌ **DON'T:** - Alert on every metric - Create alerts that require investigation to understand - Alert on things you can't fix - Duplicate alerts across systems ### 2. On-Call βœ… **DO:** - Maintain clear rotation schedules - Provide context in handoffs - Limit on-call duration (max 1 week) - Compensate on-call time - Track on-call load ❌ **DON'T:** - Have people on-call 24/7 - Make on-call mandatory without compensation - Skip handoffs between rotations ### 3. Incident Response βœ… **DO:** - Follow runbooks - Communicate frequently - Document decisions - Implement action items ❌ **DON'T:** - Skip incident documentation - Point fingers - Ignore post-mortem action items ## 🚨 Troubleshooting ### Too Many Alerts 1. Review alert rules 2. Increase thresholds where appropriate 3. Implement better grouping 4. Use alert inhibition ### Alerts Not Firing 1. Check Prometheus query syntax 2. Verify alert rule evaluation 3. Check Alertmanager configuration 4. Verify notification channel configs ### On-Call Burnout 1. Review alert volume 2. Reduce non-actionable alerts 3. Improve runbooks 4. Rotate more frequently ## πŸ“š Recursos Adicionales - [PagerDuty Incident Response](https://response.pagerduty.com/) - [Google SRE - On-Call Handbook](https://sre.google/workbook/on-call/) - [Incident Response Guide](https://response.pagerduty.com/) --- **VersiΓ³n:** 1.0.0 **Última actualizaciΓ³n:** Diciembre 2025 **Total lΓ­neas:** 1,100+