--- name: observability-sre description: Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices. --- # Observability & Site Reliability Engineering ## Core Principles - **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility - **Observability-First** — Build systems that explain their own behavior - **SLO-Driven** — Define reliability targets that matter to users - **Proactive Detection** — Find issues before customers do - **Blameless Culture** — Learn from failures without blame - **Automate Toil** — Reduce repetitive operational work - **Continuous Improvement** — Each incident makes systems more resilient - **Full-Stack Visibility** — Monitor from infrastructure to business metrics --- ## Hard Rules (Must Follow) > These rules are mandatory. Violating them means the skill is not working correctly. ### Symptom-Based Alerts Only **Alert on user-facing symptoms, not internal infrastructure metrics.** ```yaml # ❌ FORBIDDEN: Alerting on internal metrics - alert: CPUHigh expr: cpu_usage > 70% # Users don't care about CPU, they care about latency - alert: MemoryHigh expr: memory_usage > 80% # Internal metric, may not affect users # ✅ REQUIRED: Alert on user experience - alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "Users experiencing slow response times" - alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "Users encountering errors" ``` ### Low Cardinality Labels **Loki/Prometheus labels must have low cardinality (<10 unique labels).** ```yaml # ❌ FORBIDDEN: High cardinality labels labels: user_id: "usr_123" # Millions of values! order_id: "ord_456" # Millions of values! request_id: "req_789" # Every request is unique! # ✅ REQUIRED: Low cardinality only labels: namespace: "production" # Few values app: "api-server" # Few values level: "error" # 5-6 values method: "GET" # ~10 values # High cardinality data goes in log body: logger.info({ user_id: "usr_123", # In JSON body, not label order_id: "ord_456", }, "Order processed"); ``` ### SLO-Based Error Budgets **Every service must have defined SLOs with error budget tracking.** ```yaml # ❌ FORBIDDEN: No SLO definition # Just monitoring without targets # ✅ REQUIRED: Explicit SLO with budget # SLO: 99.9% availability # Error Budget: 0.1% = 43.2 minutes/month downtime groups: - name: slo_tracking rules: - record: slo:api_availability:ratio expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) - alert: ErrorBudgetBurnRate expr: slo:api_availability:ratio < 0.999 for: 5m annotations: summary: "Burning error budget too fast" ``` ### Trace Context in Logs **All logs must include trace_id for correlation with distributed traces.** ```typescript // ❌ FORBIDDEN: Logs without trace context logger.info("Payment processed"); // ✅ REQUIRED: Include trace_id in every log const span = trace.getActiveSpan(); logger.info({ trace_id: span?.spanContext().traceId, span_id: span?.spanContext().spanId, order_id: "ord_123", }, "Payment processed"); // Output includes correlation: // {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"} ``` --- ## Quick Reference ### When to Use What | Scenario | Tool/Pattern | Reason | |----------|--------------|--------| | Metrics collection | Prometheus + Grafana | Industry standard, powerful query language | | Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard | | Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper | | Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics | | Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry | | Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling | | Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection | | AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis | ### The Three Pillars | Pillar | What | When | Tools | |--------|------|------|-------| | **Metrics** | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch | | **Logs** | Event records with context | Debugging, audit trails | Loki, ELK, Splunk | | **Traces** | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin | **Fourth Pillar (Emerging):** Continuous Profiling — Code-level performance data (CPU, memory usage at function level) --- ## Observability Architecture ### Layered Prometheus Setup ```yaml # 2025 Best Practice: Federated architecture # Prevents metric chaos while enabling drill-down # Layer 1: Application Prometheus # - Detailed business logic metrics # - High cardinality acceptable # - Short retention (7 days) # Layer 2: Cluster Prometheus # - Per-environment/cluster metrics # - Medium retention (30 days) # - Aggregates from application level # Layer 3: Global Prometheus # - Cross-cluster critical metrics # - Long retention (1 year) # - Federation from cluster level # Global Prometheus config scrape_configs: - job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="kubernetes-nodes"}' - '{__name__=~"job:.*"}' # Recording rules only static_configs: - targets: - 'cluster-prom-us-east.internal:9090' - 'cluster-prom-eu-west.internal:9090' ``` ### Recording Rules for Performance ```yaml # Precompute expensive queries groups: - name: api_performance interval: 30s rules: # Request rate (requests per second) - record: job:api_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job, method, status) # Error rate - record: job:api_errors:rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) # P95 latency - record: job:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) ``` ### Resource Optimization ```yaml # Increase scrape interval for high-target deployments scrape_interval: 30s # Default: 15s reduces load by 50% # Use relabeling to drop unnecessary metrics metric_relabel_configs: - source_labels: [__name__] regex: 'go_.*|process_.*' # Drop Go runtime metrics action: drop # Limit sample retention storage: tsdb: retention.time: 15d # Keep only 15 days locally retention.size: 50GB # Or max 50GB ``` --- ## Distributed Tracing with OpenTelemetry ### Auto-Instrumentation Setup ```typescript // Node.js auto-instrumentation import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [ getNodeAutoInstrumentations({ // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc. '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy }), ], }); sdk.start(); ``` ### Manual Instrumentation for Business Logic ```typescript import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('payment-service', '1.0.0'); async function processPayment(orderId: string, amount: number) { // Create custom span for business operation return tracer.startActiveSpan('processPayment', async (span) => { try { // Add business context span.setAttributes({ 'order.id': orderId, 'payment.amount': amount, 'payment.currency': 'USD', }); // Child span for external API call const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => { const result = await stripe.charges.create({ amount, currency: 'usd' }); childSpan.setAttribute('stripe.charge_id', result.id); childSpan.setStatus({ code: SpanStatusCode.OK }); childSpan.end(); return result; }); span.setStatus({ code: SpanStatusCode.OK }); return paymentResult; } catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); throw error; } finally { span.end(); } }); } ``` ### Sampling Strategies ```yaml # OpenTelemetry Collector config processors: # Probabilistic sampling: Keep 10% of traces probabilistic_sampler: sampling_percentage: 10 # Tail sampling: Make decisions after seeing full trace tail_sampling: policies: # Always sample errors - name: error-traces type: status_code status_code: {status_codes: [ERROR]} # Always sample slow requests - name: slow-traces type: latency latency: {threshold_ms: 1000} # Sample 5% of normal traffic - name: normal-traces type: probabilistic probabilistic: {sampling_percentage: 5} ``` ### Context Propagation ```typescript // Ensure trace context flows across services import { propagation, context } from '@opentelemetry/api'; // Outgoing HTTP request (automatic with auto-instrumentation) fetch('https://api.example.com/data', { headers: { // W3C Trace Context headers injected automatically: // traceparent: 00---01 // tracestate: vendor=value }, }); // Manual propagation for non-HTTP (e.g., message queues) const carrier = {}; propagation.inject(context.active(), carrier); await publishMessage(queue, { data: payload, headers: carrier }); ``` --- ## Structured Logging Best Practices ### JSON Logging Format ```typescript // Use structured logging library import pino from 'pino'; const logger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), }, timestamp: pino.stdTimeFunctions.isoTime, // Include trace context in logs mixin() { const span = trace.getActiveSpan(); if (!span) return {}; const { traceId, spanId } = span.spanContext(); return { trace_id: traceId, span_id: spanId, }; }, }); // Structured logging with context logger.info( { user_id: '123', order_id: 'ord_456', amount: 99.99, payment_method: 'card', }, 'Payment processed successfully' ); // Output: // {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"} ``` ### Log Levels ```typescript // Follow standard severity levels logger.trace({ details }, 'Low-level debugging'); // Very verbose logger.debug({ state }, 'Debug information'); // Development logger.info({ event }, 'Normal operation'); // Production default logger.warn({ issue }, 'Warning condition'); // Potential issues logger.error({ error, context }, 'Error occurred'); // Errors logger.fatal({ critical }, 'Fatal error'); // Process crash ``` ### Grafana Loki Configuration ```yaml # Promtail config - ships logs to Loki server: http_listen_port: 9080 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: kubernetes kubernetes_sd_configs: - role: pod relabel_configs: # Add pod labels as Loki labels (LOW cardinality only!) - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod - source_labels: [__meta_kubernetes_pod_label_app] target_label: app pipeline_stages: # Parse JSON logs - json: expressions: level: level trace_id: trace_id # Extract fields as labels - labels: level: trace_id: ``` ### Loki Best Practices - **Low Cardinality Labels** — Use only 5-10 labels (namespace, app, level) - **High Cardinality in Log Body** — Put user_id, order_id in JSON, not labels - **LogQL for Filtering** — Use `{app="api"} | json | user_id="123"` - **Retention Policy** — Keep recent logs longer, compress old logs ```promql # LogQL query examples {namespace="production", app="api"} |= "error" # Text search {app="api"} | json | level="error" | line_format "{{.msg}}" # JSON parsing rate({app="api"}[5m]) # Log rate per second sum by (level) (count_over_time({namespace="production"}[1h])) # Count by level ``` --- ## SLO/SLI/SLA Management ### Definitions - **SLI (Service Level Indicator)** — Quantifiable measurement of service behavior - Examples: Request latency, error rate, availability, throughput - **SLO (Service Level Objective)** — Target value/range for an SLI - Examples: 99.9% availability, P95 latency < 200ms - **SLA (Service Level Agreement)** — Formal commitment with consequences - Examples: "99.9% uptime or 10% credit" ### The Four Golden Signals ```yaml # Google SRE's key metrics for any service 1. Latency SLI: P95 request latency SLO: 95% of requests complete in < 200ms 2. Traffic SLI: Requests per second SLO: Handle 10,000 req/s peak load 3. Errors SLI: Error rate (5xx / total) SLO: < 0.1% error rate 4. Saturation SLI: Resource utilization (CPU, memory, disk) SLO: CPU < 70%, Memory < 80% ``` ### Error Budget ```python # Error budget = 1 - SLO SLO = 99.9% # "three nines" Error_Budget = 100% - 99.9% = 0.1% # Monthly calculation (30 days) Total_Minutes = 30 * 24 * 60 = 43,200 minutes Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes # If you've had 20 minutes downtime this month: Budget_Remaining = 43.2 - 20 = 23.2 minutes Budget_Consumed = 20 / 43.2 = 46.3% # Policy: If budget > 90% consumed, freeze deployments ``` ### SLO Implementation with Prometheus ```yaml # Recording rules for SLI calculation groups: - name: slo_availability interval: 30s rules: # Total requests - record: slo:api_requests:total expr: sum(rate(http_requests_total[5m])) # Successful requests (non-5xx) - record: slo:api_requests:success expr: sum(rate(http_requests_total{status!~"5.."}[5m])) # Availability SLI - record: slo:api_availability:ratio expr: slo:api_requests:success / slo:api_requests:total # 30-day availability - record: slo:api_availability:30d expr: avg_over_time(slo:api_availability:ratio[30d]) - name: slo_latency interval: 30s rules: # P95 latency SLI - record: slo:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # Alerting on SLO burn rate - alert: HighErrorBudgetBurnRate expr: | ( slo:api_availability:ratio < 0.999 # Below 99.9% SLO and slo:api_availability:30d > 0.999 # But 30-day average still OK ) for: 5m annotations: summary: "Burning error budget too fast" description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}" ``` --- ## Incident Response ### Incident Severity Levels | Level | Impact | Response Time | Examples | |-------|--------|---------------|----------| | **SEV-1** | Service down or major degradation | < 15 min | Complete outage, data loss, security breach | | **SEV-2** | Significant impact, partial outage | < 1 hour | Feature unavailable, high error rates | | **SEV-3** | Minor impact, workaround exists | < 4 hours | Single component degraded, slow performance | | **SEV-4** | Cosmetic, no user impact | Next business day | UI glitches, logging errors | ### Incident Response Roles (IMAG Framework) ```yaml Incident Commander (IC): - Overall coordination and decision-making - Declares incident start/end - Decides on escalations - Owns communication to leadership Operations Lead (OL): - Technical investigation and mitigation - Coordinates engineers - Implements fixes - Reports status to IC Communications Lead (CL): - Internal/external status updates - Customer communication - Stakeholder notifications - Status page updates ``` ### Incident Workflow ``` 1. Detection (Alert fires or user reports) ↓ 2. Triage (Assess severity, assign IC) ↓ 3. Response (Assemble team, create war room) ↓ 4. Mitigation (Stop the bleeding, restore service) ↓ 5. Resolution (Fix root cause) ↓ 6. Postmortem (Blameless review, action items) ↓ 7. Follow-up (Implement improvements) ``` ### On-Call Best Practices - **Rotation** — 1-week shifts, balanced across timezones - **Escalation** — Primary → Secondary → Manager (15 min each) - **Playbooks** — Step-by-step debugging guides for common issues - **Runbooks** — Automated remediation scripts - **Handoff** — 15-min sync at rotation change - **Compensation** — On-call pay or comp time - **Health** — No more than 2 incidents/night target ### Alert Fatigue Prevention ```yaml # Symptoms vs Causes alerting # Alert on WHAT users experience, not WHY it's broken # GOOD: Symptom-based alert - alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 # User-facing metric annotations: summary: "API is slow for users" # BAD: Cause-based alert - alert: CPUHigh expr: cpu_usage > 70% # Internal metric, might not impact users # Don't alert unless this affects SLOs # Use SLO-based alerting # Alert when error budget burn rate is too high ``` --- ## Blameless Postmortems ### Core Principles - **Assume Good Intentions** — Everyone did their best with available information - **Focus on Systems** — Identify gaps in process/tooling, not people - **Psychological Safety** — No punishment for honest mistakes - **Learning Culture** — Incidents are opportunities to improve - **Separate from Performance Reviews** — Postmortem participation never affects evaluations ### Postmortem Template ```markdown # Incident Postmortem: [Title] **Date:** 2025-01-15 **Duration:** 10:30 - 12:15 UTC (1h 45m) **Severity:** SEV-2 **Incident Commander:** Jane Doe **Responders:** John Smith, Alice Johnson ## Impact - 15,000 users affected - 12% error rate on payment processing - $5,000 estimated revenue impact - No data loss ## Timeline (UTC) - 10:30 - Alert: Payment error rate > 5% - 10:32 - IC assigned, war room created - 10:45 - Identified: Database connection pool exhausted - 11:00 - Mitigation: Increased pool size from 50 → 100 - 11:15 - Error rate back to normal - 12:15 - Incident closed after monitoring ## Root Cause Database connection pool configured for average load, not peak traffic. Black Friday traffic spike (3x normal) exhausted connections. ## What Went Well - Alert fired within 2 minutes of issue - Clear escalation path, IC available immediately - Mitigation applied quickly (30 minutes to fix) - No data corruption or loss ## What Went Wrong - No load testing at 3x scale - No auto-scaling for connection pool - No alert on connection pool saturation - Insufficient monitoring of database metrics ## Action Items - [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20) - [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25) - [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1) - [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18) - [ ] (@john) Document connection pool tuning runbook (Due: Jan 22) ## Lessons Learned 1. Black Friday load patterns need dedicated testing 2. Database metrics were missing from standard dashboards 3. Auto-scaling should cover ALL resources, not just pods ``` ### Follow-up - Review postmortem in team meeting within 1 week - Track action items to completion (not optional!) - Share learnings across teams - Update runbooks and playbooks - Celebrate successful incident response --- ## Chaos Engineering ### Principles 1. **Define Steady State** — Normal system behavior (e.g., 99.9% success rate) 2. **Hypothesize** — Predict system will remain stable under failure 3. **Inject Failures** — Simulate real-world events 4. **Disprove Hypothesis** — Look for deviations from steady state 5. **Learn and Improve** — Fix weaknesses, increase resilience ### Failure Types ```yaml Infrastructure: - Pod/node termination - Network latency/packet loss - DNS failures - Cloud region outage Resources: - CPU stress - Memory exhaustion - Disk I/O saturation - File descriptor limits Dependencies: - Database connection failures - API timeout/errors - Cache unavailability - Message queue backlog Security: - DDoS simulation - Certificate expiration - Unauthorized access attempts ``` ### Chaos Mesh Example ```yaml # Network latency injection apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay spec: action: delay mode: one selector: namespaces: - production labelSelectors: app: payment-service delay: latency: "100ms" correlation: "50" jitter: "50ms" duration: "5m" scheduler: cron: "@every 2h" # Run every 2 hours --- # Pod kill experiment apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill spec: action: pod-kill mode: fixed-percent value: "10" # Kill 10% of pods selector: namespaces: - production labelSelectors: app: api-server duration: "30s" ``` ### Best Practices - **Start Small** — Non-production first, then canary production - **Collect Baselines** — Know normal metrics before experiments - **Define Success** — Clear criteria for what "stable" means - **Monitor Everything** — Watch metrics, logs, traces during tests - **Automate Rollback** — Stop experiment if SLOs violated - **Game Days** — Scheduled chaos exercises with full team - **Blameless Reviews** — Treat chaos failures like production incidents --- ## AIOps and AI in Observability ### 2025 Trends - **Anomaly Detection** — AI spots unusual patterns in metrics/logs - **Root Cause Analysis** — Correlate failures across services automatically - **Predictive Alerting** — Predict failures before they happen - **Auto-Remediation** — AI suggests or applies fixes autonomously - **Natural Language Queries** — Ask "Why is checkout slow?" instead of writing PromQL - **AI Observability** — Monitor AI model drift, hallucinations, token usage ### AI-Driven Platforms (2025) ```yaml Dynatrace Davis AI: - Auto-detected 73% of incidents before customer impact - Reduced alert noise by 90% - Causal AI for root cause analysis Datadog Watchdog: - Anomaly detection across metrics, logs, traces - Automated correlation of related issues - LLM-powered investigation assistant Elastic AIOps: - Machine learning for log anomaly detection - Automated baseline learning - Predictive alerting New Relic AI: - Natural language query interface - Automated incident summarization - Proactive capacity recommendations ``` ### Implementing AI Observability ```python # Monitor AI model performance from opentelemetry import trace, metrics tracer = trace.get_tracer(__name__) meter = metrics.get_meter(__name__) # Create metrics for AI model model_latency = meter.create_histogram( "ai.model.latency", description="AI model inference latency", unit="ms" ) model_tokens = meter.create_counter( "ai.model.tokens", description="Token usage" ) async def run_ai_model(prompt: str): with tracer.start_as_current_span("ai.inference") as span: start = time.time() span.set_attribute("ai.model", "gpt-4") span.set_attribute("ai.prompt_length", len(prompt)) response = await openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) latency = (time.time() - start) * 1000 tokens = response.usage.total_tokens # Record metrics model_latency.record(latency, {"model": "gpt-4"}) model_tokens.add(tokens, {"model": "gpt-4", "type": "total"}) # Add to span span.set_attribute("ai.response_length", len(response.choices[0].message.content)) span.set_attribute("ai.tokens_used", tokens) return response ``` --- ## Grafana Dashboards ### 3-3-3 Rule - **3 rows** of panels per dashboard - **3 panels** per row - **3 key metrics** per panel Avoid "dashboard sprawl" — Each dashboard should answer ONE question. ### Dashboard Categories ```yaml RED Dashboard (for services): - Rate: Requests per second - Errors: Error rate - Duration: Latency (P50, P95, P99) USE Dashboard (for resources): - Utilization: % of capacity used - Saturation: Queue depth, wait time - Errors: Error count Four Golden Signals Dashboard: - Latency - Traffic - Errors - Saturation SLO Dashboard: - Current SLI value - Error budget remaining - Burn rate - Trend (30-day) ``` ### Panel Best Practices ```json { "title": "API Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (method)", "legendFormat": "{{ method }}" } ], "options": { "tooltip": { "mode": "multi" }, "legend": { "displayMode": "table", "calcs": ["mean", "last"] } }, "fieldConfig": { "defaults": { "unit": "reqps", // Requests per second "color": { "mode": "palette-classic" }, "custom": { "lineWidth": 2, "fillOpacity": 10 } } } } ``` --- ## Checklist ```markdown ## Metrics (Prometheus + Grafana) - [ ] Layered architecture (app/cluster/global) - [ ] Recording rules for expensive queries - [ ] Resource limits and retention configured - [ ] Dashboards follow 3-3-3 rule - [ ] Alerts based on SLOs, not internal metrics ## Tracing (OpenTelemetry) - [ ] Auto-instrumentation enabled - [ ] Custom spans for business operations - [ ] Sampling strategy configured - [ ] Trace context in logs (correlation) - [ ] Backend connected (Tempo/Jaeger) ## Logging (Loki/ELK) - [ ] Structured JSON logging - [ ] Low cardinality labels (<10) - [ ] Trace IDs in logs - [ ] Appropriate log levels - [ ] Retention policy defined ## SLOs - [ ] SLIs defined for key user journeys - [ ] SLOs documented and tracked - [ ] Error budget calculated - [ ] Burn rate alerting configured - [ ] Monthly SLO review process ## Incident Response - [ ] Severity levels defined - [ ] On-call rotation scheduled - [ ] Escalation policy documented - [ ] Runbooks for common issues - [ ] Postmortem template ready ## Culture - [ ] Blameless postmortem process - [ ] Action items tracked to completion - [ ] Incident learnings shared - [ ] On-call compensation policy - [ ] Regular chaos engineering exercises ``` --- ## See Also - [reference/monitoring.md](reference/monitoring.md) — Prometheus and Grafana deep dive - [reference/logging.md](reference/logging.md) — Structured logging best practices - [reference/tracing.md](reference/tracing.md) — OpenTelemetry and distributed tracing - [reference/incident-response.md](reference/incident-response.md) — Incident management and postmortems - [templates/slo-template.md](templates/slo-template.md) — SLO definition template