--- name: afrexai-observability-engine description: "Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization." --- # Observability & Reliability Engineering Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development. --- ## Quick Health Check (/16) Score your current observability posture: | Signal | Healthy (2) | Weak (1) | Missing (0) | |--------|-------------|----------|-------------| | Structured logging | JSON logs with trace_id correlation | Logs exist but unstructured | Console.log / print statements | | Metrics collection | RED/USE metrics with dashboards | Some metrics, no dashboards | No metrics | | Distributed tracing | Full request path with sampling | Partial traces, key services only | No tracing | | Alerting | SLO-based alerts with runbooks | Threshold alerts, some runbooks | No alerts or all-noise | | Incident response | Defined process with roles + post-mortems | Ad-hoc response, some docs | "Whoever notices fixes it" | | SLOs defined | SLOs with error budgets tracked weekly | Informal availability targets | No reliability targets | | On-call rotation | Structured rotation with escalation | Informal "call someone" | No on-call | | Cost management | Observability budget tracked monthly | Some awareness of costs | No idea what you spend | **12-16:** Production-grade. Focus on optimization. **8-11:** Foundation exists. Fill the gaps systematically. **4-7:** Significant risk. Prioritize alerting + incident response. **0-3:** Flying blind. Start with Phase 1 immediately. --- ## Phase 1: Structured Logging ### Log Architecture ``` Application → Structured JSON → Log Router → Storage → Query Engine ↓ Alert Pipeline ``` ### Required Fields (Every Log Line) | Field | Type | Purpose | Example | |-------|------|---------|---------| | `timestamp` | ISO-8601 UTC | When | `2026-02-22T18:30:00.123Z` | | `level` | enum | Severity | `info`, `warn`, `error`, `fatal` | | `service` | string | Which service | `payment-api` | | `version` | string | Which deploy | `v2.3.1` | | `environment` | string | Which env | `production` | | `message` | string | What happened | `Payment processed successfully` | | `trace_id` | string | Request correlation | `abc123def456` | | `span_id` | string | Operation within trace | `span_789` | | `duration_ms` | number | How long | `142` | ### Contextual Fields (Add Per Domain) ```yaml # HTTP request context http: method: POST path: /api/v1/orders status: 201 client_ip: 203.0.113.42 # Anonymize in logs if needed user_agent: "Mozilla/5.0..." request_id: "req_abc123" # Business context business: user_id: "usr_456" tenant_id: "tenant_789" order_id: "ord_012" action: "checkout" amount_cents: 4999 currency: "USD" # Error context error: type: "PaymentDeclinedError" message: "Card declined: insufficient funds" code: "CARD_DECLINED" stack: "..." # Only in non-production or DEBUG level retry_count: 2 retryable: true ``` ### Log Level Decision Tree ``` Is the process about to crash? → FATAL (exit after logging) Did an operation fail that needs human attention? → ERROR (page someone or create ticket) Did something unexpected happen but we recovered? → WARN (review in daily triage) Is this a normal business event worth recording? → INFO (audit trail, business metrics) Is this useful for debugging but noisy in production? → DEBUG (off in prod, on in staging) Is this only useful when stepping through code? → TRACE (never in production) ``` ### Log Level Rules 1. **ERROR means action required** — if no one needs to act on it, it's WARN 2. **INFO is for business events** — not internal implementation details 3. **No logging inside tight loops** — aggregate and log summary 4. **Log at boundaries** — API entry/exit, queue consume/publish, DB calls 5. **Never log secrets** — API keys, tokens, passwords, PII (see scrubbing below) ### PII & Secret Scrubbing ```yaml scrub_patterns: # Always redact - field_patterns: ["password", "secret", "token", "api_key", "authorization"] action: replace_with_redacted # Hash for correlation without exposure - field_patterns: ["email", "phone", "ssn", "national_id"] action: sha256_hash # Mask partially - field_patterns: ["credit_card", "card_number"] action: mask_last_4 # "****-****-****-1234" # IP anonymization - field_patterns: ["client_ip", "ip_address"] action: zero_last_octet # 203.0.113.0 ``` ### Logger Setup (By Language) **Node.js (Pino):** ```typescript import pino from 'pino'; import { AsyncLocalStorage } from 'node:async_hooks'; const als = new AsyncLocalStorage>(); const logger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), }, mixin: () => als.getStore() ?? {}, redact: ['req.headers.authorization', '*.password', '*.token'], timestamp: pino.stdTimeFunctions.isoTime, }); // Middleware: inject context app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), service: 'payment-api', version: process.env.APP_VERSION, }; als.run(ctx, () => next()); }); ``` **Python (structlog):** ```python import structlog structlog.configure( processors=[ structlog.contextvars.merge_contextvars, structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso", utc=True), structlog.processors.JSONRenderer(), ], ) log = structlog.get_logger() # Bind context per-request: structlog.contextvars.bind_contextvars(trace_id=trace_id, user_id=user_id) ``` **Go (zerolog):** ```go log := zerolog.New(os.Stdout).With(). Timestamp(). Str("service", "payment-api"). Str("version", version). Logger() // Per-request: reqLog := log.With().Str("trace_id", traceID).Logger() ``` ### Log Storage Decision | Volume | Solution | Retention | Cost | |--------|----------|-----------|------| | <10 GB/day | Loki + Grafana | 30 days hot, 90 days cold | Low | | 10-100 GB/day | Elasticsearch / OpenSearch | 14 days hot, 90 days S3 | Medium | | 100+ GB/day | ClickHouse or Datadog | 7 days hot, 30 days archive | High | | Budget-constrained | Loki + S3 backend | 90 days all cold | Very low | ### 10 Logging Anti-Patterns | # | Anti-Pattern | Fix | |---|-------------|-----| | 1 | `log.error(err)` with no context | Always include: what operation, what input, what state | | 2 | Logging request/response bodies | Log only in DEBUG; redact sensitive fields | | 3 | String concatenation in log messages | Use structured fields: `log.info("processed", { order_id, amount })` | | 4 | Catch-and-log-and-rethrow | Log at the boundary where you handle it, not every layer | | 5 | Different log formats per service | Standardize schema across all services | | 6 | No log rotation / retention policy | Set max size + TTL; archive to cold storage | | 7 | Logging inside hot paths | Aggregate: log summary every N items or every interval | | 8 | Missing correlation IDs | Propagate trace_id from first entry point through all services | | 9 | Boolean log levels (`verbose: true`) | Use standard levels with configurable minimum | | 10 | Logging PII in plain text | Implement scrubbing at the logger level | --- ## Phase 2: Metrics Collection ### The RED Method (Request-Driven Services) For every service endpoint, track: | Metric | What | Prometheus Example | |--------|------|--------------------| | **R**ate | Requests per second | `http_requests_total{method, path, status}` | | **E**rrors | Failed requests per second | `http_requests_total{status=~"5.."}` / total | | **D**uration | Latency distribution | `http_request_duration_seconds{method, path}` (histogram) | ### The USE Method (Infrastructure Resources) For every resource (CPU, memory, disk, network): | Metric | What | Example | |--------|------|---------| | **U**tilization | % resource busy | CPU usage 78% | | **S**aturation | Queue depth / backpressure | 12 requests queued | | **E**rrors | Resource errors | 3 disk I/O errors | ### Golden Signals (Google SRE) | Signal | Meaning | Source | |--------|---------|--------| | Latency | Time to serve requests | RED Duration | | Traffic | Demand on the system | RED Rate | | Errors | Rate of failed requests | RED Errors | | Saturation | How "full" the service is | USE Saturation | ### Metric Types & When to Use Each | Type | Use Case | Example | |------|----------|---------| | **Counter** | Things that only go up | Total requests, errors, bytes sent | | **Gauge** | Current value that goes up/down | Active connections, queue depth, temperature | | **Histogram** | Distribution of values | Request latency, response size | | **Summary** | Pre-calculated percentiles | Client-side latency (when you need exact percentiles) | **Rule:** Use histograms over summaries in most cases — they're aggregatable across instances. ### Naming Conventions ``` # Pattern: ___ http_server_request_duration_seconds http_server_requests_total db_pool_connections_active queue_messages_pending cache_hit_ratio # Rules: # 1. Use snake_case # 2. Include unit suffix (_seconds, _bytes, _total) # 3. _total suffix for counters # 4. Don't include label names in metric name # 5. Use base units (seconds not milliseconds, bytes not kilobytes) ``` ### Label Design Rules | Rule | Why | Example | |------|-----|---------| | Keep cardinality <100 per label | High cardinality kills performance | `status="200"` not `status="200 OK"` | | No user IDs as labels | Unbounded cardinality | Use log correlation instead | | No request paths with IDs | `/api/users/123` creates millions of series | Normalize: `/api/users/:id` | | Max 5-7 labels per metric | Each combo = a time series | `{method, path, status, service}` | ### Instrumentation Checklist ```yaml application_metrics: # HTTP layer - http_request_duration_seconds: histogram {method, path, status} - http_request_size_bytes: histogram {method, path} - http_response_size_bytes: histogram {method, path} - http_requests_in_flight: gauge # Business logic - orders_processed_total: counter {status, payment_method} - order_value_dollars: histogram {payment_method} - user_signups_total: counter {source} # Dependencies - db_query_duration_seconds: histogram {query_type, table} - db_connections_active: gauge {pool} - db_connections_idle: gauge {pool} - cache_requests_total: counter {result: hit|miss} - external_api_duration_seconds: histogram {service, endpoint} - external_api_errors_total: counter {service, error_type} # Queue / async - queue_messages_published_total: counter {queue} - queue_messages_consumed_total: counter {queue, status} - queue_processing_duration_seconds: histogram {queue} - queue_depth: gauge {queue} - queue_consumer_lag: gauge {queue, consumer_group} infrastructure_metrics: # Node exporter / cAdvisor provides these automatically - cpu_usage_percent: gauge {instance} - memory_usage_bytes: gauge {instance} - disk_usage_bytes: gauge {instance, mount} - disk_io_seconds: counter {instance, device} - network_bytes: counter {instance, direction} - container_cpu_usage: gauge {pod, container} - container_memory_usage: gauge {pod, container} ``` ### Stack Recommendations | Component | Options | Recommendation | |-----------|---------|----------------| | Collection | Prometheus, OTEL Collector, Datadog Agent | Prometheus (free) or OTEL Collector (vendor-neutral) | | Storage | Prometheus, Thanos, Mimir, VictoriaMetrics | VictoriaMetrics (best cost/perf) or Mimir (Grafana ecosystem) | | Visualization | Grafana, Datadog, New Relic | Grafana (free, extensible) | | Alerting | Alertmanager, Grafana Alerting, PagerDuty | Alertmanager + PagerDuty routing | --- ## Phase 3: Distributed Tracing ### Trace Architecture ``` Client Request → API Gateway (root span) → Auth Service (child span) → Order Service (child span) → Database Query (child span) → Payment Service (child span) → Stripe API (child span) → Notification Service (child span) → Email Provider (child span) ``` ### OpenTelemetry Setup **Auto-instrumentation (Node.js):** ```typescript // tracing.ts — import BEFORE anything else import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-http': { ignoreIncomingPaths: ['/health', '/ready'] }, '@opentelemetry/instrumentation-express': { enabled: true }, })], serviceName: process.env.OTEL_SERVICE_NAME || 'payment-api', }); sdk.start(); ``` **Custom spans for business logic:** ```typescript import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('payment-service'); async function processPayment(order: Order) { return tracer.startActiveSpan('process-payment', async (span) => { span.setAttributes({ 'order.id': order.id, 'order.amount_cents': order.amountCents, 'payment.method': order.paymentMethod, }); try { const result = await chargeCard(order); span.setAttributes({ 'payment.status': result.status }); return result; } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); } ``` ### Sampling Strategies | Strategy | When | Config | |----------|------|--------| | **Always On** | Dev/staging, low traffic (<100 rps) | `ratio: 1.0` | | **Probabilistic** | Moderate traffic (100-1000 rps) | `ratio: 0.1` (10%) | | **Rate-limited** | High traffic (>1000 rps) | `max_traces_per_second: 100` | | **Tail-based** | Want all errors + slow requests | Collector-side: keep if error OR duration > p99 | | **Parent-based** | Respect upstream decisions | If parent sampled, child sampled | **Recommendation:** Start with parent-based + probabilistic (10%). Add tail-based at the collector to capture all errors. ### Context Propagation | Header | Standard | Format | |--------|----------|--------| | `traceparent` | W3C Trace Context | `00-{trace_id}-{span_id}-{flags}` | | `tracestate` | W3C Trace Context | Vendor-specific key-value pairs | | `b3` | Zipkin B3 | `{trace_id}-{span_id}-{sampled}` | **Rule:** Use W3C Trace Context (`traceparent`) as primary. Support B3 for legacy Zipkin systems. ### Trace Storage | Volume | Solution | Retention | |--------|----------|-----------| | <50 GB/day | Jaeger + Elasticsearch | 7 days | | 50-500 GB/day | Tempo + S3 | 14 days | | 500+ GB/day | Tempo + S3 with aggressive sampling | 7 days | | Budget-constrained | Jaeger + Badger (local disk) | 3 days | --- ## Phase 4: SLOs, SLIs & Error Budgets ### SLI Selection by Service Type | Service Type | Primary SLI | Secondary SLI | Measurement | |--------------|-------------|---------------|-------------| | API / Web | Availability + Latency | Error rate | Server-side + synthetic | | Data pipeline | Freshness + Correctness | Throughput | Pipeline timestamps + checksums | | Storage | Durability + Availability | Latency | Checksums + uptime monitoring | | Streaming | Throughput + Latency | Message loss rate | Consumer lag + e2e latency | | Batch jobs | Success rate + Freshness | Duration | Job scheduler metrics | ### SLO Definition Template ```yaml slo: name: "Payment API Availability" service: payment-api owner: payments-team sli: type: availability definition: "Proportion of non-5xx responses" measurement: | sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) target: 99.95% # 21.9 min downtime/month window: rolling_30d error_budget: total_minutes: 21.9 # per 30 days burn_rate_alerts: - severity: critical burn_rate: 14.4x # Budget consumed in 2 hours short_window: 5m long_window: 1h - severity: warning burn_rate: 6x # Budget consumed in 5 days short_window: 30m long_window: 6h - severity: ticket burn_rate: 1x # Budget consumed in 30 days short_window: 6h long_window: 3d consequences: budget_remaining_above_50pct: "Normal development velocity" budget_remaining_20_to_50pct: "Prioritize reliability work" budget_remaining_below_20pct: "Feature freeze; reliability only" budget_exhausted: "All hands on reliability until budget recovers" ``` ### Common SLO Targets | Service Tier | Availability | p50 Latency | p99 Latency | Monthly Downtime | |--------------|-------------|-------------|-------------|------------------| | Tier 0 (payments, auth) | 99.99% | <100ms | <500ms | 4.3 min | | Tier 1 (core API) | 99.95% | <200ms | <1s | 21.9 min | | Tier 2 (non-critical) | 99.9% | <500ms | <2s | 43.8 min | | Tier 3 (internal tools) | 99.5% | <1s | <5s | 3.6 hours | | Batch / pipeline | 99% (success rate) | N/A | N/A | N/A | ### Error Budget Tracking ```yaml # Weekly error budget review template error_budget_review: week: "2026-W08" service: payment-api slo_target: 99.95% budget: total_minutes_this_period: 21.9 consumed_minutes: 8.2 remaining_minutes: 13.7 remaining_percent: 62.6% incidents_consuming_budget: - date: "2026-02-18" duration_minutes: 5.1 cause: "Database connection pool exhaustion" preventable: true action: "Increase pool size + add saturation alert" - date: "2026-02-20" duration_minutes: 3.1 cause: "Upstream payment provider timeout" preventable: false action: "Add circuit breaker with fallback" velocity_decision: "Normal — 62.6% budget remaining" reliability_work_this_week: - "Add connection pool saturation alert" - "Implement circuit breaker for payment provider" ``` --- ## Phase 5: Alert Design ### Alert Quality Principles 1. **Every alert must be actionable** — if no one needs to act, it's not an alert 2. **Every alert needs a runbook** — linked directly in the alert annotation 3. **Symptom-based over cause-based** — alert on "users can't checkout" not "CPU high" 4. **Multi-window burn rate** — not static thresholds (see SLO alerts above) 5. **Alert on absence, not just presence** — "no orders in 15 min" catches silent failures ### Alert Severity Levels | Severity | Response Time | Channel | Who | Example | |----------|--------------|---------|-----|---------| | **P0 — Critical** | <5 min | Page (PagerDuty/Opsgenie) | On-call engineer | Payment system down | | **P1 — High** | <30 min | Page during business hours, Slack 24/7 | On-call | Error rate >5% for 10 min | | **P2 — Medium** | <4 hours | Slack channel | Team | p99 latency degraded 2x | | **P3 — Low** | Next business day | Ticket auto-created | Team backlog | Disk usage >80% | | **Info** | N/A | Dashboard only | No one | Deploy completed | ### Alerting Anti-Patterns | Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Static CPU/memory thresholds | Noisy, not user-impacting | Use SLO-based burn rate alerts | | Alert per instance | 50 instances = 50 alerts for same issue | Aggregate: alert on service-level error rate | | No deduplication | Same alert fires 100 times | Group by service + alert name; set repeat interval | | Missing runbook | Engineer gets paged, doesn't know what to do | Every alert links to a runbook | | Threshold too sensitive | Fires on brief spikes | Use `for: 5m` to require sustained condition | | Too many P0s | Alert fatigue → ignoring real incidents | Audit monthly; demote or remove noisy alerts | ### Alert Template (Prometheus Alertmanager) ```yaml groups: - name: payment-api-slo rules: - alert: PaymentAPIHighErrorRate expr: | ( sum(rate(http_requests_total{service="payment-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) ) > 0.01 for: 5m labels: severity: critical service: payment-api team: payments annotations: summary: "Payment API error rate {{ $value | humanizePercentage }} (>1%)" description: "5xx error rate has exceeded 1% for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-errors" dashboard: "https://grafana.internal/d/payment-api" - alert: PaymentAPINoTraffic expr: | sum(rate(http_requests_total{service="payment-api"}[15m])) == 0 for: 5m labels: severity: critical service: payment-api annotations: summary: "Payment API receiving zero traffic for 5 minutes" runbook: "https://wiki.internal/runbooks/payment-api-no-traffic" - alert: PaymentAPILatencyHigh expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="payment-api"}[5m])) by (le) ) > 2 for: 10m labels: severity: warning annotations: summary: "Payment API p99 latency {{ $value }}s (>2s for 10min)" runbook: "https://wiki.internal/runbooks/payment-api-latency" ``` ### Runbook Template ```markdown # Runbook: PaymentAPIHighErrorRate ## What This Alert Means The payment API is returning >1% 5xx errors over a 5-minute window. Users are likely failing to complete checkouts. ## Impact - Users cannot process payments - Revenue loss: ~$X per minute (based on average traffic) - SLO: Payment API availability (target: 99.95%) ## Immediate Actions 1. Check the error dashboard: [link] 2. Check recent deploys: `kubectl rollout history deployment/payment-api` 3. Check upstream dependencies: - Database: [dashboard link] - Stripe API: [status page] - Redis cache: [dashboard link] 4. Check application logs: ``` kubectl logs -l app=payment-api --since=10m | jq 'select(.level=="error")' ``` ## Common Causes & Fixes | Cause | Diagnosis | Fix | |-------|-----------|-----| | Bad deploy | Errors started at deploy time | `kubectl rollout undo deployment/payment-api` | | DB connection exhaustion | `db_connections_active` at max | Restart pods (rolling) + increase pool size | | Stripe outage | Stripe status page red | Enable fallback payment processor | | Memory leak | Memory climbing, OOMKilled events | Rolling restart + investigate | ## Escalation - If unresolved after 15 min: page payment team lead - If revenue impact >$10K: page VP Engineering - If Stripe outage: communicate to support team for customer messaging ## Resolution - Confirm error rate <0.1% for 10 min - Post in #incidents: root cause + duration + impact - Schedule post-mortem if downtime >5 min ``` --- ## Phase 6: Dashboard Architecture ### Dashboard Hierarchy ``` L1: Executive / Business Dashboard (non-technical stakeholders) ↓ L2: Service Overview Dashboard (on-call, quick triage) ↓ L3: Service Deep-Dive Dashboard (debugging specific service) ↓ L4: Infrastructure Dashboard (resource-level details) ``` ### L1: Business Dashboard ```yaml panels: - title: "Revenue per Minute" type: stat query: "sum(rate(orders_total{status='completed'}[5m])) * avg(order_value_dollars)" - title: "Active Users (5min)" type: stat query: "count(count by (user_id) (http_requests_total{...}[5m]))" - title: "Checkout Success Rate" type: gauge query: "sum(rate(checkout_total{status='success'}[1h])) / sum(rate(checkout_total[1h]))" thresholds: [95, 98, 99.5] - title: "Error Budget Remaining" type: gauge query: "1 - (error_budget_consumed / error_budget_total)" ``` ### L2: Service Overview Dashboard Every service gets one of these with identical layout: ```yaml row_1_traffic: - "Request Rate (rps)" — timeseries, by status code - "Error Rate (%)" — timeseries, threshold line at SLO - "Active Requests" — gauge row_2_latency: - "Latency Distribution" — heatmap - "p50 / p95 / p99" — timeseries, threshold lines - "Latency by Endpoint" — table, sorted by p99 row_3_dependencies: - "Downstream Latency" — timeseries per dependency - "Downstream Error Rate" — timeseries per dependency - "Database Query Duration" — timeseries by query type row_4_resources: - "CPU Usage" — timeseries per pod - "Memory Usage" — timeseries per pod - "Pod Restarts" — stat row_5_business: - "Business Metric 1" — service-specific - "Business Metric 2" — service-specific ``` ### Dashboard Rules 1. **Time range default: last 1 hour** — most debugging happens in recent time 2. **Variable selectors at top**: environment, service, instance 3. **Consistent color coding**: green=good, yellow=degraded, red=bad across all dashboards 4. **Link alerts to dashboards** — every alert annotation includes dashboard URL 5. **No more than 15 panels per dashboard** — split into L3 if needed 6. **Include "as of" timestamp** — so screenshots in incidents are unambiguous 7. **Dashboard as code** — store Grafana JSON in git, provision via API --- ## Phase 7: Incident Response ### Incident Severity Classification | Severity | Criteria | Response | Communication | |----------|----------|----------|---------------| | **SEV-1** | Service down, data loss risk, security breach | All hands, war room | Status page update every 15 min | | **SEV-2** | Degraded service, SLO at risk, partial outage | On-call + backup | Status page update every 30 min | | **SEV-3** | Minor degradation, workaround exists | On-call during hours | Internal Slack update | | **SEV-4** | Cosmetic, low impact | Next sprint | None | ### Incident Roles | Role | Responsibility | Who | |------|---------------|-----| | **Incident Commander (IC)** | Owns the incident. Coordinates. Makes decisions. | On-call lead | | **Technical Lead** | Diagnoses and fixes. Communicates technical status to IC. | Senior engineer | | **Communications Lead** | Updates status page, Slack, stakeholders. | Product/support | | **Scribe** | Documents timeline, actions, decisions in real-time. | Anyone available | ### Incident Response Workflow ``` 1. DETECT - Alert fires → on-call paged - Customer report → support escalates - Internal discovery → engineer reports 2. TRIAGE (first 5 minutes) - Confirm the issue is real (not false alert) - Classify severity (SEV-1 through SEV-4) - Open incident channel: #inc-YYYY-MM-DD-short-description - Assign roles (IC, Tech Lead, Comms) 3. MITIGATE (next 5-30 minutes) - Goal: STOP THE BLEEDING, not find root cause - Options (try in order): a. Rollback last deploy b. Scale up / restart pods c. Toggle feature flag off d. Redirect traffic / enable fallback e. Manual data fix - Document every action with timestamp 4. STABILIZE - Confirm mitigation is working (metrics back to normal) - Monitor for 15-30 min for recurrence - Update status page: "Monitoring fix" 5. RESOLVE - Confirm all metrics healthy for 30+ min - Update status page: "Resolved" - Schedule post-mortem (within 48 hours for SEV-1/2) - Send internal summary to stakeholders ``` ### Incident Channel Template ``` 📋 Incident: Payment API 5xx Errors 🔴 Severity: SEV-2 🕐 Started: 2026-02-22 14:23 UTC 👤 IC: @alice 🔧 Tech Lead: @bob 📢 Comms: @charlie Status: MITIGATING Impact: ~5% of checkout requests failing Customer-facing: Yes Timeline: 14:23 — Alert fired: PaymentAPIHighErrorRate 14:25 — IC assigned: @alice, confirmed real via dashboard 14:28 — Tech Lead: error logs show connection pool exhaustion post-deploy 14:31 — Rolled back deployment v2.3.1 → v2.3.0 14:35 — Error rate dropping, monitoring 14:50 — Error rate <0.1%, marking resolved ``` --- ## Phase 8: Post-Mortem Framework ### Blameless Post-Mortem Template ```yaml post_mortem: title: "Payment API Connection Pool Exhaustion" date: "2026-02-22" severity: SEV-2 duration: 27 minutes (14:23 — 14:50 UTC) authors: ["@alice", "@bob"] reviewers: ["@engineering-leads"] status: action_items_in_progress summary: | A deployment at 14:15 introduced a connection leak in the payment API. Connection pool was exhausted by 14:23, causing 5xx errors for ~5% of checkout requests. Rolled back at 14:31; recovered by 14:50. impact: user_impact: "~340 users saw checkout failures over 27 minutes" revenue_impact: "$2,100 estimated (based on average order value × failed checkouts)" slo_impact: "Consumed 5.1 min of 21.9 min monthly error budget (23%)" data_impact: "No data loss. 12 orders failed; users could retry successfully." timeline: - time: "14:15" event: "Deploy v2.3.1 rolled out (3/3 pods updated)" - time: "14:23" event: "PaymentAPIHighErrorRate alert fired" - time: "14:25" event: "IC assigned, confirmed via dashboard" - time: "14:28" event: "Root cause identified: new ORM query not releasing connections" - time: "14:31" event: "Rollback initiated: v2.3.1 → v2.3.0" - time: "14:35" event: "Error rate declining" - time: "14:50" event: "Resolved: error rate <0.1% sustained" root_cause: | The v2.3.1 deploy introduced a new database query in the order validation path. The query used a raw connection instead of the pool's managed client, so connections were acquired but never released. Under load, the pool exhausted within 8 minutes. contributing_factors: - "No integration test for connection pool behavior under load" - "Connection pool saturation metric existed but had no alert" - "Code review didn't catch raw connection usage" what_went_well: - "Alert fired within 8 minutes of deploy" - "IC assigned in 2 minutes" - "Root cause identified in 3 minutes (clear in logs)" - "Rollback executed cleanly" what_went_wrong: - "8-minute detection gap after deploy" - "No canary deployment to catch before full rollout" - "Connection pool saturation had no alert" action_items: - action: "Add connection pool saturation alert (>80% for 2 min)" owner: "@bob" priority: P1 due: "2026-02-25" status: in_progress ticket: "ENG-1234" - action: "Enable canary deployments for payment-api" owner: "@alice" priority: P1 due: "2026-03-01" ticket: "ENG-1235" - action: "Add linting rule: no raw DB connections in application code" owner: "@charlie" priority: P2 due: "2026-03-07" ticket: "ENG-1236" - action: "Load test payment-api connection pool in staging" owner: "@bob" priority: P2 due: "2026-03-07" ticket: "ENG-1237" lessons_learned: - "Resource saturation metrics need alerts, not just dashboards" - "Canary deployments are mandatory for Tier 0 services" - "ORM abstractions don't guarantee connection safety — review raw queries" ``` ### Post-Mortem Meeting Agenda (60 minutes) ``` 1. (5 min) Context setting — IC reads the summary 2. (15 min) Timeline walkthrough — what happened, when, by whom 3. (15 min) Root cause deep-dive — 5 Whys exercise 4. (5 min) What went well — celebrate good response 5. (15 min) Action items — assign owners, priorities, due dates 6. (5 min) Wrap-up — review date for action item check-in ``` ### 5 Whys Exercise ``` Problem: 5xx errors in payment API Why 1: Database connections were exhausted Why 2: A new query acquired connections without releasing them Why 3: The query used a raw connection instead of the pool manager Why 4: The ORM's raw query API doesn't auto-release (by design) Why 5: We don't have a linting rule or code review checklist item for this Root cause: Missing guard against raw connection usage in application code Systemic fix: Linting rule + connection pool saturation alerting ``` --- ## Phase 9: On-Call Operations ### On-Call Structure ```yaml on_call: rotation: weekly handoff_day: Monday 10:00 UTC primary: response_time: 5 minutes (SEV-1/2), 30 minutes (SEV-3) escalation_after: 15 minutes no-ack secondary: response_time: 15 minutes (SEV-1), 1 hour (SEV-2/3) escalation_after: 30 minutes no-ack manager_escalation: trigger: SEV-1 unresolved after 30 minutes handoff_checklist: - Review open incidents and active alerts - Check error budget status for all services - Read post-mortems from previous week - Verify PagerDuty schedule and contact info - Test alert routing (send test page) ``` ### On-Call Health Metrics | Metric | Healthy | Needs Attention | Unhealthy | |--------|---------|-----------------|-----------| | Pages per week | <5 | 5-15 | >15 | | After-hours pages per week | <2 | 2-5 | >5 | | False positive rate | <10% | 10-30% | >30% | | Mean time to acknowledge | <5 min | 5-15 min | >15 min | | Mean time to resolve | <30 min | 30-120 min | >120 min | | Toil ratio (manual vs automated) | <30% | 30-60% | >60% | ### Weekly On-Call Review Template ```yaml on_call_review: week: "2026-W08" engineer: "@bob" incidents: total: 7 sev_1: 0 sev_2: 1 sev_3: 4 false_positives: 2 after_hours: 3 time_spent: incident_response: "4.5 hours" toil_automation: "2 hours" runbook_updates: "1 hour" improvements_made: - "Silenced noisy disk alert on dev servers" - "Added auto-remediation for pod restart threshold" improvements_needed: - "Cache expiry alert fires every Tuesday at 03:00 — needs investigation" - "Payment retry logic needs circuit breaker (caused 3 alerts)" handoff_notes: | Watch payment-api p99 latency — it's been creeping up since Wednesday. Stripe changed their sandbox endpoints; staging may throw errors. ``` --- ## Phase 10: Chaos Engineering & Reliability Testing ### Chaos Principles 1. Start with a hypothesis: "If X fails, the system should Y" 2. Run in production (start small — one instance, one AZ) 3. Minimize blast radius with automatic rollback 4. Build confidence incrementally: staging → canary → production ### Chaos Experiment Template ```yaml chaos_experiment: name: "Payment DB failover" hypothesis: "If the primary database becomes unavailable, traffic should failover to the replica within 30 seconds with <1% error rate spike" steady_state: - metric: "checkout_success_rate" expected: ">99.5%" - metric: "db_query_duration_p99" expected: "<200ms" injection: type: "network_partition" target: "payment-db-primary" duration: "5 minutes" blast_radius: "single AZ" abort_conditions: - "checkout_success_rate < 95% for > 60 seconds" - "revenue_per_minute drops > 50%" - "any SEV-1 incident declared" results: failover_time: "22 seconds" error_spike: "0.3% for 25 seconds" hypothesis_confirmed: true follow_up_actions: - "Document failover behavior in runbook" - "Add failover time as SLI (target: <30s)" ``` ### Chaos Engineering Maturity Levels | Level | What You Test | Tools | |-------|--------------|-------| | 1: Manual | Kill a pod, see what happens | `kubectl delete pod` | | 2: Automated | Scheduled pod kills, network delays | Chaos Monkey, Litmus | | 3: Game Days | Multi-failure scenarios with team exercise | Custom scripts + coordination | | 4: Continuous | Automated chaos in production with auto-rollback | Gremlin, Chaos Mesh | --- ## Phase 11: Observability Cost Optimization ### Cost Drivers (Ranked) | # | Driver | Typical % of Bill | Optimization | |---|--------|-------------------|-------------| | 1 | Log volume | 40-60% | Reduce verbosity, drop DEBUG, sample repetitive | | 2 | Metric cardinality | 15-25% | Drop unused metrics, limit labels | | 3 | Trace volume | 10-20% | Sampling, tail-based sampling | | 4 | Retention | 10-15% | Tiered storage (hot → warm → cold) | | 5 | Query cost | 5-10% | Optimize dashboard queries, set max scan limits | ### Cost Reduction Checklist ```yaml cost_optimization: logs: - action: "Drop DEBUG/TRACE in production" savings: "30-50% of log volume" - action: "Sample health check logs (1:100)" savings: "5-15% of log volume" - action: "Deduplicate identical error bursts" savings: "10-20% during incidents" - action: "Move logs older than 7 days to S3/cold storage" savings: "60-80% of storage cost" - action: "Drop request/response body logging" savings: "20-40% of log volume" metrics: - action: "Audit unused metrics (no dashboard, no alert)" savings: "10-30% of series" - action: "Reduce histogram bucket count (default 11 → 8)" savings: "~27% of histogram series" - action: "Remove high-cardinality labels" savings: "Variable — can be massive" - action: "Increase scrape interval for non-critical metrics (15s → 60s)" savings: "75% of data points for those metrics" traces: - action: "Implement tail-based sampling" savings: "80-95% of trace volume" - action: "Drop internal health check traces" savings: "5-20% of trace volume" - action: "Reduce span attribute size (truncate long strings)" savings: "10-30% of trace storage" general: - action: "Review and right-size retention policies quarterly" - action: "Set query timeouts and result limits on dashboards" - action: "Use recording rules for expensive queries" ``` ### Monthly Cost Review Template ```yaml observability_cost_review: month: "February 2026" total_cost: "$X,XXX" breakdown: logs: { volume: "X TB", cost: "$X", pct: "X%" } metrics: { series: "X million", cost: "$X", pct: "X%" } traces: { volume: "X TB", cost: "$X", pct: "X%" } infrastructure: { instances: X, cost: "$X", pct: "X%" } cost_per: request: "$0.000X" service: "$X average" engineer: "$X per engineer" optimizations_applied: [] optimizations_planned: [] budget_status: "on_track | over_budget | under_budget" ``` --- ## Phase 12: Advanced Patterns ### Correlation: Connecting the Three Pillars ``` Every log line includes: trace_id, span_id Every trace span includes: service, operation Every metric includes: service label Correlation paths: Alert fires (metric) → Click → Dashboard (metric) → Filter by time window → Trace search (same service + time) → Find failing trace → Logs (filter by trace_id) → See exact error Support ticket (user report) → Find request_id in logs → Extract trace_id → View full trace → Identify slow span → Check span's service metrics → Confirm pattern ``` ### Synthetic Monitoring ```yaml synthetic_checks: - name: "Checkout flow" type: browser frequency: 5m locations: [us-east, eu-west, ap-southeast] steps: - navigate: "https://app.example.com/products" - click: "Add to Cart" - click: "Checkout" - assert: "Order confirmation page loads in <3s" alert_on: "2 consecutive failures from same location" - name: "API health" type: api frequency: 1m endpoints: - url: "https://api.example.com/health" expected_status: 200 max_latency_ms: 500 - url: "https://api.example.com/v1/products?limit=1" expected_status: 200 max_latency_ms: 1000 ``` ### Feature Flag Observability ```yaml # Correlate feature flags with metrics feature_flag_monitoring: - flag: "new_checkout_flow" metrics_to_compare: - "checkout_conversion_rate" # by flag variant - "checkout_error_rate" - "checkout_latency_p99" alerts: - "If error rate for new variant > 2x control, auto-disable flag" ``` ### Observability Maturity Model | Dimension | Level 1 | Level 2 | Level 3 | Level 4 | |-----------|---------|---------|---------|---------| | Logging | Unstructured logs | Structured JSON, centralized | Correlated with traces | Automated log analysis | | Metrics | Basic infra metrics | RED/USE for services | SLO-based with error budgets | Predictive (anomaly detection) | | Tracing | No tracing | Key services instrumented | Full distributed tracing | Trace-driven testing | | Alerting | Static thresholds | Multi-signal alerts | Burn-rate based on SLOs | Auto-remediation | | Incident Response | Ad hoc | Defined process + roles | Post-mortems with action tracking | Chaos engineering in prod | | Culture | "Ops team handles it" | Shared ownership (you build it, you run it) | SLO-driven development velocity | Reliability as a feature | --- ## Quality Scoring Rubric (0-100) | Dimension | Weight | 0 | 5 | 10 | |-----------|--------|---|---|-----| | Logging quality | 15% | Unstructured, no correlation | Structured JSON, missing fields | Full schema, trace correlation, PII scrubbing | | Metrics coverage | 15% | No metrics | RED or USE, not both | RED + USE + business metrics + custom | | Tracing completeness | 10% | No tracing | Key services | Full path, sampling strategy, tail-based | | SLO maturity | 15% | No reliability targets | Informal targets | SLOs with error budgets, burn-rate alerts, weekly review | | Alert quality | 15% | Noisy/missing | Actionable, some runbooks | SLO-based, full runbooks, low false positive | | Incident response | 10% | Ad hoc | Defined process | Full process, roles, post-mortems, chaos engineering | | Dashboard design | 10% | No dashboards | Basic panels | Hierarchical L1-L4, consistent, linked to alerts | | Cost efficiency | 10% | Unknown cost | Tracked | Optimized, reviewed monthly, within budget | **90-100:** World-class. Teach others. **70-89:** Production-ready. Fill specific gaps. **50-69:** Functional but fragile. **<50:** Significant reliability risk. --- ## 10 Observability Commandments 1. **Structured or it didn't happen** — unstructured logs are technical debt 2. **Correlate everything** — trace_id connects logs, traces, and metrics 3. **Alert on symptoms, not causes** — users don't care about CPU, they care about latency 4. **Every alert gets a runbook** — no runbook = no alert 5. **SLOs drive velocity** — error budgets decide when to ship vs stabilize 6. **Dashboards have hierarchy** — executives don't need pod CPU graphs 7. **Blameless post-mortems always** — blame prevents learning 8. **Cost is a feature** — observability that bankrupts you isn't observability 9. **You build it, you run it** — the team that ships code owns its observability 10. **Practice failure** — chaos engineering builds confidence --- ## 12 Natural Language Commands | Command | What It Does | |---------|-------------| | "Audit our observability" | Run the /16 health check, score each dimension, prioritize gaps | | "Design logging for [service]" | Generate structured log schema with context fields for the service | | "Set up metrics for [service]" | Create RED + USE + business metric instrumentation plan | | "Create SLOs for [service]" | Define SLIs, targets, error budgets, and burn-rate alert rules | | "Design alerts for [service]" | Create alert rules with severity, thresholds, and runbook templates | | "Build dashboard for [service]" | Design L2 service overview dashboard with panel specifications | | "Write a runbook for [alert]" | Generate structured runbook with diagnosis steps and fixes | | "Run post-mortem for [incident]" | Generate blameless post-mortem document with timeline and action items | | "Set up on-call for [team]" | Design rotation, escalation policy, handoff checklist | | "Plan chaos experiment for [scenario]" | Design experiment with hypothesis, injection, abort conditions | | "Optimize observability costs" | Audit current spend, identify top savings, create reduction plan | | "Design tracing for [system]" | Create OpenTelemetry instrumentation plan with sampling strategy | --- ## ⚡ Level Up Your Observability This skill gives you the methodology. For industry-specific implementation patterns: - **SaaS companies:** [AfrexAI SaaS Context Pack ($47)](https://afrexai-cto.github.io/context-packs/) — includes SaaS-specific SLOs, multi-tenant monitoring, and usage-based billing observability - **Fintech:** [AfrexAI Fintech Context Pack ($47)](https://afrexai-cto.github.io/context-packs/) — compliance audit logging, transaction monitoring, fraud detection signals - **Healthcare:** [AfrexAI Healthcare Context Pack ($47)](https://afrexai-cto.github.io/context-packs/) — HIPAA audit trails, PHI access logging, uptime requirements ### 🔗 More Free Skills by AfrexAI - `afrexai-devops-engine` — CI/CD, infrastructure, deployment strategies - `afrexai-api-architect` — API design, security, versioning - `afrexai-database-engineering` — Schema design, query optimization, migrations - `afrexai-code-reviewer` — Code review methodology with SPEAR framework - `afrexai-prompt-engineering` — System prompt design, testing, optimization **Browse all AfrexAI skills:** [clawhub.com](https://clawhub.com) | [Full storefront](https://afrexai-cto.github.io/context-packs/)