--- name: observability-patterns description: Use when implementing observability strategy, correlating signals, or designing monitoring systems. Covers the three pillars (logs, metrics, traces) and their integration. allowed-tools: Read, Glob, Grep --- # Observability Patterns Patterns for implementing comprehensive observability including logs, metrics, traces, and their correlation. ## When to Use This Skill - Designing observability strategy - Implementing the three pillars - Correlating signals across systems - Choosing observability tools - Building monitoring dashboards ## What is Observability? ```text Observability = Ability to understand internal state from external outputs Not just monitoring (known-unknowns) But understanding (unknown-unknowns) Traditional monitoring: "Is CPU > 80%?" Observability: "Why are users experiencing latency?" ``` ## The Three Pillars ### Overview ```text ┌─────────────────────────────────────────────────────────┐ │ OBSERVABILITY │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ LOGS │ │ METRICS │ │ TRACES │ │ │ │ │ │ │ │ │ │ │ │ Events │ │ Counters │ │ Requests │ │ │ │ Details │ │ Gauges │ │ Spans │ │ │ │ Context │ │ Trends │ │ Flow │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ └───────────────┼───────────────┘ │ │ │ │ │ ┌────────┴────────┐ │ │ │ CORRELATION │ │ │ │ (trace_id) │ │ │ └─────────────────┘ │ └─────────────────────────────────────────────────────────┘ Each pillar answers different questions: - Logs: What happened? (events) - Metrics: How much/many? (aggregates) - Traces: Where? (request flow) ``` ### Logs ```text Purpose: Discrete events with context Structure: { "timestamp": "2024-01-15T10:30:00.123Z", "level": "ERROR", "service": "order-service", "message": "Payment failed", "trace_id": "abc123", "span_id": "def456", "user_id": "12345", "order_id": "ORD-789", "error": { "code": "CARD_DECLINED", "message": "Insufficient funds" } } Best for: - Debugging specific issues - Audit trails - Error details - Business events Challenges: - High volume → storage costs - Unstructured → hard to query - No aggregation → not for trends ``` ### Metrics ```text Purpose: Numeric measurements over time Types: ┌─────────────────────────────────────────────────────────┐ │ Counter: Cumulative, only increases │ │ - http_requests_total │ │ - errors_total │ │ - bytes_transferred │ ├─────────────────────────────────────────────────────────┤ │ Gauge: Point-in-time value, can go up/down │ │ - current_connections │ │ - queue_depth │ │ - temperature │ ├─────────────────────────────────────────────────────────┤ │ Histogram: Distribution of values │ │ - request_duration_seconds │ │ - response_size_bytes │ │ Provides: count, sum, buckets │ ├─────────────────────────────────────────────────────────┤ │ Summary: Similar to histogram, calculates quantiles │ │ - request_latency_seconds (p50, p90, p99) │ └─────────────────────────────────────────────────────────┘ Best for: - Trends and patterns - Alerting on thresholds - Dashboards - Capacity planning Challenges: - No event details - Cardinality limits - Not request-level ``` ### Traces ```text Purpose: Request flow across services Structure: Trace (end-to-end request) ├── Span (API Gateway) - 200ms │ ├── Span (Auth) - 20ms │ └── Span (OrderService) - 150ms │ ├── Span (Database) - 50ms │ └── Span (PaymentService) - 80ms │ └── Span (External API) - 60ms Best for: - Understanding request flow - Finding bottlenecks - Debugging distributed issues - Service dependencies Challenges: - Storage intensive - Requires sampling - Complex to implement ``` ## Signal Correlation ### Why Correlate? ```text Without correlation: - Metrics: "Error rate is high" - Logs: "Error logs from somewhere" - Traces: "Some traces show errors" → Hard to connect the dots With correlation: - Metrics: "Error rate spike at 10:30" └── Click to see: Exemplar trace └── Click to see: Related logs → Full picture in seconds ``` ### Correlation Methods ```text 1. Trace ID injection: All signals include trace_id Log: {"trace_id": "abc123", "message": "..."} Metric: http_requests{trace_id="abc123"} Trace: TraceID = abc123 2. Exemplars: Metrics point to sample traces request_latency = 2.5s └── exemplar: trace_id=abc123 → "Show me a slow request" 3. Time correlation: Align signals by timestamp Metric spike at 10:30 → Query logs around 10:30 → Query traces around 10:30 ``` ### Unified Query Example ```text Investigation flow: 1. Dashboard shows latency spike http_request_duration_p99 = 3s 2. Click on spike → exemplar trace trace_id: abc123 3. View trace → slow database span db.query: SELECT * FROM orders... (2.5s) 4. Query logs with trace_id {"trace_id":"abc123","query":"SELECT...","rows":50000} 5. Root cause identified Missing index causing full table scan ``` ## OpenTelemetry Unified Approach ```text OpenTelemetry provides unified API for all signals: Application Code │ ▼ ┌─────────────────────────────────────────────────────┐ │ OpenTelemetry SDK │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Tracer │ │ Meter │ │ Logger │ │ │ │Provider │ │Provider │ │Provider │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ └────────────┼────────────┘ │ │ │ │ │ ┌───────┴───────┐ │ │ │ Exporters │ │ │ └───────────────┘ │ └─────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Tempo │ │Prometheus│ │ Loki │ │(Traces) │ │(Metrics) │ │ (Logs) │ └─────────┘ └─────────┘ └─────────┘ ``` ## Logging Patterns ### Structured Logging ```text Unstructured (bad): "User 12345 failed to login: invalid password" Structured (good): { "event": "login_failed", "user_id": "12345", "reason": "invalid_password", "timestamp": "2024-01-15T10:30:00Z", "trace_id": "abc123" } Benefits: - Queryable: user_id:12345 AND event:login_failed - Parseable: Automated analysis - Correlatable: trace_id links to traces ``` ### Log Levels ```text Level | When to use ----------|------------------------------------------ TRACE | Very detailed, development only DEBUG | Development, verbose INFO | Normal operations, audit events WARN | Degraded, recoverable issues ERROR | Failures requiring attention FATAL | Application cannot continue Production typically: INFO and above Debug mode: DEBUG and above ``` ### Log Aggregation Architecture ```text ┌─────────────────────────────────────────────────────────┐ │ Application Pods │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ App │ │ App │ │ App │ → stdout/stderr │ │ └──────┘ └──────┘ └──────┘ │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Log Collector (Fluentd/Vector/Fluent Bit) │ │ - Parse logs │ │ - Add metadata (pod, namespace, etc.) │ │ - Transform/filter │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Storage (Elasticsearch/Loki/CloudWatch) │ │ - Index for search │ │ - Retention policies │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Query Interface (Kibana/Grafana) │ │ - Search and filter │ │ - Dashboards │ └─────────────────────────────────────────────────────────┘ ``` ## Metrics Patterns ### Naming Conventions ```text Format: [namespace]_[subsystem]_[name]_[unit] Examples: http_requests_total http_request_duration_seconds http_response_size_bytes process_cpu_seconds_total db_connections_current Guidelines: - Use snake_case - Include unit suffix (_seconds, _bytes, _total) - Use base units (seconds not milliseconds) - Be consistent across services ``` ### Labels/Dimensions ```text Metrics with labels: http_requests_total{ method="GET", path="/api/users", status="200" } Cardinality warning: http_requests_total{user_id="..."} // BAD: High cardinality Keep labels low cardinality: - status: ~5 values (200, 4xx, 5xx...) - method: ~10 values - service: ~100 values - user_id: millions → TOO MANY ``` ### RED Method ```text For request-based services: R - Rate: Requests per second http_requests_total E - Errors: Failed requests per second http_requests_total{status=~"5.."} D - Duration: Latency distribution http_request_duration_seconds ``` ### USE Method ```text For resources (CPU, memory, disk): U - Utilization: % of resource used cpu_usage_percent S - Saturation: Queued work thread_pool_queued_tasks E - Errors: Error count disk_errors_total ``` ## Dashboards and Alerts ### Dashboard Design ```text Dashboard hierarchy: 1. Overview (executive level) - Key SLOs - Error rates - Traffic trends 2. Service dashboards - RED metrics - Dependencies - Resource usage 3. Debug dashboards - Detailed metrics - Component breakdown - Query performance ``` ### Alert Design ```text Good alerts: - Actionable: Someone can do something - Meaningful: Reflects user impact - Urgent: Needs attention now Bad alerts: - CPU > 80% (maybe fine) - Disk > 90% (too late?) - Any single error (noise) Better approach: SLO-based alerting - "Error budget burning too fast" - Directly tied to user impact ``` ## Tool Selection ### Open Source Stack ```text Metrics: Prometheus + Grafana Logs: Loki + Grafana Traces: Jaeger/Tempo + Grafana Alternative: Metrics: VictoriaMetrics + Grafana Logs: Elasticsearch + Kibana Traces: Zipkin ``` ### Cloud Native ```text AWS: - CloudWatch (metrics, logs) - X-Ray (traces) GCP: - Cloud Monitoring (metrics) - Cloud Logging (logs) - Cloud Trace (traces) Azure: - Azure Monitor (metrics, logs) - Application Insights (traces) ``` ### Commercial Platforms ```text Full stack: - Datadog - New Relic - Dynatrace - Splunk Benefits: Unified, managed, features Costs: Price, vendor lock-in ``` ## Best Practices ```text 1. Structured logging from day one Don't retrofit later 2. Consistent trace context Propagate trace_id everywhere 3. Metric cardinality awareness Monitor and limit label values 4. Correlation by default trace_id in logs, exemplars in metrics 5. Alert on symptoms, not causes "Users affected" not "CPU high" 6. Regular observability review Are we seeing what we need? ``` ## Related Skills - `distributed-tracing` - Deep dive on traces - `slo-sli-error-budget` - SLO-based observability - `incident-response` - Using observability in incidents