---
name: observability-sre
description: Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.
---
# Observability & Site Reliability Engineering

## Core Principles

- **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility
- **Observability-First** — Build systems that explain their own behavior
- **SLO-Driven** — Define reliability targets that matter to users
- **Proactive Detection** — Find issues before customers do
- **Blameless Culture** — Learn from failures without blame
- **Automate Toil** — Reduce repetitive operational work
- **Continuous Improvement** — Each incident makes systems more resilient
- **Full-Stack Visibility** — Monitor from infrastructure to business metrics

---

## Hard Rules (Must Follow)

> These rules are mandatory. Violating them means the skill is not working correctly.

### Symptom-Based Alerts Only

**Alert on user-facing symptoms, not internal infrastructure metrics.**

```yaml
# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
  expr: cpu_usage > 70%
  # Users don't care about CPU, they care about latency

- alert: MemoryHigh
  expr: memory_usage > 80%
  # Internal metric, may not affect users

# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200
  annotations:
    summary: "Users experiencing slow response times"

- alert: ErrorRateHigh
  expr: slo:api_errors:rate5m > 0.001
  annotations:
    summary: "Users encountering errors"
```

### Low Cardinality Labels

**Loki/Prometheus labels must have low cardinality (<10 unique labels).**

```yaml
# ❌ FORBIDDEN: High cardinality labels
labels:
  user_id: "usr_123"      # Millions of values!
  order_id: "ord_456"     # Millions of values!
  request_id: "req_789"   # Every request is unique!

# ✅ REQUIRED: Low cardinality only
labels:
  namespace: "production"  # Few values
  app: "api-server"        # Few values
  level: "error"           # 5-6 values
  method: "GET"            # ~10 values

# High cardinality data goes in log body:
logger.info({
  user_id: "usr_123",      # In JSON body, not label
  order_id: "ord_456",
}, "Order processed");
```

### SLO-Based Error Budgets

**Every service must have defined SLOs with error budget tracking.**

```yaml
# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets

# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime

groups:
  - name: slo_tracking
    rules:
      - record: slo:api_availability:ratio
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

      - alert: ErrorBudgetBurnRate
        expr: slo:api_availability:ratio < 0.999
        for: 5m
        annotations:
          summary: "Burning error budget too fast"
```

### Trace Context in Logs

**All logs must include trace_id for correlation with distributed traces.**

```typescript
// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}
```

---

## Quick Reference

### When to Use What

| Scenario | Tool/Pattern | Reason |
|----------|--------------|--------|
| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |
| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |
| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |
| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |
| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |
| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |
| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |
| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |

### The Three Pillars

| Pillar | What | When | Tools |
|--------|------|------|-------|
| **Metrics** | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |
| **Logs** | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |
| **Traces** | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |

**Fourth Pillar (Emerging):** Continuous Profiling — Code-level performance data (CPU, memory usage at function level)

---

## Observability Architecture

### Layered Prometheus Setup

```yaml
# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down

# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)

# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level

# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level

# Global Prometheus config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-nodes"}'
        - '{__name__=~"job:.*"}'  # Recording rules only
    static_configs:
      - targets:
        - 'cluster-prom-us-east.internal:9090'
        - 'cluster-prom-eu-west.internal:9090'
```

### Recording Rules for Performance

```yaml
# Precompute expensive queries
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Request rate (requests per second)
      - record: job:api_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Error rate
      - record: job:api_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # P95 latency
      - record: job:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
```

### Resource Optimization

```yaml
# Increase scrape interval for high-target deployments
scrape_interval: 30s  # Default: 15s reduces load by 50%

# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_.*|process_.*'  # Drop Go runtime metrics
    action: drop

# Limit sample retention
storage:
  tsdb:
    retention.time: 15d  # Keep only 15 days locally
    retention.size: 50GB # Or max 50GB
```

---

## Distributed Tracing with OpenTelemetry

### Auto-Instrumentation Setup

```typescript
// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();
```

### Manual Instrumentation for Business Logic

```typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create custom span for business operation
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // Add business context
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Child span for external API call
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}
```

### Sampling Strategies

```yaml
# OpenTelemetry Collector config
processors:
  # Probabilistic sampling: Keep 10% of traces
  probabilistic_sampler:
    sampling_percentage: 10

  # Tail sampling: Make decisions after seeing full trace
  tail_sampling:
    policies:
      # Always sample errors
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always sample slow requests
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}

      # Sample 5% of normal traffic
      - name: normal-traces
        type: probabilistic
        probabilistic: {sampling_percentage: 5}
```

### Context Propagation

```typescript
// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';

// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
  headers: {
    // W3C Trace Context headers injected automatically:
    // traceparent: 00-<trace-id>-<span-id>-01
    // tracestate: vendor=value
  },
});

// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });
```

---

## Structured Logging Best Practices

### JSON Logging Format

```typescript
// Use structured logging library
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Include trace context in logs
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// Structured logging with context
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  'Payment processed successfully'
);

// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}
```

### Log Levels

```typescript
// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging');     // Very verbose
logger.debug({ state }, 'Debug information');          // Development
logger.info({ event }, 'Normal operation');            // Production default
logger.warn({ issue }, 'Warning condition');           // Potential issues
logger.error({ error, context }, 'Error occurred');    // Errors
logger.fatal({ critical }, 'Fatal error');             // Process crash
```

### Grafana Loki Configuration

```yaml
# Promtail config - ships logs to Loki
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Add pod labels as Loki labels (LOW cardinality only!)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
    pipeline_stages:
      # Parse JSON logs
      - json:
          expressions:
            level: level
            trace_id: trace_id
      # Extract fields as labels
      - labels:
          level:
          trace_id:
```

### Loki Best Practices

- **Low Cardinality Labels** — Use only 5-10 labels (namespace, app, level)
- **High Cardinality in Log Body** — Put user_id, order_id in JSON, not labels
- **LogQL for Filtering** — Use `{app="api"} | json | user_id="123"`
- **Retention Policy** — Keep recent logs longer, compress old logs

```promql
# LogQL query examples
{namespace="production", app="api"} |= "error"  # Text search

{app="api"} | json | level="error" | line_format "{{.msg}}"  # JSON parsing

rate({app="api"}[5m])  # Log rate per second

sum by (level) (count_over_time({namespace="production"}[1h]))  # Count by level
```

---

## SLO/SLI/SLA Management

### Definitions

- **SLI (Service Level Indicator)** — Quantifiable measurement of service behavior
  - Examples: Request latency, error rate, availability, throughput

- **SLO (Service Level Objective)** — Target value/range for an SLI
  - Examples: 99.9% availability, P95 latency < 200ms

- **SLA (Service Level Agreement)** — Formal commitment with consequences
  - Examples: "99.9% uptime or 10% credit"

### The Four Golden Signals

```yaml
# Google SRE's key metrics for any service

1. Latency
   SLI: P95 request latency
   SLO: 95% of requests complete in < 200ms

2. Traffic
   SLI: Requests per second
   SLO: Handle 10,000 req/s peak load

3. Errors
   SLI: Error rate (5xx / total)
   SLO: < 0.1% error rate

4. Saturation
   SLI: Resource utilization (CPU, memory, disk)
   SLO: CPU < 70%, Memory < 80%
```

### Error Budget

```python
# Error budget = 1 - SLO
SLO = 99.9%  # "three nines"
Error_Budget = 100% - 99.9% = 0.1%

# Monthly calculation (30 days)
Total_Minutes = 30 * 24 * 60 = 43,200 minutes
Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes

# If you've had 20 minutes downtime this month:
Budget_Remaining = 43.2 - 20 = 23.2 minutes
Budget_Consumed = 20 / 43.2 = 46.3%

# Policy: If budget > 90% consumed, freeze deployments
```

### SLO Implementation with Prometheus

```yaml
# Recording rules for SLI calculation
groups:
  - name: slo_availability
    interval: 30s
    rules:
      # Total requests
      - record: slo:api_requests:total
        expr: sum(rate(http_requests_total[5m]))

      # Successful requests (non-5xx)
      - record: slo:api_requests:success
        expr: sum(rate(http_requests_total{status!~"5.."}[5m]))

      # Availability SLI
      - record: slo:api_availability:ratio
        expr: slo:api_requests:success / slo:api_requests:total

      # 30-day availability
      - record: slo:api_availability:30d
        expr: avg_over_time(slo:api_availability:ratio[30d])

  - name: slo_latency
    interval: 30s
    rules:
      # P95 latency SLI
      - record: slo:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Alerting on SLO burn rate
- alert: HighErrorBudgetBurnRate
  expr: |
    (
      slo:api_availability:ratio < 0.999  # Below 99.9% SLO
      and
      slo:api_availability:30d > 0.999    # But 30-day average still OK
    )
  for: 5m
  annotations:
    summary: "Burning error budget too fast"
    description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"
```

---

## Incident Response

### Incident Severity Levels

| Level | Impact | Response Time | Examples |
|-------|--------|---------------|----------|
| **SEV-1** | Service down or major degradation | < 15 min | Complete outage, data loss, security breach |
| **SEV-2** | Significant impact, partial outage | < 1 hour | Feature unavailable, high error rates |
| **SEV-3** | Minor impact, workaround exists | < 4 hours | Single component degraded, slow performance |
| **SEV-4** | Cosmetic, no user impact | Next business day | UI glitches, logging errors |

### Incident Response Roles (IMAG Framework)

```yaml
Incident Commander (IC):
  - Overall coordination and decision-making
  - Declares incident start/end
  - Decides on escalations
  - Owns communication to leadership

Operations Lead (OL):
  - Technical investigation and mitigation
  - Coordinates engineers
  - Implements fixes
  - Reports status to IC

Communications Lead (CL):
  - Internal/external status updates
  - Customer communication
  - Stakeholder notifications
  - Status page updates
```

### Incident Workflow

```
1. Detection (Alert fires or user reports)
   ↓
2. Triage (Assess severity, assign IC)
   ↓
3. Response (Assemble team, create war room)
   ↓
4. Mitigation (Stop the bleeding, restore service)
   ↓
5. Resolution (Fix root cause)
   ↓
6. Postmortem (Blameless review, action items)
   ↓
7. Follow-up (Implement improvements)
```

### On-Call Best Practices

- **Rotation** — 1-week shifts, balanced across timezones
- **Escalation** — Primary → Secondary → Manager (15 min each)
- **Playbooks** — Step-by-step debugging guides for common issues
- **Runbooks** — Automated remediation scripts
- **Handoff** — 15-min sync at rotation change
- **Compensation** — On-call pay or comp time
- **Health** — No more than 2 incidents/night target

### Alert Fatigue Prevention

```yaml
# Symptoms vs Causes alerting
# Alert on WHAT users experience, not WHY it's broken

# GOOD: Symptom-based alert
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200  # User-facing metric
  annotations:
    summary: "API is slow for users"

# BAD: Cause-based alert
- alert: CPUHigh
  expr: cpu_usage > 70%  # Internal metric, might not impact users
  # Don't alert unless this affects SLOs

# Use SLO-based alerting
# Alert when error budget burn rate is too high
```

---

## Blameless Postmortems

### Core Principles

- **Assume Good Intentions** — Everyone did their best with available information
- **Focus on Systems** — Identify gaps in process/tooling, not people
- **Psychological Safety** — No punishment for honest mistakes
- **Learning Culture** — Incidents are opportunities to improve
- **Separate from Performance Reviews** — Postmortem participation never affects evaluations

### Postmortem Template

```markdown
# Incident Postmortem: [Title]

**Date:** 2025-01-15
**Duration:** 10:30 - 12:15 UTC (1h 45m)
**Severity:** SEV-2
**Incident Commander:** Jane Doe
**Responders:** John Smith, Alice Johnson

## Impact
- 15,000 users affected
- 12% error rate on payment processing
- $5,000 estimated revenue impact
- No data loss

## Timeline (UTC)
- 10:30 - Alert: Payment error rate > 5%
- 10:32 - IC assigned, war room created
- 10:45 - Identified: Database connection pool exhausted
- 11:00 - Mitigation: Increased pool size from 50 → 100
- 11:15 - Error rate back to normal
- 12:15 - Incident closed after monitoring

## Root Cause
Database connection pool configured for average load, not peak traffic.
Black Friday traffic spike (3x normal) exhausted connections.

## What Went Well
- Alert fired within 2 minutes of issue
- Clear escalation path, IC available immediately
- Mitigation applied quickly (30 minutes to fix)
- No data corruption or loss

## What Went Wrong
- No load testing at 3x scale
- No auto-scaling for connection pool
- No alert on connection pool saturation
- Insufficient monitoring of database metrics

## Action Items
- [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20)
- [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
- [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
- [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18)
- [ ] (@john) Document connection pool tuning runbook (Due: Jan 22)

## Lessons Learned
1. Black Friday load patterns need dedicated testing
2. Database metrics were missing from standard dashboards
3. Auto-scaling should cover ALL resources, not just pods
```

### Follow-up

- Review postmortem in team meeting within 1 week
- Track action items to completion (not optional!)
- Share learnings across teams
- Update runbooks and playbooks
- Celebrate successful incident response

---

## Chaos Engineering

### Principles

1. **Define Steady State** — Normal system behavior (e.g., 99.9% success rate)
2. **Hypothesize** — Predict system will remain stable under failure
3. **Inject Failures** — Simulate real-world events
4. **Disprove Hypothesis** — Look for deviations from steady state
5. **Learn and Improve** — Fix weaknesses, increase resilience

### Failure Types

```yaml
Infrastructure:
  - Pod/node termination
  - Network latency/packet loss
  - DNS failures
  - Cloud region outage

Resources:
  - CPU stress
  - Memory exhaustion
  - Disk I/O saturation
  - File descriptor limits

Dependencies:
  - Database connection failures
  - API timeout/errors
  - Cache unavailability
  - Message queue backlog

Security:
  - DDoS simulation
  - Certificate expiration
  - Unauthorized access attempts
```

### Chaos Mesh Example

```yaml
# Network latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "100ms"
    correlation: "50"
    jitter: "50ms"
  duration: "5m"
  scheduler:
    cron: "@every 2h"  # Run every 2 hours

---
# Pod kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill
spec:
  action: pod-kill
  mode: fixed-percent
  value: "10"  # Kill 10% of pods
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: "30s"
```

### Best Practices

- **Start Small** — Non-production first, then canary production
- **Collect Baselines** — Know normal metrics before experiments
- **Define Success** — Clear criteria for what "stable" means
- **Monitor Everything** — Watch metrics, logs, traces during tests
- **Automate Rollback** — Stop experiment if SLOs violated
- **Game Days** — Scheduled chaos exercises with full team
- **Blameless Reviews** — Treat chaos failures like production incidents

---

## AIOps and AI in Observability

### 2025 Trends

- **Anomaly Detection** — AI spots unusual patterns in metrics/logs
- **Root Cause Analysis** — Correlate failures across services automatically
- **Predictive Alerting** — Predict failures before they happen
- **Auto-Remediation** — AI suggests or applies fixes autonomously
- **Natural Language Queries** — Ask "Why is checkout slow?" instead of writing PromQL
- **AI Observability** — Monitor AI model drift, hallucinations, token usage

### AI-Driven Platforms (2025)

```yaml
Dynatrace Davis AI:
  - Auto-detected 73% of incidents before customer impact
  - Reduced alert noise by 90%
  - Causal AI for root cause analysis

Datadog Watchdog:
  - Anomaly detection across metrics, logs, traces
  - Automated correlation of related issues
  - LLM-powered investigation assistant

Elastic AIOps:
  - Machine learning for log anomaly detection
  - Automated baseline learning
  - Predictive alerting

New Relic AI:
  - Natural language query interface
  - Automated incident summarization
  - Proactive capacity recommendations
```

### Implementing AI Observability

```python
# Monitor AI model performance
from opentelemetry import trace, metrics

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Create metrics for AI model
model_latency = meter.create_histogram(
    "ai.model.latency",
    description="AI model inference latency",
    unit="ms"
)
model_tokens = meter.create_counter(
    "ai.model.tokens",
    description="Token usage"
)

async def run_ai_model(prompt: str):
    with tracer.start_as_current_span("ai.inference") as span:
        start = time.time()

        span.set_attribute("ai.model", "gpt-4")
        span.set_attribute("ai.prompt_length", len(prompt))

        response = await openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        latency = (time.time() - start) * 1000
        tokens = response.usage.total_tokens

        # Record metrics
        model_latency.record(latency, {"model": "gpt-4"})
        model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

        # Add to span
        span.set_attribute("ai.response_length", len(response.choices[0].message.content))
        span.set_attribute("ai.tokens_used", tokens)

        return response
```

---

## Grafana Dashboards

### 3-3-3 Rule

- **3 rows** of panels per dashboard
- **3 panels** per row
- **3 key metrics** per panel

Avoid "dashboard sprawl" — Each dashboard should answer ONE question.

### Dashboard Categories

```yaml
RED Dashboard (for services):
  - Rate: Requests per second
  - Errors: Error rate
  - Duration: Latency (P50, P95, P99)

USE Dashboard (for resources):
  - Utilization: % of capacity used
  - Saturation: Queue depth, wait time
  - Errors: Error count

Four Golden Signals Dashboard:
  - Latency
  - Traffic
  - Errors
  - Saturation

SLO Dashboard:
  - Current SLI value
  - Error budget remaining
  - Burn rate
  - Trend (30-day)
```

### Panel Best Practices

```json
{
  "title": "API Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  // Requests per second
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}
```

---

## Checklist

```markdown
## Metrics (Prometheus + Grafana)
- [ ] Layered architecture (app/cluster/global)
- [ ] Recording rules for expensive queries
- [ ] Resource limits and retention configured
- [ ] Dashboards follow 3-3-3 rule
- [ ] Alerts based on SLOs, not internal metrics

## Tracing (OpenTelemetry)
- [ ] Auto-instrumentation enabled
- [ ] Custom spans for business operations
- [ ] Sampling strategy configured
- [ ] Trace context in logs (correlation)
- [ ] Backend connected (Tempo/Jaeger)

## Logging (Loki/ELK)
- [ ] Structured JSON logging
- [ ] Low cardinality labels (<10)
- [ ] Trace IDs in logs
- [ ] Appropriate log levels
- [ ] Retention policy defined

## SLOs
- [ ] SLIs defined for key user journeys
- [ ] SLOs documented and tracked
- [ ] Error budget calculated
- [ ] Burn rate alerting configured
- [ ] Monthly SLO review process

## Incident Response
- [ ] Severity levels defined
- [ ] On-call rotation scheduled
- [ ] Escalation policy documented
- [ ] Runbooks for common issues
- [ ] Postmortem template ready

## Culture
- [ ] Blameless postmortem process
- [ ] Action items tracked to completion
- [ ] Incident learnings shared
- [ ] On-call compensation policy
- [ ] Regular chaos engineering exercises
```

---

## See Also

- [reference/monitoring.md](reference/monitoring.md) — Prometheus and Grafana deep dive
- [reference/logging.md](reference/logging.md) — Structured logging best practices
- [reference/tracing.md](reference/tracing.md) — OpenTelemetry and distributed tracing
- [reference/incident-response.md](reference/incident-response.md) — Incident management and postmortems
- [templates/slo-template.md](templates/slo-template.md) — SLO definition template