---
name: building-with-observability
description: Build Kubernetes observability stacks with Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki. Use when implementing metrics, tracing, logging, SRE practices, or cost engineering for cloud-native applications.
allowed-tools: Read, Grep, Glob, Edit, Write, Bash, WebSearch, WebFetch
model: claude-sonnet-4-20250514
---

# Building Observability Stacks for Kubernetes

## Persona

You are a Site Reliability Engineer (SRE) specializing in Kubernetes observability and FinOps. You've deployed production observability stacks at scale and understand the trade-offs between different tools. You follow Google's SRE principles and can implement the full observability stack: metrics (Prometheus), tracing (OpenTelemetry + Jaeger), logging (Loki), and cost monitoring (OpenCost).

## When to Use This Skill

Activate when the user mentions:
- Prometheus, PromQL, metrics collection
- Grafana dashboards, alerting
- OpenTelemetry, OTel, distributed tracing
- Jaeger, Zipkin, trace visualization
- Loki, LogQL, centralized logging
- SLIs, SLOs, SLAs, error budgets
- FinOps, Kubecost, OpenCost, cost allocation
- Kubernetes monitoring, observability

## Core Concepts

### The Three Pillars of Observability

| Pillar | Tool | Query Language | Purpose |
|--------|------|----------------|---------|
| **Metrics** | Prometheus | PromQL | Aggregated numerical data over time |
| **Traces** | Jaeger | - | Request flow across services |
| **Logs** | Loki | LogQL | Detailed event records |

### Prometheus Metrics Architecture

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   App Pod   │     │ Prometheus  │     │  Grafana    │
│  /metrics   │◄────│   Scrape    │────►│  Dashboard  │
└─────────────┘     └─────────────┘     └─────────────┘
       │                  │
       ▼                  ▼
  ServiceMonitor     PrometheusRule
  (what to scrape)   (alerting rules)
```

### OpenTelemetry Tracing Architecture

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   FastAPI   │     │    OTel     │     │   Jaeger    │
│   + OTel    │────►│  Collector  │────►│    UI       │
│   SDK       │     │   (OTLP)    │     │             │
└─────────────┘     └─────────────┘     └─────────────┘
```

## Decision Logic

### When to Use Each Tool

| Scenario | Tool | Why |
|----------|------|-----|
| "Service response times" | Prometheus + Grafana | Histograms with percentiles |
| "Why is this request slow?" | Jaeger traces | See full request path |
| "What happened at 3am?" | Loki logs | Event-level detail |
| "Are we meeting SLOs?" | Prometheus + SLO rules | Error budget tracking |
| "Which team is spending most?" | OpenCost | Cost allocation by namespace |

### Alerting Strategy Decision Tree

```
Is it customer-impacting?
├── Yes → Alert on SLO burn rate
│         (multi-window, multi-burn-rate)
└── No → Is it a leading indicator?
         ├── Yes → Warning alert, page if trend continues
         └── No → Dashboard only, no alert
```

### SLO Target Selection

| Service Type | Typical SLO | Error Budget (30 days) |
|-------------|-------------|------------------------|
| User-facing API | 99.9% | 43.2 minutes |
| Internal service | 99.5% | 3.6 hours |
| Batch jobs | 99.0% | 7.2 hours |

## Workflow: Full Stack Setup

### 1. Install Prometheus + Grafana Stack

```bash
# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes Grafana)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set grafana.adminPassword=admin
```

### 2. Create ServiceMonitor for Your App

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: task-api
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: task-api
  namespaceSelector:
    matchNames: [default]
  endpoints:
  - port: http
    path: /metrics
    interval: 30s
```

### 3. Install Loki for Logging

```bash
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true
```

### 4. Install Jaeger for Tracing

```bash
helm install jaeger jaegertracing/jaeger \
  --namespace monitoring \
  --set collector.service.otlp.grpc.enabled=true \
  --set collector.service.otlp.http.enabled=true
```

### 5. Instrument Python/FastAPI with OpenTelemetry

```python
# requirements.txt
opentelemetry-api
opentelemetry-sdk
opentelemetry-instrumentation-fastapi
opentelemetry-exporter-otlp

# main.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="jaeger-collector:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))

# Instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
```

### 6. Install OpenCost for Cost Monitoring

```bash
helm install opencost opencost/opencost \
  --namespace monitoring \
  --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus
```

## Key Patterns

### PromQL Queries for Kubernetes

```promql
# Request rate by service
sum(rate(http_requests_total[5m])) by (service)

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)

# Memory usage percentage
sum(container_memory_usage_bytes{namespace="default"}) by (pod) /
sum(container_spec_memory_limit_bytes{namespace="default"}) by (pod) * 100
```

### LogQL Queries for Loki

```logql
# All logs from namespace
{namespace="default"}

# Error logs only
{namespace="default"} |= "error"

# Parse JSON and filter
{namespace="default"} | json | level="error"

# Count errors per minute
sum(rate({namespace="default"} |= "error" [1m])) by (pod)
```

### SLO Alert Rules (Multi-Burn-Rate)

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: task-api-slo
  namespace: monitoring
spec:
  groups:
  - name: task-api-slo
    rules:
    # Error budget burn rate
    - record: task_api:error_budget_burn_rate:5m
      expr: |
        1 - (
          sum(rate(http_requests_total{service="task-api",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="task-api"}[5m]))
        )

    # Fast burn (2% budget in 1 hour = page)
    - alert: TaskAPIHighErrorBudgetBurn
      expr: task_api:error_budget_burn_rate:5m > 14.4 * 0.001  # 14.4x burn for 5m window
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Task API burning error budget rapidly"
        description: "Error rate {{ $value | humanizePercentage }} exceeds SLO"
```

### Dapr Observability Integration

```yaml
# dapr-config.yaml
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-observability
spec:
  tracing:
    samplingRate: "1"
    otel:
      endpointAddress: jaeger-collector.monitoring:4317
      isSecure: false
      protocol: grpc
  metric:
    enabled: true
```

## Cost Engineering Patterns

### Resource Tagging for Cost Allocation

```yaml
# Add cost allocation labels to all deployments
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  labels:
    app: task-api
    cost-center: "platform"
    team: "agents"
    environment: "production"
```

### Right-Sizing Resources

```yaml
# Start conservative, let VPA recommend
resources:
  requests:
    cpu: "100m"      # Start low
    memory: "128Mi"
  limits:
    cpu: "500m"      # 5x headroom for bursts
    memory: "256Mi"  # 2x headroom
```

### OpenCost PromQL Queries

```promql
# Cost per namespace (daily)
sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost * 24) by (namespace)

# Idle resources (waste)
sum(container_cpu_allocation - container_cpu_usage_seconds_total) by (namespace)
```

## Safety & Guardrails

### NEVER
- Alert on every metric (alert fatigue kills teams)
- Set SLOs at 100% (impossible to maintain, blocks all releases)
- Skip retention configuration (storage costs explode)
- Use sampling rate 1.0 in high-traffic production (performance impact)
- Expose metrics endpoints publicly (security risk)

### ALWAYS
- Start with 4 golden signals: latency, traffic, errors, saturation
- Use multi-window burn rate alerting for SLOs
- Configure retention policies for all telemetry data
- Use sampling in high-traffic scenarios (0.1 for prod, 1.0 for dev)
- Secure metrics/tracing endpoints with NetworkPolicies

### Cost Engineering Guardrails
- Set budget alerts at 80% and 100% of monthly budget
- Review right-sizing recommendations weekly
- Tag ALL resources for cost allocation
- Schedule non-production environments (40h vs 168h = 75% savings)

## TaskManager Example

Complete observability setup for Task API:

### 1. Add Prometheus Metrics (FastAPI)

```python
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import Response

# Metrics
REQUEST_COUNT = Counter(
    "task_api_requests_total",
    "Total requests",
    ["method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
    "task_api_request_duration_seconds",
    "Request latency",
    ["method", "endpoint"]
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    latency = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(latency)

    return response

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
```

### 2. Kubernetes Deployment with Observability

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  labels:
    app: task-api
    cost-center: platform
spec:
  template:
    metadata:
      labels:
        app: task-api
      annotations:
        dapr.io/enabled: "true"
        dapr.io/app-id: "task-api"
        dapr.io/config: "dapr-observability"
    spec:
      containers:
      - name: task-api
        image: task-api:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://jaeger-collector.monitoring:4317"
        - name: OTEL_SERVICE_NAME
          value: "task-api"
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"
```

### 3. SLO Dashboard (Grafana JSON)

```json
{
  "title": "Task API SLO Dashboard",
  "panels": [
    {
      "title": "Availability (SLO: 99.9%)",
      "type": "gauge",
      "targets": [{
        "expr": "sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])) * 100"
      }],
      "thresholds": [{"value": 99.9, "color": "green"}, {"value": 99.5, "color": "yellow"}]
    },
    {
      "title": "Error Budget Remaining",
      "type": "stat",
      "targets": [{
        "expr": "1 - ((1 - (sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])))) / 0.001)"
      }]
    }
  ]
}
```

## References

For detailed patterns, see:
- `references/promql-patterns.md` - PromQL query examples
- `references/otel-fastapi.md` - OpenTelemetry FastAPI integration
- `references/slo-alerting.md` - SRE alerting patterns
- `references/cost-queries.md` - OpenCost PromQL queries