---
name: observability
description: Production observability from Google SRE and Netflix. Use when implementing structured logging, setting up metrics collection (Prometheus, Datadog, CloudWatch), configuring distributed tracing (Jaeger, OpenTelemetry), creating dashboards (Grafana), defining alert rules, or building observability pipelines. Triggers on logging, monitoring, tracing, metrics, alerts, dashboards, Prometheus, Grafana, OpenTelemetry, or production observability.
---
# Observability - Logging, Monitoring & Tracing

**Purpose**: Implement comprehensive observability for production systems with logs, metrics, and distributed tracing

**Agent**: Google SRE / Netflix Backend Architect
**Use When**: Setting up monitoring, debugging production issues, or ensuring system reliability

---

## Three Pillars of Observability

### 1. Logging (What happened?)
### 2. Metrics (How much/how fast?)
### 3. Tracing (Where did it happen?)

---

## 1. Structured Logging

```typescript
import pino from 'pino'

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label })
  }
})

// Structured logs (JSON)
logger.info({ userId: 123, action: 'login' }, 'User logged in')
logger.error({ error: err, userId: 123 }, 'Failed to process payment')

// Request logging middleware
app.use((req, res, next) => {
  req.log = logger.child({
    requestId: crypto.randomUUID(),
    method: req.method,
    url: req.url,
    ip: req.ip
  })

  req.log.info('Request started')

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      duration: Date.now() - req.startTime
    }, 'Request completed')
  })

  next()
})
```

**Best Practices**:
- Use JSON format for easy parsing
- Include context (requestId, userId, etc.)
- Log levels: ERROR, WARN, INFO, DEBUG
- Don't log sensitive data (passwords, tokens)
- Use correlation IDs across services

---

## 2. Metrics (Prometheus + Grafana)

```typescript
import { register, Counter, Histogram, Gauge } from 'prom-client'

// HTTP request counter
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status']
})

// HTTP request duration
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
})

// Active connections
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
})

// Metrics middleware
app.use((req, res, next) => {
  const start = Date.now()

  activeConnections.inc()

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000

    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    })

    httpRequestDuration.observe({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    }, duration)

    activeConnections.dec()
  })

  next()
})

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})
```

**Key Metrics to Track**:
- Request rate (requests/second)
- Error rate (errors/second)
- Response time (p50, p95, p99)
- Database query time
- Cache hit rate
- CPU/Memory usage
- Active connections

---

## 3. Distributed Tracing (OpenTelemetry)

```typescript
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { registerInstrumentations } from '@opentelemetry/instrumentation'
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http'
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express'
import { JaegerExporter } from '@opentelemetry/exporter-jaeger'

// Set up tracer
const provider = new NodeTracerProvider()

provider.addSpanProcessor(
  new SimpleSpanProcessor(
    new JaegerExporter({
      endpoint: 'http://localhost:14268/api/traces'
    })
  )
)

provider.register()

// Auto-instrument HTTP and Express
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation()
  ]
})

// Manual instrumentation
import { trace } from '@opentelemetry/api'

const tracer = trace.getTracer('my-service')

app.post('/api/orders', async (req, res) => {
  const span = tracer.startSpan('create-order')

  try {
    span.setAttribute('userId', req.user.id)
    span.setAttribute('orderTotal', req.body.total)

    // Create order
    const order = await db.order.create({ data: req.body })

    // Child span for payment
    const paymentSpan = tracer.startSpan('process-payment', {
      parent: span
    })

    await processPayment(order.id)
    paymentSpan.end()

    span.setStatus({ code: SpanStatusCode.OK })
    res.json(order)
  } catch (error) {
    span.recordException(error)
    span.setStatus({ code: SpanStatusCode.ERROR })
    res.status(500).json({ error: 'Failed to create order' })
  } finally {
    span.end()
  }
})
```

---

## 4. Application Performance Monitoring (APM)

**Popular Tools**:
- Datadog APM
- New Relic
- Elastic APM
- Sentry (error tracking)

```typescript
import * as Sentry from '@sentry/node'

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1 // Sample 10% of transactions
})

// Sentry middleware
app.use(Sentry.Handlers.requestHandler())
app.use(Sentry.Handlers.tracingHandler())

// Error handler (must be last)
app.use(Sentry.Handlers.errorHandler())

// Manual error tracking
try {
  await dangerousOperation()
} catch (error) {
  Sentry.captureException(error, {
    user: { id: userId, email: userEmail },
    tags: { operation: 'payment' },
    extra: { orderId: order.id }
  })
}
```

---

## 5. Health Checks

```typescript
// Liveness probe (is app running?)
app.get('/health/live', (req, res) => {
  res.json({ status: 'ok' })
})

// Readiness probe (is app ready to serve traffic?)
app.get('/health/ready', async (req, res) => {
  try {
    // Check database
    await db.$queryRaw`SELECT 1`

    // Check Redis
    await redis.ping()

    // Check external APIs
    await fetch('https://api.example.com/health', { timeout: 2000 })

    res.json({ status: 'ok', checks: { db: 'ok', redis: 'ok', api: 'ok' } })
  } catch (error) {
    res.status(503).json({ status: 'error', error: error.message })
  }
})
```

---

## 6. Alerting

```yaml
# Prometheus alert rules
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"

      - alert: SlowResponse
        expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
        for: 10m
        annotations:
          summary: "95th percentile response time > 1s"
```

---

## Best Practices

1. **Use correlation IDs** across logs, metrics, and traces
2. **Sample traces** in production (not 100%)
3. **Set up alerts** for SLOs (error rate, latency)
4. **Dashboard for each service** (Grafana)
5. **Centralized logging** (Elasticsearch, CloudWatch)
6. **Monitor business metrics** (orders/hour, revenue)

---

**Remember**: You can't fix what you can't see. Implement observability from day one.

**Created**: 2026-02-04
**Maintained By**: Google SRE