--- name: monitoring-expert description: Configures monitoring systems, implements structured logging pipelines, creates Prometheus/Grafana dashboards, defines alerting rules, and instruments distributed tracing. Implements Prometheus/Grafana stacks, conducts load testing, performs application profiling, and plans infrastructure capacity. Use when setting up application monitoring, adding observability to services, debugging production issues with logs/metrics/traces, running load tests with k6 or Artillery, profiling CPU/memory bottlenecks, or forecasting capacity needs. license: MIT metadata: author: https://github.com/Jeffallan version: "1.1.0" domain: devops triggers: monitoring, observability, logging, metrics, tracing, alerting, Prometheus, Grafana, DataDog, APM, performance testing, load testing, profiling, capacity planning, bottleneck role: specialist scope: implementation output-format: code related-skills: devops-engineer, debugging-wizard, architecture-designer --- # Monitoring Expert Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems. ## Core Workflow 1. **Assess** — Identify what needs monitoring (SLIs, critical paths, business metrics) 2. **Instrument** — Add logging, metrics, and traces to the application (see examples below) 3. **Collect** — Configure aggregation and storage (Prometheus scrape, log shipper, OTLP endpoint); verify data arrives before proceeding 4. **Visualize** — Build dashboards using RED (Rate/Errors/Duration) or USE (Utilization/Saturation/Errors) methods 5. **Alert** — Define threshold and anomaly alerts on critical paths; validate no false-positive flood before shipping ## Quick-Start Examples ### Structured Logging (Node.js / Pino) ```js import pino from 'pino'; const logger = pino({ level: 'info' }); // Good — structured fields, includes correlation ID logger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created'); // Bad — string interpolation, no correlation console.log(`Order created for user ${userId}`); ``` ### Prometheus Metrics (Node.js) ```js import { Counter, Histogram, register } from 'prom-client'; const httpRequests = new Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'], }); const httpDuration = new Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request latency', labelNames: ['method', 'route'], buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5], }); // Instrument a route app.use((req, res, next) => { const end = httpDuration.startTimer({ method: req.method, route: req.path }); res.on('finish', () => { httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode }); end(); }); next(); }); // Expose scrape endpoint app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); }); ``` ### OpenTelemetry Tracing (Node.js) ```js import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { trace } from '@opentelemetry/api'; const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }), }); sdk.start(); // Manual span around a critical operation const tracer = trace.getTracer('order-service'); async function processOrder(orderId) { const span = tracer.startSpan('order.process'); span.setAttribute('order.id', orderId); try { const result = await db.saveOrder(orderId); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (err) { span.recordException(err); span.setStatus({ code: SpanStatusCode.ERROR }); throw err; } finally { span.end(); } } ``` ### Prometheus Alerting Rule ```yaml groups: - name: api.rules rules: - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 2m labels: severity: critical annotations: summary: "Error rate above 5% on {{ $labels.route }}" ``` ### k6 Load Test ```js import http from 'k6/http'; import { check, sleep } from 'k6'; export const options = { stages: [ { duration: '1m', target: 50 }, // ramp up { duration: '5m', target: 50 }, // sustained load { duration: '1m', target: 0 }, // ramp down ], thresholds: { http_req_duration: ['p(95)<500'], // 95th percentile < 500 ms http_req_failed: ['rate<0.01'], // error rate < 1% }, }; export default function () { const res = http.get('https://api.example.com/orders'); check(res, { 'status is 200': (r) => r.status === 200 }); sleep(1); } ``` ## Reference Guide Load detailed guidance based on context: | Topic | Reference | Load When | |-------|-----------|-----------| | Logging | `references/structured-logging.md` | Pino, JSON logging | | Metrics | `references/prometheus-metrics.md` | Counter, Histogram, Gauge | | Tracing | `references/opentelemetry.md` | OpenTelemetry, spans | | Alerting | `references/alerting-rules.md` | Prometheus alerts | | Dashboards | `references/dashboards.md` | RED/USE method, Grafana | | Performance Testing | `references/performance-testing.md` | Load testing, k6, Artillery, benchmarks | | Profiling | `references/application-profiling.md` | CPU/memory profiling, bottlenecks | | Capacity Planning | `references/capacity-planning.md` | Scaling, forecasting, budgets | ## Constraints ### MUST DO - Use structured logging (JSON) - Include request IDs for correlation - Set up alerts for critical paths - Monitor business metrics, not just technical - Use appropriate metric types (counter/gauge/histogram) - Implement health check endpoints ### MUST NOT DO - Log sensitive data (passwords, tokens, PII) - Alert on every error (alert fatigue) - Use string interpolation in logs (use structured fields) - Skip correlation IDs in distributed systems [Documentation](https://jeffallan.github.io/claude-skills/skills/devops/monitoring-expert/)