--- name: monitoring-observability license: MIT compatibility: "Claude Code 2.1.34+." description: Monitoring and observability patterns for Prometheus metrics, Grafana dashboards, Langfuse LLM tracing, and drift detection. Use when adding logging, metrics, distributed tracing, LLM cost tracking, or quality drift monitoring. tags: [monitoring, observability, prometheus, grafana, langfuse, tracing, metrics, drift-detection, logging] context: fork agent: metrics-architect version: 2.0.0 author: OrchestKit user-invocable: false complexity: medium metadata: category: document-asset-creation --- # Monitoring & Observability Comprehensive patterns for infrastructure monitoring, LLM observability, and quality drift detection. Each category has individual rule files in `rules/` loaded on-demand. ## Quick Reference | Category | Rules | Impact | When to Use | |----------|-------|--------|-------------| | [Infrastructure Monitoring](#infrastructure-monitoring) | 3 | CRITICAL | Prometheus metrics, Grafana dashboards, alerting rules | | [LLM Observability](#llm-observability) | 3 | HIGH | Langfuse tracing, cost tracking, evaluation scoring | | [Drift Detection](#drift-detection) | 3 | HIGH | Statistical drift, quality regression, drift alerting | | [Silent Failures](#silent-failures) | 3 | HIGH | Tool skipping, quality degradation, loop/token spike alerting | **Total: 12 rules across 4 categories** ## Quick Start ```python # Prometheus metrics with RED method from prometheus_client import Counter, Histogram http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status']) http_duration = Histogram('http_request_duration_seconds', 'Request latency', buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5]) ``` ```python # Langfuse LLM tracing from langfuse import observe, get_client @observe() async def analyze_content(content: str): get_client().update_current_trace( user_id="user_123", session_id="session_abc", tags=["production", "orchestkit"], ) return await llm.generate(content) ``` ```python # PSI drift detection import numpy as np psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!") ``` ## Infrastructure Monitoring Prometheus metrics, Grafana dashboards, and alerting for application health. | Rule | File | Key Pattern | |------|------|-------------| | Prometheus Metrics | `rules/monitoring-prometheus.md` | RED method, counters, histograms, cardinality | | Grafana Dashboards | `rules/monitoring-grafana.md` | Golden Signals, SLO/SLI, health checks | | Alerting Rules | `rules/monitoring-alerting.md` | Severity levels, grouping, escalation, fatigue prevention | ## LLM Observability Langfuse-based tracing, cost tracking, and evaluation for LLM applications. | Rule | File | Key Pattern | |------|------|-------------| | Langfuse Traces | `rules/llm-langfuse-traces.md` | @observe decorator, OTEL spans, agent graphs | | Cost Tracking | `rules/llm-cost-tracking.md` | Token usage, spend alerts, Metrics API | | Eval Scoring | `rules/llm-eval-scoring.md` | Custom scores, evaluator tracing, quality monitoring | ## Drift Detection Statistical and quality drift detection for production LLM systems. | Rule | File | Key Pattern | |------|------|-------------| | Statistical Drift | `rules/drift-statistical.md` | PSI, KS test, KL divergence, EWMA | | Quality Drift | `rules/drift-quality.md` | Score regression, baseline comparison, canary prompts | | Drift Alerting | `rules/drift-alerting.md` | Dynamic thresholds, correlation, anti-patterns | ## Silent Failures Detection and alerting for silent failures in LLM agents. | Rule | File | Key Pattern | |------|------|-------------| | Tool Skipping | `rules/silent-tool-skipping.md` | Expected vs actual tool calls, Langfuse traces | | Quality Degradation | `rules/silent-degraded-quality.md` | Heuristics + LLM-as-judge, z-score baselines | | Silent Alerting | `rules/silent-alerting.md` | Loop detection, token spikes, escalation workflow | ## Key Decisions | Decision | Recommendation | Rationale | |----------|----------------|-----------| | Metric methodology | RED method (Rate, Errors, Duration) | Industry standard, covers essential service health | | Log format | Structured JSON | Machine-parseable, supports log aggregation | | Tracing | OpenTelemetry | Vendor-neutral, auto-instrumentation, broad ecosystem | | LLM observability | Langfuse (not LangSmith) | Open-source, self-hosted, built-in prompt management | | LLM tracing API | `@observe` + `get_client()` | OTEL-native, automatic span creation | | Drift method | PSI for production, KS for small samples | PSI is stable for large datasets, KS more sensitive | | Threshold strategy | Dynamic (95th percentile) over static | Reduces alert fatigue, context-aware | | Alert severity | 4 levels (Critical, High, Medium, Low) | Clear escalation paths, appropriate response times | ## Detailed Documentation | Resource | Description | |----------|-------------| | [references/](references/) | Logging, metrics, tracing, Langfuse, drift analysis guides | | [checklists/](checklists/) | Implementation checklists for monitoring and Langfuse setup | | [examples/](examples/) | Real-world monitoring dashboard and trace examples | | [scripts/](scripts/) | Templates: Prometheus, OpenTelemetry, health checks, Langfuse | ## Related Skills - `defense-in-depth` - Layer 8 observability as part of security architecture - `devops-deployment` - Observability integration with CI/CD and Kubernetes - `resilience-patterns` - Monitoring circuit breakers and failure scenarios - `llm-evaluation` - Evaluation patterns that integrate with Langfuse scoring - `caching` - Caching strategies that reduce costs tracked by Langfuse