--- name: agent-observability description: strategies for agent observability (logging, tracing, metrics). Use this to instrument agents for debugging, performance tracking, and quality assurance. --- # Agent Observability Strategies ## Goal Move beyond simple monitoring ("Is it running?") to deep observability ("How is it thinking?"), enabling the diagnosis of complex failures in non-deterministic systems. ## The Three Pillars of Observability ### 1. Structured Logging (The Diary) * **Definition:** Immutable, timestamped records of discrete events. * **Best Practice:** Use structured JSON logs to capture the full context: prompt/response pairs, intermediate reasoning (Chain of Thought), and tool inputs/outputs. * **Pattern:** Record the *intent* before an action and the *outcome* after to distinguish between decision failures and execution failures. ### 2. Distributed Tracing (The Narrative) * **Definition:** A visual "yarn" connecting individual log entries (spans) into a single end-to-end task execution. * **Usage:** Essential for root cause analysis. It reveals if a bad final answer was caused by a retrieval failure (RAG), a tool error, or an LLM hallucination. * **Standard:** Use OpenTelemetry to link spans across services. ### 3. Metrics (The Scorecard) Aggregated data points for tracking health over time. Separate these into two dashboards: #### System Metrics (Operational Health) * **Audience:** SREs / DevOps. * **Key Metrics:** P99 Latency, Error Rate (traces with `error=true`), Token Consumption, and API Cost per Run. #### Quality Metrics (Decision Health) * **Audience:** Product / Data Science. * **Key Metrics:** * **Trajectory Adherence:** Did the agent follow the ideal path? * **Hallucination Rate:** Frequency of ungrounded statements. * **Task Completion Rate:** Percentage of traces reaching a "success" state. ## Operational Best Practices * **Dynamic Sampling:** To save costs, log 100% of errors but only sample 10% of successful traces in production. * **PII Redaction:** Integrate PII scrubbing directly into the logging pipeline to sanitize user inputs before storage.