# Observability Guide **Version**: v1.0.2 ## Table of Contents - [Overview](#overview) - [Observability Modes](#observability-modes) - [Prerequisites](#prerequisites) - [Quick Start](#quick-start) - [Configuration](#configuration) - [Common Tasks](#common-tasks) - [Use Embedded Storage](#use-embedded-storage) - [Send Traces to Hawk Service](#send-traces-to-hawk-service) - [Track LLM Costs](#track-llm-costs) - [Enable Privacy Redaction](#enable-privacy-redaction) - [Use No-Op Tracer for Development](#use-no-op-tracer-for-development) - [Examples](#examples) - [Example 1: Embedded SQLite Storage](#example-1-embedded-sqlite-storage) - [Example 2: Hawk Service Export](#example-2-hawk-service-export) - [Example 3: Production Configuration](#example-3-production-configuration) - [Query Examples](#query-examples) - [Troubleshooting](#troubleshooting) ## Overview Loom provides comprehensive observability with distributed tracing and metrics. Choose from three modes based on your needs: 1. **Embedded Mode**: In-process storage (memory or SQLite) - no external dependencies 2. **Service Mode**: HTTP export to Hawk or other observability services 3. **None Mode**: Zero overhead for testing ## Observability Modes ### Embedded Mode (Recommended) **Use when**: You want local trace storage without external services **Features**: - ✅ Zero external dependencies - ✅ Memory or SQLite storage options - ✅ Query traces with SQL - ✅ Offline capability - ❌ Single-server only (no centralized aggregation) ### Service Mode **Use when**: You need centralized observability across multiple servers **Features**: - ✅ Centralized trace aggregation - ✅ Hawk UI for visualization - ✅ Multi-server support - ❌ Requires external Hawk service - ❌ Requires `-tags hawk` build flag ### None Mode **Use when**: Testing or when observability is not needed **Features**: - ✅ Zero overhead - ✅ Always available - ❌ No trace collection ## Prerequisites **For all modes**: - Loom v1.0.2+ - Build with FTS5 tag: `go build -tags fts5` **For service mode only**: - Hawk service running - Build with hawk tag: `go build -tags fts5,hawk` **For embedded mode**: - No additional prerequisites (always available) ## Quick Start ### Embedded Mode (Recommended for Getting Started) ```yaml # looms.yaml observability: enabled: true mode: embedded storage_type: sqlite sqlite_path: ./traces.db ``` Start server: ```bash looms serve --config looms.yaml ``` Query traces: ```bash sqlite3 ./traces.db "SELECT * FROM eval_metrics;" ``` ### Service Mode (For Production with Hawk) ```yaml # looms.yaml observability: enabled: true mode: service hawk_endpoint: http://localhost:9090/v1/traces hawk_api_key: ${HAWK_API_KEY} ``` Start server (requires `-tags hawk` build): ```bash looms serve --config looms.yaml ``` ### None Mode (For Testing) ```yaml # looms.yaml observability: enabled: false ``` Or explicitly: ```yaml observability: enabled: true mode: none ``` ## Configuration ### HawkConfig Options ```go type HawkConfig struct { Endpoint string // Hawk API endpoint (required) APIKey string // Bearer token (optional) BatchSize int // Spans per batch (default: 100) FlushInterval time.Duration // Auto-flush interval (default: 10s) MaxRetries int // Max retry attempts (default: 3) RetryBackoff time.Duration // Initial backoff (default: 1s) Privacy PrivacyConfig // Privacy settings } type PrivacyConfig struct { RedactCredentials bool // Remove passwords, API keys RedactPII bool // Redact emails, phones, SSNs AllowedAttributes []string // Keys that bypass redaction } ``` ### YAML Configuration ```yaml apiVersion: loom/v1 kind: Agent metadata: name: my-agent version: "1.0.0" spec: observability: enabled: true hawk_endpoint: http://localhost:9090 ``` ## Common Tasks ### Use Embedded Storage Store traces locally without external services: **Configuration**: ```yaml observability: enabled: true mode: embedded storage_type: sqlite # or "memory" sqlite_path: ./traces.db flush_interval: 30s ``` **Query traces**: ```bash # View sessions sqlite3 ./traces.db " SELECT id, name, status, datetime(created_at, 'unixepoch') as created FROM evals ORDER BY created_at DESC LIMIT 10; " # View metrics sqlite3 ./traces.db " SELECT eval_id, total_runs, successful_runs, ROUND(success_rate, 4) as success_rate, ROUND(avg_execution_time_ms, 2) as avg_ms FROM eval_metrics; " ``` Embedded storage includes: - Session tracking - Span storage with timing - Aggregated metrics (success rate, avg execution time) - Cost tracking (token usage) ### Send Traces to Hawk Service Export traces to centralized Hawk service: **Configuration**: ```yaml observability: enabled: true mode: service hawk_endpoint: http://localhost:9090/v1/traces hawk_api_key: ${HAWK_API_KEY} ``` **Build requirement**: Requires `-tags fts5,hawk`: ```bash go build -tags fts5,hawk -o bin/looms ./cmd/looms ``` Traces include: - LLM calls with token counts - Tool executions with timing - Conversation history - Error patterns ### Track LLM Costs Costs are tracked automatically. Access them in responses: ```go response, _ := agent.Chat(ctx, sessionID, query) fmt.Printf("Cost: $%.4f\n", response.Usage.CostUSD) fmt.Printf("Tokens: %d\n", response.Usage.TotalTokens) ``` Query costs in Hawk: ```bash hawk query --metric llm.cost --group-by session.id --timerange 24h ``` ### Enable Privacy Redaction Redact sensitive data before export: ```go tracer, _ := observability.NewHawkTracer(observability.HawkConfig{ Endpoint: "http://localhost:9090/v1/traces", Privacy: observability.PrivacyConfig{ RedactCredentials: true, RedactPII: true, AllowedAttributes: []string{ "session.id", "llm.model", "tool.name", }, }, }) ``` Redaction patterns: - Emails: `user@example.com` -> `[EMAIL_REDACTED]` - Phones: `555-123-4567` -> `[PHONE_REDACTED]` - SSNs: `123-45-6789` -> `[SSN_REDACTED]` - Credit cards: `1234-5678-9012-3456` -> `[CARD_REDACTED]` ### Use No-Op Tracer for Development Disable tracing without code changes: ```go tracer := observability.NewNoOpTracer() agent := loom.NewInstrumentedAgent(backend, llmProvider, tracer) ``` ## Examples ### Example 1: Instrumented Agent ```go package main import ( "context" "log" "os" "time" "github.com/teradata-labs/loom" "github.com/teradata-labs/loom/pkg/llm/anthropic" "github.com/teradata-labs/loom/pkg/observability" ) func main() { ctx := context.Background() // Create tracer tracer, err := observability.NewHawkTracer(observability.HawkConfig{ Endpoint: "http://localhost:9090/v1/traces", BatchSize: 100, FlushInterval: 10 * time.Second, }) if err != nil { log.Fatal(err) } defer tracer.Close() // Create LLM provider llm := anthropic.NewClient(anthropic.Config{ APIKey: os.Getenv("ANTHROPIC_API_KEY"), Model: "claude-sonnet-4-5-20250929", }) // Create instrumented agent agent := loom.NewInstrumentedAgent(backend, llm, tracer) // Use agent response, err := agent.Chat(ctx, "session-123", "Hello!") if err != nil { log.Fatal(err) } log.Printf("Response: %s", response.Content) log.Printf("Cost: $%.4f", response.Usage.CostUSD) // Flush before exit tracer.Flush(ctx) } ``` ### Example 2: Production Configuration ```go tracer, _ := observability.NewHawkTracer(observability.HawkConfig{ Endpoint: os.Getenv("HAWK_ENDPOINT"), APIKey: os.Getenv("HAWK_API_KEY"), BatchSize: 100, FlushInterval: 10 * time.Second, MaxRetries: 3, RetryBackoff: 1 * time.Second, Privacy: observability.PrivacyConfig{ RedactCredentials: true, RedactPII: true, AllowedAttributes: []string{ "session.id", "llm.provider", "llm.model", "tool.name", }, }, HTTPClient: &http.Client{ Timeout: 30 * time.Second, Transport: &http.Transport{ MaxIdleConns: 100, MaxIdleConnsPerHost: 10, IdleConnTimeout: 90 * time.Second, }, }, }) ``` ## Hawk Query Examples ### Cost Analysis ```bash # Total cost by session hawk query --metric llm.cost --group-by session.id --timerange 24h # Cost by LLM provider hawk query --metric llm.cost --group-by llm.provider --timerange 7d # Most expensive sessions hawk query --metric llm.cost --sort desc --limit 10 ``` ### Performance Analysis ```bash # LLM latency by model hawk query --metric llm.latency --group-by llm.model --timerange 24h # Slow tool executions hawk query --span tool.execute --where "duration_ms > 5000" ``` ### Error Tracking ```bash # LLM error rate hawk query --metric llm.errors.total --group-by error.type --timerange 24h # Failed tool executions hawk query --span tool.execute --where "status = error" --timerange 24h ``` ## Troubleshooting ### Traces Not Appearing 1. Check endpoint reachability: ```bash curl -X POST http://localhost:9090/v1/traces ``` 2. Verify API key (if required): ```bash export HAWK_API_KEY=your-key-here ``` 3. Force flush to see immediate results: ```go tracer.Flush(context.Background()) ``` ### High Memory Usage Reduce buffer size or flush more frequently: ```go tracer, _ := observability.NewHawkTracer(observability.HawkConfig{ BatchSize: 50, FlushInterval: 5 * time.Second, }) ``` ### Export Failures Increase retry attempts: ```go tracer, _ := observability.NewHawkTracer(observability.HawkConfig{ MaxRetries: 5, RetryBackoff: 2 * time.Second, }) ``` Check network connectivity: ```bash ping your-hawk-endpoint ```