--- name: observability-guidelines description: Observability guidelines for distributed systems using OpenTelemetry, tracing, metrics, and structured logging --- # Observability Guidelines Apply these observability principles to ensure comprehensive visibility into distributed systems and microservices. ## Core Observability Principles - Guide the development of idiomatic, maintainable, and high-performance code with built-in observability - Enforce modular design and separation of concerns through Clean Architecture - Promote test-driven development and robust observability from the start ## OpenTelemetry Integration - Use OpenTelemetry for distributed tracing, metrics, and structured logging - Start and propagate tracing spans across all service boundaries - Use otel.Tracer for creating spans and otel.Meter for collecting metrics - Export data to OpenTelemetry Collector, Jaeger, or Prometheus - Configure appropriate sampling rates for production environments ## Distributed Tracing - Trace all incoming requests and propagate context through internal calls - Use middleware to instrument HTTP and gRPC endpoints automatically - Include trace context in all downstream service calls - Create child spans for significant operations within a service - Add relevant attributes to spans for debugging and analysis ## Metrics Collection Monitor these key metrics across all services: - **Request latency**: Track p50, p90, p95, and p99 percentiles - **Throughput**: Measure requests per second by endpoint - **Error rate**: Track 4xx and 5xx responses separately - **Resource usage**: Monitor CPU, memory, disk, and network utilization - **Custom business metrics**: Track domain-specific KPIs ## Structured Logging - Include unique request IDs and trace context in all logs for correlation - Use structured logging formats (JSON) for machine parseability - Include relevant context: timestamp, service name, trace ID, span ID - Log at appropriate levels: DEBUG, INFO, WARN, ERROR - Avoid logging sensitive information (PII, credentials) ## Architecture Patterns - Apply Clean Architecture with handlers, services, repositories, and domain models - Use domain-driven design principles for clear boundaries - Prioritize interface-driven development with explicit dependency injection - Prefer composition over inheritance; favor small, purpose-specific interfaces ## Correlation and Context - Propagate context through the entire request lifecycle - Use correlation IDs for request tracking across services - Include service version and deployment information in telemetry - Tag traces with relevant business context for filtering - Enable trace-to-log and log-to-trace correlation ## Alerting and Dashboards - Create dashboards for service health and business metrics - Set up alerts based on SLOs and error budgets - Use anomaly detection for proactive issue identification - Document runbooks for common alert scenarios - Review and tune alerts regularly to reduce noise ## Instrumentation Best Practices - Instrument at service boundaries (entry/exit points) - Add custom spans for database operations and external calls - Include relevant attributes (user ID, request type, etc.) - Avoid over-instrumentation that creates noise - Use semantic conventions for consistent attribute naming ## Production Considerations - Configure appropriate sampling rates to balance visibility and cost - Use head-based sampling for consistent trace capture - Implement tail-based sampling for capturing errors - Set retention policies based on debugging needs - Monitor observability infrastructure health