--- name: monitoring-guidelines description: Monitoring guidelines for applications and infrastructure including metrics collection, alerting strategies, and SLO-based monitoring --- # Monitoring Guidelines Apply these monitoring principles to ensure system reliability, performance visibility, and proactive issue detection. ## Core Monitoring Principles - Monitor the four golden signals: latency, traffic, errors, and saturation - Implement monitoring as code for reproducibility - Design monitoring around user experience and business impact - Use SLOs (Service Level Objectives) to guide alerting decisions - Balance comprehensive coverage with actionable insights ## Key Metrics to Monitor ### Application Metrics - Request rate (requests per second) - Error rate (percentage of failed requests) - Response time (p50, p90, p95, p99 latencies) - Active connections and concurrent users - Queue depths and processing times ### Infrastructure Metrics - CPU utilization and load average - Memory usage and available memory - Disk I/O and available storage - Network throughput and error rates - Container and pod health (for Kubernetes) ### Business Metrics - Transaction volumes and values - User signups and conversions - Feature usage and adoption rates - Revenue-impacting events - Customer satisfaction indicators ## Alerting Strategy ### Alert Design Principles - Alert on symptoms, not causes - Make alerts actionable with clear remediation steps - Set appropriate severity levels (critical, warning, info) - Avoid alert fatigue through proper threshold tuning - Include runbook links in alert notifications ### SLO-Based Alerting - Define SLOs for critical user journeys - Calculate error budgets and burn rates - Alert when error budget consumption is high - Use multi-window, multi-burn-rate alerts - Review and adjust SLOs quarterly ### Alert Configuration - Set meaningful thresholds based on baseline data - Use hysteresis to prevent flapping alerts - Implement alert dependencies to reduce noise - Route alerts to appropriate teams - Configure escalation policies ## Dashboard Design ### Effective Dashboards - Create overview dashboards for service health - Build detailed dashboards for debugging - Use consistent layouts and naming conventions - Include time range selectors and drill-down capabilities - Display SLO status prominently ### Dashboard Content - Show current state and recent trends - Include comparison to baseline or previous periods - Display deployment markers for correlation - Add annotations for significant events - Include links to related dashboards and logs ## Monitoring Tools Integration ### Data Collection - Use agents or sidecars for metric collection - Implement service discovery for dynamic environments - Configure appropriate scrape intervals - Use push vs pull based on use case - Ensure metric cardinality is manageable ### Data Storage and Retention - Set retention periods based on use case - Implement downsampling for long-term storage - Use appropriate storage backends for scale - Plan for disaster recovery of monitoring data - Monitor your monitoring infrastructure ## Health Checks and Probes - Implement liveness probes for crash detection - Use readiness probes for traffic management - Create deep health checks that verify dependencies - Expose health endpoints in a standard format - Monitor health check latency as a metric ## Incident Response - Use monitoring data to detect incidents early - Correlate metrics, logs, and traces during investigation - Document findings and update monitoring post-incident - Track MTTR (Mean Time to Recovery) metrics - Conduct regular monitoring reviews and improvements ## Capacity Planning - Track resource utilization trends - Set alerts for approaching capacity limits - Use forecasting for proactive scaling - Document capacity requirements and headroom - Review capacity quarterly