--- name: eks-observability description: EKS observability with metrics, logging, and tracing. Use when setting up monitoring, configuring logging pipelines, implementing distributed tracing, building production dashboards, troubleshooting EKS issues, optimizing observability costs, or establishing SLOs. --- # EKS Observability ## Overview Complete observability solution for Amazon EKS using AWS-native managed services and open-source tools. This skill implements the three-pillar approach (metrics, logs, traces) with 2025 best practices including ADOT, Amazon Managed Prometheus, Fluent Bit, and OpenTelemetry. **Keywords**: EKS monitoring, CloudWatch Container Insights, Prometheus, Grafana, ADOT, Fluent Bit, X-Ray, OpenTelemetry, distributed tracing, log aggregation, metrics collection, observability stack **Status**: Production-ready with 2025 best practices ## When to Use This Skill - Setting up monitoring for EKS clusters - Implementing centralized logging pipelines - Configuring distributed tracing - Building production dashboards in Grafana - Troubleshooting application performance - Establishing SLOs and error budgets - Optimizing observability costs - Migrating from X-Ray SDKs to OpenTelemetry - Correlating metrics, logs, and traces - Setting up alerting and on-call runbooks ## The Three-Pillar Approach (2025 Recommendation) ### 1. Metrics **CloudWatch Container Insights + Amazon Managed Prometheus (AMP)** - Dual monitoring provides complete visibility - CloudWatch for AWS-native integration and quick setup - Prometheus for advanced queries and community dashboards - Amazon Managed Grafana for visualization ### 2. Logs **Fluent Bit → CloudWatch Logs** - Lightweight log forwarder (AWS deprecated FluentD in Feb 2025) - DaemonSet deployment for automatic collection - Structured logging with JSON parsing - Optional aggregation to OpenSearch for analytics ### 3. Traces **ADOT → AWS X-Ray** - OpenTelemetry standard (X-Ray SDKs entering maintenance mode 2026) - ADOT Collector converts OTLP to X-Ray format - Distributed tracing across microservices - Integration with CloudWatch ServiceLens ## Quick Start Workflow ### Step 1: Enable CloudWatch Container Insights **Using EKS Add-on (Recommended):** ```bash # Create IAM policy for CloudWatch access aws iam create-policy \ --policy-name CloudWatchAgentServerPolicy \ --policy-document file://cloudwatch-policy.json # Create IRSA for CloudWatch eksctl create iamserviceaccount \ --name cloudwatch-agent \ --namespace amazon-cloudwatch \ --cluster my-cluster \ --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \ --approve \ --override-existing-serviceaccounts # Install Container Insights add-on aws eks create-addon \ --cluster-name my-cluster \ --addon-name amazon-cloudwatch-observability \ --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/CloudWatchAgentRole ``` **Verify Installation:** ```bash # Check add-on status aws eks describe-addon \ --cluster-name my-cluster \ --addon-name amazon-cloudwatch-observability # Verify pods running kubectl get pods -n amazon-cloudwatch ``` **What You Get:** - Node-level metrics (CPU, memory, disk, network) - Pod-level metrics (resource usage, restart counts) - Namespace-level aggregations - Automatic CloudWatch Logs integration - Pre-built CloudWatch dashboards ### Step 2: Deploy Amazon Managed Prometheus **Create AMP Workspace:** ```bash # Create workspace aws amp create-workspace \ --alias my-cluster-metrics \ --region us-west-2 # Get workspace ID WORKSPACE_ID=$(aws amp list-workspaces \ --alias my-cluster-metrics \ --query 'workspaces[0].workspaceId' \ --output text) # Create IRSA for AMP ingestion eksctl create iamserviceaccount \ --name amp-ingest \ --namespace prometheus \ --cluster my-cluster \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \ --approve ``` **Deploy kube-prometheus-stack:** ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install with AMP remote write helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace prometheus \ --create-namespace \ --set prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write \ --set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=us-west-2 \ --set prometheus.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/AMPIngestRole" ``` **What You Get:** - Prometheus Operator for CRD-based monitoring - Node Exporter for hardware metrics - kube-state-metrics for cluster state - Alertmanager for alert routing - 100+ pre-built Grafana dashboards ### Step 3: Deploy Fluent Bit for Logging **Create IRSA for Fluent Bit:** ```bash eksctl create iamserviceaccount \ --name fluent-bit \ --namespace logging \ --cluster my-cluster \ --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \ --approve ``` **Deploy Fluent Bit:** ```bash helm repo add fluent https://fluent.github.io/helm-charts helm install fluent-bit fluent/fluent-bit \ --namespace logging \ --create-namespace \ --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/FluentBitRole" \ --set cloudWatch.enabled=true \ --set cloudWatch.region=us-west-2 \ --set cloudWatch.logGroupName=/aws/eks/my-cluster/logs \ --set cloudWatch.autoCreateGroup=true ``` **What You Get:** - Automatic log collection from all pods - Structured JSON log parsing - CloudWatch Logs integration - Multi-line log support - Kubernetes metadata enrichment ### Step 4: Deploy ADOT for Distributed Tracing **Install ADOT Operator:** ```bash # Create IRSA for ADOT eksctl create iamserviceaccount \ --name adot-collector \ --namespace adot \ --cluster my-cluster \ --attach-policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess \ --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \ --approve # Install ADOT add-on aws eks create-addon \ --cluster-name my-cluster \ --addon-name adot \ --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/ADOTCollectorRole ``` **Deploy ADOT Collector:** ```yaml # adot-collector.yaml apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: adot-collector namespace: adot spec: mode: deployment serviceAccount: adot-collector config: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 30s send_batch_size: 50 memory_limiter: check_interval: 1s limit_mib: 512 exporters: awsxray: region: us-west-2 awsemf: region: us-west-2 namespace: EKS/Observability service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [awsxray] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [awsemf] ``` ```bash kubectl apply -f adot-collector.yaml ``` **What You Get:** - OTLP receiver for OpenTelemetry traces - Automatic X-Ray integration - Service map visualization - Trace sampling and filtering - CloudWatch ServiceLens integration ### Step 5: Setup Amazon Managed Grafana **Create AMG Workspace:** ```bash # Create workspace (via AWS Console recommended) # Or use AWS CLI: aws grafana create-workspace \ --workspace-name my-cluster-grafana \ --account-access-type CURRENT_ACCOUNT \ --authentication-providers AWS_SSO \ --permission-type SERVICE_MANAGED ``` **Add Data Sources:** 1. Navigate to AMG workspace URL 2. Configuration → Data Sources → Add data source 3. Add **Amazon Managed Service for Prometheus** - Region: us-west-2 - Workspace: Select your AMP workspace 4. Add **CloudWatch** - Default region: us-west-2 - Namespaces: ContainerInsights, EKS/Observability 5. Add **AWS X-Ray** - Default region: us-west-2 **Import Dashboards:** ```bash # EKS Container Insights Dashboard Dashboard ID: 16028 # Node Exporter Full Dashboard Dashboard ID: 1860 # Kubernetes Cluster Monitoring Dashboard ID: 15760 ``` ## Production Deployment Checklist ### Infrastructure - [ ] CloudWatch Container Insights enabled (EKS add-on) - [ ] Amazon Managed Prometheus workspace created - [ ] kube-prometheus-stack deployed with remote write - [ ] Fluent Bit DaemonSet running on all nodes - [ ] ADOT Collector deployed (deployment or daemonset) - [ ] Amazon Managed Grafana workspace created - [ ] All IRSA roles configured with least-privilege policies ### Configuration - [ ] Prometheus scrape configs include all targets - [ ] Fluent Bit log groups created and structured - [ ] ADOT sampling configured (5-10% for high traffic) - [ ] Grafana data sources connected (AMP, CloudWatch, X-Ray) - [ ] Log retention policies set (7-90 days typical) - [ ] Metric retention configured (AMP default 150 days) ### Dashboards - [ ] Cluster overview dashboard (nodes, pods, namespaces) - [ ] Application performance dashboard (latency, errors, throughput) - [ ] Resource utilization dashboard (CPU, memory, disk) - [ ] Cost monitoring dashboard (resource waste, right-sizing) - [ ] Network performance dashboard (CNO metrics) ### Alerting - [ ] Critical alerts: Pod crash loops, node not ready - [ ] Performance alerts: High latency, error rate spikes - [ ] Resource alerts: CPU/memory pressure, disk full - [ ] Cost alerts: Budget thresholds, waste detection - [ ] SNS topics configured for notifications - [ ] PagerDuty/Opsgenie integration (optional) ### Application Instrumentation - [ ] OpenTelemetry SDK integrated in applications - [ ] Trace context propagation configured - [ ] Custom metrics exported via OTLP - [ ] Structured logging with JSON format - [ ] Log correlation with trace IDs ## Modern Observability Stack (2025) ``` ┌─────────────────────────────────────────────────────────────┐ │ EKS Cluster │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Application │ │ Application │ │ Application │ │ │ │ + OTel SDK │ │ + OTel SDK │ │ + OTel SDK │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ └──────────────────┴──────────────────┘ │ │ │ │ │ ┌────────▼────────┐ │ │ │ ADOT Collector │ │ │ │ (OTel) │ │ │ └────────┬────────┘ │ │ │ │ │ ┌──────────────────┼──────────────────┐ │ │ │ │ │ │ │ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │ │ │Prometheus│ │Fluent Bit│ │Container │ │ │ │ (local) │ │DaemonSet │ │ Insights │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ └─────────┼──────────────────┼──────────────────┼────────────┘ │ │ │ │ │ │ ┌─────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │ AMP │ │CloudWatch│ │ X-Ray │ │(Managed │ │ Logs │ │ │ │Prometheus)│ └────┬─────┘ └────┬─────┘ └─────┬─────┘ │ │ │ │ │ └─────────────────┴──────────────────┘ │ ┌────────▼────────┐ │Amazon Managed │ │ Grafana │ └─────────────────┘ ``` ## Detailed Documentation For comprehensive guides on each observability component: - **Metrics Collection**: [references/metrics.md](references/metrics.md) - CloudWatch Container Insights setup - Amazon Managed Prometheus configuration - kube-prometheus-stack deployment - Custom metrics and ServiceMonitors - Cost optimization strategies - **Centralized Logging**: [references/logging.md](references/logging.md) - Fluent Bit configuration and parsers - CloudWatch Logs integration - OpenSearch aggregation (optional) - Log retention and lifecycle policies - Troubleshooting log collection - **Distributed Tracing**: [references/tracing.md](references/tracing.md) - ADOT Collector deployment patterns - OpenTelemetry SDK instrumentation - X-Ray integration and migration - Trace sampling strategies - ServiceLens and trace analysis ## Cost Optimization ### Metrics - Sample high-cardinality metrics (5-10% of labels) - Use metric relabeling to drop unnecessary labels - Aggregate metrics before remote write to AMP - Set appropriate retention periods (30-90 days typical) ### Logs - Implement log sampling for verbose applications - Use CloudWatch Logs Insights instead of exporting to S3 - Set aggressive retention for debug logs (7 days) - Keep audit logs longer (90+ days) ### Traces - Sample traces based on traffic (5-10% default) - Increase sampling for errors (100%) - Use tail-based sampling for important transactions - Clean up old X-Ray traces (default 30 days) **Typical Monthly Costs:** - Small cluster (10 nodes): $50-150/month - Medium cluster (50 nodes): $200-500/month - Large cluster (200+ nodes): $1000-2000/month ## Integration Patterns ### Correlation Between Pillars **Metrics → Logs:** ```promql # Find pods with high error rates rate(http_requests_total{status=~"5.."}[5m]) > 0.1 # Then search CloudWatch Logs for those pod names ``` **Logs → Traces:** ```json // Include trace_id in structured logs { "timestamp": "2025-01-27T10:30:00Z", "level": "error", "message": "Database connection failed", "trace_id": "1-67a2f3b1-12456789abcdef012345678", "span_id": "abcdef0123456789" } ``` **Traces → Metrics:** - Use trace data to identify slow endpoints - Create SLIs from trace latency percentiles - Alert on trace error rates ### CloudWatch ServiceLens Unified view combining: - X-Ray traces (request flow) - CloudWatch metrics (performance) - CloudWatch Logs (detailed context) ```bash # Enable ServiceLens (automatic with Container Insights + X-Ray) aws servicelens get-service-lens-metrics \ --service-name my-app \ --start-time 2025-01-27T00:00:00Z \ --end-time 2025-01-27T23:59:59Z ``` ## Troubleshooting Quick Reference | Issue | Cause | Fix | |-------|-------|-----| | No metrics in AMP | Missing IRSA or remote write config | Check Prometheus pod logs, verify IAM role | | Logs not appearing | Fluent Bit not running or wrong IAM | `kubectl logs -n logging fluent-bit-xxx` | | Traces not in X-Ray | ADOT not deployed or app not instrumented | Verify ADOT pods, check OTel SDK setup | | High costs | Too much data ingestion | Enable sampling, reduce log verbosity | | Missing pod metrics | kube-state-metrics not running | Check kube-prometheus-stack installation | | Grafana can't connect | Data source IAM permissions | Add CloudWatch/AMP read policies to AMG role | ## Production Runbooks ### Incident Response 1. **Check Grafana overview dashboard** - Identify affected services 2. **Review X-Ray service map** - Find bottleneck in request flow 3. **Query CloudWatch Logs Insights** - Get detailed error messages 4. **Correlate with metrics spike** - Understand timeline and scope 5. **Execute remediation** - Scale, restart, or rollback ### Performance Investigation 1. **Start with RED metrics** (Rate, Errors, Duration) 2. **Check USE metrics** (Utilization, Saturation, Errors) for infrastructure 3. **Analyze trace percentiles** (p50, p95, p99) 4. **Review log patterns** during slow periods 5. **Identify optimization opportunities** ## SLO Implementation **Define SLIs (Service Level Indicators):** ```yaml # Availability SLI - metric: probe_success target: 99.9% window: 30d # Latency SLI - metric: http_request_duration_seconds percentile: p99 target: < 500ms window: 30d # Error Rate SLI - metric: http_requests_total{status=~"5.."} target: < 0.1% window: 30d ``` **Calculate Error Budget:** ``` Error Budget = 100% - SLO Target Example: 99.9% SLO = 0.1% error budget = 43.2 minutes downtime/month ``` **Burn Rate Alerts:** ```promql # Fast burn (5% budget in 1 hour) (1 - slo:availability:ratio_rate_1h) > 0.05 # Slow burn (10% budget in 6 hours) (1 - slo:availability:ratio_rate_6h) > 0.1 ``` ## Best Practices Summary 1. **Use Dual Monitoring**: CloudWatch Container Insights + Prometheus 2. **Standardize on OpenTelemetry**: Future-proof instrumentation 3. **Enable IRSA for Everything**: No node IAM roles 4. **Deploy ADOT Collector**: Vendor-neutral observability 5. **Sample Intelligently**: 5-10% traces, 100% errors 6. **Structure Your Logs**: JSON format with trace correlation 7. **Set Retention Policies**: Balance cost and compliance 8. **Build Actionable Dashboards**: Focus on SLIs and anomalies 9. **Implement Progressive Alerting**: Warn before critical 10. **Regularly Review Costs**: Optimize based on actual usage --- **Stack**: CloudWatch Container Insights, AMP, Fluent Bit, ADOT, AMG, X-Ray **Standards**: OpenTelemetry, IRSA, EKS Add-ons **Last Updated**: January 2025 (2025 Best Practices)