--- name: grafana description: | Observability visualization with Grafana and LGTM stack. Dashboard design, panel configuration, alerting, variables/templating, and data sources. USE WHEN: Creating Grafana dashboards, configuring panels and visualizations, writing LogQL/TraceQL queries, setting up Grafana data sources, configuring dashboard variables and templates, building Grafana alerts. DO NOT USE: For writing PromQL queries (use /prometheus), for alerting rule strategy (use /prometheus), for general observability architecture (use senior-infrastructure-engineer). TRIGGERS: grafana, dashboard, panel, visualization, logql, traceql, loki, tempo, mimir, data source, annotation, variable, template, row, stat, graph, table, heatmap, gauge, bar chart, pie chart, time series, logs panel, traces panel, LGTM stack. triggers: - grafana - dashboard - panel - visualization - logql - traceql - loki - tempo - mimir - data source - annotation - variable - template - row - stat - graph - table - heatmap - gauge - bar chart - pie chart - time series - logs panel - traces panel - LGTM stack allowed-tools: Read, Grep, Glob, Edit, Write, Bash --- # Grafana and LGTM Stack Skill ## Overview The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities: - **Loki**: Log aggregation and querying (LogQL) - **Grafana**: Visualization, dashboarding, alerting, and exploration - **Tempo**: Distributed tracing (TraceQL) - **Mimir**: Long-term metrics storage (Prometheus-compatible) This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices. ## When to Use This Skill ### Primary Use Cases - Creating or modifying Grafana dashboards - Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.) - Writing queries (PromQL, LogQL, TraceQL) - Configuring data sources (Prometheus, Loki, Tempo, Mimir) - Setting up alerting rules and notification policies - Implementing dashboard variables and templates - Dashboard provisioning and GitOps workflows - Troubleshooting observability queries - Analyzing application performance, errors, or system behavior ### Who Uses This Skill - **senior-infrastructure-engineer** (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture - **software-engineer**: Application dashboards, service metrics visualization - **devops-engineer**: Infrastructure monitoring, deployment dashboards ## LGTM Stack Components ### Loki - Log Aggregation #### Architecture - Loki Horizontally scalable log aggregation inspired by Prometheus - Indexes only metadata (labels), not log content - Cost-effective storage with object stores (S3, GCS, etc.) - LogQL query language similar to PromQL #### Key Concepts - Loki - Labels for indexing (low cardinality) - Log streams identified by unique label sets - Parsers: logfmt, JSON, regex, pattern - Line filters and label filters ### Grafana - Visualization #### Features - Multi-datasource dashboarding - Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series - Templating and variables for dynamic dashboards - Alerting (unified alerting with contact points and notification policies) - Dashboard provisioning and GitOps integration - Role-based access control (RBAC) - Explore mode for ad-hoc queries - Annotations for event markers - Dashboard folders and organization ### Tempo - Distributed Tracing #### Architecture - Tempo Scalable distributed tracing backend - Cost-effective trace storage - TraceQL for trace querying - Integration with logs and metrics (trace-to-logs, trace-to-metrics) - OpenTelemetry compatible ### Mimir - Metrics Storage #### Architecture - Mimir Horizontally scalable long-term Prometheus storage - Multi-tenancy support - Query federation - High availability - Prometheus remote_write compatible ## Dashboard Design and Best Practices ### Dashboard Organization Principles 1. **Hierarchy**: Overview -> Service -> Component -> Deep Dive 2. **Golden Signals**: Latency, Traffic, Errors, Saturation (RED/USE method) 3. **Variable-driven**: Use templates for flexibility across environments 4. **Consistent Layouts**: Grid alignment (24-column grid), logical top-to-bottom flow 5. **Performance**: Limit queries, use query caching, appropriate time intervals ### Panel Types and When to Use Them | Panel Type | Use Case | Best For | |------------|----------|----------| | **Time Series / Graph** | Trends over time | Request rates, latency, resource usage | | **Stat** | Single metric value | Error rates, current values, percentage | | **Gauge** | Progress toward limit | CPU usage, memory, disk space | | **Bar Gauge** | Comparative values | Top N items, distribution | | **Table** | Structured data | Service lists, error details, resource inventory | | **Pie Chart** | Proportions | Traffic distribution, error breakdown | | **Heatmap** | Distribution over time | Latency percentiles, request patterns | | **Logs** | Log streams | Error investigation, debugging | | **Traces** | Distributed tracing | Performance analysis, dependency mapping | ### Panel Configuration Best Practices #### Titles and Descriptions - **Clear, descriptive titles**: Include units and metric context - **Tooltips**: Add description fields for panel documentation - **Examples**: - Good: "P95 Latency (seconds) by Endpoint" - Bad: "Latency" #### Legends and Labels - Show legends only when needed (multiple series) - Use `{{label}}` format for dynamic legend names - Place legends appropriately (bottom, right, or hidden) - Sort by value when showing Top N #### Axes and Units - Always label axes with units - Use appropriate unit formats (seconds, bytes, percent, requests/sec) - Set reasonable min/max ranges to avoid misleading scales - Use logarithmic scales for wide value ranges #### Thresholds and Colors - Use thresholds for visual cues (green/yellow/red) - Standard threshold pattern: - Green: Normal operation - Yellow: Warning (action may be needed) - Red: Critical (immediate attention required) - Examples: - Error rate: 0% (green), 1% (yellow), 5% (red) - P95 latency: <1s (green), 1-3s (yellow), >3s (red) #### Links and Drilldowns - Link panels to related dashboards - Use data links for context (logs, traces, related services) - Create drill-down paths: Overview -> Service -> Component -> Details - Link to runbooks for alert panels ### Dashboard Variables and Templating Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges. #### Variable Types | Type | Purpose | Example | |------|---------|---------| | **Query** | Populate from data source | Namespaces, services, pods | | **Custom** | Static list of options | Environments (prod/staging/dev) | | **Interval** | Time interval selection | Auto-adjusted query intervals | | **Datasource** | Switch between data sources | Multiple Prometheus instances | | **Constant** | Hidden values for queries | Cluster name, region | | **Text box** | Free-form input | Custom filters | #### Common Variable Patterns ```json { "templating": { "list": [ { "name": "datasource", "type": "datasource", "query": "prometheus", "description": "Select Prometheus data source" }, { "name": "namespace", "type": "query", "datasource": "${datasource}", "query": "label_values(kube_pod_info, namespace)", "multi": true, "includeAll": true, "description": "Kubernetes namespace filter" }, { "name": "app", "type": "query", "datasource": "${datasource}", "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)", "multi": true, "includeAll": true, "description": "Application filter (depends on namespace)" }, { "name": "interval", "type": "interval", "auto": true, "auto_count": 30, "auto_min": "10s", "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"], "description": "Query resolution interval" }, { "name": "environment", "type": "custom", "options": [ { "text": "Production", "value": "prod" }, { "text": "Staging", "value": "staging" }, { "text": "Development", "value": "dev" } ], "current": { "text": "Production", "value": "prod" } } ] } } ``` #### Variable Usage in Queries Variables are referenced with `$variable_name` or `${variable_name}` syntax: ```promql # Simple variable reference rate(http_requests_total{namespace="$namespace"}[5m]) # Multi-select with regex match rate(http_requests_total{namespace=~"$namespace"}[5m]) # Variable in legend rate(http_requests_total{app="$app"}[5m]) by (method) # Legend format: "{{method}}" # Using interval variable for adaptive queries rate(http_requests_total[$__interval]) # Chained variables (app depends on namespace) rate(http_requests_total{namespace="$namespace", app="$app"}[5m]) ``` #### Advanced Variable Techniques **Regex filtering**: ```json { "name": "pod", "type": "query", "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)", "regex": "/^$app-.*/", "description": "Filter pods by app prefix" } ``` **All option with custom value**: ```json { "name": "status", "type": "custom", "options": ["200", "404", "500"], "includeAll": true, "allValue": ".*", "description": "HTTP status code filter" } ``` **Dependent variables** (variable chain): 1. `$datasource` (datasource type) 2. `$cluster` (query: depends on datasource) 3. `$namespace` (query: depends on cluster) 4. `$app` (query: depends on namespace) 5. `$pod` (query: depends on app) ### Annotations Annotations display events as vertical markers on time series panels: ```json { "annotations": { "list": [ { "name": "Deployments", "datasource": "Prometheus", "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])", "tagKeys": "deployment,namespace", "textFormat": "Deployment: {{deployment}}", "iconColor": "blue" }, { "name": "Alerts", "datasource": "Loki", "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"", "textFormat": "Alert: {{alertname}}", "iconColor": "red" } ] } } ``` ### Dashboard Performance Optimization #### Query Optimization - Limit number of panels (< 15 per dashboard) - Use appropriate time ranges (avoid queries over months) - Leverage `$__interval` for adaptive sampling - Avoid high-cardinality grouping (too many series) - Use query caching when available #### Panel Performance - Set max data points to reasonable values - Use instant queries for current-state panels - Combine related metrics into single queries when possible - Disable auto-refresh on heavy dashboards ## Dashboard as Code and Provisioning ### Dashboard Provisioning Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions. #### Provisioning Provider Configuration File: `/etc/grafana/provisioning/dashboards/dashboards.yaml` ```yaml apiVersion: 1 providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /etc/grafana/provisioning/dashboards - name: 'application' orgId: 1 folder: 'Applications' type: file disableDeletion: true editable: false options: path: /var/lib/grafana/dashboards/application - name: 'infrastructure' orgId: 1 folder: 'Infrastructure' type: file options: path: /var/lib/grafana/dashboards/infrastructure ``` #### Dashboard JSON Structure Complete dashboard JSON with metadata and provisioning: ```json { "dashboard": { "title": "Application Observability - ${app}", "uid": "app-observability", "tags": ["observability", "application"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-1h", "to": "now" }, "refresh": "30s", "templating": { "list": [] }, "panels": [], "links": [] }, "overwrite": true, "folderId": null, "folderUid": null } ``` #### Kubernetes ConfigMap Provisioning ```yaml apiVersion: v1 kind: ConfigMap metadata: name: grafana-dashboards namespace: monitoring labels: grafana_dashboard: "1" data: application-dashboard.json: | { "dashboard": { "title": "Application Metrics", "uid": "app-metrics", "tags": ["application"], "panels": [] } } ``` #### Grafana Operator (CRD) ```yaml apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDashboard metadata: name: application-observability namespace: monitoring spec: instanceSelector: matchLabels: dashboards: "grafana" json: | { "dashboard": { "title": "Application Observability", "panels": [] } } ``` ### Data Source Provisioning #### Loki Data Source File: `/etc/grafana/provisioning/datasources/loki.yaml` ```yaml apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki:3100 jsonData: maxLines: 1000 derivedFields: - datasourceUid: tempo_uid matcherRegex: "trace_id=(\\w+)" name: TraceID url: "$${__value.raw}" editable: false ``` #### Tempo Data Source File: `/etc/grafana/provisioning/datasources/tempo.yaml` ```yaml apiVersion: 1 datasources: - name: Tempo type: tempo access: proxy url: http://tempo:3200 uid: tempo_uid jsonData: httpMethod: GET tracesToLogs: datasourceUid: loki_uid tags: ["job", "instance", "pod", "namespace"] mappedTags: [{ key: "service.name", value: "service" }] spanStartTimeShift: "1h" spanEndTimeShift: "1h" tracesToMetrics: datasourceUid: prometheus_uid tags: [{ key: "service.name", value: "service" }] serviceMap: datasourceUid: prometheus_uid nodeGraph: enabled: true editable: false ``` #### Mimir/Prometheus Data Source File: `/etc/grafana/provisioning/datasources/mimir.yaml` ```yaml apiVersion: 1 datasources: - name: Mimir type: prometheus access: proxy url: http://mimir:8080/prometheus uid: prometheus_uid jsonData: httpMethod: POST exemplarTraceIdDestinations: - datasourceUid: tempo_uid name: trace_id prometheusType: Mimir prometheusVersion: 2.40.0 cacheLevel: "High" incrementalQuerying: true incrementalQueryOverlapWindow: 10m editable: false ``` ## Alerting ### Alert Rule Configuration Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing. #### Prometheus/Mimir Alert Rule File: `/etc/grafana/provisioning/alerting/rules.yaml` ```yaml apiVersion: 1 groups: - name: application_alerts interval: 1m rules: - uid: error_rate_high title: High Error Rate condition: A data: - refId: A queryType: "" relativeTimeRange: from: 300 to: 0 datasourceUid: prometheus_uid model: expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 intervalMs: 1000 maxDataPoints: 43200 noDataState: NoData execErrState: Error for: 5m annotations: description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)' summary: Application error rate is above threshold runbook_url: https://wiki.company.com/runbooks/high-error-rate labels: severity: critical team: platform isPaused: false - uid: high_latency title: High P95 Latency condition: A data: - refId: A datasourceUid: prometheus_uid model: expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint) ) > 2 for: 10m annotations: description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}" runbook_url: https://wiki.company.com/runbooks/high-latency labels: severity: warning ``` #### Loki Alert Rule ```yaml apiVersion: 1 groups: - name: log_based_alerts interval: 1m rules: - uid: error_spike title: Error Log Spike condition: A data: - refId: A queryType: "" datasourceUid: loki_uid model: expr: | sum(rate({app="api"} | json | level="error" [5m])) > 10 for: 2m annotations: description: "Error log rate is {{ $values.A.Value }} logs/sec" summary: Spike in error logs detected labels: severity: warning - uid: critical_error_pattern title: Critical Error Pattern Detected condition: A data: - refId: A datasourceUid: loki_uid model: expr: | sum(count_over_time({app="api"} |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m] )) > 0 for: 0m annotations: description: "Critical error pattern found in logs" labels: severity: critical page: true ``` ### Contact Points and Notification Policies File: `/etc/grafana/provisioning/alerting/contactpoints.yaml` ```yaml apiVersion: 1 contactPoints: - orgId: 1 name: slack-critical receivers: - uid: slack_critical type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL title: "{{ .GroupLabels.alertname }}" text: | {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Summary:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Severity:* {{ .Labels.severity }} {{ end }} disableResolveMessage: false - orgId: 1 name: pagerduty-oncall receivers: - uid: pagerduty_oncall type: pagerduty settings: integrationKey: YOUR_INTEGRATION_KEY severity: critical class: infrastructure - orgId: 1 name: email-team receivers: - uid: email_team type: email settings: addresses: team@company.com singleEmail: true notificationPolicies: - orgId: 1 receiver: slack-critical group_by: ["alertname", "namespace"] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - receiver: pagerduty-oncall matchers: - severity = critical - page = true group_wait: 10s repeat_interval: 1h continue: true - receiver: email-team matchers: - severity = warning - team = platform group_interval: 10m repeat_interval: 12h ``` ## LogQL Query Patterns ### Basic Log Queries #### Stream Selection ```logql # Simple label matching {namespace="production", app="api"} # Regex matching {app=~"api|web|worker"} # Not equal {env!="staging"} # Multiple conditions {namespace="production", app="api", level!="debug"} ``` #### Line Filters ```logql # Contains {app="api"} |= "error" # Does not contain {app="api"} != "debug" # Regex match {app="api"} |~ "error|exception|fatal" # Case insensitive {app="api"} |~ "(?i)error" # Chaining filters {app="api"} |= "error" != "timeout" ``` ### Parsing and Extraction #### JSON Parsing ```logql # Parse JSON logs {app="api"} | json # Extract specific fields {app="api"} | json message="msg", level="severity" # Filter on extracted field {app="api"} | json | level="error" # Nested JSON {app="api"} | json | line_format "{{.response.status}}" ``` #### Logfmt Parsing ```logql # Parse logfmt (key=value) {app="api"} | logfmt # Extract specific fields {app="api"} | logfmt level, caller, msg # Filter parsed fields {app="api"} | logfmt | level="error" ``` #### Pattern Parsing ```logql # Extract with pattern {app="nginx"} | pattern ` - - <_> " <_>" <_>` # Filter on extracted values {app="nginx"} | pattern `<_> <_>` | status >= 400 # Complex pattern {app="api"} | pattern `level= msg="" duration=ms` ``` ### Aggregations and Metrics #### Count Queries ```logql # Count log lines over time count_over_time({app="api"}[5m]) # Rate of logs rate({app="api"}[5m]) # Errors per second sum(rate({app="api"} |= "error" [5m])) by (namespace) # Error ratio sum(rate({app="api"} |= "error" [5m])) / sum(rate({app="api"}[5m])) ``` #### Extracted Metrics ```logql # Average duration avg_over_time({app="api"} | logfmt | unwrap duration [5m]) by (endpoint) # P95 latency quantile_over_time(0.95, {app="api"} | regexp `duration=(?P[0-9.]+)ms` | unwrap duration [5m]) by (method) # Top 10 error messages topk(10, sum by (msg) ( count_over_time({app="api"} | json | level="error" [1h] ) ) ) ``` ## TraceQL Query Patterns ### Basic Trace Queries ```traceql # Find traces by service { .service.name = "api" } # HTTP status codes { .http.status_code = 500 } # Combine conditions { .service.name = "api" && .http.status_code >= 400 } # Duration filter { duration > 1s } ``` ### Advanced TraceQL ```traceql # Parent-child relationship { .service.name = "frontend" } >> { .service.name = "backend" && .http.status_code = 500 } # Descendant spans { .service.name = "api" } >>+ { .db.system = "postgresql" && duration > 1s } # Failed database queries { .service.name = "api" } >> { .db.system = "postgresql" && status = "error" } ``` ## Complete Dashboard Examples ### Application Observability Dashboard ```json { "dashboard": { "title": "Application Observability - ${app}", "tags": ["observability", "application"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-1h", "to": "now" }, "templating": { "list": [ { "name": "app", "type": "query", "datasource": "Mimir", "query": "label_values(up, app)", "current": { "selected": false, "text": "api", "value": "api" } }, { "name": "namespace", "type": "query", "datasource": "Mimir", "query": "label_values(up{app=\"$app\"}, namespace)", "multi": true, "includeAll": true } ] }, "panels": [ { "id": 1, "title": "Request Rate", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)", "legendFormat": "{{method}} - {{status}}" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }, "yaxes": [ { "format": "reqps", "label": "Requests/sec" } ] }, { "id": 2, "title": "P95 Latency", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))", "legendFormat": "{{endpoint}}" } ], "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }, "yaxes": [ { "format": "s", "label": "Duration" } ], "thresholds": [ { "value": 1, "colorMode": "critical", "fill": true, "line": true, "op": "gt" } ] }, { "id": 3, "title": "Error Rate", "type": "graph", "datasource": "Mimir", "targets": [ { "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))", "legendFormat": "Error %" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }, "yaxes": [ { "format": "percentunit", "max": 1, "min": 0 } ], "alert": { "conditions": [ { "evaluator": { "params": [0.01], "type": "gt" }, "operator": { "type": "and" }, "query": { "params": ["A", "5m", "now"] }, "reducer": { "type": "avg" }, "type": "query" } ], "frequency": "1m", "handler": 1, "name": "Error Rate Alert", "noDataState": "no_data", "notifications": [] } }, { "id": 4, "title": "Recent Error Logs", "type": "logs", "datasource": "Loki", "targets": [ { "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"", "refId": "A" } ], "options": { "showTime": true, "showLabels": false, "showCommonLabels": false, "wrapLogMessage": true, "dedupStrategy": "none", "enableLogDetails": true }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 } } ], "links": [ { "title": "Explore Logs", "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}", "type": "link", "icon": "doc" }, { "title": "Explore Traces", "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}", "type": "link", "icon": "gf-traces" } ] } } ``` ## LGTM Stack Configuration ### Loki Configuration File: `loki.yaml` ```yaml auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 log_level: info common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory schema_config: configs: - from: 2024-01-01 store: tsdb object_store: s3 schema: v13 index: prefix: index_ period: 24h storage_config: aws: s3: s3://us-east-1/my-loki-bucket s3forcepathstyle: true tsdb_shipper: active_index_directory: /loki/tsdb-index cache_location: /loki/tsdb-cache shared_store: s3 limits_config: retention_period: 744h # 31 days ingestion_rate_mb: 10 ingestion_burst_size_mb: 20 max_query_series: 500 max_query_lookback: 30d reject_old_samples: true reject_old_samples_max_age: 168h compactor: working_directory: /loki/compactor shared_store: s3 compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h ``` ### Tempo Configuration File: `tempo.yaml` ```yaml server: http_listen_port: 3200 grpc_listen_port: 9096 distributor: receivers: otlp: protocols: http: grpc: jaeger: protocols: thrift_http: grpc: ingester: max_block_duration: 5m compactor: compaction: block_retention: 720h # 30 days storage: trace: backend: s3 s3: bucket: tempo-traces endpoint: s3.amazonaws.com region: us-east-1 wal: path: /var/tempo/wal metrics_generator: registry: external_labels: source: tempo cluster: primary storage: path: /var/tempo/generator/wal remote_write: - url: http://mimir:9009/api/v1/push send_exemplars: true ``` ## Production Best Practices ### Performance Optimization #### Query Optimization - Use label filters before line filters - Limit time ranges for expensive queries - Use `unwrap` instead of parsing when possible - Cache query results with query frontend #### Dashboard Performance - Limit number of panels (< 15 per dashboard) - Use appropriate time intervals - Avoid high-cardinality grouping - Use `$__interval` for adaptive sampling #### Storage Optimization - Configure retention policies - Use compaction for Loki and Tempo - Implement tiered storage (hot/warm/cold) - Monitor storage growth ### Security Best Practices #### Authentication - Enable auth (`auth_enabled: true` in Loki/Tempo) - Use OAuth/LDAP for Grafana - Implement multi-tenancy with org isolation #### Authorization - Configure RBAC in Grafana - Limit datasource access by team - Use folder permissions for dashboards #### Network Security - TLS for all components - Network policies in Kubernetes - Rate limiting at ingress ### Troubleshooting #### Common Issues 1. **High Cardinality**: Too many unique label combinations - Solution: Reduce label dimensions, use log parsing instead 2. **Query Timeouts**: Complex queries on large datasets - Solution: Reduce time range, use aggregations, add query limits 3. **Storage Growth**: Unbounded retention - Solution: Configure retention policies, enable compaction 4. **Missing Traces**: Incomplete trace data - Solution: Check sampling rates, verify instrumentation ## Resources - [Loki Documentation](https://grafana.com/docs/loki/latest/) - [Tempo Documentation](https://grafana.com/docs/tempo/latest/) - [Grafana Documentation](https://grafana.com/docs/grafana/latest/) - [LogQL Cheat Sheet](https://grafana.com/docs/loki/latest/logql/) - [TraceQL Guide](https://grafana.com/docs/tempo/latest/traceql/) - [Grafana Operator](https://github.com/grafana-operator/grafana-operator)