--- name: check-logs description: Query and analyze logs using Grafana Loki for the Kagenti platform, search for errors, and investigate issues --- # Check Logs Skill This skill helps you query and analyze logs from the Kagenti platform using Loki via Grafana. ## When to Use - User asks "show me logs for X" - Investigating errors or failures - After deployments to check for issues - Debugging pod crashes or restarts - Analyzing application behavior ## What This Skill Does 1. **Query Logs**: Search logs by namespace, pod, container, or log level 2. **Error Detection**: Find errors and warnings in logs 3. **Log Aggregation**: View logs across multiple pods 4. **Time-based Queries**: Query logs for specific time ranges 5. **Log Patterns**: Detect common issues from log patterns ## Examples ### Query Logs in Grafana UI **Access Grafana**: https://grafana.localtest.me:9443 **Navigate**: **Explore** → Select **Loki** datasource **Log Dashboard**: https://grafana.localtest.me:9443/d/loki-logs/loki-logs **Query Examples in Grafana Explore**: ```logql # All logs from observability namespace {kubernetes_namespace_name="observability"} # Logs from specific pod {kubernetes_pod_name=~"prometheus.*"} # Logs with errors {kubernetes_namespace_name="observability"} |= "error" # Logs from last 5 minutes with level=error {kubernetes_namespace_name="observability"} | json | level="error" # Count errors per namespace sum by (kubernetes_namespace_name) (count_over_time({kubernetes_namespace_name=~".+"} |= "error" [5m])) ``` ### Query Logs via CLI (Promtail/Loki) ```bash # Query Loki for recent errors in observability namespace kubectl exec -n observability deployment/grafana -- \ curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \ --data-urlencode 'query={kubernetes_namespace_name="observability"} |= "error"' \ --data-urlencode 'limit=100' \ --data-urlencode 'start='$(date -u -v-5M +%s)000000000 \ --data-urlencode 'end='$(date -u +%s)000000000 | python3 -m json.tool ``` ### Check Logs for Specific Pod ```bash # Get logs for a specific pod using kubectl kubectl logs -n observability deployment/prometheus --tail=100 # Get logs from previous container (if crashed) kubectl logs -n observability pod/prometheus-xxx --previous # Follow logs in real-time kubectl logs -n observability deployment/grafana -f --tail=20 # Get logs from specific container in pod kubectl logs -n observability pod/alertmanager-xxx -c alertmanager --tail=50 ``` ### Search for Errors Across Platform ```bash # Get recent error logs from all namespaces for ns in observability keycloak oauth2-proxy istio-system kiali-system; do echo "=== Errors in $ns ===" kubectl logs -n $ns --all-containers=true --tail=50 2>&1 | grep -i "error\|fatal\|exception" | head -5 echo done ``` ### Check Logs for Failed Pods ```bash # Find pods with issues and check their logs kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull" | while read ns pod rest; do echo "=== Logs for $pod in $ns ===" kubectl logs -n $ns $pod --tail=30 --previous 2>/dev/null || kubectl logs -n $ns $pod --tail=30 echo done ``` ### Query Log Volume by Namespace ```logql # In Grafana Explore (Loki datasource) sum by (kubernetes_namespace_name) ( rate({kubernetes_namespace_name=~".+"}[5m]) ) ``` ### Search for Specific Error Pattern ```logql # Find connection errors {kubernetes_namespace_name="observability"} |~ "connection (refused|timeout|reset)" # Find authentication failures {kubernetes_namespace_name=~"keycloak|oauth2-proxy"} |~ "auth.*fail|unauthorized|forbidden" # Find OOM kills {kubernetes_namespace_name=~".+"} |~ "OOM|out of memory|oom.*kill" ``` ## Log Levels and Filtering ### Standard Log Levels - **error**: Critical errors requiring attention - **warn/warning**: Warnings that may indicate issues - **info**: Informational messages - **debug**: Detailed debugging information - **trace**: Very detailed trace information ### Filter by Log Level ```logql # Only errors {kubernetes_namespace_name="observability"} | json | level="error" # Errors and warnings {kubernetes_namespace_name="observability"} | json | level=~"error|warn" # Everything except debug {kubernetes_namespace_name="observability"} | json | level!="debug" ``` ## Common Log Queries for Platform Components ### Prometheus Logs ```bash kubectl logs -n observability deployment/prometheus --tail=100 # Check for scrape errors kubectl logs -n observability deployment/prometheus | grep -i "scrape\|error" ``` ### Grafana Logs ```bash kubectl logs -n observability deployment/grafana --tail=100 # Check for datasource errors kubectl logs -n observability deployment/grafana | grep -i "datasource\|error" ``` ### Keycloak Logs ```bash kubectl logs -n keycloak statefulset/keycloak --tail=100 # Check for authentication errors kubectl logs -n keycloak statefulset/keycloak | grep -i "auth\|login\|error" ``` ### Istio Proxy (Sidecar) Logs ```bash # Check sidecar logs for a specific pod POD=$(kubectl get pod -n observability -l app=alertmanager -o jsonpath='{.items[0].metadata.name}') kubectl logs -n observability $POD -c istio-proxy --tail=50 ``` ### AlertManager Logs ```bash kubectl logs -n observability deployment/alertmanager -c alertmanager --tail=100 # Check for notification errors kubectl logs -n observability deployment/alertmanager -c alertmanager | grep -i "notif\|error\|fail" ``` ## Log Analysis Patterns ### Detect Crash Loops ```bash # Find pods restarting frequently kubectl get pods -A | awk '{if ($4 > 5) print $0}' # Check logs before crash kubectl logs -n --previous | tail -50 ``` ### Find HTTP Errors ```logql {kubernetes_namespace_name=~".+"} |~ "HTTP.*[45]\\d{2}" ``` ### Find Timeout Errors ```logql {kubernetes_namespace_name=~".+"} |~ "timeout|timed out|deadline exceeded" ``` ### Find Database Connection Issues ```logql {kubernetes_namespace_name=~".+"} |~ "database.*error|connection.*refused|SQL.*error" ``` ## Troubleshooting with Logs ### Issue: Service Not Starting 1. Check pod events: `kubectl describe pod -n ` 2. Check container logs: `kubectl logs -n ` 3. Check init container logs: `kubectl logs -n -c ` ### Issue: High Error Rate 1. Query error logs: `{kubernetes_namespace_name="X"} |= "error" [5m]` 2. Group by component: `sum by (kubernetes_pod_name) (count_over_time({...} |= "error" [5m]))` 3. Identify pattern in error messages ### Issue: Performance Degradation 1. Check for warnings: `{kubernetes_namespace_name="X"} |= "warn"` 2. Look for timeout messages 3. Check for resource exhaustion messages ## Grafana Loki Dashboard Features **Loki Logs Dashboard**: https://grafana.localtest.me:9443/d/loki-logs/loki-logs **Features**: - **Namespace filter**: Select specific namespace - **Pod filter**: Filter by pod name - **Log level**: Filter by error/warn/info/debug - **Time range**: Select time window - **Log volume graphs**: See log rate over time - **Log table**: Browse actual log lines **Panels**: 1. **Log Volume by Level**: See errors vs warnings over time 2. **Log Volume by Namespace**: Compare activity across namespaces 3. **Logs per Second**: Current log ingestion rate 4. **Log Lines**: Actual log content with search ## Related Documentation - [Loki Documentation](https://grafana.com/docs/loki/) - [LogQL Query Language](https://grafana.com/docs/loki/latest/logql/) - [CLAUDE.md Troubleshooting](../../../CLAUDE.md#troubleshooting) - [Alert Runbooks](../../../docs/runbooks/alerts/) - Many reference logs ## Pro Tips 1. **Use time ranges**: Always specify time range to limit data 2. **Filter early**: Add namespace/pod filters before log level filters (more efficient) 3. **Use regex carefully**: Complex regex can be slow on large log volumes 4. **Check both current and previous**: For crashed pods, use `--previous` 5. **Tail first**: Use `--tail=N` to limit output, then increase if needed 🤖 Generated with [Claude Code](https://claude.com/claude-code)