--- name: debug:kubernetes description: Debug Kubernetes clusters and workloads systematically with this comprehensive troubleshooting skill. Covers CrashLoopBackOff, ImagePullBackOff, OOMKilled, pending pods, service connectivity issues, PVC binding failures, and RBAC permission errors. Provides structured four-phase debugging methodology with kubectl commands, ephemeral debug containers, and essential one-liners for diagnosing pod, service, network, and storage problems across namespaces. --- # Kubernetes Debugging Guide A systematic approach to diagnosing and resolving Kubernetes issues. Always start with the basics: check events and logs first. ## Common Error Patterns ### CrashLoopBackOff **What it means:** Container repeatedly crashes and fails to start. Kubernetes restarts it with exponential backoff (10s, 20s, 40s... up to 5 minutes). **Common causes:** - Insufficient memory/CPU resources - Missing dependencies in container image - Misconfigured liveness/readiness probes - Application code errors or misconfigurations - Missing environment variables or secrets **Debug steps:** ```bash # Check pod events and status kubectl describe pod -n # View current and previous container logs kubectl logs -n kubectl logs -n --previous # Check resource limits vs actual usage kubectl top pod -n ``` **Solutions:** - Tune probe `initialDelaySeconds` and `timeoutSeconds` - Increase resource limits if hitting memory/CPU caps - Fix missing dependencies in Dockerfile - Review application startup code for errors --- ### ImagePullBackOff **What it means:** Kubernetes cannot pull the container image. Retries with increasing delay (5s, 10s, 20s... up to 5 minutes). **Common causes:** - Incorrect image name or tag - Missing registry authentication credentials - Private registry without imagePullSecrets configured - Network connectivity issues to registry - Image does not exist in registry **Debug steps:** ```bash # Check pod events for specific error kubectl describe pod -n # Verify image name in deployment kubectl get deployment -n -o yaml | grep image: # Check if imagePullSecrets are configured kubectl get pod -n -o yaml | grep -A5 imagePullSecrets # Test pulling image from node (if you have node access) docker pull ``` **Solutions:** - Correct image name/tag in deployment spec - Create and attach imagePullSecret for private registries - Verify network access to container registry - Check registry credentials haven't expired --- ### Pending Pods (Scheduling Failures) **What it means:** Pod cannot be scheduled to any node. **Common causes:** - Insufficient cluster resources (CPU/memory) - Node selectors or affinity rules cannot be satisfied - Taints without matching tolerations - PersistentVolumeClaim not bound - Resource quotas exceeded **Debug steps:** ```bash # Check why pod is pending kubectl describe pod -n # View cluster events kubectl get events -n --sort-by='.lastTimestamp' # Check node resources kubectl describe nodes | grep -A5 "Allocated resources" kubectl top nodes # Check PVC status if using persistent storage kubectl get pvc -n ``` **Solutions:** - Scale up cluster or reduce resource requests - Adjust nodeSelector/affinity rules - Add tolerations for node taints - Create or fix PersistentVolume bindings - Increase namespace resource quotas --- ### OOMKilled **What it means:** Container was forcefully terminated (SIGKILL, exit code 137) for exceeding memory limit. **Common causes:** - Memory limit set too low for application - Memory leak in application code - Processing large files or datasets - High concurrency causing memory spikes - JVM/runtime heap misconfiguration **Debug steps:** ```bash # Check termination reason kubectl describe pod -n | grep -A10 "Last State" # View logs before termination kubectl logs -n --previous # Check memory limits vs usage kubectl top pod -n kubectl get pod -n -o yaml | grep -A5 resources: ``` **Solutions:** - Increase memory limits in deployment spec - Profile application for memory leaks - Configure application memory settings (e.g., JVM -Xmx) - Implement memory-efficient processing patterns - Add horizontal pod autoscaling for load distribution --- ### Service Not Reachable **What it means:** Cannot connect to service from within or outside cluster. **Common causes:** - Service selector doesn't match pod labels - Pod not ready (failing readiness probe) - NetworkPolicy blocking traffic - Service port mismatch with container port - Ingress/LoadBalancer misconfiguration **Debug steps:** ```bash # Check service and endpoints kubectl get svc -n kubectl get endpoints -n kubectl describe svc -n # Verify pod labels match service selector kubectl get pods -n --show-labels kubectl get svc -n -o yaml | grep -A5 selector # Test connectivity from within cluster kubectl run debug --rm -it --image=busybox -- wget -qO- http://.:port # Check network policies kubectl get networkpolicy -n ``` **Solutions:** - Fix service selector to match pod labels - Ensure pods are passing readiness probes - Update NetworkPolicy to allow required traffic - Verify port configurations match --- ### PVC Binding Failures **What it means:** PersistentVolumeClaim cannot bind to a PersistentVolume. **Common causes:** - No PV available matching PVC requirements - StorageClass not found or misconfigured - Access mode mismatch (RWO vs RWX) - Storage capacity insufficient - Zone/region constraints not met **Debug steps:** ```bash # Check PVC status kubectl get pvc -n kubectl describe pvc -n # List available PVs kubectl get pv # Check StorageClass kubectl get storageclass kubectl describe storageclass # View provisioner events kubectl get events -n --field-selector reason=ProvisioningFailed ``` **Solutions:** - Create matching PersistentVolume manually - Fix StorageClass name or create required class - Adjust access mode or capacity requirements - Enable dynamic provisioning if available --- ### RBAC Permission Denied **What it means:** Service account lacks required permissions for API operations. **Common causes:** - Missing Role or ClusterRole - RoleBinding not created for service account - Wrong namespace for RoleBinding - Insufficient permissions in Role **Debug steps:** ```bash # Check what service account pod uses kubectl get pod -n -o yaml | grep serviceAccountName # Test permissions kubectl auth can-i --as=system:serviceaccount:: # List roles and bindings kubectl get roles,rolebindings -n kubectl get clusterroles,clusterrolebindings | grep # Describe specific binding kubectl describe rolebinding -n ``` **Solutions:** - Create Role/ClusterRole with required permissions - Create RoleBinding/ClusterRoleBinding - Verify binding references correct service account - Use namespace-scoped roles when possible --- ## Debugging Tools Reference ### kubectl describe Get detailed information about any resource including events. ```bash kubectl describe pod -n kubectl describe node kubectl describe svc -n kubectl describe deployment -n ``` ### kubectl logs View container stdout/stderr logs. ```bash kubectl logs -n kubectl logs -n -c # specific container kubectl logs -n --previous # previous instance kubectl logs -n -f # follow/stream kubectl logs -n --tail=100 # last 100 lines kubectl logs -l app=myapp -n # by label selector ``` ### kubectl exec Execute commands inside running container. ```bash kubectl exec -it -n -- /bin/sh kubectl exec -it -n -- /bin/bash kubectl exec -n -- cat /etc/config/app.conf kubectl exec -n -c -- env # specific container ``` ### kubectl debug Debug nodes or pods with ephemeral containers (K8s 1.23+). ```bash # Debug a running pod with ephemeral container kubectl debug -it -n --image=busybox --target= # Debug a node kubectl debug node/ -it --image=busybox # Create debug copy of pod with different image kubectl debug -it --copy-to=debug-pod --container=app --image=busybox ``` ### kubectl get events View cluster events for troubleshooting. ```bash kubectl get events -n kubectl get events -n --sort-by='.lastTimestamp' kubectl get events -n --field-selector type=Warning kubectl get events -n -w # watch for new events kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # cluster-wide recent ``` ### kubectl top View resource usage metrics (requires metrics-server). ```bash kubectl top pods -n kubectl top pods -n --sort-by=memory kubectl top nodes kubectl top pod -n --containers ``` --- ## The Four Phases of Kubernetes Debugging ### Phase 1: Gather Information Start broad, then narrow down. Never assume the cause. ```bash # Quick status overview kubectl get pods,svc,deploy,rs -n # Recent events (often reveals the issue immediately) kubectl get events -n --sort-by='.lastTimestamp' | tail -20 # Describe the problematic resource kubectl describe -n ``` ### Phase 2: Check Logs and Metrics Logs reveal application-level issues; metrics reveal resource issues. ```bash # Application logs kubectl logs -n --tail=200 kubectl logs -n --previous # if crashed # Resource metrics kubectl top pod -n kubectl top nodes ``` ### Phase 3: Interactive Investigation Get inside the environment when logs aren't enough. ```bash # Shell into the container kubectl exec -it -n -- /bin/sh # Use ephemeral debug container kubectl debug -it -n --image=nicolaka/netshoot # Check connectivity from inside wget -qO- http://service-name:port nslookup service-name curl -v http://endpoint ``` ### Phase 4: Validate and Fix Make changes, verify they work, document the solution. ```bash # Apply fix kubectl apply -f fixed-manifest.yaml # Watch for success kubectl get pods -n -w kubectl get events -n -w # Verify health kubectl logs -n -f ``` --- ## Quick Reference Commands ### Essential One-Liners ```bash # Get all pods with their status across namespaces kubectl get pods -A -o wide # Find pods not in Running state kubectl get pods -A --field-selector=status.phase!=Running # Get pod restart counts kubectl get pods -n -o=custom-columns='NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount' # Show pod resource requests and limits kubectl get pods -n -o=custom-columns='NAME:.metadata.name,MEM_REQ:.spec.containers[0].resources.requests.memory,MEM_LIM:.spec.containers[0].resources.limits.memory' # Get events sorted by time (most recent last) kubectl get events --sort-by='.lastTimestamp' -n # Watch pods in real-time kubectl get pods -n -w # Get logs from all pods with label kubectl logs -l app=myapp -n --all-containers=true # Check endpoints for a service kubectl get endpoints -n # Test DNS resolution kubectl run dnstest --rm -it --image=busybox --restart=Never -- nslookup . # Check RBAC permissions kubectl auth can-i --list --as=system:serviceaccount:: ``` ### Debugging Network Issues ```bash # Run network debug pod kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash # Test service connectivity kubectl run curl --rm -it --image=curlimages/curl -- curl -v http://: # Check CoreDNS logs kubectl logs -n kube-system -l k8s-app=kube-dns # Trace network path kubectl exec -it -- traceroute ``` ### Debugging Storage Issues ```bash # List all PVCs with status kubectl get pvc -A # Describe PVC for binding issues kubectl describe pvc -n # Check storage provisioner logs kubectl logs -n kube-system -l app= # Verify mount inside pod kubectl exec -it -n -- df -h kubectl exec -it -n -- ls -la /path/to/mount ``` --- ## Useful Debug Images | Image | Use Case | |-------|----------| | `busybox` | Basic shell, networking tools | | `nicolaka/netshoot` | Comprehensive network debugging | | `curlimages/curl` | HTTP testing | | `alpine` | Minimal Linux with package manager | | `gcr.io/kubernetes-e2e-test-images/jessie-dnsutils` | DNS debugging | --- ## Prevention Best Practices 1. **Always set resource requests and limits** - Prevents noisy neighbor issues and OOMKilled 2. **Configure proper health probes** - Liveness, readiness, and startup probes with appropriate delays 3. **Use namespaces** - Isolate workloads for easier debugging 4. **Label everything** - Makes filtering and selection reliable 5. **Implement monitoring** - Prometheus, Grafana, ELK stack for visibility 6. **Validate manifests before deployment** - Use kubeval, kube-linter, or similar tools 7. **Use GitOps** - Track changes and enable rollback 8. **Document runbooks** - For common issues specific to your applications