--- name: kubernetes-troubleshooting description: | Comprehensive Kubernetes and OpenShift cluster health analysis and troubleshooting. Use this skill when: (1) Proactive cluster health assessment and security analysis (2) Analyzing pod/container logs for errors or issues (3) Interpreting cluster events (kubectl get events) (4) Debugging pod failures: CrashLoopBackOff, ImagePullBackOff, OOMKilled (5) Diagnosing networking issues: DNS, Service connectivity, Ingress/Route problems (6) Investigating storage issues: PVC pending, mount failures (7) Analyzing node problems: NotReady, resource pressure, taints (8) Troubleshooting OCP-specific issues: SCCs, Routes, Operators, Builds (9) Performance analysis and resource optimization (10) Security vulnerability assessment and RBAC validation metadata: author: cluster-skills version: "1.0.0" --- # Kubernetes / OpenShift Troubleshooting Guide Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and Popeye-style health scoring. ## Current Versions & Tools (January 2026) | Platform | Version | Key Changes | |----------|---------|-------------| | **Kubernetes** | 1.31.x | Sidecar containers GA, Pod lifecycle improvements | | **OpenShift** | 4.17.x | OVN-Kubernetes default, enhanced web terminal | | **EKS** | 1.31 | Pod Identity, Auto Mode, Karpenter 1.x | | **AKS** | 1.31 | Cilium CNI, Workload Identity GA | | **GKE** | 1.31 | Autopilot improvements, Gateway API GA | ### Troubleshooting Tools | Tool | Install | Purpose | |------|---------|---------| | **k9s** | `brew install k9s` | Terminal UI | | **stern** | `brew install stern` | Multi-pod log tailing | | **kubectx/kubens** | `brew install kubectx` | Context switching | | **kubectl-node-shell** | `kubectl krew install node-shell` | Node access | ## Command Usage Convention **IMPORTANT**: This skill uses `kubectl` as the primary command. When working with: - **OpenShift/ARO clusters**: Replace `kubectl` with `oc` - **Standard Kubernetes (AKS, EKS, GKE)**: Use `kubectl` as shown ## Cluster Health Scoring (Popeye-Style) Health scores range from 0-100. Issues reduce the score based on severity: - **BOOM (Critical)**: -50 points - Security vulnerabilities, resource exhaustion, failed services - **WARN (Warning)**: -20 points - Configuration inefficiencies, best practice violations - **INFO (Informational)**: -5 points - Non-critical issues, optimization opportunities ### Quick Cluster Health Assessment ```bash #!/bin/bash # cluster-health-check.sh echo "=== CLUSTER HEALTH ASSESSMENT ===" # 1. Node Health (Critical) echo "### NODE HEALTH ###" kubectl get nodes -o wide | grep -E "NotReady|Unknown" && \ echo "BOOM: Unhealthy nodes detected!" || echo "✓ All nodes healthy" # 2. Pod Issues (Critical) echo -e "\n### POD HEALTH ###" POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l) if [ $POD_ISSUES -gt 0 ]; then echo "WARN: $POD_ISSUES pods not running" kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded else echo "✓ All pods running" fi # 3. Security (Critical) echo -e "\n### SECURITY ASSESSMENT ###" PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l) [ $PRIVILEGED -gt 0 ] && echo "BOOM: $PRIVILEGED privileged containers!" || echo "✓ No privileged containers" # 4. Resource Configuration (Warning) echo -e "\n### RESOURCE CONFIGURATION ###" NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l) [ $NO_LIMITS -gt 0 ] && echo "WARN: $NO_LIMITS containers without limits" || echo "✓ All have limits" # 5. Storage (Warning) echo -e "\n### STORAGE HEALTH ###" PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l) [ $PENDING_PVC -gt 0 ] && echo "WARN: $PENDING_PVC PVCs not bound" || echo "✓ All PVCs bound" # OpenShift: Cluster Operators if command -v oc &> /dev/null; then echo -e "\n### OPENSHIFT OPERATORS ###" DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False") [ $DEGRADED -gt 0 ] && echo "BOOM: $DEGRADED operators degraded!" || echo "✓ All operators healthy" fi ``` ## Quick Diagnostic Commands ```bash # Pod status overview kubectl get pods -n ${NAMESPACE} -o wide # Recent events (sorted by time) kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' # Pod details and events kubectl describe pod ${POD_NAME} -n ${NAMESPACE} # Container logs (current) kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} # Container logs (previous crashed instance) kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous # Multi-pod log streaming stern -n ${NAMESPACE} ${POD_PREFIX} stern -A -l app=${APP_NAME} --since 1h # Node status kubectl get nodes -o wide kubectl describe node ${NODE_NAME} # Resource usage kubectl top pods -n ${NAMESPACE} kubectl top nodes ``` ## Pod Status Interpretation ### Pod Phase States | Phase | Meaning | Action | |-------|---------|--------| | `Pending` | Not scheduled or pulling images | Check events, node resources, PVC status | | `Running` | At least one container running | Check container statuses if issues | | `Succeeded` | All containers completed successfully | Normal for Jobs | | `Failed` | All containers terminated, at least one failed | Check logs, exit codes | | `Unknown` | Cannot determine state | Node communication issue | ### Container Waiting States | Reason | Cause | Resolution | |--------|-------|------------| | `ContainerCreating` | Setting up container | Check events, volume mounts | | `ImagePullBackOff` | Cannot pull image | Verify image name, registry access, credentials | | `ErrImagePull` | Image pull failed | Check image exists, network, ImagePullSecrets | | `CreateContainerConfigError` | Config error | Check ConfigMaps, Secrets exist | | `CrashLoopBackOff` | Container repeatedly crashing | Check `logs --previous`, fix application | ### Container Exit Codes | Exit Code | Signal | Cause | Resolution | |-----------|--------|-------|------------| | 0 | - | Normal exit | Expected for Jobs | | 1 | - | Application error | Check logs for stack trace | | 126 | - | Command not executable | Fix permissions | | 127 | - | Command not found | Fix command path | | 137 | SIGKILL | OOM or forced termination | Increase memory limit | | 143 | SIGTERM | Graceful shutdown | Normal during updates | ## Event Analysis ### Critical Events to Monitor #### Scheduling Events | Event | Meaning | Resolution | |-------|---------|------------| | `FailedScheduling` | Cannot place pod | Check node resources, taints, affinity | | `Unschedulable` | No suitable node | Add nodes, adjust requirements | **FailedScheduling Messages:** ``` "Insufficient cpu" → Reduce requests or add capacity "Insufficient memory" → Reduce requests or add capacity "node(s) had taint" → Add toleration or remove taint "node(s) didn't match selector" → Fix nodeSelector/affinity "persistentvolumeclaim not found" → Create PVC or fix name ``` #### Image Events | Event | Meaning | Resolution | |-------|---------|------------| | `BackOff` | Repeated pull failures | Check image name, registry, auth | | `ErrImageNeverPull` | Image not local | Change imagePullPolicy or pre-pull | **ImagePullBackOff Diagnosis:** ```bash # Check image name kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}' # Verify ImagePullSecrets kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}' kubectl get secret ${SECRET} -n ${NAMESPACE} ``` #### Volume Events | Event | Meaning | Resolution | |-------|---------|------------| | `FailedMount` | Cannot mount volume | Check PVC, storage class | | `FailedAttachVolume` | Cannot attach | Check cloud provider, volume exists | **PVC Pending Diagnosis:** ```bash kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE} kubectl get storageclass kubectl get pv ``` ## Log Analysis Patterns ### Common Error Patterns ```bash # Search for errors kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)" # Java OOM java.lang.OutOfMemoryError → Increase memory, tune JVM heap # Connection refused ECONNREFUSED, Connection refused → Dependency not available # DNS failure ENOTFOUND, getaddrinfo → DNS resolution failed, check service name # Permission denied Permission denied → Check securityContext, runAsUser, fsGroup ``` ### Memory Issues (OOMKilled) ``` Last State: Terminated Reason: OOMKilled Exit Code: 137 → Solutions: 1. Increase memory limit 2. Profile application memory usage 3. For JVM: Set -Xmx < container limit (leave ~25% headroom) ``` ## Node Troubleshooting ### Node Conditions | Condition | Status | Meaning | |-----------|--------|---------| | `Ready` | True | Node healthy | | `Ready` | False | Kubelet not healthy | | `Ready` | Unknown | No heartbeat | | `MemoryPressure` | True | Low memory | | `DiskPressure` | True | Low disk space | | `PIDPressure` | True | Too many processes | ### Node NotReady Diagnosis ```bash kubectl describe node ${NODE_NAME} # On the node (SSH or debug) systemctl status kubelet journalctl -u kubelet -f # Check resources df -h free -m top ``` ## Networking Troubleshooting ### DNS Issues ```bash # Test DNS resolution kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- \ nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local # Check CoreDNS kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -n kube-system -l k8s-app=kube-dns ``` ### Service Connectivity ```bash # Verify service and endpoints kubectl get svc ${SERVICE} -n ${NS} kubectl get endpoints ${SERVICE} -n ${NS} # Test from debug pod kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT} ``` ### Ingress/Route Issues ```bash # Check Ingress kubectl describe ingress ${INGRESS} -n ${NS} # Ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx # OpenShift Route oc describe route ${ROUTE} -n ${NS} oc get pods -n openshift-ingress ``` ## OpenShift-Specific Troubleshooting ### Cluster Operators ```bash # Check overall health oc get clusteroperators # Investigate degraded operator oc describe clusteroperator ${OPERATOR} oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator ``` ### Security Context Constraints (SCC) ```bash # List SCCs oc get scc # Check which SCC a pod is using oc get pod ${POD} -n ${NS} -o yaml | grep scc # Common error fix # "unable to validate against any security context constraint" oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS} ``` ### Build Failures ```bash # Check build status oc get builds -n ${NS} oc describe build ${BUILD} -n ${NS} oc logs build/${BUILD} -n ${NS} ``` ## Cloud Provider Troubleshooting ### EKS (AWS) ```bash aws eks describe-cluster --name ${CLUSTER} --query 'cluster.status' aws eks describe-addon --cluster-name ${CLUSTER} --addon-name vpc-cni eksctl get nodegroup --cluster ${CLUSTER} ``` ### AKS (Azure) ```bash az aks show --resource-group ${RG} --name ${CLUSTER} --query provisioningState az aks check-network outbound --resource-group ${RG} --name ${CLUSTER} ``` ### GKE (Google Cloud) ```bash gcloud container clusters describe ${CLUSTER} --region ${REGION} --format='value(status)' gcloud container operations list --filter="targetLink:${CLUSTER}" --limit=10 ``` ## Diagnostic Decision Tree ### Pod Not Starting ``` Pod Phase = Pending? ├── Yes → Check Scheduling │ ├── "Insufficient cpu/memory" → Add nodes or reduce requests │ ├── "node(s) had taint" → Add toleration │ ├── "PVC not found" → Create PVC │ └── No events → Check API server │ └── No → Check Container Status ├── ImagePullBackOff → Fix image name/auth ├── CrashLoopBackOff → Check logs --previous ├── CreateContainerConfigError → Fix ConfigMap/Secret └── Running but not ready → Check readiness probe ``` ### Application Not Responding ``` Can reach Service? ├── No → Check Service │ ├── No endpoints → Fix selector labels │ ├── Wrong port → Fix targetPort │ └── NetworkPolicy blocking → Adjust policy │ └── Yes → Check Pod ├── Probe failing → Fix probe or application ├── High latency → Check resources, dependencies └── Errors in logs → Fix application ``` ## Performance Analysis ### Resource Optimization ```bash # Compare usage vs requests kubectl top pods -n ${NS} kubectl get pods -n ${NS} -o custom-columns=\ NAME:.metadata.name,\ CPU_REQ:.spec.containers[*].resources.requests.cpu,\ MEM_REQ:.spec.containers[*].resources.requests.memory # Find pods without limits kubectl get pods -A -o json | jq -r \ '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' ``` ### Right-Sizing Recommendations | Symptom | Indication | Action | |---------|------------|--------| | CPU throttling | CPU limit too low | Increase CPU limit | | OOMKilled | Memory limit too low | Increase memory limit | | Low utilization | Over-provisioned | Reduce requests |