--- name: k8s-incident description: Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response. license: Apache-2.0 metadata: author: rohitg00 version: "1.0.0" tools: 15 category: observability --- # Kubernetes Incident Response Runbooks and diagnostic workflows for common Kubernetes incidents. ## When to Apply Use this skill when: - User mentions: "incident", "outage", "emergency", "down", "not working" - Operations: emergency response, production issues, service degradation - Keywords: "urgent", "broken", "fix", "restore", "recover" ## Priority Rules | Priority | Rule | Impact | Tools | |----------|------|--------|-------| | 1 | Check control plane first | CRITICAL | `get_pods(namespace="kube-system")` | | 2 | Assess node health | CRITICAL | `get_nodes` | | 3 | Gather events before changes | HIGH | `get_events` | | 4 | Document timeline | HIGH | Manual notes | | 5 | Rollback if safe | MEDIUM | `rollback_deployment` | ## Quick Reference | Incident | First Tool | Next Steps | |----------|------------|------------| | Pod failure | `get_pod_logs(previous=True)` | `describe_pod`, `get_events` | | Node down | `describe_node` | Check kubelet logs | | Service unreachable | `get_endpoints` | `get_network_policies` | | Control plane | `get_pods(namespace="kube-system")` | Check API server logs | ## Incident Triage ### Quick Health Check ```python get_nodes() get_pods(namespace="kube-system") get_events(namespace) ``` ### Severity Assessment | Indicator | Severity | Action | |-----------|----------|--------| | Multiple nodes NotReady | Critical | Escalate immediately | | kube-system pods failing | Critical | Control plane issue | | Single pod CrashLoop | Medium | Debug pod | | High latency | Medium | Check resources | ## Runbook: Pod Failures ### CrashLoopBackOff ```python get_pod_logs(name, namespace, previous=True) describe_pod(name, namespace) get_events(namespace, field_selector="involvedObject.name=") get_pod_metrics(name, namespace) ``` **Common Causes:** - OOMKilled → Increase memory limits - Exit code 1 → Application error in logs - Exit code 137 → Killed by OOM or SIGKILL - Exit code 143 → Graceful SIGTERM ### ImagePullBackOff ```python describe_pod(name, namespace) get_secrets(namespace) ``` ### Pending Pod ```python describe_pod(name, namespace) get_nodes() get_events(namespace) ``` ## Runbook: Node Issues ### Node NotReady ```python describe_node(name) get_events(namespace="", field_selector="involvedObject.name=") node_logs_tool(name, "kubelet") ``` ### Node DiskPressure ```python describe_node(name) get_pods(field_selector="spec.nodeName=") ``` ## Runbook: Network Issues ### Service Not Accessible ```python get_services(namespace) get_endpoints(namespace) get_pods(namespace, label_selector="") get_network_policies(namespace) ``` ### DNS Resolution Failures ```python get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns") get_pod_logs("coredns-xxx", "kube-system") ``` ### With Cilium ```python cilium_status_tool() cilium_endpoints_list_tool(namespace) hubble_flows_query_tool(namespace) ``` ### With Istio ```python istio_analyze_tool(namespace) istio_proxy_status_tool() ``` ## Runbook: Storage Issues ### PVC Pending ```python describe_pvc(name, namespace) get_storage_classes() get_events(namespace) ``` ### Pod Stuck in ContainerCreating ```python describe_pod(name, namespace) get_pvc(namespace) get_events(namespace) ``` ## Runbook: Control Plane Issues ### API Server Unavailable ```python get_pods(namespace="kube-system", label_selector="component=kube-apiserver") get_events(namespace="kube-system") ``` ### etcd Issues ```python get_pods(namespace="kube-system", label_selector="component=etcd") get_pod_logs("etcd-xxx", "kube-system") ``` ## Emergency Actions ### Force Delete Pod ```python delete_pod(name, namespace, grace_period=0, force=True) ``` ### Rollback Deployment ```python rollback_deployment(name, namespace, revision=0) ``` ### Helm Rollback ```python rollback_helm_release(name, namespace, revision=1) ``` ## Diagnostic Collection Script For comprehensive incident diagnostics, see [scripts/collect-diagnostics.py](scripts/collect-diagnostics.py). ## Multi-Cluster Incident Response Check all clusters: ```python for context in ["prod-1", "prod-2", "staging"]: get_nodes(context=context) get_pods(namespace="kube-system", context=context) get_events(namespace="kube-system", context=context) ``` ## Post-Incident ### Document Timeline 1. When did the incident start? 2. What was the impact? 3. What was the root cause? 4. What fixed it? ### Prevent Recurrence - Add monitoring/alerting - Improve resource limits - Add readiness probes - Document runbook ## Related Skills - [k8s-troubleshoot](../k8s-troubleshoot/SKILL.md) - Detailed debugging - [k8s-security](../k8s-security/SKILL.md) - Security incidents