--- name: k8s description: Kubernetes ops skill for deploying, operating, and troubleshooting services on Kubernetes. Use for tasks like writing manifests/Helm, configuring deployments/services/ingress, autoscaling, observability, RBAC, secrets/configmaps, rollout/rollback, incident debugging, and production readiness checks. --- # k8s Use this skill for Kubernetes 运维与发布相关工作。 ## Defaults / assumptions to confirm - Cluster type: managed (EKS/GKE/ACK) vs self-hosted - Packaging: raw YAML vs Helm vs Kustomize - Ingress: NGINX/ALB/APISIX/Istio - Observability stack: Prometheus/Grafana, Loki/ELK, tracing ## Workflow 1) Understand service requirements - Ports, protocols, health checks, resources (CPU/mem), storage needs. - SLOs: latency, availability, RPO/RTO. - Dependencies: DB, cache, MQ, external APIs. 2) Deployment design - Use `Deployment` for stateless; `StatefulSet` for stable identities/storage. - Define `readinessProbe` and `livenessProbe` (and `startupProbe` if needed). - Set `resources.requests/limits` and choose appropriate QoS. - Use `PodDisruptionBudget` for availability during maintenance. 3) Config & secrets - Config: `ConfigMap` (non-sensitive), mounted or env. - Secrets: `Secret` (sensitive) + external secret manager if available. - Never commit plaintext secrets; prefer sealed/external secrets. 4) Networking - `Service` types and DNS. - `Ingress`/Gateway routing, TLS termination, timeouts. - NetworkPolicy if cluster enforces it. 5) Scaling & resilience - `HPA` based on CPU/memory/custom metrics. - Graceful shutdown (`preStop`, terminationGracePeriodSeconds). - Retry/backoff at client; avoid retry storms. 6) Observability - Standard logs with correlation IDs. - Metrics: RPS, p95 latency, error rate, saturation. - Alerts and dashboards; runbook links. 7) Release operations - Rolling updates, canary/blue-green if needed. - `kubectl rollout status` + rollback plan. - Post-deploy verification checks and smoke tests. 8) Troubleshooting checklist - `kubectl get/describe` pods, events, and `logs`. - Check probes, image pull, env/config, DNS, network, and resource throttling. - For performance: node pressure, HPA behavior, GC/heap, connection pool limits. ## Output expectations when making changes - Provide manifests (or Helm values/templates) + brief deployment notes. - Include resource sizing rationale and probe settings. - Include rollback instructions and verification steps.