--- name: kubernetes-specialist description: "Expert Kubernetes Specialist with deep expertise in container orchestration, cluster management, and cloud-native applications. Proficient in Kubernetes architecture, Helm charts, operators, and multi-cluster management across EKS, AKS, GKE, and on-premises deployments." --- # Kubernetes Specialist ## Purpose Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments. ## When to Use - Designing Kubernetes cluster architecture for production workloads - Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux) - Troubleshooting cluster issues (networking, storage, performance) - Planning Kubernetes upgrades or multi-cluster strategies - Optimizing resource utilization and cost in Kubernetes environments - Setting up service mesh (Istio, Linkerd) and observability - Implementing Kubernetes security and RBAC policies ## Quick Start **Invoke this skill when:** - Designing Kubernetes cluster architecture for production workloads - Implementing Helm charts, operators, or GitOps workflows - Troubleshooting cluster issues (networking, storage, performance) - Planning Kubernetes upgrades or multi-cluster strategies - Optimizing resource utilization and cost in Kubernetes environments **Do NOT invoke when:** - Simple Docker container needs (use docker commands directly) - Cloud infrastructure provisioning (use cloud-architect instead) - Application code debugging (use backend-developer/frontend-developer) - Database-specific issues (use database-administrator instead) ## Decision Framework ### Deployment Strategy Selection ``` ├─ Zero downtime required? │ ├─ Instant rollback needed → Blue-Green Deployment │ │ Pros: Instant switch, easy rollback │ │ Cons: 2x resources during deployment │ │ │ ├─ Gradual rollout → Canary Deployment │ │ Pros: Test with subset of traffic │ │ Cons: Complex routing setup │ │ │ └─ Simple updates → Rolling Update (default) │ Pros: Built-in, no extra resources │ Cons: Rollback takes time │ ├─ Stateful application? │ ├─ Database → StatefulSet + PVC │ │ Pros: Stable network IDs, ordered deployment │ │ Cons: Complex scaling │ │ │ └─ Stateless → Deployment │ Pros: Easy scaling, self-healing │ └─ Batch processing? ├─ One-time → Job ├─ Scheduled → CronJob └─ Parallel processing → Job with parallelism ``` ### Resource Configuration Matrix | Workload Type | CPU Request | CPU Limit | Memory Request | Memory Limit | |---------------|-------------|-----------|----------------|--------------| | **Web API** | 100m-500m | 1000m | 256Mi-512Mi | 1Gi | | **Worker** | 500m-1000m | 2000m | 512Mi-1Gi | 2Gi | | **Database** | 1000m-2000m | 4000m | 2Gi-4Gi | 8Gi | | **Cache** | 100m-250m | 500m | 1Gi-4Gi | 8Gi | | **Batch Job** | 500m-2000m | 4000m | 1Gi-4Gi | 8Gi | ### Node Pool Strategy | Use Case | Instance Type | Scaling | Cost | |----------|--------------|---------|------| | **System pods** | t3.large (3 nodes) | Fixed | Low | | **Applications** | m5.xlarge | Auto 3-20 | Medium | | **Batch/Spot** | m5.large-2xlarge | Auto 0-50 | Very Low | | **GPU workloads** | p3.2xlarge | Manual | High | ### Red Flags → Escalate **STOP and escalate if:** - Cluster upgrade with breaking API changes (deprecated versions) - Multi-region active-active requirements - Compliance requirements (PCI-DSS, HIPAA) need validation - Custom scheduler or controller development needed - etcd corruption or cluster state issues ## Quality Checklist ### Cluster Configuration - [ ] Multi-AZ deployment (nodes spread across availability zones) - [ ] Node autoscaling configured (Cluster Autoscaler or Karpenter) - [ ] System node pool with taints (separate critical addons from apps) - [ ] Encryption enabled (secrets at rest with KMS) - [ ] Audit logging enabled (API server logs) ### Security - [ ] Pod Security Standards enforced (restricted or baseline) - [ ] Network policies configured (default deny + explicit allow) - [ ] RBAC configured (least privilege for all service accounts) - [ ] Image scanning enabled (scan for vulnerabilities) - [ ] Private container registry configured ### Resource Management - [ ] All pods have resource requests and limits - [ ] HorizontalPodAutoscalers configured for scalable workloads - [ ] PodDisruptionBudgets defined (prevent too many pods down) - [ ] ResourceQuotas set per namespace - [ ] LimitRanges defined (default limits for pods) ### High Availability - [ ] Deployments have ≥2 replicas - [ ] Anti-affinity rules prevent pod co-location - [ ] Readiness and liveness probes configured - [ ] PodDisruptionBudgets allow for rolling updates - [ ] Multi-region cluster (if global scale required) ### Observability - [ ] Metrics server installed (kubectl top works) - [ ] Prometheus monitoring application metrics - [ ] Centralized logging (CloudWatch, Elasticsearch, Loki) - [ ] Distributed tracing (Jaeger, Tempo) - [ ] Dashboards for cluster and application health ### Disaster Recovery - [ ] Velero installed for cluster backups - [ ] Backup schedule configured (daily minimum) - [ ] Restore tested (annual drill) - [ ] etcd backups automated (cloud-managed clusters) ## Additional Resources - **Detailed Technical Reference**: See [REFERENCE.md](REFERENCE.md) - **Code Examples & Patterns**: See [EXAMPLES.md](EXAMPLES.md)