--- name: cluster-admin description: Master Kubernetes cluster administration, from initial setup through production management. Learn cluster installation, scaling, upgrades, and HA strategies. sasmp_version: "1.3.0" eqhm_enabled: true bonded_agent: 01-cluster-admin bond_type: PRIMARY_BOND capabilities: ["Cluster lifecycle management", "Node administration", "HA configuration", "Cluster upgrades", "etcd management", "Resource quotas", "Namespace management", "Cluster autoscaling"] input_schema: type: object properties: action: type: string enum: ["create", "upgrade", "scale", "backup", "restore", "diagnose"] cluster_type: type: string enum: ["kind", "minikube", "kubeadm", "eks", "aks", "gke"] target: type: string output_schema: type: object properties: status: type: string commands: type: array recommendations: type: array --- # Cluster Administration ## Executive Summary Production-grade Kubernetes cluster administration covering the complete lifecycle from initial deployment to day-2 operations. This skill provides deep expertise in cluster architecture, high availability configurations, upgrade strategies, and operational best practices aligned with CKA/CKS certification standards. ## Core Competencies ### 1. Cluster Architecture Mastery **Control Plane Components** ``` ┌─────────────────────────────────────────────────────────────────┐ │ CONTROL PLANE │ ├─────────────┬─────────────┬──────────────┬────────────────────┤ │ API Server │ Scheduler │ Controller │ etcd │ │ │ │ Manager │ │ │ - AuthN │ - Pod │ - ReplicaSet │ - Cluster state │ │ - AuthZ │ placement │ - Endpoints │ - 3+ nodes for HA │ │ - Admission │ - Node │ - Namespace │ - Regular backups │ │ control │ affinity │ - ServiceAcc │ - Encryption │ └─────────────┴─────────────┴──────────────┴────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ WORKER NODES │ ├─────────────────┬─────────────────┬─────────────────────────────┤ │ kubelet │ kube-proxy │ Container Runtime │ │ - Pod lifecycle │ - iptables/ipvs │ - containerd (recommended) │ │ - Node status │ - Service VIPs │ - CRI-O │ │ - Volume mount │ - Load balance │ - gVisor (sandboxed) │ └─────────────────┴─────────────────┴─────────────────────────────┘ ``` **Production Cluster Bootstrap (kubeadm)** ```bash # Initialize control plane with HA sudo kubeadm init \ --control-plane-endpoint "k8s-api.example.com:6443" \ --upload-certs \ --pod-network-cidr=10.244.0.0/16 \ --service-cidr=10.96.0.0/12 \ --apiserver-advertise-address=0.0.0.0 \ --apiserver-cert-extra-sans=k8s-api.example.com # Join additional control plane nodes kubeadm join k8s-api.example.com:6443 \ --token \ --discovery-token-ca-cert-hash sha256: \ --control-plane \ --certificate-key # Join worker nodes kubeadm join k8s-api.example.com:6443 \ --token \ --discovery-token-ca-cert-hash sha256: ``` ### 2. Node Management **Node Lifecycle Operations** ```bash # View node details with resource usage kubectl get nodes -o wide kubectl top nodes # Label nodes for workload placement kubectl label nodes worker-01 node-type=compute tier=production kubectl label nodes worker-02 node-type=gpu accelerator=nvidia-a100 # Taint nodes for dedicated workloads kubectl taint nodes worker-gpu dedicated=gpu:NoSchedule # Cordon node (prevent new pods) kubectl cordon worker-03 # Drain node safely (for maintenance) kubectl drain worker-03 \ --ignore-daemonsets \ --delete-emptydir-data \ --grace-period=300 \ --timeout=600s # Return node to service kubectl uncordon worker-03 ``` **Node Problem Detector Configuration** ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: node-problem-detector namespace: kube-system spec: selector: matchLabels: app: node-problem-detector template: metadata: labels: app: node-problem-detector spec: containers: - name: node-problem-detector image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.14 securityContext: privileged: true env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName volumeMounts: - name: log mountPath: /var/log readOnly: true - name: kmsg mountPath: /dev/kmsg readOnly: true volumes: - name: log hostPath: path: /var/log - name: kmsg hostPath: path: /dev/kmsg tolerations: - operator: Exists effect: NoSchedule ``` ### 3. High Availability Configuration **HA Architecture Pattern** ``` ┌─────────────────┐ │ Load Balancer │ │ (HAProxy/NLB) │ └────────┬────────┘ │ ┌────────────────────┼────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Control Plane │ │ Control Plane │ │ Control Plane │ │ Node 1 │ │ Node 2 │ │ Node 3 │ ├───────────────┤ ├───────────────┤ ├───────────────┤ │ API Server │ │ API Server │ │ API Server │ │ Scheduler │ │ Scheduler │ │ Scheduler │ │ Controller │ │ Controller │ │ Controller │ │ etcd │◄──►│ etcd │◄──►│ etcd │ └───────────────┘ └───────────────┘ └───────────────┘ │ │ │ └────────────────────┴────────────────────┘ │ ┌────────┴────────┐ │ Worker Nodes │ │ (N instances) │ └─────────────────┘ ``` **etcd Backup & Restore** ```bash # Backup etcd ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Verify backup ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db --write-out=table # Restore etcd (disaster recovery) ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-*.db \ --data-dir=/var/lib/etcd-restored \ --name=etcd-0 \ --initial-cluster=etcd-0=https://10.0.0.10:2380 \ --initial-advertise-peer-urls=https://10.0.0.10:2380 # Automated backup CronJob kubectl apply -f - < kubectl logs -n kube-system kube-scheduler- kubectl logs -n kube-system kube-controller-manager- # etcd health ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Node diagnostics kubectl describe node kubectl get node -o yaml | grep -A 10 conditions ssh "journalctl -u kubelet --since '1 hour ago'" # Certificate expiration check kubeadm certs check-expiration # Resource usage kubectl top nodes kubectl top pods -A --sort-by=memory ``` ## Common Challenges & Solutions | Challenge | Solution | |-----------|----------| | etcd performance degradation | Use SSD storage, tune compaction | | Certificate expiration | Set up cert-manager, kubeadm renew | | Node resource exhaustion | Configure eviction thresholds, resource quotas | | Control plane overload | Add more control plane nodes, tune rate limits | | Upgrade failures | Always backup etcd, use staged rollouts | | kubelet not starting | Check containerd socket, certificates | | API server latency | Enable priority/fairness, scale API servers | | Cluster state drift | GitOps, regular audits, policy enforcement | ## Success Criteria | Metric | Target | |--------|--------| | Cluster uptime | 99.9% | | API server latency p99 | <200ms | | etcd backup success | 100% | | Node ready status | 100% | | Upgrade success rate | 100% | | Certificate validity | >30 days | | Control plane pods healthy | 100% | ## Resources - [Official Kubernetes Documentation](https://kubernetes.io/docs/) - [Kubernetes Cluster Administration](https://kubernetes.io/docs/tasks/administer-cluster/) - [kubeadm Reference](https://kubernetes.io/docs/reference/setup-tools/kubeadm/) - [etcd Operations Guide](https://etcd.io/docs/v3.5/op-guide/)