# Kubernetes x JobSet: How Co-evolution Makes AI Job Restarts 10× Faster In the fast-moving world of AI infrastructure, a powerful synergy is emerging: the Kubernetes community develops core capabilities, while downstream projects such as [JobSet](https://github.com/kubernetes-sigs/jobset), [Ray](https://github.com/ray-project/ray), and [LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) adopt these features to deliver dramatic efficiency gains. We call this **co-evolution**—the entire ecosystem moving forward together. Kubernetes has recently introduced a growing set of AI-related capabilities. However, to fully unlock their potential for AI workloads, other projects must adapt to them. Today, we explore a representative example: **JobSet achieves a 92% restart speed improvement by leveraging Kubernetes in-place container restarts.** ## The Problem: Slow JobSet Restarts When a distributed training job running on [JobSet](https://github.com/kubernetes-sigs/jobset) needs to restart—due to transient failures, configuration updates, or checkpoint recovery—the traditional approach involves: 1. **Deleting all Pods in the JobSet** 2. **Waiting for Pod termination** to complete 3. **Re-scheduling all Pods** via the Kubernetes scheduler 4. **Waiting for Pods to start** (including image pulls, init containers, etc.) In a large-scale cluster with 5,000 nodes, this process takes about **2 minutes and 10 seconds**. For AI/ML workloads where fast recovery is critical, this overhead is significant. ## The Solution: In-Place Container Restarts Kubernetes has introduced capabilities that allow containers to restart without recreating the Pod: ### KEP-5307: Container Restart Policy (Kubernetes 1.34) [KEP-5307](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5307-container-restart-policy/README.md) introduces fine-grained control over restart behavior for individual containers within a Pod. This enables: - Specifying restart policies per container (not just per Pod) - Triggering container restarts without affecting the entire Pod - Preserving Pod identity, IP, and volumes during restarts ### KEP-5532: Restart All Containers on Container Exit (Kubernetes 1.35) [KEP-5532](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5532-restart-all-containers-on-container-exits/README.md) extends this capability to coordinated restarts: - Restarting all containers in a Pod when a specific container exits - Restarting init containers and sidecars as part of the Pod lifecycle - Enabling Pod-level restart coordination without Pod recreation ## Real-World Results: JobSet In-Place Restarts The JobSet team developed an [in-place restart prototype](https://github.com/kubernetes-sigs/jobset/compare/main...GiuseppeTT:jobset:in-place-restart-prototype) that demonstrates dramatic performance improvements: | Metric | Traditional Restart | In-Place Restart | Improvement | | --- | --- | --- | --- | | Restart time | 2 min 10 sec | 10 sec | **92% faster** | | Test scale | 5,000 nodes | 5,000 nodes | – | | Scheduling overhead | High | None | Eliminated | | Pod recreation | Required | Not required | Avoided | For detailed design information, see the [JobSet in-place restart design document](https://docs.google.com/document/d/16zexVooHKPc80F4dVtUjDYK9DOpkVPRNfSv0zRtfFpk/edit?tab=t.0#heading=h.y6xl7juq7465). ## Why This Matters for AI Workloads ### 1. Distributed Training Recovery Large-scale distributed training jobs (PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy) are especially sensitive to restart latency: - **Checkpoint recovery**: After a failure, all workers must restart from the latest checkpoint. In-place restarts make worker recovery **12× faster**. - **Gradient synchronization**: Training can only proceed when all workers are running. Faster restarts mean less wasted GPU time. - **Cost savings**: On expensive GPU clusters ($2–10 per GPU-hour), saving 2 minutes per restart quickly adds up. ### 2. Job Dependencies Many AI pipelines have complex job dependencies. When a job restarts: - **Downstream jobs** wait for upstream completion - **Gang scheduling constraints** require all workers to be present - **Network connections** must be preserved for collective operations In-place restarts preserve Pod identity and network connections, minimizing disruption to the overall pipeline. ### 3. Resource Efficiency Traditional restarts involve: - **Scheduler load**: Finding nodes for potentially thousands of Pods - **API server load**: Creating and deleting Pod objects - **Node preparation**: Image pulls, volume mounts, init containers In-place restarts eliminate all of this overhead, reserving resources for actual workloads. ## How It Works ### Before: Traditional Restart Flow ```text Trigger job restart ↓ Delete all Pods → wait for termination (30s+) ↓ Create new Pods → wait for scheduling (30s+) ↓ Pull images (if needed) → start containers (60s+) ↓ Total: ~2 min 10 sec ```` ### After: In-Place Restart Flow ```text Trigger job restart ↓ Send container exit signal → containers restart in place (10s) ↓ Total: ~10 sec ``` Key differences: 1. **No Pod deletion**: Pod objects are preserved, maintaining identity 2. **No re-scheduling**: Pods remain on their current nodes 3. **No image pulls**: Images are already cached on the node 4. **Immediate restart**: Container processes restart directly ## Implementation Considerations ### When to Use In-Place Restarts * **Transient failures**: Container crashes, OOM kills, network timeouts * **Configuration updates**: Restarting to pick up new environment variables * **Checkpoint recovery**: Resuming training from saved state * **Rolling restarts**: Gracefully restarting workers in sequence ### When Traditional Restarts Are Required * **Node failures**: Pods must move to healthy nodes * **Resource changes**: Pods need more or fewer resources (consider VPA) * **Image updates**: A new container image is required * **Topology changes**: Pods need different placement ### Integrating with JobSet JobSet can leverage in-place restarts as follows: ```yaml apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: distributed-training spec: replicatedJobs: - name: workers replicas: 8 template: spec: template: spec: restartPolicy: Always # Enable in-place restarts containers: - name: trainer image: pytorch/pytorch:latest ``` ## The Broader Co-evolution Pattern This JobSet improvement is a classic example of co-evolution in cloud-native AI: | Kubernetes Capability | Project Adoption | Benefit | | ---------------------- | ------------------- | -------------------------- | | In-place restart | JobSet | 92% faster recovery | | Gang scheduling (1.35) | Kueue, LWS | All-or-nothing placement | | DRA (1.34 GA) | NVIDIA GPU Operator | Flexible device allocation | | Workload API (1.35) | Volcano, YuniKorn | Native workload support | As Kubernetes continues to add AI-friendly features, we expect more projects to adopt them, creating a virtuous cycle of improvement. ## Getting Started ### Prerequisites * Kubernetes 1.34+ (for KEP-5307) * Kubernetes 1.35+ (for KEP-5532 Pod-level restarts) * A JobSet version that supports in-place restarts (check the latest release) ### Enable Feature Gates ```bash # Enable KEP-5307 (Container Restart Policy, 1.34+) on kubelet --feature-gates=ContainerRestartPolicy=true # Enable KEP-5532 (Restart All Containers, 1.35+) on kubelet --feature-gates=RestartAllContainersOnContainerExits=true ``` ### Test In-Place Restarts 1. Deploy a JobSet with `restartPolicy: Always` 2. Trigger a container restart (e.g., `kubectl exec ... -- kill -TERM 1`) 3. Observe the restart time compared to Pod recreation ## Future Roadmap In-place restart capabilities continue to evolve: * **KEP-5307 graduation**: Moving toward Beta/GA * **KEP-5532 enhancements**: More robust Pod-level restart control * **JobSet integration**: Native support for in-place restart policies * **Observability**: Better visibility into restart events * **Kueue integration**: Workload-aware restart handling ## Conclusion The JobSet in-place restart optimization showcases the power of co-evolution in the Kubernetes ecosystem. By adopting upstream Kubernetes capabilities, projects can achieve significant performance gains: * **92% faster restarts** (2 min 10 sec → 10 sec) * **Zero scheduling overhead** * **Preserved Pod identity and networking** * **Reduced API server load** This is just one example of how the Kubernetes community and downstream projects collaborate to improve AI workload efficiency. As more AI-related features land in Kubernetes, we can expect JobSet, Ray, LWS, and others to deliver even more optimizations. The future of AI infrastructure is co-evolution—and it’s already happening. ## References ### KEPs and Documentation * [KEP-5307: Container Restart Policy](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5307-container-restart-policy/README.md) * [KEP-5532: Restart All Containers on Container Exit](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/5532-restart-all-containers-on-container-exits/README.md) * [KEP-1287: In-Place Pod Vertical Scaling](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/1287-in-place-update-pod-resources/README.md) * [JobSet In-Place Restart Design Doc](https://docs.google.com/document/d/16zexVooHKPc80F4dVtUjDYK9DOpkVPRNfSv0zRtfFpk/edit?tab=t.0#heading=h.y6xl7juq7465) * [JobSet In-Place Restart Prototype](https://github.com/kubernetes-sigs/jobset/compare/main...GiuseppeTT:jobset:in-place-restart-prototype) ### Related Projects * [JobSet](https://github.com/kubernetes-sigs/jobset) – Kubernetes SIG Apps * [LeaderWorkerSet](https://github.com/kubernetes-sigs/lws) – Kubernetes SIG Apps * [Kueue](https://github.com/kubernetes-sigs/kueue) – Kubernetes SIG Scheduling * [Volcano](https://github.com/volcano-sh/volcano) – CNCF Incubating