--- layout: center highlighter: shiki css: unocss colorSchema: dark transition: fade-out title: Taming Dependency Chaos for LLM in K8S exportFilename: KubeCon HK 2025.06 - Taming Dependency Chaos for LLM in K8S lineNumbers: false drawings: persist: false mdc: true clicks: 0 preload: false glowSeed: 229 routerMode: hash ---

Taming Dependency Chaos for LLM in K8S

DaoCloud Fanshi Zhang, Kebe Liu, Peter Pan
--- layout: intro class: px-24 glowSeed: 205 ---
Kubernetes
--- layout: intro class: px-35 glowSeed: 205 ---
Peter Pan
Software Engineering VP
panpan0000
Kebe Liu
Senior software engineer
kebe7jun
Fanshi Zhang
Senior software engineer
nekomeowww
--- class: py-10 glowSeed: 100 --- # Challenges Across LLM Lifecycle From environment setup to production deployment
Dependency Hell
Dependency install overhead
Python/NodeJS install fails frequently with long waiting
CUDA version drift
Incompatible versions across environments
Dependency Lifecycle consistency
From development to training to inference
Tool fragmentation
pip / uv / conda / nix / pixi
Data Preparation
Unattended dataset/model preparation
Time-consuming & error-prone processes
Disparate sources
HuggingFace / S3 / NFS / Web
Data Governance
Sharing artifacts
Across teams and Kubernetes namespaces
Version control & Reproducibility
Tracking model & environment versions
LLM projects face unique infrastructure challenges beyond traditional ML
--- layout: center class: text-center --- # "It Works On My Machine"™ The ML Engineer's Lament
How many times have you seen this?
```bash $ python train.py ImportError: libcudart.so.11.0: cannot open shared object file $ pip install torch --index-url https://download.pytorch.org/whl/cu118 RuntimeError: CUDA error: no kernel image is available for execution $ ldd $(which python3) | grep 'not found' libstdc++.so.6 => not found ```
--- class: py-10 glowSeed: 175 --- # Development vs Training: The Environment Gap Bridging the divide between model development and production training
The Common Pattern
Development
Preparing new model training datasets
Training
Fine-tuning load with transformers lib
Inference
Inference from vLLM with transformers
pip install -r requirements.txt
python train.py
Dependency drift
Repeated downloads
No lockfile tracking
Inconsistent versions
The Dataset Solution
Single Environment, Multiple Contexts
Define once, use everywhere
Tracked dependencies with lockfiles
Automatic dependency resolution
Automatic Tool Integration
Jupyter
VSCode
No configuration needed - just click and use!
--- clicks: 3 --- # When Python Meets C++ Dependency Hell Emerges

The Perfect Storm: When Python Code Meets C++ Underpinnings

ML libraries are just thin Python wrappers around massive C++ and CUDA codebases

--- class: py-10 glowSeed: 123 --- # The Silent Saboteurs Hidden issues that break ML pipelines
ABI Incompatibility
CUDA Version Conflicts
System Library Conflicts
Package Inconsistencies
Real World Impact
Hours wasted reinstalling CUDA
Inconsistent model results
Broken production deployments
--- class: py-10 clicks: 4 glowSeed: 180 --- # The Hidden Iceberg: What pip Can't See The deceptive simplicity of Python dependencies
Seems installing
torch==2.1.0 transformers==4.36.0 accelerate==0.25.0
But actually...
CUDA 11.8
gcc 9.4.0
cmake 3.22.1
libnccl2
libcudnn8
cuda-cupti-dev
libstdc++.so.6
libopenblas.so
libpython3.10.so
libcublas.so
...and dozens more
Python Package Managers
Handle Python dependencies well
Blind to underlying C++ libraries
Cannot handle compiler compatibility
The Reality
Modern ML libraries are just thin Python wrappers around massive C++ and CUDA codebases
Dependency Complexity
PyTorch source:
1.8M+ lines of C++
Python wrapper:
~100K lines of Python
Binary size:
1.7GB+ with CUDA
--- layout: center --- # CUDA Conundrum: The Version Wars
🎯

Version 11.6

Legacy Model

Required by older frameworks

🎯

Version 11.8

PyTorch's Choice

Optimized for current models

🎯

Version 12.1

System Default

Newest features, compatibility issues

CUDA Complexity

  • • Driver vs Runtime version mismatch
  • • cuDNN compatibility matrix
  • • NCCL version requirements

The Silent Killer

Often fails with cryptic errors or worse - silent numerical errors in your models

--- class: py-10 clicks: 6 glow: right --- # Compiler Chaos: When gcc Versions Wage War The battlefield of binary compatibility
GCC Version Matrix
{{version}}
{{['PyTorch', 'gcc'][idx]}}
Binary Incompatibility
ImportError: /lib64/libstdc++.so.6: version 'GLIBCXX_3.4.29' not found
undefined symbol: _ZN3c10...
C++ ABI Changes
String Implementation
// GCC 4.x
struct string { char* _M_p; size_t _M_string_length; };
// GCC 5.x+
struct string { union { ... } _M_dataplus; ... _M_string_length; };
Memory Layout Mismatch
The Developer Experience: "It works differently everywhere!"
--- class: py-10 clicks: 7 glowSeed: 350 --- # Why Reusable Environments Matter From hours of frustration to seconds of mounting
Traditional Workflow
{{item}}
Reusable Environments
{{item}}
Without Reusable Environments
4-6 Hours
Per developer, per environment setup
Manual CUDA installation
System library conflicts
Disk space duplication
With Reusable Environments
30 Seconds
Just mount the shared environment
Pre-built environments
Consistent across team
Efficient storage usage
--- class: py-10 clicks: 5 glow: left --- # The Usual Suspects: Tools We've Tried What works, what doesn't, and why
pip & uv
Fast for Python packages
Blind to C++/CUDA deps
No system library management
Version conflicts common
Docker
Reproducible environments
Massive image sizes (5-10GB)
Slow build times (30+ min)
Resource intensive
Nix
Complete reproducibility
PhD-level learning curve
Complex configuration
K8s integration challenges
What We Need
Python package management
C++/CUDA awareness
Storage efficiency
Fast setup times
K8s native
Team consistency
--- glowSeed: 12129 --- # Introducing Datasets Python + C++ Harmony in K8S

One solution to rule them all

Python + C++ + CUDA harmony in Kubernetes

--- # Dataset CRD One CRD to Rule Them All
```yaml apiVersion: dataset.baizeai.io/v1alpha1 kind: Dataset metadata: name: pytorch-env spec: source: type: CONDA uri: conda://python?version=3.11.9 options: packageManager: CONDA pythonVersion: 3.11.9 condaEnvironmentYml: |- channels: ['nvidia', 'conda-forge'] dependencies: [ - 'cuda' - 'cuda-libraries-dev' - 'cuda-nvcc' - 'cuda-nvtx' - 'cuda-cupti' pipRequirementsTxt: |- transformers==4.35.0 torch torchaudio torchvision ```
Key Features
  • • Multi-source support (conda, huggingface)
  • • Self-contained enterprise model hub
  • • Pre-loaded datasets and models
  • • Install once, use everywhere
  • • Secure credential management
--- class: py-10 glowSeed: 125 --- # Datasets vs Docker: Flexibility Matters Why writable persistent environments win for data science
Docker Approach
# Need to add a dependency? Rebuild the entire image
FROM nvidia/cuda:11.8.0-base-ubuntu22.04
RUN apt-get update && apt-get install -y python3-pip
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
# Immutable after build - can't easily modify
30+ minutes to rebuild for one new package
Read-only runtime limits dynamic ML tools
One container = one environment
Dataset CRD Approach
# Mount pre-built environments as needed
volumes:
- name: pytorch-env
persistentVolumeClaim:
claimName: pytorch-2.1-env
# Need another env? Just mount another PVC
- name: pytorch-nightly-env
persistentVolumeClaim:
claimName: pytorch-nightly-env
Add packages on-the-fly in seconds
Writeable PVCs support all ML workflows
Switch multiple environments simultaneously
--- class: py-10 glowSeed: 182 clicks: 4 --- # How does it work? Looking under the hood
Controller Architecture
{{step}}
--- class: py-4 glowSeed: 310 ---
```yaml apiVersion: dataset.baizeai.io/v1alpha1 kind: Dataset metadata: name: pytorch-env spec: source: type: CONDA uri: conda://python?version=3.11.9 options: packageManager: CONDA pythonVersion: 3.11.9 condaEnvironmentYml: |- channels: ['nvidia', 'conda-forge'] dependencies: [ - 'cuda' - 'cuda-libraries-dev' - 'cuda-nvcc' - 'cuda-nvtx' - 'cuda-cupti' pipRequirementsTxt: |- transformers==4.35.0 torch torchaudio torchvision ```
Python Environment Management
From dependency chaos to environment harmony
Full environment control
CUDA integration
C++ binary packages
pip
Familiar requirements.txt
PyPI packages
Private indexes
Pixi
Fast parallel installs
Rust-powered speed
Lockfile support
Mamba
10x faster than conda
Parallel downloads
Conda-compatible
--- class: py-10 glowSeed: 150 --- # Intelligent Dependency Approach
Optimizing the unbearable heaviness of builds
1: Fetching
Source packages & archives
Mirror for Conda & pip
Auto merge config & requirements.txt
2: Install & Build
Compiled binaries & wheels
Existing cache used
No duplicated installation
3: Persist & Activate
Environment configs
Auto discovery for Notebooks
Auto activate
Traditional Approach
CUDA setup:
45-60 min
PyTorch install:
20-30 min
With Datasets
First setup:
10-15 min
Subsequent use:
seconds
--- class: py-4 glowSeed: 275 ---
```yaml apiVersion: dataset.baizeai.io/v1alpha1 kind: Dataset metadata: name: qwen3-32b spec: dataSyncRound: 1 secretRef: dataset-hf-qwen3-32b-secret source: options: endpoint: https://hf-mirror.com repoType: MODEL type: HUGGING_FACE uri: huggingface://Qwen/Qwen3-32B volumeClaimTemplate: metadata: {} spec: accessModes: - ReadWriteMany resources: requests: storage: '0' storageClassName: juicefs-no-share-sc status: {} ```
HuggingFace & ModelScope
Models, datasets, all in one
Smart Filtering
Include/exclude patterns
Skip redundant files
options:
  exclude: "*.bin"
Advanced Features
Mirroring Support
Configurable endpoints
Regional mirrors
endpoint: https://hf-mirror.com
Token Authentication
Secure token management
Private repo access
secretRef: hf-token-secret
--- class: py-10 glowSeed: 215 --- # How many? Flexible multi-source data integration
ML Model Repositories
HuggingFace
Models, datasets, spaces
ModelScope
Alibaba AI models
Environment & Packages
Conda
Environment management
Pixi
Environment management
PyPI / pip
Python packages
Storage & Version Control
Git Repositories
Code and configurations
S3-compatible
Cloud storage
Local Volumes
On-prem storage
Unified Data Access Layer
One consistent API to access all your AI assets
--- class: py-10 glowSeed: 185 --- # The Real Problem: Collaboration at Scale When every team becomes an island
The Isolation Problem
Team A
llama-3-70b-instruct
PyTorch 2.1 + CUDA 11.8
Storage: 160GB
Team B
llama-3-70b-instruct
PyTorch 2.1 + CUDA 11.8
Storage: 160GB
Same model, same env
Downloaded twice, stored twice!
The Sharing Solution
Shared Dataset
llama-3-70b-instruct
PyTorch 2.1 + CUDA 11.8
Storage: 160GB (once!)
Team A → uses reference
Team B → uses reference
Enterprise Impact
10 teams × 160GB model = 1.6TB160GB
--- # Cross-Namespace Dataset Sharing
```yaml {7-9} apiVersion: dataset.baizeai.io/v1alpha1 kind: Dataset metadata: name: llama3-foundation-ref namespace: nlp-team spec: source: type: REFERENCE uri: dataset://ml-platform/llama3-70b-foundation ``` ```yaml {10-11} --- apiVersion: v1 kind: Pod metadata: name: fine-tuning-job spec: ... volumes: - name: model persistentVolumeClaim: claimName: llama3-foundation-ref # Auto-created PVC ```
How It Works
  • • Reference points to shared dataset
  • • Controller auto-creates local PVC & PV
  • • No data duplication
  • • Instant access to models
--- class: py-10 clicks: 3 glow: bottom --- # Enterprise Model Hub in Minutes From fragmented assets to unified ecosystem
Model Management
{{item.label}}
Version control
Metadata extraction
Ready Before You Are
Deployment Timeline
Time Savings
Traditional
30m
6h
30m
Setup
Download Weights
Test running
With Datasets
30s
⚡️
Mount
95%
Time Saved
From isolated silos to unified model ecosystem
Enabling seamless collaboration across data science teams
--- class: py-10 clicks: 3 glowSeed: 338 --- # Pixi Integration: The Next Evolution Supercharging environment creation
Minutes to hours for environment setup
Sequential dependency resolution
Incompatible ABI issues
$ conda create -n myenv python=3.12
$ conda install cuda -c nvidia
$ conda activate myenv
$ pip install torch
# Waiting... a long time... up to hours
Average setup time: 60+ minutes
Pixi.sh
Seconds to minutes for complete setup
Parallel processing of dependencies
Precompiled binaries with correct ABI
$ pixi init
$ pixi project channel add nvidia
$ pixi add cuda python=3.12
$ pixi add --pypi torch
# Ready in seconds
Average setup time: 4-5 minutes
--- class: py-10 clicks: 2 glowSeed: 338 --- # Pixi Integration: How Fast? 100x difference in environment setup times!
Setup Time Comparison
Smaller is better
45+ min
~100× slower
Conda
Sequential installation
~30 sec
Fastest
Pixi
Parallel processing
--- class: py-10 glowSeed: 250 --- # Metrics in summary The measurable benefits of Datasets
Setup time cost
5-10x
With shared environments
From hours to minutes
Save storage
90%
Using JuiceFS dedup
10GB → 1GB typical savings
Time cost
75%
No more environment setup
Instant environment activation
--- class: py-10 --- # Let's build it together Open sourced, already
BaizeAI/dataset
--- class: py-10 ---
Thank you
Slides open sourced at
Slides built on top of