Kubernetes

--- layout: intro class: px-35 glowSeed: 205 ---

Peter Pan

Software Engineering VP

panpan0000

Kebe Liu

Senior software engineer

kebe7jun

Fanshi Zhang

Senior software engineer

nekomeowww

--- class: py-10 glowSeed: 100 --- # Challenges Across LLM Lifecycle From environment setup to production deployment

Dependency Hell

Dependency install overhead

Python/NodeJS install fails frequently with long waiting

CUDA version drift

Incompatible versions across environments

Dependency Lifecycle consistency

From development to training to inference

Tool fragmentation

pip / uv / conda / nix / pixi

Data Preparation

Unattended dataset/model preparation

Time-consuming & error-prone processes

Disparate sources

HuggingFace / S3 / NFS / Web

Data Governance

Sharing artifacts

Across teams and Kubernetes namespaces

Version control & Reproducibility

Tracking model & environment versions

LLM projects face unique infrastructure challenges beyond traditional ML

--- layout: center class: text-center --- # "It Works On My Machine"™ The ML Engineer's Lament

How many times have you seen this?

```bash $ python train.py ImportError: libcudart.so.11.0: cannot open shared object file $ pip install torch --index-url https://download.pytorch.org/whl/cu118 RuntimeError: CUDA error: no kernel image is available for execution $ ldd $(which python3) | grep 'not found' libstdc++.so.6 => not found ```

--- class: py-10 glowSeed: 175 --- # Development vs Training: The Environment Gap Bridging the divide between model development and production training

The Common Pattern

Development

Preparing new model training datasets

Training

Fine-tuning load with transformers lib

Inference

Inference from vLLM with transformers

pip install -r requirements.txt

python train.py

Dependency drift

Repeated downloads

No lockfile tracking

Inconsistent versions

The Dataset Solution

Single Environment, Multiple Contexts

Define once, use everywhere

Tracked dependencies with lockfiles

Automatic dependency resolution

Automatic Tool Integration

Jupyter

VSCode

No configuration needed - just click and use!

--- clicks: 3 --- # When Python Meets C++ Dependency Hell Emerges

The Perfect Storm: When Python Code Meets C++ Underpinnings

ML libraries are just thin Python wrappers around massive C++ and CUDA codebases

--- class: py-10 glowSeed: 123 --- # The Silent Saboteurs Hidden issues that break ML pipelines

ABI Incompatibility

CUDA Version Conflicts

System Library Conflicts

Package Inconsistencies

Real World Impact

Hours wasted reinstalling CUDA

Inconsistent model results

Broken production deployments

--- class: py-10 clicks: 4 glowSeed: 180 --- # The Hidden Iceberg: What pip Can't See The deceptive simplicity of Python dependencies

Seems installing

torch==2.1.0 transformers==4.36.0 accelerate==0.25.0

But actually...

CUDA 11.8

gcc 9.4.0

cmake 3.22.1

libnccl2

libcudnn8

cuda-cupti-dev

libstdc++.so.6

libopenblas.so

libpython3.10.so

libcublas.so

...and dozens more

Python Package Managers

Handle Python dependencies well

Blind to underlying C++ libraries

Cannot handle compiler compatibility

The Reality

Modern ML libraries are just thin Python wrappers around massive C++ and CUDA codebases

Dependency Complexity

PyTorch source:

1.8M+ lines of C++

Python wrapper:

~100K lines of Python

Binary size:

1.7GB+ with CUDA

--- layout: center --- # CUDA Conundrum: The Version Wars

🎯

Version 11.6

Legacy Model

Required by older frameworks

🎯

Version 11.8

PyTorch's Choice

Optimized for current models

🎯

Version 12.1

System Default

Newest features, compatibility issues

CUDA Complexity

• Driver vs Runtime version mismatch
• cuDNN compatibility matrix
• NCCL version requirements

The Silent Killer

Often fails with cryptic errors or worse - silent numerical errors in your models

--- class: py-10 clicks: 6 glow: right --- # Compiler Chaos: When gcc Versions Wage War The battlefield of binary compatibility

GCC Version Matrix

Binary Incompatibility

ImportError: /lib64/libstdc++.so.6: version 'GLIBCXX_3.4.29' not found

undefined symbol: _ZN3c10...

C++ ABI Changes

String Implementation

// GCC 4.x

struct string { char* _M_p; size_t _M_string_length; };

// GCC 5.x+

struct string { union { ... } _M_dataplus; ... _M_string_length; };

Memory Layout Mismatch

The Developer Experience: "It works differently everywhere!"

--- class: py-10 clicks: 7 glowSeed: 350 --- # Why Reusable Environments Matter From hours of frustration to seconds of mounting

Traditional Workflow

Reusable Environments

Without Reusable Environments

4-6 Hours

Per developer, per environment setup

Manual CUDA installation

System library conflicts

Disk space duplication

With Reusable Environments

30 Seconds

Just mount the shared environment

Pre-built environments

Consistent across team

Efficient storage usage

--- class: py-10 clicks: 5 glow: left --- # The Usual Suspects: Tools We've Tried What works, what doesn't, and why

pip & uv

Fast for Python packages

Blind to C++/CUDA deps

No system library management

Version conflicts common

Docker

Reproducible environments

Massive image sizes (5-10GB)

Slow build times (30+ min)

Resource intensive

Nix

Complete reproducibility

PhD-level learning curve

Complex configuration

K8s integration challenges

What We Need

Python package management

C++/CUDA awareness

Storage efficiency

Fast setup times

K8s native

Team consistency

--- glowSeed: 12129 --- # Introducing Datasets Python + C++ Harmony in K8S

One solution to rule them all

Python + C++ + CUDA harmony in Kubernetes

--- # Dataset CRD One CRD to Rule Them All

```yaml apiVersion: dataset.baizeai.io/v1alpha1 kind: Dataset metadata: name: pytorch-env spec: source: type: CONDA uri: conda://python?version=3.11.9 options: packageManager: CONDA pythonVersion: 3.11.9 condaEnvironmentYml: |- channels: ['nvidia', 'conda-forge'] dependencies: [ - 'cuda' - 'cuda-libraries-dev' - 'cuda-nvcc' - 'cuda-nvtx' - 'cuda-cupti' pipRequirementsTxt: |- transformers==4.35.0 torch torchaudio torchvision ```

Key Features

• Multi-source support (conda, huggingface)
• Self-contained enterprise model hub
• Pre-loaded datasets and models
• Install once, use everywhere
• Secure credential management

--- class: py-10 glowSeed: 125 --- # Datasets vs Docker: Flexibility Matters Why writable persistent environments win for data science

Docker Approach

# Need to add a dependency? Rebuild the entire image

FROM nvidia/cuda:11.8.0-base-ubuntu22.04

RUN apt-get update && apt-get install -y python3-pip

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY . .

# Immutable after build - can't easily modify

30+ minutes to rebuild for one new package

Read-only runtime limits dynamic ML tools

One container = one environment

Dataset CRD Approach

# Mount pre-built environments as needed

volumes:

- name: pytorch-env

persistentVolumeClaim:

claimName: pytorch-2.1-env

# Need another env? Just mount another PVC

- name: pytorch-nightly-env

persistentVolumeClaim:

claimName: pytorch-nightly-env

Add packages on-the-fly in seconds

Writeable PVCs support all ML workflows

Switch multiple environments simultaneously

--- class: py-10 glowSeed: 182 clicks: 4 --- # How does it work? Looking under the hood

Controller Architecture

--- class: py-4 glowSeed: 310 ---

Python Environment Management

From dependency chaos to environment harmony

Full environment control

CUDA integration

C++ binary packages

pip

Familiar requirements.txt

PyPI packages

Private indexes

Pixi

Fast parallel installs

Rust-powered speed

Lockfile support

Mamba

10x faster than conda

Parallel downloads

Conda-compatible

--- class: py-10 glowSeed: 150 --- # Intelligent Dependency Approach

Optimizing the unbearable heaviness of builds

1: Fetching

Source packages & archives

Mirror for Conda & pip

Auto merge config & requirements.txt

2: Install & Build

Compiled binaries & wheels

Existing cache used

No duplicated installation

3: Persist & Activate

Environment configs

Auto discovery for Notebooks

Auto activate

Traditional Approach

CUDA setup:

45-60 min

PyTorch install:

20-30 min

With Datasets

First setup:

10-15 min

Subsequent use:

seconds

--- class: py-4 glowSeed: 275 ---

```yaml apiVersion: dataset.baizeai.io/v1alpha1 kind: Dataset metadata: name: qwen3-32b spec: dataSyncRound: 1 secretRef: dataset-hf-qwen3-32b-secret source: options: endpoint: https://hf-mirror.com repoType: MODEL type: HUGGING_FACE uri: huggingface://Qwen/Qwen3-32B volumeClaimTemplate: metadata: {} spec: accessModes: - ReadWriteMany resources: requests: storage: '0' storageClassName: juicefs-no-share-sc status: {} ```

HuggingFace & ModelScope

Models, datasets, all in one

Smart Filtering

Include/exclude patterns

Skip redundant files

options:

exclude: "*.bin"

Advanced Features

Mirroring Support

Configurable endpoints

Regional mirrors

endpoint: https://hf-mirror.com

Token Authentication

Secure token management

Private repo access

secretRef: hf-token-secret

--- class: py-10 glowSeed: 215 --- # How many? Flexible multi-source data integration

ML Model Repositories

HuggingFace

Models, datasets, spaces

ModelScope

Alibaba AI models

Environment & Packages

Conda

Environment management

Pixi

Environment management

PyPI / pip

Python packages

Storage & Version Control

Git Repositories

Code and configurations

S3-compatible

Cloud storage

Local Volumes

On-prem storage

Unified Data Access Layer

One consistent API to access all your AI assets

--- class: py-10 glowSeed: 185 --- # The Real Problem: Collaboration at Scale When every team becomes an island

The Isolation Problem

Team A

llama-3-70b-instruct

PyTorch 2.1 + CUDA 11.8

Storage: 160GB

Team B

llama-3-70b-instruct

PyTorch 2.1 + CUDA 11.8

Storage: 160GB

Same model, same env

Downloaded twice, stored twice!

The Sharing Solution

Shared Dataset

llama-3-70b-instruct

PyTorch 2.1 + CUDA 11.8

Storage: 160GB (once!)

Team A → uses reference

Team B → uses reference

Enterprise Impact

10 teams × 160GB model = 1.6TB → 160GB

--- # Cross-Namespace Dataset Sharing

```yaml {7-9} apiVersion: dataset.baizeai.io/v1alpha1 kind: Dataset metadata: name: llama3-foundation-ref namespace: nlp-team spec: source: type: REFERENCE uri: dataset://ml-platform/llama3-70b-foundation ``` ```yaml {10-11} --- apiVersion: v1 kind: Pod metadata: name: fine-tuning-job spec: ... volumes: - name: model persistentVolumeClaim: claimName: llama3-foundation-ref # Auto-created PVC ```

How It Works

• Reference points to shared dataset
• Controller auto-creates local PVC & PV
• No data duplication
• Instant access to models

--- class: py-10 clicks: 3 glow: bottom --- # Enterprise Model Hub in Minutes From fragmented assets to unified ecosystem

Model Management

Version control

Metadata extraction

Ready Before You Are

Deployment Timeline

Time Savings

Traditional

30m

Setup

Download Weights

Test running

With Datasets

30s

⚡️

Mount

95%

Time Saved

From isolated silos to unified model ecosystem

Enabling seamless collaboration across data science teams

--- class: py-10 clicks: 3 glowSeed: 338 --- # Pixi Integration: The Next Evolution Supercharging environment creation

Minutes to hours for environment setup

Sequential dependency resolution

Incompatible ABI issues

$ conda create -n myenv python=3.12

$ conda install cuda -c nvidia

$ conda activate myenv

$ pip install torch

# Waiting... a long time... up to hours

Average setup time: 60+ minutes

Pixi.sh

Seconds to minutes for complete setup

Parallel processing of dependencies

Precompiled binaries with correct ABI

$ pixi init

$ pixi project channel add nvidia

$ pixi add cuda python=3.12

$ pixi add --pypi torch

# Ready in seconds

Average setup time: 4-5 minutes

--- class: py-10 clicks: 2 glowSeed: 338 --- # Pixi Integration: How Fast? 100x difference in environment setup times!

Setup Time Comparison

Smaller is better

45+ min

~100× slower

Conda

Sequential installation

~30 sec

Fastest

Pixi

Parallel processing

--- class: py-10 glowSeed: 250 --- # Metrics in summary The measurable benefits of Datasets

Setup time cost

5-10x

With shared environments

From hours to minutes

Save storage

90%

Using JuiceFS dedup

10GB → 1GB typical savings

Time cost

75%

No more environment setup

Instant environment activation

--- class: py-10 --- # Let's build it together Open sourced, already

BaizeAI/dataset

--- class: py-10 ---

Thank you

Slides open sourced at

github.com/BaizeAI/talks

Slides built on top of

sli.dev

Taming Dependency Chaos for LLM in K8S

The Perfect Storm: When Python Code Meets C++ Underpinnings

Version 11.6

Version 11.8

Version 12.1

CUDA Complexity

The Silent Killer

One solution to rule them all