--- title: PD Disaggregation --- # PD Disaggregation Prefill-Decode (PD) disaggregation separates the two phases of LLM inference onto specialized workers, optimizing Time to First Token (TTFT) and Time Per Output Token (TPOT) independently. --- ## Overview
### :material-lightning-bolt: Optimized TTFT Dedicated prefill workers process prompts with maximum throughput.
### :material-speedometer: Optimized TPOT Dedicated decode workers generate tokens with minimal latency.
### :material-arrow-expand-all: Independent Scaling Scale prefill and decode workers based on their specific resource needs.
### :material-memory: KV Cache Transfer Automatic coordination of KV cache transfer between worker types.
--- ## Why Disaggregate? Traditional LLM inference has two distinct phases with different characteristics: | Phase | Compute Pattern | Bottleneck | Optimization | |-------|-----------------|------------|--------------| | **Prefill** | Compute-bound, parallel | GPU compute | Batch similar-length prompts | | **Decode** | Memory-bound, sequential | Memory bandwidth | Maximize batch size | Running both phases on the same worker creates inefficiencies: - Prefill batches are delayed waiting for decode slots - Decode batches are small due to prefill memory pressure - Neither phase is optimally configured **PD disaggregation solves this** by dedicating workers to each phase. --- ## Supported Runtimes SMG supports PD disaggregation with two inference backends: | Runtime | Protocol | Dispatch | KV Transfer | Best For | |---------|----------|----------|-------------|----------| | **vLLM** | gRPC | Sequential | NIXL or Mooncake | High-performance with RDMA/TCP networking | | **SGLang** | HTTP | Parallel | Bootstrap-based coordination | Production deployments with SGLang | ### vLLM KV Transfer Backends vLLM supports two backends for KV cache transfer: | Backend | Transport | Configuration | Best For | |---------|-----------|---------------|----------| | **NIXL** | RDMA | `VLLM_NIXL_SIDE_CHANNEL_PORT` env var | High-bandwidth RDMA networks | | **Mooncake** | TCP/RDMA | `VLLM_MOONCAKE_BOOTSTRAP_PORT` env var per prefill worker | Flexible deployment, TCP fallback | --- ## How It Works ### SGLang PD (Parallel Dispatch)
![SGLang PD Parallel Dispatch Sequence](../../assets/images/pd-sglang.svg)
SGLang uses **parallel dispatch** with bootstrap-based coordination: 1. SMG sends the request to both prefill and decode workers simultaneously 2. Metadata (bootstrap host/port) enables workers to coordinate 3. Prefill completes and transfers KV cache to decode 4. Decode streams tokens back to client ### vLLM PD (Sequential Dispatch)
![vLLM PD Sequential Dispatch Sequence](../../assets/images/pd-vllm.svg)
vLLM uses **sequential dispatch** with NIXL or Mooncake KV transfer: 1. SMG sends request to prefill worker with `max_tokens=1` (tagged with `do_remote_decode=true` for NIXL) 2. Prefill computes KV cache and returns its KV handoff params (output tokens discarded) 3. SMG sends original request to decode worker with the relayed `kv_transfer_params` 4. Decode worker pulls the KV cache from prefill (NIXL RDMA / Mooncake) 5. Decode streams tokens back to client ### Request Flow 1. **Request arrives** at SMG gateway 2. **Find P/D pair**: Select a prefill worker and decode worker 3. **Prefill phase**: Prefill worker processes the prompt 4. **KV transfer**: KV cache is transferred to decode worker 5. **Decode phase**: Decode worker generates tokens 6. **Stream response**: Tokens are streamed back to client --- ## Configuration ### SGLang PD Setup SGLang workers use HTTP and require a bootstrap port for coordination: ```bash smg \ --pd-disaggregation \ --prefill http://prefill1:8000 9001 \ --prefill http://prefill2:8000 9002 \ --decode http://decode1:8000 \ --decode http://decode2:8000 ``` ### vLLM PD Setup vLLM workers use gRPC and NIXL/Mooncake for KV transfer (no bootstrap port needed): ```bash smg \ --pd-disaggregation \ --prefill grpc://prefill1:50051 \ --prefill grpc://prefill2:50052 \ --decode grpc://decode1:50053 \ --decode grpc://decode2:50054 \ --model-path /path/to/model ``` !!! note "Model Path Required" The `--model-path` parameter is required for vLLM PD mode to load the tokenizer for request processing. ### Parameters | Parameter | Description | |-----------|-------------| | `--pd-disaggregation` | Enable PD disaggregated mode | | `--prefill` | Prefill worker URL (and optional bootstrap port for SGLang) | | `--decode` | Decode worker URLs | | `--prefill-policy` | Routing policy for prefill workers | | `--decode-policy` | Routing policy for decode workers | ### Per-Phase Policies Configure different routing policies for each phase: ```bash smg \ --pd-disaggregation \ --prefill http://prefill1:8000 \ --prefill http://prefill2:8000 \ --decode http://decode1:8000 \ --decode http://decode2:8000 \ --prefill-policy cache_aware \ --decode-policy power_of_two ``` ### Supported Policies Both prefill and decode support these policies: | Policy | Prefill Use Case | Decode Use Case | |--------|------------------|-----------------| | `cache_aware` | Maximize prompt cache hits | Less beneficial | | `power_of_two` | Balance prefill load | Balance decode load | | `round_robin` | Even distribution | Even distribution | | `random` | Simple distribution | Simple distribution | **Recommended**: `cache_aware` for prefill, `power_of_two` for decode. --- ## Kubernetes Service Discovery Use label selectors to automatically discover prefill and decode workers. ### Configuration ```bash smg \ --service-discovery \ --pd-disaggregation \ --prefill-selector app=sglang role=prefill \ --decode-selector app=sglang role=decode \ --service-discovery-namespace inference ``` ### Worker Deployments ```yaml # Prefill workers apiVersion: apps/v1 kind: StatefulSet metadata: name: sglang-prefill namespace: inference spec: replicas: 2 selector: matchLabels: app: sglang role: prefill template: metadata: labels: app: sglang role: prefill spec: containers: - name: sglang image: lmsysorg/sglang:latest args: - --model-path=meta-llama/Llama-3.1-70B-Instruct - --port=8000 - --prefill-only resources: limits: nvidia.com/gpu: 4 --- # Decode workers apiVersion: apps/v1 kind: StatefulSet metadata: name: sglang-decode namespace: inference spec: replicas: 4 selector: matchLabels: app: sglang role: decode template: metadata: labels: app: sglang role: decode spec: containers: - name: sglang image: lmsysorg/sglang:latest args: - --model-path=meta-llama/Llama-3.1-70B-Instruct - --port=8000 - --decode-only resources: limits: nvidia.com/gpu: 2 ``` --- ## P/D Pair Selection SMG maintains awareness of which prefill and decode workers can communicate. ### Pairing Strategies | Strategy | Description | Best For | |----------|-------------|----------| | **Any-to-Any** | Any prefill can send to any decode | Network with uniform latency | | **Affinity** | Prefer co-located pairs | Reduce KV transfer latency | | **Load-Based** | Select least loaded pair | Maximize throughput | ### KV Cache Transfer The KV cache is transferred between workers using the backend's native mechanism: | Backend | Transfer Method | Coordination | |---------|-----------------|--------------| | vLLM + NIXL | RDMA | `kv_transfer_params` relayed by SMG | | vLLM + Mooncake | TCP/RDMA | P2P handshake via master server | | SGLang | NCCL/Gloo over network | Bootstrap metadata (host/port/room) | **vLLM**: SMG uses the standard proxy pattern — sends `max_tokens=1` to prefill to trigger KV cache computation, then relays the engine's KV-transfer metadata to the decode request: - **NIXL**: SMG tags the prefill request with `do_remote_decode=true`; the engine holds its KV blocks and returns handoff params (e.g. `remote_engine_id`, `remote_request_id`, `remote_block_ids`, `remote_host`/`remote_port`, `tp_size`) that SMG forwards verbatim with the decode request, which pulls the blocks over RDMA - **Mooncake**: SMG injects the prefill worker's bootstrap host/port; workers coordinate via P2P handshake (no external metadata server required) **SGLang**: SMG injects bootstrap metadata (`DisaggregatedParams`) into requests, enabling workers to coordinate KV transfer through a shared "room". --- ## Sizing Guidelines ### Prefill Workers Prefill is **compute-bound**: - More GPUs per worker = faster prefill - Fewer workers with more GPUs is often better - Size for your longest prompts | Prompt Length | Recommended GPUs | |---------------|------------------| | < 4K tokens | 1-2 GPUs | | 4K - 16K tokens | 2-4 GPUs | | 16K - 64K tokens | 4-8 GPUs | | > 64K tokens | 8+ GPUs | ### Decode Workers Decode is **memory-bandwidth-bound**: - More workers = higher throughput - Smaller workers can batch more requests - Size for your target concurrency | Concurrent Users | Recommended Setup | |------------------|-------------------| | < 50 | 2 decode workers | | 50 - 200 | 4 decode workers | | 200 - 500 | 8 decode workers | | > 500 | 16+ decode workers | ### Ratio Guidelines | Workload Type | Prefill:Decode Ratio | |---------------|----------------------| | Short prompts, long outputs | 1:4 | | Balanced prompts/outputs | 1:2 | | Long prompts, short outputs | 1:1 or 2:1 | | RAG with large context | 2:1 | --- ## Monitoring ### Metrics | Metric | Description | |--------|-------------| | `smg_worker_requests_active` | Active requests per worker (label: `worker`) | | `smg_worker_selection_total` | Worker selection count (labels: `worker_type`, `connection_mode`, `model`, `policy`) | | `smg_router_request_duration_seconds` | End-to-end request duration | | `smg_router_ttft_seconds` | Time to first token (gRPC/vLLM only) | ### Key Performance Indicators | KPI | Target | Indicates | |-----|--------|-----------| | TTFT | < 500ms | Prefill performance | | TPOT | < 50ms | Decode performance | | KV transfer time | < 100ms | Network performance | ### PromQL Queries
#### Active Requests by Worker ```promql # Active requests per worker smg_worker_requests_active ```
#### Time to First Token (gRPC/vLLM only) ```promql # Average TTFT # gRPC/vLLM streaming path only; not available for SGLang HTTP PD rate(smg_router_ttft_seconds_sum[5m]) / rate(smg_router_ttft_seconds_count[5m]) ```
--- ## Troubleshooting | Symptom | Cause | Solution | |---------|-------|----------| | High TTFT | Prefill workers overloaded | Add prefill workers or GPUs | | High TPOT | Decode workers overloaded | Add decode workers | | KV transfer timeout | Network congestion | Check network bandwidth | | Uneven load | Poor pairing | Adjust routing policy | | Decode queue buildup | Prefill too fast | Balance P:D ratio | ### Debug Logging ```bash RUST_LOG=smg::pd=debug smg --pd-disaggregation ... ``` ### Verify Configuration ```bash # Check discovered workers curl http://smg:3001/workers | jq # Check worker roles curl http://smg:3001/workers | jq '.[] | {url, role}' ``` --- ## Complete Example ### SGLang PD (Kubernetes Service Discovery) ```bash smg \ --service-discovery \ --pd-disaggregation \ --prefill-selector app=sglang role=prefill \ --decode-selector app=sglang role=decode \ --prefill-policy cache_aware \ --decode-policy power_of_two \ --cb-failure-threshold 3 \ --health-check-interval-secs 10 \ --host 0.0.0.0 \ --port 8000 ``` ### SGLang PD (Static Workers) ```bash smg \ --pd-disaggregation \ --prefill http://prefill-0:8000 9001 \ --prefill http://prefill-1:8000 9002 \ --decode http://decode-0:8000 \ --decode http://decode-1:8000 \ --prefill-policy cache_aware \ --decode-policy power_of_two ``` ### vLLM PD (Static Workers with NIXL) ```bash smg \ --pd-disaggregation \ --prefill grpc://prefill-0:50051 \ --prefill grpc://prefill-1:50052 \ --decode grpc://decode-0:50053 \ --decode grpc://decode-1:50054 \ --model-path /path/to/model \ --prefill-policy cache_aware \ --decode-policy round_robin ``` ### vLLM PD (Static Workers with Mooncake) ```bash smg \ --pd-disaggregation \ --prefill grpc://prefill-0:50051 8998 \ --prefill grpc://prefill-1:50052 8999 \ --decode grpc://decode-0:50053 \ --decode grpc://decode-1:50054 \ --model-path /path/to/model \ --prefill-policy cache_aware \ --decode-policy round_robin ``` Note: For Mooncake, each prefill worker needs a unique bootstrap port (8998, 8999, etc.) passed to SMG. Workers should be started with Mooncake configuration: ```bash # Launch workers using helper script (auto-assigns unique bootstrap ports) KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model ``` ### Launching vLLM PD Workers #### Option 1: NIXL Backend (RDMA) ```bash # Prefill worker (kv_producer) VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \ python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50051 \ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}' # Decode worker (kv_consumer) VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \ python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50052 \ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}' ``` #### Option 2: Mooncake Backend (TCP/RDMA) Mooncake requires each prefill worker to have a unique bootstrap port for P2P coordination: ```bash # Prefill worker 1 (kv_producer) - bootstrap port 8998 VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \ python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50051 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' # Prefill worker 2 (kv_producer) - bootstrap port 8999 (must be unique!) VLLM_MOONCAKE_BOOTSTRAP_PORT=8999 \ python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50052 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' # Decode worker (kv_consumer) - no bootstrap port needed python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50053 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' ``` Mooncake environment variables: | Variable | Description | Default | |----------|-------------|---------| | `VLLM_MOONCAKE_BOOTSTRAP_PORT` | Bootstrap port for prefill workers (must be unique per worker) | Required for prefill | | `MOONCAKE_PROTOCOL` | Transport protocol: `tcp` or `rdma` | `tcp` | | `MOONCAKE_DEVICE` | RDMA device name (for RDMA protocol) | `""` | #### Helper Script Use the provided helper script to launch workers with either backend: ```bash # NIXL backend (default) ./scripts/launch-pd-workers.sh vllm /path/to/model # Mooncake backend KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model # Mooncake with custom bootstrap port KV_BACKEND=mooncake MOONCAKE_BOOTSTRAP_PORT=9000 \ ./scripts/launch-pd-workers.sh vllm /path/to/model ``` --- ## What's Next?
### :material-kubernetes: Service Discovery Automatic worker discovery in Kubernetes. [Service Discovery →](../architecture/service-discovery.md)
### :material-cached: Cache-Aware Routing Optimize prefill with cache-aware routing. [Cache-Aware Routing →](cache-aware.md)
### :material-scale-balance: Load Balancing All available routing policies. [Load Balancing →](load-balancing.md)