--- title: PD Disaggregation --- # PD Disaggregation Prefill-Decode (PD) disaggregation separates the two phases of LLM inference — prompt processing (prefill) and token generation (decode) — onto specialized workers. This optimizes Time to First Token (TTFT) and throughput independently.
#### Before you begin - Completed the [Getting Started](index.md) guide - At least one prefill worker and one decode worker - For vLLM PD: workers started with gRPC entrypoint and KV transfer backend
--- ## Why Disaggregate? | Phase | Compute Pattern | Bottleneck | |-------|-----------------|------------| | **Prefill** | Compute-bound, parallel | GPU compute | | **Decode** | Memory-bound, sequential | Memory bandwidth | Running both on the same worker creates contention — prefill batches wait for decode slots, and decode batches stay small due to memory pressure. Dedicating workers to each phase removes this conflict. --- ## SGLang PD SMG sends the request to both prefill and decode workers simultaneously, and they coordinate KV cache transfer through a bootstrap mechanism. ### Start SGLang Workers ```bash # Prefill worker python -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-70B-Instruct \ --port 8000 \ --prefill-only # Decode worker python -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-70B-Instruct \ --port 8001 \ --decode-only ``` ### Start SMG Each prefill worker needs a bootstrap port for coordination: ```bash smg \ --pd-disaggregation \ --prefill http://prefill:8000 9001 \ --decode http://decode:8001 \ --host 0.0.0.0 \ --port 30000 ``` ### Multiple Workers ```bash smg \ --pd-disaggregation \ --prefill http://prefill1:8000 9001 \ --prefill http://prefill2:8000 9002 \ --decode http://decode1:8001 \ --decode http://decode2:8001 \ --prefill-policy cache_aware \ --decode-policy power_of_two ``` --- ## vLLM PD SMG sends to prefill first with `max_tokens=1`, then sends the original request to decode, relaying KV-transfer metadata between the two legs: - **NIXL**: SMG tags the prefill request with `do_remote_decode=true`, harvests the `kv_transfer_params` the prefill engine returns (engine id, request id, block ids, side-channel address, TP size), and forwards them verbatim with the decode request so decode pulls the KV cache over NIXL. - **Mooncake**: SMG injects the prefill worker's bootstrap host/port into the decode request. ### Start vLLM Workers with NIXL ```bash # Prefill worker VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \ python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50051 \ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}' # Decode worker VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \ python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50052 \ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}' ``` `VLLM_NIXL_SIDE_CHANNEL_PORT` must be unique per worker on the same host (with data parallelism each rank uses `port + dp_rank`). When prefill and decode run on different machines, also set `VLLM_NIXL_SIDE_CHANNEL_HOST` to an address reachable from the decode worker — prefill embeds this host/port in the handoff params that decode uses to fetch the KV cache. To verify KV transfer is active, send a request and look for `Transfer plan:` in the decode worker log (vLLM >= 0.20). If the router logs `prefill returned no kv_transfer_params`, upgrade the servicer (smg-grpc-servicer >= 0.5.4, smg-grpc-proto >= 0.4.9) or check the `--kv-transfer-config` on the workers. ### Start SMG vLLM workers use `grpc://` URLs and require `--model-path` for tokenizer loading: ```bash smg \ --pd-disaggregation \ --prefill grpc://prefill:50051 \ --decode grpc://decode:50052 \ --model-path /path/to/model \ --host 0.0.0.0 \ --port 30000 ``` ### Alternative: Mooncake Backend Mooncake supports TCP transport (no RDMA required). Each prefill worker needs a unique bootstrap port: ```bash # Prefill worker VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \ python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50051 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' # Decode worker python -m vllm.entrypoints.grpc_server \ --model /path/to/model \ --port 50052 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' ``` ```bash smg \ --pd-disaggregation \ --prefill grpc://prefill:50051 8998 \ --decode grpc://decode:50052 \ --model-path /path/to/model ``` ### Helper Script Use the provided script to launch workers with either backend: ```bash # NIXL (default) ./scripts/launch-pd-workers.sh vllm /path/to/model # Mooncake KV_BACKEND=mooncake ./scripts/launch-pd-workers.sh vllm /path/to/model ``` --- ## Verify ```bash # Check workers and their roles curl http://localhost:30000/workers | jq # Send a request curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-70B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }' ``` --- ## SGLang vs vLLM PD at a Glance | | SGLang PD | vLLM PD | |---|-----------|---------| | **Protocol** | HTTP | gRPC | | **Dispatch** | Both workers receive request simultaneously | Prefill first, then decode | | **KV Transfer** | Bootstrap-based coordination | NIXL (RDMA) or Mooncake (TCP/RDMA) | | **SMG flags** | `--prefill http://... ` | `--prefill grpc://...` + `--model-path` | --- ## Next Steps For sizing guidelines, per-phase routing policies, Kubernetes service discovery, and monitoring, see the full [PD Disaggregation Concepts](../concepts/routing/pd-disaggregation.md) page.