--- title: Load Balancing --- # Load Balancing SMG provides multiple load balancing policies to distribute requests across workers. Choosing the right policy depends on your workload characteristics. --- ## Overview
### :material-cached: Cache-Aware **Production default.** Maintains radix tree mirroring backend KV cache for optimal prefix routing with load balancing fallback.
### :material-tray-full: Bucket Request-length-based routing with adaptive boundaries. Designed for PD disaggregation workloads.
### :material-scale-balance: Power of Two Load-aware selection without global state. Samples two workers, routes to the lighter one.
### :material-link-variant: Consistent Hashing Header-based routing with minimal redistribution on scaling. Ideal for session affinity.
--- ## Policy Comparison | Policy | Load Aware | Cache Affinity | Session Affinity | Complexity | Best For | |--------|:----------:|:--------------:|:----------------:|:----------:|----------| | `cache_aware` | :material-check: | :material-check: | :material-close: | O(prefix) | **Production LLM** | | `bucket` | :material-check: | :material-close: | :material-close: | O(n) | PD disaggregation | | `power_of_two` | :material-check: | :material-close: | :material-close: | O(1) | Load balancing | | `consistent_hashing` | :material-close: | :material-close: | :material-check: | O(log n) | Session affinity | | `prefix_hash` | :material-check: | Partial | :material-close: | O(log n) | Lightweight caching | | `manual` | :material-close: | :material-close: | :material-check: | O(1) | Stateful chat | | `round_robin` | :material-close: | :material-close: | :material-close: | O(1) | Even distribution | | `random` | :material-close: | :material-close: | :material-close: | O(1) | Testing | --- ## Cache-Aware The **recommended policy** for production LLM inference. Maintains a multi-tenant radix tree that mirrors backend KV cache state, enabling perfect cache prediction with integrated load balancing. ```bash smg --policy cache_aware --worker-urls http://w1:8000 http://w2:8000 ```
#### :material-check-circle: Advantages - Maximizes KV cache hits (60-90% hit rate) - Reduces TTFT by 70-75% - Integrated load balancing fallback - 100% accurate prefix matching
#### :material-close-circle: Limitations - Higher memory usage (radix tree per worker) - O(prefix) selection time - Requires tokenization
**Use when:** Production workloads with repeated prefixes—multi-turn conversations, RAG applications, batch processing with templates. [**Learn more about Cache-Aware Routing →**](cache-aware.md) --- ## Bucket Routes requests based on request text length using adaptive boundaries. Periodically adjusts boundaries based on observed load distribution. ```bash smg --policy bucket --worker-urls http://w1:8000 http://w2:8000 http://w3:8000 ```
#### :material-check-circle: Advantages - Request-length awareness - Adaptive boundary adjustment - Falls back to load balancing when imbalanced
#### :material-close-circle: Limitations - O(n) complexity - No cache locality - Requires understanding of length distribution
**Use when:** PD disaggregation where prefill workers handle different request sizes, or workloads with bimodal request length distribution. --- ## Power of Two Choices Samples two random workers and selects the one with lower load. Provides good load distribution with minimal coordination overhead—a proven algorithm from distributed systems research. ```bash smg --policy power_of_two --worker-urls http://w1:8000 http://w2:8000 ```
#### :material-check-circle: Advantages - Load-aware without global state - O(1) selection time - Exponentially better than random
#### :material-close-circle: Limitations - No cache locality - Requires load metrics from workers - May not find optimal worker
**Use when:** Heterogeneous workers with varying response times, or when cache locality doesn't matter. --- ## Consistent Hashing Provides header-based consistent routing using a hash ring. Minimizes redistribution when workers scale—only ~1/N keys move when adding/removing workers. ```bash smg --policy consistent_hashing --worker-urls http://w1:8000 http://w2:8000 ```
#### :material-check-circle: Advantages - Minimal redistribution on scaling - Automatic failover to next healthy worker - O(log n) lookup time
#### :material-close-circle: Limitations - No load awareness - No cache locality - Requires routing key header
### Routing Headers | Header | Description | |--------|-------------| | `X-SMG-Target-Worker` | Direct routing by worker index (0-based) | | `X-SMG-Routing-Key` | Consistent hash routing for session affinity | **Priority order:** `X-SMG-Target-Worker` → `X-SMG-Routing-Key` → Implicit keys (`Authorization`, `X-Forwarded-For`, `Cookie`) → Random fallback **Use when:** Session affinity needed, user-to-worker pinning, or consistent routing for stateful applications. --- ## Prefix Hash A lightweight alternative to full cache-aware routing. Routes requests based on a hash of the first N tokens, using consistent hashing with load factor override. ```bash smg --policy prefix_hash --prefix-token-count 256 --worker-urls http://w1:8000 http://w2:8000 ```
#### :material-check-circle: Advantages - Predictable O(log n) performance - Lower memory than cache_aware - Groups similar prefixes together
#### :material-close-circle: Limitations - Prefix grouping, not exact matching - Less precise than cache_aware - Load factor can cause redistribution
### Comparison with Cache-Aware | Aspect | prefix_hash | cache_aware | |--------|-------------|-------------| | Lookup | O(log n) | O(prefix_len) | | Memory | O(workers × virtual_nodes) | O(total_tokens) | | Precision | Prefix grouping | Exact matching | **Use when:** Need some cache locality with predictable performance and lower memory footprint. --- ## Manual Provides sticky session routing with explicit routing key mapping. Unlike consistent hashing, sessions stay with their assigned worker even when new workers are added. ```bash smg --policy manual --assignment-mode min_load --worker-urls http://w1:8000 http://w2:8000 ```
#### :material-check-circle: Advantages - Strong session stickiness - Automatic failover with recovery - TTL-based eviction prevents memory growth
#### :material-close-circle: Limitations - No load balancing for existing sessions - Requires `X-SMG-Routing-Key` header - Memory grows with active sessions
### Assignment Modes | Mode | Description | |------|-------------| | `random` | Randomly select from healthy workers | | `min_load` | Select worker with fewest active requests | | `min_group` | Select worker with fewest routing keys assigned | **Use when:** Stateful chat sessions where context is stored on workers, or when session continuity is critical. --- ## Round Robin Rotates through workers sequentially, guaranteeing even distribution over time. Skips unhealthy workers automatically. ```bash smg --policy round_robin --worker-urls http://w1:8000 http://w2:8000 ```
#### :material-check-circle: Advantages - Guaranteed even distribution - Predictable routing pattern - Minimal state (counter only)
#### :material-close-circle: Limitations - No load awareness - No cache locality - Ignores request characteristics
**Use when:** All workers have equal capacity and you want predictable, even distribution. --- ## Random The simplest policy—each healthy worker has equal probability of selection. Zero state overhead. ```bash smg --policy random --worker-urls http://w1:8000 http://w2:8000 ```
#### :material-check-circle: Advantages - Zero state overhead - O(1) selection time - Completely stateless
#### :material-close-circle: Limitations - No load awareness - No cache locality - Can create hot spots
**Use when:** Testing environments or completely homogeneous workloads where simplicity is preferred. --- ## Choosing a Policy ### Decision Guide | Requirement | Recommended Policy | |-------------|-------------------| | Production LLM inference | `cache_aware` | | Session affinity (sticky sessions) | `manual` or `consistent_hashing` | | PD disaggregation | `bucket` | | Load balancing without cache | `power_of_two` | | Lightweight cache locality | `prefix_hash` | | Even distribution | `round_robin` | | Testing/development | `random` | ### Scenario Guide
#### :material-message-text: Conversational AI **Recommended:** `cache_aware` Maximizes KV cache reuse for multi-turn conversations with shared system prompts.
#### :material-file-search: RAG Applications **Recommended:** `cache_aware` Exploits common document prefixes for faster Time to First Token.
#### :material-account-group: Multi-Tenant Platform **Recommended:** `consistent_hashing` or `manual` User-to-worker affinity for tenant isolation or stateful sessions.
#### :material-server-network: PD Disaggregation **Recommended:** `bucket` (prefill) + `power_of_two` (decode) Length-based routing for prefill, load-based for decode workers.
--- ## What's Next?
### :material-cached: Cache-Aware Routing Deep dive into the radix tree architecture and routing algorithm. [Cache-Aware Routing →](cache-aware.md)
### :material-shield: Circuit Breakers How SMG handles worker failures gracefully. [Circuit Breakers →](../reliability/circuit-breakers.md)