---
title: Load Balancing
---
# Load Balancing
SMG provides multiple load balancing policies to distribute requests across workers. Set the policy with `--policy`:
```bash
smg --worker-urls http://w1:8000 http://w2:8000 --policy cache_aware
```
#### Before you begin
- Completed the [Getting Started](index.md) guide
- Two or more workers running
---
## Policy Comparison
| Policy | Load Aware | Cache Affinity | Session Affinity | Best For |
|--------|:----------:|:--------------:|:----------------:|----------|
| `cache_aware` | Yes | Yes | — | **Production LLM** |
| `bucket` | Yes | — | — | PD disaggregation |
| `power_of_two` | Yes | — | — | General load balancing |
| `consistent_hashing` | — | — | Yes | Session affinity |
| `prefix_hash` | Yes | Partial | — | Lightweight caching |
| `manual` | — | — | Yes | Stateful chat |
| `round_robin` | — | — | — | Even distribution |
| `random` | — | — | — | Testing |
---
## Cache-Aware (Recommended)
The production default. Maintains a radix tree mirroring backend KV cache state for optimal prefix routing with load balancing fallback. Maximizes KV cache hits (60-90% hit rate), reduces TTFT by 70-75%.
```bash
smg \
--policy cache_aware \
--worker-urls http://w1:8000 http://w2:8000 \
--cache-threshold 0.3 \
--balance-abs-threshold 64 \
--balance-rel-threshold 1.5
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--cache-threshold` | `0.3` | Minimum prefix match ratio (0.0–1.0) to route to highest-match worker. At or below this threshold, routes to the least-loaded healthy worker |
| `--balance-abs-threshold` | `64` | Absolute load difference threshold — triggers load balancing when exceeded |
| `--balance-rel-threshold` | `1.5` | Relative load ratio threshold — triggers load balancing when max_load > min_load × ratio |
| `--eviction-interval` | `120` | Seconds between LRU eviction cycles for the radix trees |
| `--max-tree-size` | `67108864` | Maximum nodes per radix tree. Excess nodes are evicted during maintenance cycles |
Best for multi-turn conversations, RAG applications, and batch processing with shared templates.
---
## Power of Two Choices
Samples two random workers and routes to the one with lower load. Good load distribution with minimal overhead.
```bash
smg --policy power_of_two --worker-urls http://w1:8000 http://w2:8000
```
Best for heterogeneous workers with varying response times.
---
## Consistent Hashing
Header-based routing with minimal redistribution on scaling. Routes based on `X-SMG-Routing-Key` header or implicit keys (`Authorization`, `X-Forwarded-For`, `Cookie`).
```bash
smg --policy consistent_hashing --worker-urls http://w1:8000 http://w2:8000
```
### Routing Headers
| Header | Description |
|--------|-------------|
| `X-SMG-Target-Worker` | Direct routing by worker index (0-based) |
| `X-SMG-Routing-Key` | Consistent hash routing for session affinity |
**Priority:** `X-SMG-Target-Worker` > `X-SMG-Routing-Key` > Implicit keys > Random fallback
Best for session affinity and user-to-worker pinning.
---
## Prefix Hash
A lightweight alternative to full cache-aware routing. Routes based on a hash of the first N tokens using consistent hashing with bounded load balancing.
```bash
smg \
--policy prefix_hash \
--worker-urls http://w1:8000 http://w2:8000 \
--prefix-token-count 256 \
--prefix-hash-load-factor 1.25
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--prefix-token-count` | `256` | Number of prefix tokens to hash. Longer = more precise routing, shorter = more requests grouped together |
| `--prefix-hash-load-factor` | `1.25` | Load threshold ratio — if a worker's load exceeds avg_load × factor, walk the hash ring to find a less loaded worker |
Lower memory than `cache_aware` with predictable O(log n) performance.
---
## Bucket
Routes requests based on text length with adaptive boundaries. Periodically adjusts boundaries based on observed load distribution.
```bash
smg \
--policy bucket \
--worker-urls http://w1:8000 http://w2:8000 http://w3:8000 \
--balance-abs-threshold 64 \
--balance-rel-threshold 1.5
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--balance-abs-threshold` | `64` | Absolute load difference threshold for load balancing |
| `--balance-rel-threshold` | `1.5` | Relative load ratio threshold for balancing decisions |
Best for PD disaggregation where prefill workers handle different request sizes.
---
## Manual
Sticky session routing with explicit routing key mapping. Sessions stay with their assigned worker even when new workers are added. Requires `X-SMG-Routing-Key` header.
```bash
smg \
--policy manual \
--worker-urls http://w1:8000 http://w2:8000 \
--assignment-mode min_load \
--max-idle-secs 14400 \
--eviction-interval 120
```
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--assignment-mode` | `random` | Strategy for assigning new routing keys: `random`, `min_load` (fewest active requests), or `min_group` (fewest routing keys) |
| `--max-idle-secs` | `14400` | Maximum idle time (seconds) before a routing entry is evicted. Default is 4 hours |
| `--eviction-interval` | `120` | Seconds between TTL eviction cycles |
Best for stateful chat where context is stored on workers.
---
## Round Robin
Rotates through workers sequentially. Skips unhealthy workers automatically.
```bash
smg --policy round_robin --worker-urls http://w1:8000 http://w2:8000
```
---
## Random
Each healthy worker has equal probability of selection. Zero state overhead.
```bash
smg --policy random --worker-urls http://w1:8000 http://w2:8000
```
---
## Choosing a Policy
| Requirement | Recommended Policy |
|-------------|-------------------|
| Production LLM inference | `cache_aware` |
| Session affinity (sticky sessions) | `manual` or `consistent_hashing` |
| PD disaggregation | `bucket` |
| Load balancing without cache | `power_of_two` |
| Lightweight cache locality | `prefix_hash` |
| Even distribution | `round_robin` |
| Testing/development | `random` |
---
## Next Steps
- [Load Balancing Concepts](../concepts/routing/load-balancing.md) — Detailed policy architecture, advantages/limitations, scenario guides
- [Cache-Aware Routing Concepts](../concepts/routing/cache-aware.md) — Radix tree architecture and routing algorithm deep dive
- [Tokenizer Caching](tokenizer-caching.md) — Reduce tokenization overhead with two-level caching