--- title: Cache-Aware Routing --- # Cache-Aware Routing Cache-aware routing is SMG's most sophisticated load balancing policy. It maintains a multi-tenant radix tree that mirrors backend KV cache state, enabling perfect cache prediction and optimal worker selection. --- ## Overview
### :material-tree: Radix Tree Mirroring SMG maintains an exact replica of each backend's KV cache structure, enabling 100% accurate prefix match calculations.
### :material-sync: Backend Synchronization Uses the same tokens, page sizes, and eviction policies as backend schedulers (vLLM, TensorRT-LLM, TokenSpeed, SGLang).
### :material-scale-balance: Intelligent Load Balancing Automatically switches between cache affinity and load-based routing based on real-time worker state.
### :material-lightning-bolt: Dramatic TTFT Reduction Achieves 70-75% reduction in Time to First Token through optimal KV cache reuse.
--- ## Why Cache-Aware Routing? LLM inference workers maintain a **KV cache** of previously computed attention states. When a new request shares a prefix with a previous request, the worker can skip recomputing that portion—dramatically reducing latency.
### :material-message-text: Multi-Turn Conversations Same system prompt across turns. Growing context benefits from cached prefixes.
### :material-file-document-multiple: RAG Applications Common document snippets. Shared retrieval prefixes across queries.
### :material-tray-full: Batch Processing Similar prompts in sequence. Template-based generation with variable suffixes.
### :material-account-group: Shared System Prompts Multiple users with same instructions. Amortized prefill cost across sessions.
**The Challenge:** If requests with shared prefixes go to different workers, each worker must recompute the same prefix independently—wasting GPU cycles. **Cache-Aware Solution:** Route requests to workers that already have the relevant prefix cached, maximizing KV cache hits and minimizing redundant computation. --- ## Multi-Tenant Radix Tree Architecture
![Multi-Tenant Radix Tree Architecture](../../assets/images/radix-tree.svg)
SMG maintains a **multi-tenant radix tree** that mirrors the KV cache state on each backend worker. This enables true cache-aware routing with 100% prefix match accuracy.
### :material-sync: 100% Backend Synchronization The gateway's radix tree uses the **exact same parameters** as backend schedulers: - **Same tokens**: Pre-tokenized input matches backend representation - **Same page size**: Aligned to kernel page boundaries (e.g., 16 tokens for FlashInfer) - **Same eviction policy**: LRU, LFU, FIFO, MRU, FILO, or Priority
### :material-tree: Two Tree Implementations | Tree Type | Mode | Input | Use Case | |-----------|------|-------|----------| | **StringTree** | HTTP | Raw text | OpenAI-compatible endpoints | | **TokenTree** | gRPC | Token IDs | Direct worker communication |
**Why This Matters:** - **Perfect Cache Prediction**: Since the gateway tree mirrors backend behavior exactly, prefix match calculations are 100% accurate - **Kernel-Aware**: Honors page boundaries used by inference kernels (FlashInfer, Mamba, etc.) - **Eviction Parity**: Tracks the same eviction policy as vLLM/TensorRT-LLM/TokenSpeed/SGLang schedulers --- ## Routing Algorithm ### Two-State Decision Model Cache-aware routing operates in two states based on system load:
### :material-cached: Balanced State (Cache-Affinity Mode) When workers are evenly loaded: 1. Search radix tree for longest matching prefix 2. If match ratio > `cache-threshold` → route to matched worker 3. If match ratio ≤ threshold → route to least-loaded healthy worker **Goal**: Maximize KV cache hit rate
### :material-scale-balance: Imbalanced State (Load-Balance Mode) When load difference exceeds thresholds: 1. Calculate load difference: `max_load - min_load` 2. Calculate load ratio: `max_load / min_load` 3. If both exceed thresholds → route to least loaded worker **Goal**: Prevent worker overload
### Imbalance Detection The system is considered imbalanced when **both** conditions are met: ``` (max_load - min_load) > balance_abs_threshold AND max_load > balance_rel_threshold × min_load ``` | Parameter | Default | Description | |-----------|---------|-------------| | `balance_abs_threshold` | 64 | Absolute load difference threshold | | `balance_rel_threshold` | 1.5 | Relative load ratio threshold (1.5 = 50% difference) | ### Selection Flow
![Cache-Aware Routing Selection Flow](../../assets/images/cache-routing-flow.svg)
--- ## Performance Impact ### Benchmarks In typical conversational workloads: | Metric | Without Cache-Aware | With Cache-Aware | Improvement | |--------|---------------------|------------------|-------------| | **TTFT (p50)** | 150ms | 45ms | **70%** | | **TTFT (p99)** | 800ms | 200ms | **75%** | | **Cache hit rate** | 0% | 65% | - | | **GPU utilization** | 85% | 70% | **18% reduction** | !!! note "Results vary based on workload" Workloads with diverse, unique prompts will see smaller improvements. Workloads with repeated prefixes (chatbots, RAG) see the largest gains. ### When Cache-Aware Helps Most
#### :material-check-circle: High Impact - Multi-turn conversations with shared system prompts - RAG applications with common document prefixes - Batch processing with template-based prompts - Multiple users with identical instructions
#### :material-close-circle: Lower Impact - Completely random, unique prompts - Single-turn diverse queries - Very short prompts (< 50 tokens) - Highly variable system prompts
--- ## Tuning Guidelines ### Cache Threshold The `cache-threshold` parameter controls how aggressively routing favors cache affinity: | Value | Behavior | Use Case | |-------|----------|----------| | **Low (0.1-0.2)** | More cache hits, potential hot spots | Conversational workloads | | **Default (0.3)** | Balanced approach | Most deployments | | **High (0.5-0.8)** | Fewer cache hits, better load balance | Diverse workloads | !!! tip "Recommendation" Start with the default (0.3) and increase if you observe hot spots or uneven load distribution. ### Balance Thresholds | Scenario | Recommendation | |----------|---------------| | **Homogeneous workers** | Higher thresholds (favor caching) | | **Heterogeneous workers** | Lower thresholds (favor load balance) | | **Bursty traffic** | Lower thresholds (prevent queuing) | | **Steady traffic** | Higher thresholds (maximize cache hits) | ### Memory Management The radix tree grows with unique prefixes. Configure based on your memory budget: | Memory Budget | `max-tree-size` | Unique Prefixes | |---------------|-----------------|-----------------| | 1 GB | 16,777,216 | ~16M nodes | | 4 GB | 67,108,864 | ~64M nodes (default) | | 16 GB | 268,435,456 | ~256M nodes | **Eviction**: SMG periodically evicts stale entries using LRU policy. Configure with `--eviction-interval` (default: 120 seconds). --- ## Monitoring ### Key Metrics | Metric | Description | Alert Threshold | |--------|-------------|-----------------| | `smg_worker_requests_active` | Active requests per worker | Imbalance >50% | | `smg_worker_selection_total` | Worker selection count | - | | `smg_router_request_duration_seconds` | End-to-end request latency | - | | `smg_router_ttft_seconds` | Time to first token (gRPC only) | - | ### Useful PromQL Queries
#### Load Distribution ```promql stddev(smg_worker_requests_active) / avg(smg_worker_requests_active) ```
#### Time to First Token (gRPC only) ```promql rate(smg_router_ttft_seconds_sum[5m]) / rate(smg_router_ttft_seconds_count[5m]) ```
### Alert Thresholds | Metric | Warning | Critical | Action | |--------|---------|----------|--------| | Load imbalance | >30% | >50% | Lower balance thresholds | | TTFT p99 | >500ms | >1s | Review workload patterns and cache policy | --- ## Multi-Gateway Considerations When running multiple SMG instances behind a load balancer: - Each instance maintains its **own radix tree** - Cache efficiency drops by **10-20%** compared to single instance - Consider **session affinity** at the load balancer to improve cache hits ### Load Balancer Configuration === "NGINX" ```nginx upstream smg { hash $http_x_user_id consistent; # Affinity by user server smg-1:30000; server smg-2:30000; } ``` === "HAProxy" ```haproxy backend smg balance hdr(X-User-ID) hash-type consistent server smg-1 smg-1:30000 check server smg-2 smg-2:30000 check ``` === "Kubernetes" ```yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: nginx.ingress.kubernetes.io/upstream-hash-by: "$http_x_user_id" ``` --- ## Event-Driven Block Size By default, SMG uses the `--block-size` CLI flag (default: 16) to align the radix tree with backend KV cache page boundaries. However, when workers report KV cache events, the cache-aware policy can **learn the actual block size** directly from those events. This is especially useful when different backends use different page sizes—rather than requiring a single global flag, each worker's block size is discovered automatically from its event stream. The learned block size takes precedence over the CLI default, ensuring routing accuracy without manual configuration per backend. | Source | Priority | Use Case | |--------|----------|----------| | **Worker KV events** | Highest | Automatically learned per worker | | **`--block-size` flag** | Fallback | Global default when events are unavailable | !!! note Event-driven block size detection requires backends that emit KV cache events (e.g., SGLang with event reporting enabled). If a worker does not report events, the `--block-size` CLI value is used as the fallback. --- ## Mesh State Synchronization When running in mesh HA mode (`--enable-mesh`), cache-aware routing state is **synchronized across all router nodes**. Each node's radix tree receives updates from the mesh, ensuring that every gateway instance has a consistent view of backend KV cache state. This eliminates the cache efficiency penalty described in [Multi-Gateway Considerations](#multi-gateway-considerations) and enables accurate cache-aware decisions across the entire cluster. For mesh setup and configuration details, see [High Availability](../architecture/high-availability.md). !!! tip When using cache-aware routing with mesh HA, KV cache state is automatically synchronized across all nodes. No additional configuration is needed beyond enabling mesh. --- ## Comparison with Other Policies | Aspect | cache_aware | prefix_hash | consistent_hashing | |--------|-------------|-------------|-------------------| | **Memory** | O(tokens) | O(workers) | O(workers) | | **Lookup** | O(prefix_len) | O(log n) | O(log n) | | **Precision** | Exact matching | Prefix grouping | Key-based | | **Cache locality** | Excellent | Good | Key-dependent | | **Load balancing** | Integrated | Load factor | None | --- ## What's Next?
### :material-scale-balance: Load Balancing Compare all available routing policies. [Load Balancing →](load-balancing.md)
### :material-chart-box: Metrics Reference Complete list of routing metrics. [Metrics Reference →](../../reference/metrics.md)