---
title: KV Events Cache-Aware Routing
---
# KV Events Cache-Aware Routing
This guide walks through wiring an **SGLang worker emitting KV cache events** to **SMG running the cache-aware policy in event-driven mode**, so the gateway routes each request to the worker whose KV cache already holds the longest prefix.
#### Before you begin
- Completed the [Getting Started](index.md) guide
- Read [Cache-Aware Routing](../concepts/routing/cache-aware.md) for the routing concepts
- A machine that can run an SGLang worker (GPU + CUDA-capable Python environment)
- `smg-grpc-servicer[sglang]` installed alongside SGLang
---
## Why event-driven?
Cache-aware routing has three internal flavours. The one this guide configures is the most accurate of the three because it routes against the worker's **actual** KV cache state rather than an approximation.
| Flavour | Tree | Input | Worker connection | Triggered when |
|---|---|---|---|---|
| **Event-driven** | `PositionalIndexer` (event-built) | Token IDs | gRPC | Worker emits KV events |
| Approximate token tree | `TokenTree` (prefix observed at routing time) | Token IDs | gRPC | Worker is gRPC but emits no events |
| Approximate string tree | `Tree` (prefix observed at routing time) | Raw text | HTTP | Worker is HTTP |
Selection is automatic and per-worker: enabling events on one worker upgrades that worker's routing path; the others keep using the approximate tree.
---
## How the pieces fit together
```
┌────────────┐ ┌────────────────────────┐ ┌──────────────────┐
│ client │ ──▶ │ smg gateway │ ──▶ │ smg-grpc-servicer│
│ │ │ ─ cache_aware policy │ │ + sglang scheduler│
│ │ │ ─ KvEventMonitor │ ◀── │ ZMQ PUB ─ KV evt │
└────────────┘ └────────────────────────┘ └──────────────────┘
gRPC ZMQ (in-process)
SubscribeKvEvents
```
1. SGLang's scheduler publishes block-stored / block-removed events on a ZMQ `PUB` socket configured by `--kv-events-config`.
2. `smg-grpc-servicer` (running in the same process, launched via `--grpc-mode`) subscribes to that ZMQ socket and re-publishes the events as a gRPC server-streaming RPC (`SubscribeKvEvents`).
3. SMG's `KvEventMonitor` opens one gRPC subscription per worker, feeds the events into a per-model `PositionalIndexer`, and the `cache_aware` policy queries that indexer at routing time.
---
## Step 1 — Launch the SGLang worker
Install the SGLang extra of the servicer, then launch the SGLang server with both `--grpc-mode` and `--kv-events-config`:
```bash
pip install "smg-grpc-servicer[sglang]"
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 50051 \
--grpc-mode \
--page-size 16 \
--kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:5557","topic":"kv-events"}'
```
What each flag does:
| Flag | Why |
|---|---|
| `--grpc-mode` | Hands the request loop off to `smg-grpc-servicer`'s gRPC `SglangScheduler` service instead of SGLang's default HTTP server. Required for SMG to talk to this worker in gRPC mode. |
| `--page-size 16` | The KV cache block size, in tokens. Mirror this in SMG's worker config so the gateway can align its overlap scoring to the right page boundaries (see [Block size alignment](#block-size-alignment)). |
| `--kv-events-config` | A JSON object parsed by SGLang's `KVEventsConfig.from_cli`. Setting `publisher: "zmq"` is what actually turns on event publishing — the default `publisher: "null"` is a no-op. |
### `--kv-events-config` field reference
All fields and defaults match SGLang's `KVEventsConfig` (see `python/sglang/srt/disaggregation/kv_events.py` upstream):
| Field | Default | Notes |
|---|---|---|
| `publisher` | `"null"` | Set to `"zmq"` to enable. Any other value disables event bridging in the servicer. |
| `endpoint` | `"tcp://*:5557"` | ZMQ `PUB` socket address. The publisher **binds** when the endpoint contains `*`, `::`, or starts with `ipc://` / `inproc://`; otherwise it connects. |
| `topic` | `""` | ZMQ topic prefix. Match this on the subscriber side; SMG accepts any topic, so the value here matters only if you wire other subscribers in parallel. |
| `replay_endpoint` | `null` | Optional REQ/REP socket for replaying missed events. SMG does not currently use replay. |
| `buffer_steps` | `10000` | Size of the in-publisher replay buffer (events). |
| `hwm` | `100000` | ZMQ high-water mark. Once N events are queued and the consumer hasn't drained them, new events drop. |
| `max_queue_size` | `100000` | Internal queue between SGLang and the ZMQ thread. |
For data-parallel deployments, the actual TCP port becomes `endpoint_port + dp_rank` (rank 0 keeps the configured port).
---
## Step 2 — Launch SMG
Point SMG at the gRPC worker and select `cache_aware`:
```bash
smg \
--worker-urls grpc://worker-1:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--policy cache_aware \
--block-size 16 \
--host 0.0.0.0 \
--port 30000
```
The flags that matter for event-driven routing:
| Flag | Why |
|---|---|
| `grpc://...` worker URL | Event subscription only runs over gRPC; HTTP workers are skipped silently. |
| `--policy cache_aware` | The only policy that consults the `PositionalIndexer`. |
| `--block-size 16` | Fallback block size used until the first event arrives. After events start flowing, SMG **learns** the worker's true block size from the event payload and uses the learned value automatically. |
`--model-path` is still required for tokenization at the gateway, the same as any gRPC-worker deployment ([gRPC Workers](grpc-workers.md)).
### Block size alignment
The cache-aware policy chunks an incoming request's token IDs into blocks of `block_size` tokens to look them up in the `PositionalIndexer`. If the block size does not match what SGLang actually wrote to its cache, **the lookup misses every block** and the policy silently falls back to load-only routing.
Order of precedence inside SMG:
1. **Event-learned block size** (highest priority — discovered per-model from the event stream).
2. **Per-worker `kv_block_size`** in the worker spec, if you load workers from a config file.
3. **`--block-size` CLI flag** (router-wide default).
In practice: keep `--page-size` (SGLang) and `--block-size` (SMG) numerically equal, and let SMG correct itself once events arrive.
### Worker config file
If you load workers from a config file rather than CLI, pin the block size per worker so event-driven routing works on the very first request:
```yaml
workers:
- url: grpc://worker-1:50051
connection_mode: grpc
kv_block_size: 16
- url: grpc://worker-2:50052
connection_mode: grpc
kv_block_size: 16
```
---
## Step 3 — Send a request
The API surface is unchanged:
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Hello, who are you?"}
]
}'
```
Send the same prompt twice. On the second call the request should land on the worker that already serves the first call's prefix.
---
## Verifying event delivery
The gateway logs three events that prove the path is live.
**1. Subscription started.** When SMG registers a gRPC worker, `KvEventMonitor::on_worker_added` logs:
```
INFO Starting KV event subscription worker_url=grpc://worker-1:50051 model_id=meta-llama/Llama-3.1-8B-Instruct
```
If you do not see this line for a worker, that worker is either HTTP or the subscription task crashed before the first connect — check the worker logs.
**2. Backend block size learned.** Once the first event arrives, SMG records the backend's actual block size:
```
DEBUG Learned block_size=16 model_id=meta-llama/Llama-3.1-8B-Instruct
```
**3. Routing decision uses the indexer.** With `RUST_LOG=model_gateway::policies::cache_aware=debug`, a routed request prints the overlap count and the chosen worker.
If events never arrive, the policy keeps working — it falls back to the approximate `TokenTree` for that worker — so cache hits will still happen, just less accurately.
---
## Tuning
| Knob | Where | Effect |
|---|---|---|
| `--cache-threshold` | SMG | Minimum prefix overlap ratio before cache affinity overrides load. Default 0.5. Lower for more aggressive cache stickiness. |
| `--balance-abs-threshold` / `--balance-rel-threshold` | SMG | Imbalance triggers. When workers diverge in load past both thresholds, the policy switches to shortest-queue regardless of cache. |
| `hwm` | SGLang `--kv-events-config` | Raise if you see SGLang logs reporting dropped events under bursty load. |
| `buffer_steps` | SGLang `--kv-events-config` | Raise if SMG ever reports gap-detected reconnects on its KV event stream. |
---
## Caveats
- **gRPC only.** Event-driven routing requires a gRPC worker — `smg-grpc-servicer` is the bridge that turns SGLang's in-process ZMQ feed into a gRPC server-streaming surface SMG can subscribe to. HTTP workers fall back to the approximate string tree automatically.
- **Per-worker block size assumed homogeneous within a model.** If you mix workers serving the same model with different `--page-size` values, the policy uses whichever block size the most recent event reported. Keep page sizes homogeneous within a model.
- **`mesh` mode synchronizes the approximate trees, not events.** When multiple SMG instances cluster via `--enable-mesh`, the event-driven indexer is local to each gateway. Each gateway independently subscribes to each worker.
- **No replay on reconnect today.** SMG reconnects with exponential backoff on stream drops, but does not currently consume SGLang's `replay_endpoint`. A drop window may briefly degrade routing to load-only until events resume.
---
## Reference
- Policy implementation: `model_gateway/src/policies/cache_aware.rs`
- Event subscription manager: `model_gateway/src/worker/kv_event_monitor.rs`
- KV event proto: `crates/grpc_client/proto/common.proto` (messages `KvEventBatch`, `KvCacheEvent`, `KvBlocksStored`, `KvBlocksRemoved`)
- Servicer bridge: `grpc_servicer/smg_grpc_servicer/sglang/servicer.py` (`SubscribeKvEvents`)
- SGLang upstream config: `python/sglang/srt/disaggregation/kv_events.py` (class `KVEventsConfig`)