---
title: Tokenizer Caching
---
# Tokenizer Caching
SMG provides a two-level tokenizer cache that reduces tokenization overhead for repeated content. In typical production workloads, this achieves 60-90% cache hit rates.
#### Before you begin
- Completed the [Getting Started](index.md) guide
- Using gRPC workers (tokenization happens at the gateway)
- `--model-path` configured so SMG can load the tokenizer
---
## How It Works
| Cache Level | Strategy | Best For |
|-------------|----------|----------|
| **L0** (Exact Match) | Hash-based O(1) lookup for identical strings | Repeated system prompts, batch inference |
| **L1** (Prefix Match) | Boundary-aligned prefix matching, tokenizes only the suffix | Multi-turn conversations, growing contexts |
On a multi-turn conversation, L1 avoids re-tokenizing the entire history — only new messages are tokenized.
---
## Enable Caching
Both cache levels are disabled by default. Enable them with CLI flags:
### L0 Only (Exact Match)
Best for workloads with many identical prompts (system prompts, batch processing):
```bash
smg \
--worker-urls grpc://worker:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 10000
```
### L0 + L1 (Exact + Prefix Match)
Best for multi-turn chat applications:
```bash
smg \
--worker-urls grpc://worker:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 20000 \
--tokenizer-cache-enable-l1 \
--tokenizer-cache-l1-max-memory 104857600
```
---
## Configuration Reference
### L0 Cache
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--tokenizer-cache-enable-l0` | `false` | Enable exact match cache |
| `--tokenizer-cache-l0-max-entries` | `10000` | Maximum number of cached entries |
Each entry uses ~2.2 KB of memory.
### L1 Cache
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--tokenizer-cache-enable-l1` | `false` | Enable prefix match cache |
| `--tokenizer-cache-l1-max-memory` | `52428800` (50 MB) | Maximum memory in bytes |
---
## Memory Planning
### L0 Sizing
| Entries | Memory | Recommended For |
|---------|--------|-----------------|
| 1,000 | ~2.2 MB | Development, testing |
| 10,000 | ~22 MB | Standard production |
| 25,000 | ~55 MB | High-repetition workloads |
| 50,000 | ~110 MB | Large-scale deployments |
Set L0 entries to 1-2x the number of unique system prompt variants in your workload.
### L1 Sizing
| Memory | Recommended For |
|--------|-----------------|
| 25 MB | Memory-constrained environments |
| 50 MB | Standard deployments (default) |
| 100 MB | Multi-turn conversation heavy |
| 200 MB | Long context applications |
Estimate ~1 KB per active conversation context for L1 sizing.
---
## Recommended Configurations
=== "High-Throughput Chat"
For workloads with repeated system prompts:
```bash
smg \
--worker-urls grpc://worker:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 50000
```
=== "Multi-Turn Conversations"
For chat applications with growing conversation history:
```bash
smg \
--worker-urls grpc://worker:50051 \
--model-path Qwen/Qwen2.5-7B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 20000 \
--tokenizer-cache-enable-l1 \
--tokenizer-cache-l1-max-memory 104857600
```
=== "Memory-Constrained"
Moderate benefit with minimal memory:
```bash
smg \
--worker-urls grpc://worker:50051 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tokenizer-cache-enable-l0 \
--tokenizer-cache-l0-max-entries 5000
```
---
## Next Steps
- [Tokenizer Caching Concepts](../concepts/performance/tokenizer-caching.md) — Cache architecture, special token boundaries, monitoring metrics, PromQL queries
- [gRPC Workers](grpc-workers.md) — Enable gateway-level tokenization with gRPC mode
- [Load Balancing](load-balancing.md) — Choose a routing policy (cache-aware routing uses tokenizer results)