--- title: "🗜️ Prompt Compression Guide — OmniRoute" version: 3.8.40 lastUpdated: 2026-06-28 --- # 🗜️ Prompt Compression Guide — OmniRoute > Save 15-95% on eligible context automatically. For a quick overview, see the [README Compression section](../README.md#%EF%B8%8F-prompt-compression--save-15-95-eligible-tokens-automatically). ## Overview OmniRoute implements a modular prompt compression pipeline that runs **proactively** before requests hit upstream providers. This means your token savings happen transparently — no changes needed to your workflow. ``` Client Request → Compression Strategy Selector → Combo override? → Use combo setting → Auto-trigger threshold? → Use auto mode → Default mode? → Use global setting → Off? → Skip compression → Selected Compression Mode → Off: No compression → Lite: Safe whitespace/formatting cleanup (~15%) → Standard: Caveman-speak filler removal (~30%) → Aggressive: History aging + summarization (~50%) → Ultra: Heuristic pruning + code-block thinning (~75%) → RTK: Command-aware terminal/tool-output filtering (60-90% upstream range) → Stacked: Ordered multi-engine pipeline, usually RTK then Caveman (78-95% eligible range) → Compressed Request → Provider ``` --- ## Compression Modes ### Off No compression applied. All messages pass through unchanged. ### Lite Mode (~15% savings, <1ms latency) The safest mode — zero semantic change, only formatting cleanup: | Technique | Description | | ------------------------ | ------------------------------------------------- | | `collapseWhitespace` | Merge consecutive blank lines and trailing spaces | | `dedupSystemPrompt` | Remove duplicate system messages | | `compressToolResults` | Compress verbose tool/function outputs | | `removeRedundantContent` | Strip repeated instructions | | `replaceImageUrls` | Shorten base64 image data URIs | **Best for:** Always-on usage, safety-critical workflows. ### Standard Mode (~30% savings) Inspired by [Caveman](https://github.com/JuliusBrussee/caveman) — removes filler words and verbose phrasing while preserving meaning: - Removes filler words ("please", "I think", "basically", "actually") - Condenses verbose phrases ("in order to" → "to", "as a result of" → "because") - Strips polite hedging ("Would you mind...", "If you could possibly...") - 30+ regex rules tuned for coding prompts **Best for:** Daily coding workflows, cost-conscious teams. ### Aggressive Mode (~50% savings) Smart history management for long sessions: - **Message Aging** — older messages get progressively compressed - **Tool Result Summarization** — long tool outputs replaced with summaries - **Structural Integrity Guards** — ensures `tool_use` + `tool_result` pairs stay consistent - **Context Window Awareness** — respects per-model token limits **Best for:** Extended debugging sessions, large codebases. ### Ultra Mode (~75% savings) Maximum compression for token-critical scenarios: - **Heuristic Pruning** — removes messages below relevance threshold - **Code Block Thinning** — compresses repetitive code examples - **Binary Search Truncation** — finds optimal cut point for context window - All Aggressive mode features included **Best for:** When you're hitting context limits repeatedly. ### RTK Mode (60-90% upstream range) RTK mode is optimized for verbose tool outputs that appear in coding-agent sessions: - Detects command/output classes such as `git status`, `git diff`, `git log`, test runners, TypeScript/Vite/Webpack builds, ESLint/Biome/Prettier, npm audit/installs, Docker logs, infra output, and generic shell output - Applies JSON filter packs from `open-sse/services/compression/engines/rtk/filters/` - Ships 49 built-in filters with inline verify samples - Removes ANSI control sequences, progress bars, repeated lines, and non-actionable noise - Preserves failures, errors, warnings, changed files, summaries, and the tail of long output - Supports trust-gated project filters, global filters, and optional redacted raw-output recovery **Best for:** Agent sessions with shell, build, test, git, grep, and file-output transcripts. ### Stacked Mode (78-95% eligible range) Stacked mode runs multiple compression engines in a deterministic order. The default pipeline is: ```txt RTK -> Caveman ``` That order keeps terminal/tool output compact first, then applies Caveman semantic condensation to the remaining natural-language prompt. Stacked pipelines can be configured globally or through compression combos assigned to routing combos. **Best for:** Mixed context with large tool logs plus human instructions or assistant summaries. --- ## Upstream Savings Math OmniRoute documents compression savings from two sources: upstream project benchmarks and OmniRoute's own engine composition. | Source | Upstream README number used here | | ------- | --------------------------------------------------------------------------------------------------------------------- | | Caveman | `~75%` fewer output tokens, `65%` benchmark average output savings, `22-87%` range, and `~46%` input compression tool | | RTK | `60-90%` command-output savings; sample session `~118,000 -> ~23,900` tokens, or `79.7%` saved (`~80%`) | For overlapping tool/context payloads, the default OmniRoute combo stacks the engines: ```txt RTK -> Caveman ``` The combined savings are multiplicative, not additive: ```txt combined = 1 - (1 - RTK savings) * (1 - Caveman input savings) average = 1 - (1 - 0.80) * (1 - 0.46) = 89.2% range = 1 - (1 - 0.60..0.90) * (1 - 0.46) = 78.4-94.6% ``` That `78-95%` number applies when both RTK and Caveman can reduce the same input/context payload. Caveman response output mode is separate: when enabled, use Caveman's own output savings (`65%` average, `~75%` headline, `22-87%` range). Total billing savings depend on your prompt/output mix. --- ## Token Savings Visualization ``` Without compression: 47K tokens sent to LLM With Lite: 40K tokens sent (15% saved — safe, always-on) With Standard: 33K tokens sent (30% saved — caveman-speak rules) With Aggressive: 24K tokens sent (50% saved — aging + summarization) With Ultra: 12K tokens sent (75% saved — heuristic pruning) With RTK: 19K-5K tokens sent (60-90% saved on command/tool output) With Stacked: 10K-2.5K tokens sent (78-95% eligible RTK+Caveman range) ``` --- ## Configuration ### Dashboard Navigate to `Dashboard → Context & Cache`: - **Caveman** — mode selection, language packs, preview, and global defaults - **RTK** — command-filter preview, RTK safety settings, and filter catalog - **Compression Combos** — named engine pipelines assigned to routing combos - **Auto-Trigger Threshold** — automatically engage compression when token count exceeds threshold ### Per-Combo Override In `Dashboard → Context & Cache → Compression Combos`, assign a compression combo to a routing combo: ```txt Combo: "free-forever" Compression Combo: "coding-agent-stack" Pipeline: RTK -> Caveman Targets: 1. if/kimi-k2-thinking 2. qw/qwen3-coder-plus ``` This lets you use stacked compression on free/coding providers while keeping lite mode on paid subscriptions. ### Per-request override Send the `x-omniroute-compression` request header to override the compression plan for a single request. It has the highest precedence — it beats the routing-combo override, the active profile, auto-trigger, and the panel Default. Unknown values are ignored (the request is never rejected) and the global master switch still gates everything: when compression is off globally, the header cannot turn it on. Values: | Value | Effect | | ------------- | -------------------------------------------------------------------- | | `off` | No compression for this request. | | `default` | The panel-derived Default profile (ignores the active profile). | | `engine:` | A single engine when enabled, e.g. `engine:rtk`. | | `` | A named combo, matched by name (case-insensitive) first, then by id. | The applied plan is echoed back in the `X-OmniRoute-Compression: ; source=` response header, where `` is one of `request-header`, `routing-override`, `active-profile`, `auto-trigger`, `default`, or `off`. ### API ```bash # Get compression settings curl http://localhost:20128/api/settings/compression # Update compression settings curl -X PUT http://localhost:20128/api/settings/compression \ -H "Content-Type: application/json" \ -d '{"defaultMode":"stacked","autoTriggerMode":"stacked","autoTriggerTokens":32000}' # Preview a specific RTK/stacked payload curl -X POST http://localhost:20128/api/compression/preview \ -H "Content-Type: application/json" \ -d '{"mode":"rtk","messages":[{"role":"tool","content":"npm test output here"}]}' # List RTK filter packs curl http://localhost:20128/api/context/rtk/filters # Test RTK directly with optional command metadata curl -X POST http://localhost:20128/api/context/rtk/test \ -H "Content-Type: application/json" \ -d '{"command":"npm test","text":"FAIL tests/example.test.ts\nError: boom"}' ``` --- ## What Gets Protected The compression engine **always preserves:** - ✅ Code blocks (fenced and inline) - ✅ URLs and file paths - ✅ JSON structures and structured data - ✅ Identifiers and protected technical tokens - ✅ Mathematical expressions - ✅ Tool/function call definitions - ✅ System prompts (in lite mode) RTK raw-output recovery redacts common API keys, bearer tokens, Slack tokens, AWS access keys, passwords, tokens, and secrets before anything is persisted. --- ## Compression Stats Every compressed request includes stats in the server logs: ```json { "originalTokens": 47200, "compressedTokens": 40120, "savingsPercent": 15.0, "techniquesUsed": ["collapseWhitespace", "dedupSystemPrompt"], "mode": "lite", "engine": "caveman", "compressionComboId": "coding-agent-stack", "durationMs": 0.8, "rtkRawOutputPointers": [] } ``` --- ## Phase Roadmap | Phase | Modes | Status | | ------- | -------------------------------------------------------------------- | ---------- | | Phase 1 | Off, Lite | ✅ Shipped | | Phase 2 | Standard, Aggressive, Ultra | ✅ Shipped | | Phase 3 | RTK, Stacked, Compression Combos | ✅ Shipped | | Phase 4 | Output Styles, SLM-tier Ultra, adaptive context-budget, eval harness | ✅ Shipped | --- ## Acknowledgments Standard mode compression rules are inspired by **[Caveman](https://github.com/JuliusBrussee/caveman)** by **[JuliusBrussee](https://github.com/JuliusBrussee)** (⭐ 51K+) — the viral "why use many token when few token do trick" project. Caveman reports `~75%` fewer output tokens, `65%` benchmark average output savings, a `22-87%` output range, and a `~46%` input-compression tool. RTK mode is inspired by **[RTK - Rust Token Killer](https://github.com/rtk-ai/rtk)** by **[RTK AI](https://github.com/rtk-ai)** — the high-performance command-output compression project for terminal, build, test, git, and tool-output filtering. RTK reports `60-90%` savings, with its README sample session showing `~80%` saved. --- ## Advanced Compression Systems Beyond the 7 standard modes, OmniRoute includes several advanced compression systems that work automatically based on context. ### Cache-Aware Compression Some providers (like Anthropic with prompt caching) support **prompt caching**, which lets them cache parts of the prompt to reduce costs and latency. When caching is enabled, aggressive compression can actually **hurt** performance because it changes the cached tokens, invalidating the cache. The `cachingAware.ts` module solves this by **detecting caching context** and **adjusting the compression strategy** accordingly. #### How it works 1. **Detect caching context** — Scans the request body for `cache_control` markers 2. **Identify caching providers** — Checks if the target provider supports caching 3. **Adjust strategy** — Downgrades `aggressive`/`ultra` to `standard` for caching providers 4. **Skip system prompt** — System prompts are usually cached, so don't compress them 5. **Use deterministic transformations** — Only use transformations that produce consistent output #### Code example ```ts import { detectCachingContext, getCacheAwareStrategy, } from "@omniroute/open-sse/services/compression/cachingAware"; const body = { model: "anthropic/claude-sonnet-4.5", messages: [{ role: "user", content: "Hello" }], cache_control: { type: "ephemeral" }, // ← Cache marker }; const ctx = detectCachingContext(body, { provider: "anthropic" }); // → { hasCacheControl: true, provider: "anthropic", isCachingProvider: true } const strategy = getCacheAwareStrategy("aggressive", ctx); // → { strategy: "standard", skipSystemPrompt: true, deterministicOnly: true } ``` #### When to use Cache-aware compression is **always on** — no configuration needed. It only kicks in when: - The request has `cache_control` markers - The target provider supports prompt caching (Anthropic, OpenAI, etc.) ### Progressive Aging Long conversations accumulate many message turns, but older turns become less relevant. The `progressiveAging.ts` module **degrades messages by turn distance**: - **Recent turns (0-3)**: Kept verbatim (full detail) - **Medium turns (4-8)**: Lite compression (whitespace, formatting cleanup) - **Old turns (9+)**: Caveman compression (filler removal, summarization) - **Very old turns (20+)**: Heavily summarized or dropped #### Code example ```ts import { applyAging } from "@omniroute/open-sse/services/compression/progressiveAging"; const messages = [ { role: "system", content: "You are a helpful assistant" }, { role: "user", content: "What is 2+2?" }, { role: "assistant", content: "4" }, // ... 50 more turns ... ]; const { messages: aged, saved } = applyAging(messages, { verbatim: 3, // First 3 turns: verbatim light: 8, // Turns 4-8: lite compression moderate: 20, // Turns 9-20: caveman compression // Turns 21+: heavy summarization }); // saved = number of tokens saved ``` #### When to use Progressive aging is **always on** for `aggressive` and `ultra` modes. It's particularly effective for: - Long-running coding sessions - Multi-day conversations - Agentic workflows with many tool calls ### Caveman Output Mode The `outputMode.ts` module injects **system prompt instructions** to make the model itself produce compressed, terse output (a "caveman" style). #### How it works Instead of compressing the input, this mode adds a system prompt like: > "Reply in minimal words. Skip pleasantries. Use short sentences." This works particularly well for: - Code generation (terser output = fewer tokens) - Quick Q&A (no need for elaborate explanations) - Batch processing (maximize throughput) #### When to use Caveman output mode is **opt-in** — set it via the combo config: ```json { "strategy": "auto", "config": { "auto": { "outputMode": "caveman" } } } ``` ### Tool Result Compression The `toolResultCompressor.ts` module provides **5 specialized compression strategies** for tool results (function calls, agent outputs, search results, etc.): 1. **Search result compression** — Removes redundant results, keeps top-N 2. **File read compression** — Truncates large files, preserves headers/imports 3. **Code execution compression** — Keeps only essential stdout/stderr 4. **Database query compression** — Limits rows, removes verbose metadata 5. **API response compression** — Strips null fields, condenses arrays #### When to use Tool result compression is **always on** when tool calls are present. No configuration needed. ### Stacked Pipeline The stacked mode runs **multiple engines in sequence** — usually RTK first (60-90% savings on tool output), then Caveman (30% additional savings on the remaining text). This achieves **78-95% total savings**. #### How it works ``` Input (1000 tokens) → RTK (command-aware filter) → 200 tokens → Caveman (filler removal) → 140 tokens → Output (140 tokens, 86% savings) ``` #### When to use Use stacked mode for: - Tool-heavy workflows (agentic coding, research) - Cost-sensitive batch processing - When you need maximum token savings Configure via combo: ```json { "strategy": "auto", "config": { "auto": { "modePack": "stacked" } } } ``` --- ## Compression Combo Overrides You can override the global compression mode **per combo** to fine-tune behavior for different use cases: ```json { "id": "coding-combo", "strategy": "priority", "config": { "auto": { "weights": { "taskFit": 0.5 }, "modePack": "quality-first" } }, "compressionOverride": { "mode": "aggressive", "stackedPipelines": ["rtk", "caveman"], "preserveToolDefinitions": true } } ``` This is useful for: - **Coding combos**: Use `aggressive` mode for long sessions - **Quick Q&A combos**: Use `lite` mode for fast responses - **Tool-heavy combos**: Use `stacked` mode for max savings - **Production combos**: Use `cache-aware` mode for caching providers --- ## See Also - [Environment Config](../reference/ENVIRONMENT.md) — Compression environment variables - [Architecture Guide](../architecture/ARCHITECTURE.md) — Compression pipeline internals - [User Guide](../guides/USER_GUIDE.md) — Getting started with compression - [RTK Compression](./RTK_COMPRESSION.md) — RTK filters, trust model, verify gate, raw-output recovery - [Compression Engines](./COMPRESSION_ENGINES.md) — Caveman, RTK, stacked, APIs, MCP, dashboard - [Compression Rules Format](./COMPRESSION_RULES_FORMAT.md) — JSON rule-pack format - [Compression Language Packs](./COMPRESSION_LANGUAGE_PACKS.md) — Language-specific Caveman rules