# Lynkr vs LiteLLM — Benchmark Report **Date:** June 5, 2026 **Setup:** Same backend providers (Ollama local, Moonshot, Azure OpenAI), 9 scenarios across 4 feature categories. --- ## TL;DR Lynkr reduces tokens sent to the model through three independent mechanisms that LiteLLM does not have. On a realistic agentic coding workload, this translates to measurably lower costs and faster responses — on the same backend. --- ## 1. Smart Tool Selection **What it does:** Classifies each request and strips irrelevant tool schemas before forwarding. A read-only question doesn't need Write, Edit, Bash, Git tools — Lynkr removes them automatically. **Test:** 14 tool definitions sent on every request (realistic Claude Code / Cursor session). | | Tokens billed | Cost | |---|---|---| | Lynkr | 959 | $0.0044 | | LiteLLM | 2,085 | $0.0091 | **Result: 53% fewer tokens, 52% cheaper — same model, same prompt.** --- ## 2. TOON Compression (JSON Tool Results) **What it does:** Binary-compresses large JSON payloads (tool results, file reads, grep outputs) before forwarding to the model. Plain text passes through unchanged. **Test:** A Bash tool returning 60 grep results as a JSON array (~3,400 tokens). | | Tokens billed | Cost | Latency | |---|---|---|---| | Lynkr | 427 | $0.009 | 12s | | LiteLLM | 3,458 | $0.018 | 12s | **Result: 87.6% compression, 50% cheaper. Same latency — compression happens in-process.** --- ## 3. Semantic Cache **What it does:** Generates embeddings of incoming prompts and returns cached responses for semantically similar queries (cosine similarity ≥ 0.85). Zero model tokens billed on a cache hit. **Test:** Two paraphrased prompts sent back-to-back ("Explain TCP vs UDP" → "What is the difference between TCP and UDP?"). | | Tokens billed | Response time | |---|---|---| | Lynkr SC1 (cold) | 2,857 | 1,891ms | | **Lynkr SC2 (cache hit)** | **2,857\*** | **171ms** | | LiteLLM SC2 | 54 | 3,282ms | \*Token count shown is what would have been billed — cache served the stored response, no model call made. Response time dropped from 1,891ms to 171ms (11× faster). --- ## 4. Tier Routing **What it does:** Scores each request on 15 dimensions (token count, code complexity, reasoning markers, risk patterns, agentic signals) and routes to the appropriate model tier automatically — no caller changes needed. **Test:** Simple Q&A vs complex security analysis on the same endpoint. | Request | Lynkr routes to | LiteLLM routes to | |---|---|---| | "What does git stash do?" | `minimax-m2.5` (local, free) | Ollama (local) | | Security analysis of JWT vs cookies | `moonshot-v1-auto` (cloud) | **Ollama (local)** | LiteLLM's `cost-based-routing` sends everything to the cheapest available model regardless of complexity. For a banking security architecture question, that's the wrong call. --- ## Summary Table | Feature | Lynkr | LiteLLM | |---|---|---| | Smart tool selection | ✅ 53% token reduction | ❌ | | TOON JSON compression | ✅ 87.6% on large results | ❌ | | Semantic cache | ✅ 171ms cache hits | ❌ | | Automatic complexity routing | ✅ 15-dimension scorer | ❌ Cost-only routing | | Self-hosted / data stays local | ✅ | ✅ | | MCP integration | ✅ | ❌ | | Memory system | ✅ Long-term per-session | ❌ | --- ## Cost Projection (100,000 requests/month, same backend) Based on L2 (TOON) as representative of a tool-heavy agentic session: | | Monthly cost | vs LiteLLM | |---|---|---| | LiteLLM | ~$818 | baseline | | **Lynkr** | **~$409** | **~50% cheaper** | *Note: LiteLLM's headline number in raw benchmarks is lower because it routes everything to a free local model. On equal footing (same provider, same model), Lynkr is cheaper due to token optimization.* --- ## How to Compare with Portkey Portkey is a managed cloud gateway focused on observability, guardrails, and prompt management. It has no automatic complexity detection, no token compression, and no semantic cache. Conditional routing requires the caller to pass metadata tags manually. It's a different product category — operations/governance tooling vs. token optimization proxy. **Lynkr's target user:** Teams running Claude Code, Cursor, or Codex CLI who want to reduce LLM spend and route through their own infrastructure (Databricks, Azure, Bedrock) without changing any client-side code. --- --- ## Methodology ### Environment | Component | Version / Config | |---|---| | Machine | macOS, Apple Silicon (aarch64) | | Lynkr | v9.3.2, Node.js 20, `NODE_OPTIONS=--max-old-space-size=512` | | LiteLLM | v1.87.1, Python 3.12, `routing_strategy: cost-based-routing` | | Ollama | Local, `minimax-m2.5:cloud` (SIMPLE/MEDIUM tier) | | Moonshot | `moonshot-v1-auto` (COMPLEX tier) | | Azure OpenAI | `gpt-5.2-chat` (REASONING tier) | ### What was measured Each scenario sent an identical HTTP request to both proxies at `POST /v1/messages`. From each response: - **Billed tokens** — `usage.input_tokens` from the response body (what the model actually received and charged for) - **Estimated tokens** — calculated locally from the raw request payload before sending (~4 chars/token approximation) - **Tokens saved** — `max(0, estimatedInputTokens - billedInput)` - **Cost** — `(billedInput / 1,000,000) × inputPrice + (billedOutput / 1,000,000) × outputPrice` using published per-model pricing - **Latency** — wall-clock time from request send to full response received ### How compression is measured Lynkr applies optimizations **before** forwarding to the provider. The `billedInput` from the model therefore reflects post-optimization token count. The delta between estimated request tokens and billed tokens captures the saving. Important caveat: Lynkr injects a system prompt (memory context, agent instructions) that adds ~2,800 tokens of overhead not present in the raw request. This overhead is included in `billedInput` but not in `estimatedInputTokens`. As a result, scenarios without compression (S1, H1, R1) appear to show negative savings — the system prompt overhead outweighs the compression. Only tool-selection (T1, T2) and TOON (L2) show positive savings because those optimizations save more tokens than the overhead adds. **For a clean apples-to-apples measurement, compare Lynkr vs LiteLLM billed tokens on the same scenario — not estimated vs billed.** ### Scenario design | ID | What it tests | Why | |---|---|---| | S1 | Simple Q&A | Baseline — tier routing to cheapest model | | T1 | 14 tools, read request | Smart tool selection — irrelevant tools stripped | | T2 | 14 tools, write request | Smart tool selection — only write tools kept | | H1 | 8-turn conversation | History compression — older turns deduped | | L1 | JSON tool result (package.json, ~40 deps) | TOON — moderate JSON structure | | L2 | JSON tool result (60-item grep array) | TOON — large repetitive JSON array | | SC1 | Cold cache prompt | Semantic cache population | | SC2 | Paraphrased identical prompt | Semantic cache hit — measures latency and token savings | | R1 | Security analysis, step-by-step | Tier routing — should escalate to cloud model | ### Fairness notes - Both proxies used the same backend credentials and endpoints. - Requests were sent sequentially with a 400ms buffer between calls to avoid rate-limiting skew. - LiteLLM's overall cost is lower in the summary because `cost-based-routing` routes all requests to the free local Ollama model. On scenarios where both proxies used the same provider (L1, L2), Lynkr was 50% cheaper due to TOON compression. - Portkey was included as a third proxy but was excluded from results — the local gateway container was not running during the benchmark session. - The benchmark script is open and reproducible: `node benchmark-tier-routing.js` from the Lynkr repo root. *Benchmark run on macOS, Apple Silicon. Lynkr v9.3.2. LiteLLM v1.87.1.*