--- name: llm-caching description: Optimize LLM costs and latency through KV caching and prompt caching. Use when (1) structuring prompts for cache hits, (2) configuring API cache_control for Anthropic/Cohere/OpenAI/Gemini, (3) setting up self-hosted inference with vLLM/SGLang/Ollama, (4) building agentic workflows with prefix reuse, (5) designing batch processing pipelines, or (6) understanding cache pricing and tradeoffs. --- # LLM Caching Maximize KV cache reuse to reduce costs and latency. ## Core Concept LLMs compute Key (K) and Value (V) vectors for each token during inference. These encode the model's "understanding" of context. Caching avoids recomputation. ``` Level 1: KV Cache (inference) - Within one generation, reuse previous tokens' K,V Level 2: Prompt Cache (API) - Across requests, persist KV state server-side Level 3: Prefix Sharing (batch) - Across users/requests, share common prefixes ``` ## The Golden Rule **Static content first, variable content last.** ``` [System prompt] <- cacheable, same every request [Tool definitions] <- cacheable [Few-shot examples] <- cacheable (same order!) [Reference documents] <- cacheable if stable [User message] <- variable, at the end ``` Cache hits require the **prefix** (beginning) to match exactly. Any difference breaks caching for everything after. ## Prompt Structure Template ``` ┌─────────────────────────────────────┐ │ 1. System instructions (static) │ <- cache_control ├─────────────────────────────────────┤ │ 2. Tool definitions (static) │ <- cache_control ├─────────────────────────────────────┤ │ 3. Few-shot examples (static) │ <- cache_control ├─────────────────────────────────────┤ │ 4. Documents/context (semi-static) │ <- cache_control if reused ├─────────────────────────────────────┤ │ 5. Conversation history (growing) │ <- cache after N turns ├─────────────────────────────────────┤ │ 6. Current user message (variable) │ <- no caching └─────────────────────────────────────┘ ``` ## Anti-Patterns | Anti-Pattern | Why It Breaks Caching | |--------------|----------------------| | Variable content early | Prefix changes every request | | Randomizing few-shot order | Different order = different prefix | | Timestamps in system prompt | Changes every request | | User ID in prefix | Per-user cache = no sharing | | Prompts < minimum threshold | Too small to cache (1024 tokens for Claude) | | Shuffling tool definitions | Tool order is part of prefix | ## Cost Impact | Operation | Typical Pricing | Notes | |-----------|-----------------|-------| | Cache write | ~1.25x input | One-time, stores KV state | | Cache read | ~0.1x input | 90% savings on cache hit | | No caching | 1x input | Full recomputation every time | **Example:** 50k token system prompt, 100 requests - Without cache: 50k × 100 × $3/1M = $15.00 - With cache: 50k × $3.75/1M + 50k × 99 × $0.30/1M = $1.67 (**89% savings**) ## Provider References - **Anthropic Claude** (recommended): [references/claude.md](references/claude.md) - **Cohere**: [references/cohere.md](references/cohere.md) - **Self-hosted (vLLM, SGLang, Ollama, HuggingFace)**: [references/self-hosted.md](references/self-hosted.md) - **OpenAI**: [references/openai.md](references/openai.md) - **Google Gemini**: [references/gemini.md](references/gemini.md) ## Cookbooks Practical examples: [references/cookbooks.md](references/cookbooks.md) | Pattern | Key Insight | |---------|-------------| | Web scraping agent | Same tools + system prompt, different URLs | | RAG pipeline | Cache document chunks, vary queries | | Multi-turn chat | Growing prefix, cache conversation history | | Batch processing | Same prompt template, different inputs | | Agentic tool use | Cache tool definitions + examples | | Multi-tenant SaaS | Shared base prompt, tenant-specific suffix |