--- name: prompt-caching description: Provider-native prompt caching for Claude and OpenAI. Use when optimizing LLM costs with cache breakpoints, caching system prompts, or reducing token costs for repeated prefixes. tags: [llm, caching, cost-optimization, anthropic] context: fork agent: llm-integrator version: 1.0.0 author: OrchestKit user-invocable: false --- # Prompt Caching Cache LLM prompt prefixes for 90% token savings. ## Supported Models (2026) | Provider | Models | |----------|--------| | Claude | Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3 | | OpenAI | gpt-4o, gpt-4o-mini, o1, o1-mini (automatic caching) | ## Claude Prompt Caching ```python def build_cached_messages( system_prompt: str, few_shot_examples: str | None, user_content: str, use_extended_cache: bool = False ) -> list[dict]: """Build messages with cache breakpoints. Cache structure (processing order: tools → system → messages): 1. System prompt (cached) 2. Few-shot examples (cached) ─────── CACHE BREAKPOINT ─────── 3. User content (NOT cached) """ # TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost) ttl = "1h" if use_extended_cache else "5m" content_parts = [] # Breakpoint 1: System prompt content_parts.append({ "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral", "ttl": ttl} }) # Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed) if few_shot_examples: content_parts.append({ "type": "text", "text": few_shot_examples, "cache_control": {"type": "ephemeral", "ttl": ttl} }) # Dynamic content (NOT cached) content_parts.append({ "type": "text", "text": user_content }) return [{"role": "user", "content": content_parts}] ``` ## Cache Pricing (2026) ``` ┌─────────────────────────────────────────────────────────────┐ │ Cache Cost Multipliers (relative to base input price) │ ├─────────────────────────────────────────────────────────────┤ │ 5-minute cache write: 1.25x base input price │ │ 1-hour cache write: 2.00x base input price │ │ Cache read: 0.10x base input price (90% off!) │ └─────────────────────────────────────────────────────────────┘ Example: Claude Sonnet 4 @ $3/MTok input Without Prompt Caching: System prompt: 2,000 tokens @ $3/MTok = $0.006 Few-shot examples: 5,000 tokens @ $3/MTok = $0.015 User content: 10,000 tokens @ $3/MTok = $0.030 ─────────────────────────────────────────────────── Total: 17,000 tokens = $0.051 With 5m Caching (first request = cache write): Cached prefix: 7,000 tokens @ $3.75/MTok = $0.02625 (1.25x) User content: 10,000 tokens @ $3/MTok = $0.03000 Total first req: = $0.05625 With 5m Caching (subsequent = cache read): Cached prefix: 7,000 tokens @ $0.30/MTok = $0.0021 (0.1x) User content: 10,000 tokens @ $3/MTok = $0.0300 Total cached req: = $0.0321 Savings: 37% per cached request, break-even after 2 requests ``` ## Extended Cache (1-hour TTL) Use 1-hour cache when: - Prompt reused > 10 times per hour - System prompts are highly stable - Token count > 10k (maximize savings) ```python # Extended cache: 2x write cost but persists 12x longer "cache_control": {"type": "ephemeral", "ttl": "1h"} # Break-even: 1h cache pays off after ~8 reads # (2x write cost ÷ 0.9 savings per read ≈ 8 reads) ``` ## OpenAI Automatic Caching ```python # OpenAI caches prefixes automatically # No cache_control markers needed response = await openai.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, # Cached {"role": "user", "content": user_content} # Not cached ] ) # Check cache usage in response cache_tokens = response.usage.prompt_tokens_cached ``` ## Cache Processing Order ``` Cache references entire prompt in order: 1. Tools (cached first) 2. System messages (cached second) 3. User messages (cached last) ⚠️ Extended thinking changes invalidate message caches (but NOT system/tools caches) ``` ## Monitoring Cache Effectiveness ```python # Track these fields in API response response = await client.messages.create(...) cache_created = response.usage.cache_creation_input_tokens # New cache cache_read = response.usage.cache_read_input_tokens # Cache hit regular = response.usage.input_tokens # Not cached # Calculate cache hit rate if cache_created + cache_read > 0: hit_rate = cache_read / (cache_created + cache_read) print(f"Cache hit rate: {hit_rate:.1%}") ``` ## Best Practices ```python # ✅ Good: Long, stable prefix first messages = [ {"role": "system", "content": LONG_SYSTEM_PROMPT}, {"role": "user", "content": FEW_SHOT_EXAMPLES}, {"role": "user", "content": user_input} # Variable ] # ❌ Bad: Variable content early (breaks cache) messages = [ {"role": "user", "content": user_input}, # Breaks cache! {"role": "system", "content": LONG_SYSTEM_PROMPT} ] ``` ## Key Decisions | Decision | Recommendation | |----------|----------------| | Min prefix size | 1,024 tokens (Claude) | | Breakpoint count | 2-4 per request | | Content order | Stable prefix first | | Default TTL | 5m for most cases | | Extended TTL | 1h if >10 reads/hour | ## Common Mistakes - Variable content before cached prefix - Too many breakpoints (overhead) - Prefix too short (min 1024 tokens) - Not checking `cache_read_input_tokens` - Using 1h TTL for infrequent calls (wastes 2x write) ## Related Skills - `semantic-caching` - Redis similarity caching - `cache-cost-tracking` - Cost monitoring - `llm-streaming` - Streaming with caching ## Capability Details ### anthropic-caching **Keywords:** anthropic, claude, cache_control, ephemeral **Solves:** - Use Anthropic prompt caching - Set cache breakpoints - Reduce API costs ### openai-caching **Keywords:** openai, gpt, cached_tokens, automatic **Solves:** - Use OpenAI prompt caching - Structure prompts for cache hits - Monitor cache effectiveness ### wrapper-template **Keywords:** wrapper, template, implementation, python **Solves:** - Prompt cache wrapper template - Python implementation - Drop-in caching layer