--- name: llm-cost-optimization description: Reduce LLM API costs without sacrificing quality. Covers prompt caching (Anthropic), local response caching, prompt compression, debouncing triggers, and cost analysis. Use when building LLM-powered features, analyzing API costs, optimizing prompts, or implementing caching strategies. --- # LLM Cost Optimization Practical techniques to reduce LLM API costs by 35-65%. ## Quick Reference | Technique | Savings | When to Use | Reference | |-----------|---------|-------------|-----------| | Prompt Caching | 25-45% | Same system prompt, frequent calls | [caching.md](references/caching.md) | | Response Cache | 100% | Repeated identical requests | [caching.md](references/caching.md) | | Prompt Compression | 10-20% | Long system prompts | [prompts.md](references/prompts.md) | | Debouncing | 50%+ | Duplicate triggers | [triggers.md](references/triggers.md) | ## The 80/20 of LLM Costs For short user inputs, **system prompts dominate costs**: | Text Length | Input Tokens | System Prompt % | |-------------|--------------|-----------------| | Short (~100 chars) | ~250 | **80-87%** | | Medium (~500 chars) | ~450 | **44%** | | Long (~2000 chars) | ~900 | **22%** | **Optimization priority:** 1. Cache system prompts (biggest impact) 2. Cache identical requests (free repeats) 3. Debounce triggers (prevent waste) 4. Compress prompts (last resort) ## Cost Estimation (Claude Haiku 3.5) | Text Length | Est. Cost | |-------------|-----------| | Short (~100 chars) | ~$0.0004 | | Medium (~500 chars) | ~$0.0008 | | Long (~2000 chars) | ~$0.002 | **Benchmark:** 1000 translations ≈ $0.80 (before optimization) ## Implementation Checklist ### Before Building - [ ] Add logging to every AI trigger point - [ ] Verify triggers fire exactly once per user action - [ ] Check for Pressed/Released event duplicates ### Caching Strategy - [ ] Enable Anthropic Prompt Caching for system prompts - [ ] Implement local response cache (hash-based) - [ ] Include model name in cache key - [ ] Set reasonable cache limits (e.g., 500 entries LRU) ### Prompt Design - [ ] Measure current token count - [ ] Identify critical rules (security, output format) - [ ] Test quality after compression - [ ] Document WHY for each rule kept ## Common Mistakes | Mistake | Impact | Fix | |---------|--------|-----| | Trigger fires twice | 2x cost | Check event.state | | No prompt caching | Full price every call | Use cache_control | | Aggressive prompt compression | Quality drops | Keep critical rules | | Cache key missing model | Wrong results | Include model in key | ## Quick Wins ### 1. Check for Duplicate Triggers ```rust // Before ANY optimization, verify this log::info!("AI trigger fired: {:?}", event); if event.state != ShortcutState::Pressed { return; // Ignore Released events } ``` ### 2. Enable Prompt Caching (Anthropic) ```rust let system = vec![SystemBlock { block_type: "text".to_string(), text: system_prompt, cache_control: CacheControl { cache_type: "ephemeral".to_string() }, }]; ``` ### 3. Add Response Cache ```rust // Check cache before API call if let Some(cached) = get_cached(&text, &model) { return Ok(cached); // Free! } // Save after API call save_to_cache(&text, &result, &model)?; ``` ## Anti-Patterns - **TOON format for plain text** - Only helps with structured data - **Caching without model key** - Haiku vs Sonnet give different results - **Prompt compression first** - Optimize triggers and caching before touching prompts