--- name: llm-app-patterns type: reference description: "Provides architectural patterns for LLM-powered applications and AI assistants, including prompt engineering, RAG, agent loops, conversation management, and evaluation. Use when building AI-based features, chatbots, or complex AI system architectures." paths: ["**/*.py", "**/*.ts", "**/openai*", "**/anthropic*", "**/langchain*", "**/chatbot*", "**/assistant*"] effort: 3 allowed-tools: Read, Glob, Grep, Write, Edit, Bash user-invocable: true when_to_use: "When designing LLM applications, building AI assistants/chatbots, implementing RAG pipelines, or setting up agent architectures." --- # LLM Application & AI Assistant Patterns ## Resources ## Architecture decision matrix | Pattern | Use when | Cost | |---|---|---| | Simple RAG | FAQ, docs Q&A | Low | | Hybrid RAG (semantic + BM25) | Mixed query types | Medium | | Function calling | Structured tool use | Low | | ReAct agent | Multi-step reasoning | Medium | | Plan-and-execute | Complex decomposable tasks | High | | Multi-agent | Research, critique-refine | Very High | ## RAG: critical config numbers ```python CHUNK_CONFIG = { "chunk_size": 512, # tokens — sweet spot for most docs "chunk_overlap": 50, # prevents context loss at boundaries "separators": ["\n\n", "\n", ". ", " "], } # Hybrid search alpha: 1.0=semantic only, 0.0=BM25 only, 0.5=balanced ``` ## RAG: retrieval strategies ```python # Basic: semantic search results = vector_db.similarity_search(embed(query), top_k=5) # Better: hybrid (semantic + keyword via RRF) def hybrid_search(query, alpha=0.5): return rrf_merge(vector_db.search(query), bm25_search(query), alpha) # Best for recall: multi-query (3 variations, deduplicate) queries = llm.generate_variations(query, n=3) results = deduplicate([semantic_search(q) for q in queries]) ``` ## RAG: generation prompt template ```python RAG_PROMPT = """Answer based ONLY on the context below. If insufficient, say "I don't have enough information." Context: {context} Question: {question} Answer:""" ``` ## Agent: function calling loop ```python messages = [{"role": "user", "content": question}] while True: response = llm.chat(messages=messages, tools=TOOLS, tool_choice="auto") if not response.tool_calls: return response.content for call in response.tool_calls: result = execute_tool(call.name, call.arguments) messages.append({"role": "tool", "tool_call_id": call.id, "content": str(result)}) ``` ## Production: caching (only temperature=0 responses) ```python def get_or_generate(prompt, model, **kwargs): deterministic = kwargs.get("temperature", 1.0) == 0 if deterministic: key = sha256(f"{model}:{prompt}:{json.dumps(kwargs, sort_keys=True)}") if cached := redis.get(key): return cached response = llm.generate(prompt, model=model, **kwargs) if deterministic: redis.setex(key, 3600, response) return response ``` ## Production: retry + fallback ```python from tenacity import retry, wait_exponential, stop_after_attempt @retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(5)) def call_llm(prompt): return llm.generate(prompt) # Fallback chain for model in [primary] + fallbacks: try: return llm.generate(prompt, model=model) except (RateLimitError, APIError): continue ``` ## LLMOps: key metrics ``` Latency : p50, p99 response time Quality : satisfaction (thumbs), task completion %, hallucination rate Cost : cost_per_request, tokens_per_request, cache_hit_rate Health : error_rate, timeout_rate, retry_rate ``` ## Embedding model selection | Model | Dims | Cost | Use | |---|---|---|---| | text-embedding-3-small | 1536 | $0.02/1M | Most cases | | text-embedding-3-large | 3072 | $0.13/1M | High accuracy | | bge-large (local) | 1024 | Free | Self-hosted |