--- name: RAG Implementer description: Implement retrieval-augmented generation systems. Use when building knowledge-intensive applications, document search, Q&A systems, or need to ground LLM responses in external data. Covers embedding strategy, vector stores, retrieval pipelines, and evaluation. version: 1.0.0 --- # RAG Implementer Build production-ready retrieval-augmented generation systems. ## Core Principle **RAG = Retrieval + Context Assembly + Generation** Use RAG when you need LLMs to access fresh, domain-specific, or proprietary knowledge that wasn't in their training data. --- ## ⚠️ Prerequisites & Cost Reality Check ### STOP: Have You Validated the Need for RAG? **Before implementing RAG, confirm:** - [ ] **Problem validated** - Completed `product-strategist` Phase 1 (problem discovery) - [ ] **Users need AI search** - Tested with simpler alternatives (see below) - [ ] **ROI justified** - Calculated cost vs benefit of RAG vs alternatives ### Try These FIRST (Before RAG) RAG is powerful but expensive. Try cheaper alternatives first: **1. FAQ Page / Documentation (1 day, $0)** - Create well-organized FAQ or docs - Add search with Cmd+F - **Works for:** <50 common questions, static content - **Test:** Do users find answers? If yes, stop here. **2. Simple Keyword Search (2-3 days, $0-20/month)** - Use Algolia, Typesense, or PostgreSQL full-text search - Good enough for 80% of use cases - **Works for:** <100k documents, keyword matching sufficient - **Test:** Do users get relevant results? If yes, stop here. **3. Manual Curation (Concierge MVP) (1 week, $0)** - Manually answer user questions - Build FAQ from common questions - **Works for:** <100 users, validating if users want AI - **Test:** Do users value your answers enough to pay? If yes, consider RAG. **4. Simple Semantic Search (1 week, $30-50/month)** - Use OpenAI embeddings + Postgres pgvector - Skip complex retrieval, re-ranking, etc. - **Works for:** <50k documents, basic semantic search - **Test:** Are embeddings better than keyword search? If no, stop here. ### Cost Reality Check **Naive RAG (Prototype):** - **Time:** 1-2 weeks - **Cost:** $50-150/month (vector DB + embeddings + API calls) - **When:** Prototype, <10k documents, proof of concept **Advanced RAG (Production):** - **Time:** 3-4 weeks - **Cost:** $200-500/month (hybrid search, re-ranking, monitoring) - **When:** Production, 10k-1M documents, validated demand **Modular RAG (Enterprise):** - **Time:** 6-8 weeks - **Cost:** $500-2000+/month (multiple KBs, specialized modules) - **When:** Enterprise, 1M+ documents, mission-critical ### Decision Tree: Do You Really Need RAG? ``` Do users need to search your content? │ ├─ No → Don't build RAG ❌ │ └─ Yes ├─ <50 items? → FAQ page ✅ ($0) │ └─ >50 items? ├─ Keyword search enough? → Use Algolia ✅ ($0-20/mo) │ └─ Need semantic understanding? ├─ <50k docs? → Simple semantic (pgvector) ✅ ($30/mo) │ └─ >50k docs? ├─ Validated with users? → Build RAG ✅ └─ Not validated? → Test with Concierge MVP first ⚠️ ``` ### Validation Checklist Only proceed with RAG implementation if: - [ ] Tested simpler alternatives (FAQ, keyword search, manual curation) - [ ] Users confirmed they need AI-powered search (not just you think they do) - [ ] Calculated ROI: cost of RAG < value users get - [ ] Have >50k documents OR complex semantic search requirements - [ ] Budget: $200-500/month for infrastructure - [ ] Time: 3-4 weeks for production implementation **If any checkbox is unchecked:** Go back to `product-strategist` or `mvp-builder` skills to validate first. **See also:** `PLAYBOOKS/validation-first-development.md` for step-by-step validation process. --- ## 8-Phase RAG Implementation ### Phase 1: Knowledge Base Design **Goal**: Create well-structured knowledge foundation **Actions**: - Map data sources (internal: docs, databases, APIs / external: web, feeds) - Filter noise, select authoritative content (prevent "data dump fallacy") - Define chunking strategy: semantic chunking based on structure - Add metadata: tags, timestamps, source identifiers, categories **Validation**: - [ ] All data sources catalogued and prioritized - [ ] Data quality assessed (accuracy, completeness, freshness) - [ ] Chunking strategy tested with sample documents - [ ] Metadata schema validated for search effectiveness **Common Chunking Strategies**: - Fixed-size: 500-1000 tokens, 50-100 token overlap - Semantic: By paragraph, section headers, or topic boundaries - Recursive: Split by structure (markdown headers, code blocks) --- ### Phase 2: Embedding Strategy **Goal**: Choose optimal embedding approach for semantic understanding **Actions**: - Select embedding model: `text-embedding-3-large` (1536 dim) for general, domain-specific for specialized - Plan multi-modal needs (text, code, images, tables) - Decide on fine-tuning: use domain data if general embeddings underperform - Establish similarity benchmarks **Validation**: - [ ] Embedding model benchmarked on domain data - [ ] Retrieval accuracy tested with known query-document pairs - [ ] Storage and compute costs validated **Model Selection**: - General: OpenAI `text-embedding-3-large`, `text-embedding-3-small` - Code: `code-search-babbage-code-001` or StarEncoder - Multilingual: `multilingual-e5-large` --- ### Phase 3: Vector Store Architecture **Goal**: Implement scalable vector database **Actions**: - Choose vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector) - Configure index: HNSW for speed, IVF for scale - Plan scalability: data growth and query volume - Implement backup, recovery, security **Validation**: - [ ] Vector store benchmarked under expected load - [ ] Index optimized for retrieval speed and accuracy - [ ] Backup and recovery tested - [ ] Security controls implemented **Vector DB Decision**: - Managed cloud → Pinecone - Self-hosted, feature-rich → Weaviate - Lightweight, local → Chroma - Cost-conscious → pgvector (Postgres extension) - High-performance → Qdrant --- ### Phase 4: Retrieval Pipeline **Goal**: Build sophisticated retrieval beyond simple similarity search **Actions**: - Implement hybrid retrieval: semantic search + keyword (BM25) - Add query enhancement: expansion, reformulation, multi-query - Apply contextual filtering: metadata, temporal constraints, relevance ranking - Design for query types: factual (precision), analytical (breadth), creative (diversity) - Handle edge cases: no relevant results found **Advanced Techniques**: - **Re-ranking**: Use cross-encoder after initial retrieval (e.g., `cross-encoder/ms-marco-MiniLM-L-12-v2`) - **Query routing**: Route different query types to specialized strategies - **Ensemble methods**: Combine multiple retrieval approaches - **Adaptive retrieval**: Adjust top-k based on query complexity **Validation**: - [ ] Retrieval accuracy tested across diverse query types - [ ] Hybrid retrieval outperforms single-method baselines - [ ] Query latency meets requirements (<500ms ideal) - [ ] Edge cases and fallbacks tested --- ### Phase 5: Context Assembly **Goal**: Transform retrieved chunks into optimal LLM context **Actions**: - Rank and select: prioritize by relevance score, recency, source authority - Synthesize: merge related chunks, avoid redundancy - Compress: use LLMLingua or similar for token optimization - Mitigate "lost in the middle": place critical info at start/end - Adapt dynamically: adjust context based on conversation history **Context Engineering Integration**: - Blend RAG results with system instructions and user prompts - Maintain conversation coherence across multi-turn interactions - Implement context persistence for follow-up queries - Balance context size vs. information density **Validation**: - [ ] Context relevance validated against human judgments - [ ] Token optimization maintains accuracy - [ ] Multi-turn conversations maintain coherence - [ ] Assembly latency <200ms --- ### Phase 6: Evaluation & Metrics **Goal**: Measure RAG system performance comprehensively **Retrieval Quality**: - **Precision@K**: Fraction of top-K results that are relevant - **Recall@K**: Fraction of relevant docs in top-K - **MRR (Mean Reciprocal Rank)**: Average rank of first relevant result - **NDCG**: Ranking quality with graded relevance **Generation Quality**: - **Faithfulness**: Generated content accuracy vs. sources - **Answer Relevance**: Response relevance to query - **Context Utilization**: How effectively LLM uses retrieved info - **Hallucination Rate**: Frequency of unsupported claims **System Performance**: - **End-to-End Latency**: Query to answer (<3 seconds target) - **Retrieval Latency**: Time to retrieve and rank (<500ms) - **Token Efficiency**: Information density per token - **Cost Per Query**: Combined retrieval + generation costs **Validation**: - [ ] Baseline metrics established - [ ] A/B testing framework for config comparisons - [ ] Automated evaluation pipeline deployed - [ ] Human evaluation protocols for ground truth --- ### Phase 7: Production Deployment **Goal**: Deploy with enterprise-grade reliability and security **Deployment**: - Containerize with Docker/Kubernetes - Implement load balancing across RAG instances - Add caching for frequent queries - Graceful degradation: fallback to base model on component failure **Security**: - Role-based access controls for knowledge base - Data masking and PII protection - Audit logging for compliance - Prompt injection defense **Monitoring**: - Real-time metrics dashboard (latency, cost, accuracy) - Query analysis for patterns and failure modes - Cost tracking and optimization alerts - Performance profiling for bottlenecks **Validation**: - [ ] Production handles expected traffic - [ ] Security prevents unauthorized access - [ ] Monitoring provides actionable insights - [ ] Incident response procedures tested --- ### Phase 8: Continuous Improvement **Goal**: Establish processes for ongoing enhancement **Data Pipeline**: - Automated knowledge base updates (real-time or scheduled) - Quality monitoring: detect data drift and degradation - Source diversification: add new data sources - Feedback integration: user corrections and preferences **Model Evolution**: - Evaluate and migrate to improved embeddings - Fine-tune on domain data regularly - Upgrade architecture: Naive → Advanced → Modular RAG - Expand multi-modal support (images, audio, video) **Optimization**: - Analyze query patterns, optimize for common needs - Improve cache hit rates - Tune vector indices regularly - Balance performance vs. costs **Validation**: - [ ] Automated improvement pipelines functioning - [ ] Performance trends show improvement - [ ] User satisfaction increasing - [ ] System adapts to changing needs ## Key RAG Principles ### 1. Relevance Over Volume - Quality curation > massive datasets - Remove outdated/low-quality content continuously - Prioritize most relevant info to prevent "lost in the middle" ### 2. Semantic Understanding - Use embeddings for true semantic matching, not just keywords - Recognize query intent (factual, analytical, creative) - Adapt retrieval strategy based on context ### 3. Multi-Modal Intelligence - Handle text, images, code, tables, structured data - Enable cross-modal retrieval (text query → image results) - Preserve document structure and formatting ### 4. Temporal Awareness - Prioritize recent info for time-sensitive topics - Maintain historical access when relevant - Integrate real-time data feeds for dynamic domains ### 5. Transparency & Trust - Always provide source citations - Indicate confidence levels - Explain why specific information was selected ## Standard RAG Response Format ```json { "answer": "Generated response incorporating retrieved information", "sources": [ { "content": "Retrieved text chunk", "source": "Document/URL identifier", "relevance_score": 0.95, "chunk_id": "unique_identifier" } ], "confidence": 0.87, "retrieval_metadata": { "chunks_retrieved": 5, "retrieval_time_ms": 150, "generation_time_ms": 800 } } ``` ## Critical Success Rules **Non-Negotiable**: 1. ✅ Source attribution for every response 2. ✅ Validate generated content against sources (prevent hallucination) 3. ✅ Filter sensitive data before retrieval 4. ✅ Respond within latency thresholds (<3 seconds) 5. ✅ Monitor and optimize costs continuously 6. ✅ Comply with security policies 7. ✅ Graceful degradation on failures 8. ✅ Comprehensive testing before production **Quality Gates**: - Before Production: >85% accuracy on evaluation dataset - Ongoing: User satisfaction >4.0/5.0 - Performance: 95th percentile <5 seconds - Reliability: 99.5% uptime - Cost: Within 10% of budget ## Advanced Patterns ### Modular RAG Architecture - **Search Module**: Query understanding and reformulation - **Memory Module**: Long-term conversation persistence - **Routing Module**: Query routing to specialized knowledge bases - **Predict Module**: Anticipatory pre-loading based on context ### Hybrid RAG + Fine-tuning - RAG for dynamic, frequently changing knowledge - Fine-tuning for domain-specific reasoning patterns - Combine strengths for maximum effectiveness ## Related Resources **Related Skills**: - `multi-agent-architect` - For complex RAG orchestration - `knowledge-graph-builder` - For structured knowledge integration - `performance-optimizer` - For RAG system optimization **Related Patterns**: - `META/DECISION-FRAMEWORK.md` - Vector DB and embedding selection - `STANDARDS/architecture-patterns/rag-pattern.md` - RAG architecture details (when created) **Related Playbooks**: - `PLAYBOOKS/deploy-rag-system.md` - RAG deployment procedure (when created)