--- name: rag-architecture description: Retrieval-Augmented Generation (RAG) system design patterns, chunking strategies, embedding models, retrieval techniques, and context assembly. Use when designing RAG pipelines, improving retrieval quality, or building knowledge-grounded LLM applications. allowed-tools: Read, Glob, Grep --- # RAG Architecture ## When to Use This Skill Use this skill when: - Designing RAG pipelines for LLM applications - Choosing chunking and embedding strategies - Optimizing retrieval quality and relevance - Building knowledge-grounded AI systems - Implementing hybrid search (dense + sparse) - Designing multi-stage retrieval pipelines **Keywords:** RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval ## RAG Architecture Overview ```text ┌─────────────────────────────────────────────────────────────────────┐ │ RAG Pipeline │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Ingestion │ │ Indexing │ │ Vector Store │ │ │ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ Documents Chunks + Indexed │ │ Embeddings Vectors │ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Query │ │ Retrieval │ │ Context Assembly │ │ │ │ Processing │───▶│ Engine │───▶│ + Generation │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ User Query Top-K Chunks LLM Response │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Document Ingestion Pipeline ### Document Processing Steps ```text Raw Documents │ ▼ ┌─────────────┐ │ Extract │ ← PDF, HTML, DOCX, Markdown │ Content │ └─────────────┘ │ ▼ ┌─────────────┐ │ Clean & │ ← Remove boilerplate, normalize │ Normalize │ └─────────────┘ │ ▼ ┌─────────────┐ │ Chunk │ ← Split into retrievable units │ Documents │ └─────────────┘ │ ▼ ┌─────────────┐ │ Generate │ ← Create vector representations │ Embeddings │ └─────────────┘ │ ▼ ┌─────────────┐ │ Store │ ← Persist vectors + metadata │ in Index │ └─────────────┘ ``` ## Chunking Strategies ### Strategy Comparison | Strategy | Description | Best For | Chunk Size | | -------- | ----------- | -------- | ---------- | | **Fixed-size** | Split by token/character count | Simple documents | 256-512 tokens | | **Sentence-based** | Split at sentence boundaries | Narrative text | Variable | | **Paragraph-based** | Split at paragraph boundaries | Structured docs | Variable | | **Semantic** | Split by topic/meaning | Long documents | Variable | | **Recursive** | Hierarchical splitting | Mixed content | Configurable | | **Document-specific** | Custom per doc type | Specialized (code, tables) | Variable | ### Chunking Decision Tree ```text What type of content? ├── Code │ └── AST-based or function-level chunking ├── Tables/Structured │ └── Keep tables intact, chunk surrounding text ├── Long narrative │ └── Semantic or recursive chunking ├── Short documents (<1 page) │ └── Whole document as chunk └── Mixed content └── Recursive with type-specific handlers ``` ### Chunk Overlap ```text Without Overlap: [Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"] ↑ Information lost at boundary With Overlap (20%): [Chunk 1: "The quick brown fox"] [Chunk 2: "brown fox jumps over"] ↑ Context preserved across boundaries ``` **Recommended overlap:** 10-20% of chunk size ### Chunk Size Trade-offs ```text Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens) ├── More precise retrieval ├── More context per chunk ├── Less context per chunk ├── May include irrelevant content ├── More chunks to search ├── Fewer chunks to search ├── Better for factoid Q&A ├── Better for summarization └── Higher retrieval recall └── Higher retrieval precision ``` ## Embedding Models ### Model Comparison | Model | Dimensions | Context | Strengths | | ----- | ---------- | ------- | --------- | | **OpenAI text-embedding-3-large** | 3072 | 8K | High quality, expensive | | **OpenAI text-embedding-3-small** | 1536 | 8K | Good quality/cost ratio | | **Cohere embed-v3** | 1024 | 512 | Multilingual, fast | | **BGE-large** | 1024 | 512 | Open source, competitive | | **E5-large-v2** | 1024 | 512 | Open source, instruction-tuned | | **GTE-large** | 1024 | 512 | Alibaba, good for Chinese | | **Sentence-BERT** | 768 | 512 | Classic, well-understood | ### Embedding Selection ```text Need best quality, cost OK? ├── Yes → OpenAI text-embedding-3-large └── No └── Need self-hosted/open source? ├── Yes → BGE-large or E5-large-v2 └── No └── Need multilingual? ├── Yes → Cohere embed-v3 └── No → OpenAI text-embedding-3-small ``` ### Embedding Optimization | Technique | Description | When to Use | | --------- | ----------- | ----------- | | **Matryoshka embeddings** | Truncatable to smaller dims | Memory-constrained | | **Quantized embeddings** | INT8/binary embeddings | Large-scale search | | **Instruction-tuned** | Prefix with task instruction | Specialized retrieval | | **Fine-tuned embeddings** | Domain-specific training | Specialized domains | ## Retrieval Strategies ### Dense Retrieval (Semantic Search) ```text Query: "How to deploy containers" │ ▼ ┌─────────┐ │ Embed │ │ Query │ └─────────┘ │ ▼ ┌─────────────────────────────────┐ │ Vector Similarity Search │ │ (Cosine, Dot Product, L2) │ └─────────────────────────────────┘ │ ▼ Top-K semantically similar chunks ``` ### Sparse Retrieval (BM25/TF-IDF) ```text Query: "Kubernetes pod deployment YAML" │ ▼ ┌─────────┐ │Tokenize │ │ + Score │ └─────────┘ │ ▼ ┌─────────────────────────────────┐ │ BM25 Ranking │ │ (Term frequency × IDF) │ └─────────────────────────────────┘ │ ▼ Top-K lexically matching chunks ``` ### Hybrid Search (Best of Both) ```text Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking │ │ │ └──▶ Sparse Search ─┘ │ │ Fusion Methods: ▼ • RRF (Reciprocal Rank Fusion) • Linear combination • Learned reranking ``` ### Reciprocal Rank Fusion (RRF) ```text RRF Score = Σ 1 / (k + rank_i) Where: - k = constant (typically 60) - rank_i = rank in each retrieval result Example: Doc A: Dense rank=1, Sparse rank=5 RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318 Doc B: Dense rank=3, Sparse rank=1 RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323 Result: Doc B ranks higher (better combined relevance) ``` ## Multi-Stage Retrieval ### Two-Stage Pipeline ```text ┌─────────────────────────────────────────────────────────┐ │ Stage 1: Recall (Fast, High Recall) │ │ • ANN search (HNSW, IVF) │ │ • Retrieve top-100 candidates │ │ • Latency: 10-50ms │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 2: Rerank (Slow, High Precision) │ │ • Cross-encoder or LLM reranking │ │ • Score top-100 → return top-10 │ │ • Latency: 100-500ms │ └─────────────────────────────────────────────────────────┘ ``` ### Reranking Options | Reranker | Latency | Quality | Cost | | -------- | ------- | ------- | ---- | | **Cross-encoder (local)** | Medium | High | Compute | | **Cohere Rerank** | Fast | High | API cost | | **LLM-based rerank** | Slow | Highest | High API cost | | **BGE-reranker** | Fast | Good | Compute | ## Context Assembly ### Context Window Management ```text Context Budget: 128K tokens ├── System prompt: 500 tokens (fixed) ├── Conversation history: 4K tokens (sliding window) ├── Retrieved context: 8K tokens (dynamic) └── Generation buffer: ~115K tokens (available) Strategy: Maximize retrieved context quality within budget ``` ### Context Assembly Strategies | Strategy | Description | When to Use | | -------- | ----------- | ----------- | | **Simple concatenation** | Join top-K chunks | Small context, simple Q&A | | **Relevance-ordered** | Most relevant first | General retrieval | | **Chronological** | Time-ordered | Temporal queries | | **Hierarchical** | Summary + details | Long-form generation | | **Interleaved** | Mix sources | Multi-source queries | ### Lost-in-the-Middle Problem ```text LLM Attention Pattern: ┌─────────────────────────────────────────────────────────┐ │ Beginning Middle End │ │ ████ ░░░░ ████ │ │ High attention Low attention High attention │ └─────────────────────────────────────────────────────────┘ Mitigation: 1. Put most relevant at beginning AND end 2. Use shorter context windows when possible 3. Use hierarchical summarization 4. Fine-tune for long-context attention ``` ## Advanced RAG Patterns ### Query Transformation ```text Original Query: "Tell me about the project" │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ HyDE │ │ Query │ │ Sub-query│ │ (Hypo │ │ Expansion│ │ Decomp. │ │ Doc) │ │ │ │ │ └─────────┘ └──────────┘ └──────────┘ │ │ │ ▼ ▼ ▼ Hypothetical "project, "What is the answer to goals, project scope?" embed timeline, "What are the deliverables" deliverables?" ``` ### HyDE (Hypothetical Document Embeddings) ```text Query: "How does photosynthesis work?" │ ▼ ┌───────────────┐ │ LLM generates │ │ hypothetical │ │ answer │ └───────────────┘ │ ▼ "Photosynthesis is the process by which plants convert sunlight into energy..." │ ▼ ┌───────────────┐ │ Embed hypo │ │ document │ └───────────────┘ │ ▼ Search with hypothetical embedding (Better matches actual documents) ``` ### Self-RAG (Retrieval-Augmented LM with Self-Reflection) ```text ┌─────────────────────────────────────────────────────────┐ │ 1. Generate initial response │ │ 2. Decide: Need more retrieval? (critique token) │ │ ├── Yes → Retrieve more, regenerate │ │ └── No → Check factuality (isRel, isSup tokens) │ │ 3. Verify claims against sources │ │ 4. Regenerate if needed │ │ 5. Return verified response │ └─────────────────────────────────────────────────────────┘ ``` ### Agentic RAG ```text Query: "Compare Q3 revenue across regions" │ ▼ ┌───────────────┐ │ Query Agent │ │ (Plan steps) │ └───────────────┘ │ ┌───────────┼───────────┐ ▼ ▼ ▼ ┌───────┐ ┌───────┐ ┌───────┐ │Search │ │Search │ │Search │ │ EMEA │ │ APAC │ │ AMER │ │ docs │ │ docs │ │ docs │ └───────┘ └───────┘ └───────┘ │ │ │ └───────────┼───────────┘ ▼ ┌───────────────┐ │ Synthesize │ │ Comparison │ └───────────────┘ ``` ## Evaluation Metrics ### Retrieval Metrics | Metric | Description | Target | | ------ | ----------- | ------ | | **Recall@K** | % relevant docs in top-K | >80% | | **Precision@K** | % of top-K that are relevant | >60% | | **MRR (Mean Reciprocal Rank)** | 1/rank of first relevant | >0.5 | | **NDCG** | Graded relevance ranking | >0.7 | ### End-to-End Metrics | Metric | Description | Target | | ------ | ----------- | ------ | | **Answer correctness** | Is the answer factually correct? | >90% | | **Faithfulness** | Is the answer grounded in context? | >95% | | **Answer relevance** | Does it answer the question? | >90% | | **Context relevance** | Is retrieved context relevant? | >80% | ### Evaluation Framework ```text ┌─────────────────────────────────────────────────────────┐ │ RAG Evaluation Pipeline │ ├─────────────────────────────────────────────────────────┤ │ 1. Query Set: Representative questions │ │ 2. Ground Truth: Expected answers + source docs │ │ 3. Metrics: │ │ • Retrieval: Recall@K, MRR, NDCG │ │ • Generation: Correctness, Faithfulness │ │ 4. A/B Testing: Compare configurations │ │ 5. Error Analysis: Identify failure patterns │ └─────────────────────────────────────────────────────────┘ ``` ## Common Failure Modes | Failure Mode | Cause | Mitigation | | ------------ | ----- | ---------- | | **Retrieval miss** | Query-doc mismatch | Hybrid search, query expansion | | **Wrong chunk** | Poor chunking | Better segmentation, overlap | | **Hallucination** | Poor grounding | Faithfulness training, citations | | **Lost context** | Long-context issues | Hierarchical, summarization | | **Stale data** | Outdated index | Incremental updates, TTL | ## Scaling Considerations ### Index Scaling | Scale | Approach | | ----- | -------- | | <1M docs | Single node, exact search | | 1-10M docs | Single node, HNSW | | 10-100M docs | Distributed, sharded | | >100M docs | Distributed + aggressive filtering | ### Latency Budget ```text Typical RAG Pipeline Latency: Query embedding: 10-50ms Vector search: 20-100ms Reranking: 100-300ms LLM generation: 500-2000ms ──────────────────────────── Total: 630-2450ms Target p95: <3 seconds for interactive use ``` ## Related Skills - `llm-serving-patterns` - LLM inference infrastructure - `vector-databases` - Vector store selection and optimization - `ml-system-design` - End-to-end ML pipeline design - `estimation-techniques` - Capacity planning for RAG systems ## Version History - v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design --- ## Last Updated **Date:** 2025-12-26