---
name: rag-retrieval
description: Retrieval-Augmented Generation patterns for grounded LLM responses. Use when building RAG pipelines, constructing context from retrieved documents, adding citations, or implementing hybrid search.
tags: [rag, retrieval, llm, context, grounding]
context: fork
agent: data-pipeline-engineer
version: 1.0.0
author: OrchestKit
user-invocable: false
---

# RAG Retrieval

Combine vector search with LLM generation for accurate, grounded responses.

## Basic RAG Pattern

```python
async def rag_query(question: str, top_k: int = 5) -> str:
    """Basic RAG: retrieve then generate."""
    # 1. Retrieve relevant documents
    docs = await vector_db.search(question, limit=top_k)

    # 2. Construct context
    context = "\n\n".join([
        f"[{i+1}] {doc.text}"
        for i, doc in enumerate(docs)
    ])

    # 3. Generate with context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If not in context, say 'I don't have that information.'"},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content
```

## RAG with Citations

```python
async def rag_with_citations(question: str) -> dict:
    """RAG with inline citations [1], [2], etc."""
    docs = await vector_db.search(question, limit=5)

    context = "\n\n".join([
        f"[{i+1}] {doc.text}\nSource: {doc.metadata['source']}"
        for i, doc in enumerate(docs)
    ])

    response = await llm.chat([
        {"role": "system", "content":
            "Answer with inline citations like [1], [2]. "
            "End with a Sources section."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return {
        "answer": response.content,
        "sources": [doc.metadata['source'] for doc in docs]
    }
```

## Hybrid Search (Semantic + Keyword)

```python
def reciprocal_rank_fusion(
    semantic_results: list,
    keyword_results: list,
    k: int = 60
) -> list:
    """Combine semantic and keyword search with RRF."""
    scores = {}

    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)

    # Sort by combined score
    ranked_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_doc(id) for id in ranked_ids]
```

## Context Window Management

```python
def fit_context(docs: list, max_tokens: int = 6000) -> list:
    """Truncate context to fit token budget."""
    total_tokens = 0
    selected = []

    for doc in docs:
        doc_tokens = count_tokens(doc.text)
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens

    return selected
```

**Guidelines:**
- Keep context under 75% of model limit
- Reserve tokens for system prompt + response
- Prioritize highest-relevance documents

## Context Sufficiency Check (2026 Best Practice)

```python
from pydantic import BaseModel

class SufficiencyCheck(BaseModel):
    """Pre-generation context validation."""
    is_sufficient: bool
    confidence: float  # 0.0-1.0
    missing_info: str | None = None

async def rag_with_sufficiency(question: str, top_k: int = 5) -> str:
    """RAG with hallucination prevention via sufficiency check.

    Based on Google Research ICLR 2025: Adding a sufficiency check
    before generation reduces hallucinations from insufficient context.
    """
    docs = await vector_db.search(question, limit=top_k)
    context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(docs)])

    # Pre-generation sufficiency check (prevents hallucination)
    check = await llm.with_structured_output(SufficiencyCheck).ainvoke(
        f"""Does this context contain sufficient information to answer the question?

Question: {question}

Context:
{context}

Evaluate:
- is_sufficient: Can the question be fully answered from context?
- confidence: How confident are you? (0.0-1.0)
- missing_info: What's missing if not sufficient?"""
    )

    # Abstain if context insufficient (high-confidence)
    if not check.is_sufficient and check.confidence > 0.7:
        return f"I don't have enough information to answer this question. Missing: {check.missing_info}"

    # Low confidence → retrieve more context
    if not check.is_sufficient and check.confidence <= 0.7:
        more_docs = await vector_db.search(question, limit=top_k * 2)
        context = "\n\n".join([f"[{i+1}] {doc.text}" for i, doc in enumerate(more_docs)])

    # Generate only with sufficient context
    response = await llm.chat([
        {"role": "system", "content":
            "Answer using ONLY the provided context. "
            "If information is missing, say so rather than guessing."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])

    return response.content
```

**Why this matters (Google Research 2025):**
- RAG paradoxically increases hallucinations when context is insufficient
- Additional context increases model confidence → more likely to hallucinate
- Sufficiency check allows abstention when information is missing

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Top-k | 3-10 documents |
| Temperature | 0.1-0.3 (factual) |
| Context budget | 4K-8K tokens |
| Hybrid ratio | 50/50 semantic/keyword |

## Common Mistakes

- No citation tracking (unverifiable answers)
- Context too large (dilutes relevance)
- Temperature too high (hallucinations)
- Single retrieval method (misses keyword matches)

## Advanced Patterns

See `references/advanced-rag.md` for:
- **HyDE Integration**: Hypothetical document embeddings for vocabulary mismatch
- **Agentic RAG**: Multi-step retrieval with tool use
- **Self-RAG**: LLM decides when to retrieve and validates outputs
- **Corrective RAG**: Evaluate retrieval quality and correct if needed
- **Pipeline Composition**: Combine HyDE + Hybrid + Rerank

## Related Skills

- `embeddings` - Creating vectors for retrieval
- `hyde-retrieval` - Hypothetical document embeddings
- `query-decomposition` - Multi-concept query handling
- `reranking-patterns` - Cross-encoder and LLM reranking
- `contextual-retrieval` - Anthropic's context-prepending technique
- `langgraph-functional` - Building agentic RAG workflows

## Capability Details

### retrieval-patterns
**Keywords:** retrieval, context, chunks, relevance
**Solves:**
- Retrieve relevant context for LLM
- Implement RAG pipeline
- Optimize retrieval quality

### hybrid-search
**Keywords:** hybrid, bm25, vector, fusion
**Solves:**
- Combine keyword and semantic search
- Implement reciprocal rank fusion
- Balance precision and recall

### chatbot-example
**Keywords:** chatbot, rag, example, typescript
**Solves:**
- Build RAG chatbot example
- TypeScript implementation
- End-to-end RAG pipeline

### pipeline-template
**Keywords:** pipeline, template, implementation, starter
**Solves:**
- RAG pipeline starter template
- Production-ready code
- Copy-paste implementation