---
name: agent:rag
description: RAG Pipeline Design - guides through chunking, embedding, vector store selection, retrieval tuning, and RAG alternatives
argument-hint: ["description or path"]
---

# RAG Pipeline Design

Guides the user through designing a Retrieval-Augmented Generation (RAG) pipeline. Based on "Principles of Building AI Agents" (Bhagwat & Gienow, 2025), Part V: RAG (Chapters 17-20).

## When to use

Use this skill when the user needs to:
- Design a RAG pipeline for an agent
- Choose a vector database
- Configure chunking, embedding, and retrieval
- Evaluate whether RAG is even needed (vs. alternatives)
- Tune an existing RAG pipeline for better quality

## Instructions

### Step 1: Do You Actually Need RAG?

Before building a pipeline, apply the principle: **Start simple, check quality, get complex.**

Use `AskUserQuestion` to assess:

```markdown
## RAG Decision Tree

### Step 1: How large is your corpus?
- **< 200 pages** → Try full context loading first (Gemini 2M, Claude 200K)
- **200-10,000 pages** → Consider agentic RAG (tools that query data) OR traditional RAG
- **> 10,000 pages** → Traditional RAG pipeline is likely needed

### Step 2: What is the query pattern?
- **Factual lookup** ("What is X?") → RAG works well
- **Analytical** ("Compare X and Y across documents") → Agentic RAG may be better
- **Conversational** ("Tell me about...") → Either works

### Step 3: How structured is the data?
- **Highly structured** (tables, databases) → Use tools/APIs, not RAG
- **Semi-structured** (markdown, HTML) → RAG with format-specific chunking
- **Unstructured** (PDFs, free text) → Traditional RAG
```

**Recommended progression:**
1. First, load entire corpus into a large context window
2. Second, write functions to query the dataset, give to agent as tools
3. Only if 1 and 2 fail on quality, build a RAG pipeline

If the user decides RAG is needed, proceed. Otherwise, recommend the simpler alternative.

### Step 2: Chunking Strategy

Design how documents are split into retrievable pieces:

```markdown
## Chunking Strategy

### Method
| Strategy | Best For | Description |
|----------|----------|-------------|
| Recursive | General text | Splits by paragraph, then sentence, then character |
| Token-aware | LLM optimization | Splits by token count, respects model limits |
| Format-specific | Markdown/HTML/JSON | Uses document structure (headers, tags, keys) |
| Semantic | High quality needs | Uses LLM to identify natural topic boundaries |

**Selected:** [Strategy]

### Parameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Chunk size | [256-1024 tokens] | Balance: smaller = more precise, larger = more context |
| Overlap | [50-200 tokens] | Prevents losing context at chunk boundaries |
| Metadata | [title, source, date, section, page] | Enables filtered retrieval |

### Document-Specific Rules
| Document Type | Chunking Rule |
|--------------|---------------|
| [Markdown docs] | Split on ## headers, keep header as metadata |
| [PDFs] | Page-based with overlap, extract title/section |
| [Code files] | Function/class-level chunks |
| [Chat logs] | Message groups of [N] turns |
```

### Step 3: Embedding Configuration

Choose how chunks become vectors:

```markdown
## Embedding

### Model Selection
| Model | Dimensions | Quality | Cost | Speed |
|-------|-----------|---------|------|-------|
| OpenAI text-embedding-3-large | 3072 | High | $0.13/M tokens | Fast |
| OpenAI text-embedding-3-small | 1536 | Good | $0.02/M tokens | Fast |
| Voyage voyage-3 | 1024 | High | $0.06/M tokens | Fast |
| Cohere embed-v3 | 1024 | High | $0.10/M tokens | Fast |
| Local (e5-large, BGE) | 1024 | Good | Free (compute) | Varies |

**Selected:** [Model]

### Indexing
| Parameter | Value |
|-----------|-------|
| Dimensions | [From model] |
| Similarity metric | Cosine (most common) |
| Index type | HNSW (default, good balance of speed/accuracy) |
```

### Step 4: Vector Database Selection

Apply the principle: **Prevent infra sprawl — vector DB choice is mostly commoditized.**

Use `AskUserQuestion`:

```markdown
## Vector Database

### Decision Matrix
| Option | When to Choose | Pros | Cons |
|--------|---------------|------|------|
| **pgvector** (Postgres extension) | Already using Postgres | No new infra, familiar SQL, metadata filtering | May need tuning at scale |
| **Pinecone** (managed) | New project, want simplicity | Fully managed, fast, scalable | Additional service + cost |
| **Chroma** (open-source) | Local dev, small scale | Free, easy setup | Self-host in production |
| **Cloud-native** (Cloudflare, DataStax) | Already on that cloud | Integrated billing, low latency | Vendor lock-in |

**Selected:** [Database]
**Rationale:** [Why]
```

### Step 5: Retrieval Configuration

Design how the agent queries the vector store:

```markdown
## Retrieval

### Query Strategy
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| topK | [3-10] | Number of chunks to retrieve |
| similarityThreshold | [0.7-0.9] | Min relevance to include |
| reranking | [Yes/No] | Post-retrieval quality boost |

### Hybrid Queries
Combine vector similarity with metadata filters:

| Filter | Type | Example |
|--------|------|---------|
| Date range | Metadata | Only docs from last 30 days |
| Category | Metadata | Only "technical" documents |
| Source | Metadata | Only from "docs.example.com" |
| User access | Metadata | Only docs user has permission to see |

### Reranking (Optional)
- **When to use:** Quality matters more than latency
- **How:** Retrieve topK * 3 candidates, rerank with a cross-encoder, return topK
- **Models:** Cohere Rerank, bge-reranker, cross-encoder/ms-marco
- **Cost:** More expensive per query, but runs only on candidates (not full corpus)

### Query Transformation (Optional)
- **HyDE:** Generate a hypothetical answer, use it as the search query
- **Multi-query:** Generate multiple query variations, merge results
- **Step-back:** Abstract the query to a higher level, then search
```

### Step 6: Pipeline Architecture

Bring it all together:

```markdown
## RAG Pipeline

### Ingestion Pipeline
1. **Load** documents from [source]
2. **Chunk** using [strategy] with [size] tokens, [overlap] overlap
3. **Enrich** metadata: source, date, category, section
4. **Embed** using [model]
5. **Upsert** into [vector DB]
6. **Schedule:** [On change / Nightly / Manual]

### Query Pipeline
1. **Receive** user query
2. **Transform** query (optional: HyDE, multi-query)
3. **Embed** query using [same model as ingestion]
4. **Search** vector DB: topK=[N], filters=[metadata filters]
5. **Rerank** results (optional)
6. **Inject** top chunks into LLM context as <retrieved_documents>
7. **Generate** response with source attribution

### Architecture Diagram
```

```mermaid
graph LR
    subgraph Ingestion
        Docs[Documents] --> Chunk[Chunker]
        Chunk --> Embed[Embedder]
        Embed --> Store[(Vector DB)]
    end
    subgraph Query
        User[User Query] --> QEmbed[Query Embedder]
        QEmbed --> Search[Similarity Search]
        Store --> Search
        Search --> Rerank[Reranker]
        Rerank --> LLM[LLM + Context]
        LLM --> Response[Response]
    end
```

### Step 7: Quality Checklist

```markdown
## RAG Quality Checklist

### Retrieval Quality
- [ ] Relevant documents consistently in top-K results
- [ ] Metadata filters working correctly
- [ ] No duplicate chunks in results
- [ ] Chunk size balances precision vs. context

### Generation Quality
- [ ] Responses are grounded in retrieved documents
- [ ] Source attribution is accurate
- [ ] Agent says "I don't know" when no relevant chunks found
- [ ] No hallucination beyond retrieved context

### Operational
- [ ] Ingestion pipeline runs on schedule
- [ ] New documents are available within [SLA]
- [ ] Vector DB latency < [target]ms
- [ ] Embedding costs within budget
```

### Step 8: Summarize and Offer Next Steps

Present all findings to the user as a structured summary in the conversation (including the pipeline diagram). Do NOT write to `.specs/` — this skill works directly.

Use `AskUserQuestion` to offer:
1. **Implement pipeline** — scaffold ingestion and query code
2. **Skip RAG** — if the decision tree said RAG isn't needed, help with the alternative (full context or agentic tools)
3. **Comprehensive design** — run `agent:design` to cover all areas with a spec

## Arguments

- `$ARGUMENTS` (`$0`) - Optional description of the knowledge domain or path to existing RAG code

Examples:
- `agent:rag documentation search` — design RAG for a docs search agent
- `agent:rag src/rag/` — review and tune existing RAG pipeline
- `agent:rag` — start fresh