---
name: ai-searching-docs
description: Build AI that searches your documents and answers questions. Use when building a knowledge base, help center Q&A, chatting with documents, answering questions from a database, search-and-answer over internal docs, customer support bot, or FAQ system. Also use when embedding search loses critical context, retrieval returns irrelevant results, the right document is buried deep in search results, RAG pipeline tutorial, semantic search over documents, or vector database search quality.
---

# Build AI-Powered Document Search

Guide the user through building an AI that searches documents and answers questions accurately. Uses DSPy's RAG (retrieval-augmented generation) pattern — retrieve relevant passages, then generate an answer grounded in them.

## Step 0: Load your data

If you have documents in files, databases, or SaaS tools, use LangChain's document loaders to get them into a standard format before building your search pipeline.

### LangChain document loaders

```python
from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    CSVLoader,
    WebBaseLoader,
    DirectoryLoader,
    NotionDBLoader,
    JSONLoader,
)

# PDF files
docs = PyPDFLoader("report.pdf").load()

# All text files in a directory
docs = DirectoryLoader("./docs/", glob="**/*.txt", loader_cls=TextLoader).load()

# Web pages
docs = WebBaseLoader("https://example.com/help").load()

# CSV
docs = CSVLoader("data.csv", source_column="url").load()

# JSON
docs = JSONLoader("data.json", jq_schema=".records[]", content_key="text").load()
```

### Text splitting

Split loaded documents into chunks sized for embedding and retrieval:

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# Each chunk has .page_content (text) and .metadata (source info)
```

| Splitter | Best for |
|----------|----------|
| `RecursiveCharacterTextSplitter` | General-purpose (recommended default) |
| `MarkdownHeaderTextSplitter` | Markdown docs — splits by heading |
| `TokenTextSplitter` | When you need strict token budgets |

### Vector store setup with LangChain

```python
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # or HuggingFaceEmbeddings, etc.
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Use as a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("How do refunds work?")
```

Other stores follow the same pattern: `Pinecone.from_documents(...)`, `FAISS.from_documents(...)`.

Once your data is loaded and chunked, wire it into a DSPy retriever (Step 3 below) or use ChromaDB directly. For the full LangChain/LangGraph API, see the [LangChain docs](https://python.langchain.com/docs/integrations/document_loaders/).

## Step 1: Understand the setup

Ask the user:
1. **What documents are you searching?** (PDFs, web pages, database, help articles, etc.)
2. **What kind of questions will users ask?** (factual lookups, how-to questions, multi-step research?)
3. **Do you have a search backend already?** (Elasticsearch, Pinecone, ChromaDB, pgvector, etc.)
4. **Do questions need info from multiple documents?** (simple lookup vs. combining info)

## Step 2: Build the search-and-answer pipeline

### Basic: search then answer

```python
import dspy

lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

class AnswerFromDocs(dspy.Signature):
    """Answer the question based on the given context."""
    context: list[str] = dspy.InputField(desc="Relevant passages from the knowledge base")
    question: str = dspy.InputField(desc="User's question")
    answer: str = dspy.OutputField(desc="Answer grounded in the context")

class DocSearch(dspy.Module):
    def __init__(self, num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.answer = dspy.ChainOfThought(AnswerFromDocs)

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.answer(context=context, question=question)
```

### Configure the search backend

DSPy supports multiple search backends. Set up via `dspy.configure`:

```python
# ColBERTv2 (hosted)
colbert = dspy.ColBERTv2(url="http://your-server:port/endpoint")
dspy.configure(lm=lm, rm=colbert)

# Or wrap your own search (Elasticsearch, Pinecone, pgvector, etc.)
class MySearchBackend(dspy.Retrieve):
    def forward(self, query, k=None):
        k = k or self.k
        # Your search logic here
        results = your_search_function(query, top_k=k)
        return dspy.Prediction(passages=[r["text"] for r in results])
```

## Step 3: Set up a vector store

If you do not have a search backend yet, set one up. For prototyping, use `dspy.Embeddings` (built-in, no external DB needed) or ChromaDB:

### DSPy built-in retriever (simplest option)

```python
embedder = dspy.Embedder("openai/text-embedding-3-small")  # or any supported model
retriever = dspy.Embeddings(corpus=corpus_texts, embedder=embedder, k=5)
# Uses FAISS for large corpora (>20K docs), brute-force for smaller ones
# Use retriever("query") to search — returns dspy.Prediction(passages=..., indices=...)
dspy.configure(lm=lm, rm=retriever)
```

### ChromaDB setup

```python
import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("my_docs")
```

### Load and chunk documents

Split documents into passages before adding them to the vector store. Sentence-based chunking works well for most use cases:

```python
import re

def chunk_text(text, max_sentences=5):
    """Split text into chunks of N sentences."""
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk = " ".join(sentences[i:i + max_sentences])
        if chunk:
            chunks.append(chunk)
    return chunks

# Load and chunk your documents
for doc in documents:
    chunks = chunk_text(doc["text"])
    collection.add(
        documents=chunks,
        ids=[f"{doc['id']}_chunk_{i}" for i in range(len(chunks))],
        metadatas=[{"source": doc["source"]}] * len(chunks),
    )
```

### Custom embeddings

ChromaDB uses its default embedding function, but you can swap in others:

```python
# SentenceTransformers (local, free)
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

# OpenAI embeddings (API, paid)
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
ef = OpenAIEmbeddingFunction(api_key="...", model_name="text-embedding-3-small")

collection = client.get_or_create_collection("my_docs", embedding_function=ef)
```

### Chunking strategies

| Strategy | How it works | Best for |
|----------|-------------|----------|
| Sentence-based | Split on sentence boundaries | Articles, docs, help pages |
| Fixed-size | Split every N characters with overlap | Long unstructured text |
| Paragraph | Split on double newlines | Well-structured documents |
| Overlap | Fixed-size with N-character overlap between chunks | When context at chunk boundaries matters |

### Wire it up as a DSPy retriever

```python
class ChromaRetriever(dspy.Retrieve):
    def __init__(self, collection, k=3):
        super().__init__(k=k)
        self.collection = collection

    def forward(self, query, k=None):
        k = k or self.k
        results = self.collection.query(query_texts=[query], n_results=k)
        return dspy.Prediction(passages=results["documents"][0])

# Use it
retriever = ChromaRetriever(collection)
dspy.configure(lm=lm, rm=retriever)
```

## Step 3b: Connect an existing vector store

If you already have a vector store, wire it up as a DSPy retriever. Each follows the same pattern — subclass `dspy.Retrieve` and implement `forward()`:

### Provider comparison

| Store | Type | Best for | Setup |
|-------|------|----------|-------|
| ChromaDB | Embedded | Prototyping, small datasets | `pip install chromadb` |
| Pinecone | Cloud | Managed, serverless, scales to billions | `pip install pinecone` |
| Qdrant | Self-hosted or cloud | Open-source, filtering, hybrid search | `pip install qdrant-client` |
| Weaviate | Self-hosted or cloud | Multi-modal, GraphQL, hybrid search | `pip install weaviate-client` |
| pgvector | Postgres extension | Teams already using Postgres | `pip install pgvector sqlalchemy` |

### Pinecone retriever

```python
from pinecone import Pinecone

class PineconeRetriever(dspy.Retrieve):
    def __init__(self, index_name, api_key, embed_fn, k=3):
        super().__init__(k=k)
        pc = Pinecone(api_key=api_key)
        self.index = pc.Index(index_name)
        self.embed_fn = embed_fn  # function: str -> list[float]

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        results = self.index.query(vector=vector, top_k=k, include_metadata=True)
        passages = [m["metadata"]["text"] for m in results["matches"]]
        return dspy.Prediction(passages=passages)
```

### Qdrant retriever

```python
from qdrant_client import QdrantClient

class QdrantRetriever(dspy.Retrieve):
    def __init__(self, collection_name, url, embed_fn, k=3):
        super().__init__(k=k)
        self.client = QdrantClient(url=url)
        self.collection = collection_name
        self.embed_fn = embed_fn

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        results = self.client.query_points(
            collection_name=self.collection, query=vector, limit=k,
        )
        passages = [p.payload["text"] for p in results.points]
        return dspy.Prediction(passages=passages)
```

### Weaviate retriever

```python
import weaviate

class WeaviateRetriever(dspy.Retrieve):
    def __init__(self, collection_name, url, k=3):
        super().__init__(k=k)
        self.client = weaviate.connect_to_local(url=url)  # or connect_to_weaviate_cloud
        self.collection = self.client.collections.get(collection_name)

    def forward(self, query, k=None):
        k = k or self.k
        results = self.collection.query.near_text(query=query, limit=k)
        passages = [o.properties["text"] for o in results.objects]
        return dspy.Prediction(passages=passages)
```

### pgvector retriever (PostgreSQL)

```python
from sqlalchemy import create_engine, text

class PgvectorRetriever(dspy.Retrieve):
    def __init__(self, engine, table, embed_fn, k=3):
        super().__init__(k=k)
        self.engine = engine
        self.table = table
        self.embed_fn = embed_fn

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        sql = text(f"""
            SELECT content FROM {self.table}
            ORDER BY embedding <=> :vec LIMIT :k
        """)
        with self.engine.connect() as conn:
            rows = conn.execute(sql, {"vec": str(vector), "k": k}).fetchall()
        passages = [row[0] for row in rows]
        return dspy.Prediction(passages=passages)
```

All of these work as drop-in replacements for `dspy.Retrieve`:

```python
retriever = PineconeRetriever("my-index", api_key="...", embed_fn=embed)
dspy.configure(lm=lm, rm=retriever)
# Or pass directly to your module
```

## Step 4: Multi-document search (for complex questions)

When questions need info from multiple places:

```python
class GenerateSearchQuery(dspy.Signature):
    """Generate a search query to find missing information."""
    context: list[str] = dspy.InputField(desc="Information gathered so far")
    question: str = dspy.InputField(desc="The question to answer")
    query: str = dspy.OutputField(desc="Search query to find missing information")

class MultiStepSearch(dspy.Module):
    def __init__(self, num_passages=3, num_searches=2):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(num_searches)]
        self.answer = dspy.ChainOfThought(AnswerFromDocs)

    def forward(self, question):
        context = []

        for hop in self.generate_query:
            query = hop(context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        return self.answer(context=context, question=question)

def deduplicate(passages):
    seen = set()
    result = []
    for p in passages:
        if p not in seen:
            seen.add(p)
            result.append(p)
    return result
```

## Step 5: Test the quality

```python
def search_metric(example, prediction, trace=None):
    # Exact match (simple)
    return prediction.answer == example.answer

# Or use an AI judge for open-ended answers
class JudgeAnswer(dspy.Signature):
    """Is the predicted answer correct given the expected answer?"""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(JudgeAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct
```

## Step 6: Improve accuracy

```python
optimizer = dspy.BootstrapFewShot(metric=search_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(DocSearch(), trainset=trainset)

# Typical improvement: 45-60% exact match -> 65-80% after optimization
# For further gains, upgrade to MIPROv2:
# optimizer = dspy.MIPROv2(metric=search_metric, auto="medium")
```

## When NOT to use RAG

- **Data fits in context** — if all your documents fit within the LM context window (under ~100K tokens), pass them directly as context instead of building a retrieval pipeline. RAG adds complexity for no benefit.
- **Questions are always about the same document** — if every query targets one known document, skip retrieval and just include it.
- **You need exact keyword search** — if users search by ID, SKU, or exact phrases, a database query or full-text search (Elasticsearch, Postgres `tsvector`) outperforms embedding search. Use RAG only when queries are semantic.
- **Real-time data** — if the answer changes every minute (stock prices, live dashboards), a retrieval index will be stale. Query the source directly.

## Key patterns

- **Prefer `ChainOfThought`** for the answer step — reasoning typically helps ground answers in the documents. Use `Predict` if latency matters more than accuracy
- **Include context in the signature** so the AI knows to use the retrieved passages
- **Multi-step search for complex questions** — if one search is not enough, chain search queries
- **Use `dspy.Refine`** to ensure answers actually cite the documents by scoring citation presence in a reward function
- **Separate search from answer generation** — optimize each independently
- **Consider joint prompt + retrieval optimization** — the GEPA paper (arxiv 2507.19457) shows a RAG adapter that jointly optimizes prompts and retrieval strategy for multiplicative gains. See `/dspy-gepa` for details
- **Consider `dspy.Embeddings`** as a built-in retriever — it handles embedding, FAISS indexing, and search in one class without needing a separate vector store (see reference.md for API details)

## Gotchas

- **Chunk size matters more than retriever choice** — most RAG failures trace to bad chunking, not bad embeddings. Start with 512 tokens with 50-token overlap and tune from there.
- **Do not skip the reranking step** — embedding similarity retrieves candidates; a reranker (or LM-based reranker) filters them. Without reranking, irrelevant passages dilute the context.
- **k=3 is not always right** — the default `k` (number of retrieved passages) is a critical hyperparameter. Too few and you miss relevant context; too many and you overwhelm the LM. Tune it against your dev set.
- **Test with questions that require combining information** — single-hop retrieval fails when the answer spans multiple chunks. Use `dspy.ChainOfThought` with multi-step retrieval for these cases.
- **Embedding models and chunk sizes must match at index and query time** — if you re-chunk or switch embedding models, you must rebuild the vector index. Stale indexes silently return bad results.

## Cross-references

> Install any skill: `npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>`

- Need to summarize docs instead of answering questions? Use `/ai-summarizing`
- Put your document search behind a REST API — see `/ai-serving-apis`
- Building a chatbot on top of doc search? Use `/ai-building-chatbots`
- Measure and improve your AI — see `/ai-improving-accuracy`
- Define input/output contracts for your signatures — see `/dspy-signatures`
- Add reasoning to your answer step — see `/dspy-chain-of-thought`
- **Install `/ai-do` if you do not have it** — it routes any AI problem to the right skill and is the fastest way to work: `npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do`

## Additional resources

- For DSPy retrieval API details (Embeddings, Retrieve, ColBERTv2, Embedder), see [reference.md](reference.md)
- For worked examples, see [examples.md](examples.md)