---
name: ollama-rag
description: Build RAG systems with Ollama local + cloud models. Latest cloud models include DeepSeek-V3.2 (GPT-5 level), Qwen3-Coder-480B (1M context), MiniMax-M2. Use for document Q&A, knowledge bases, and agentic RAG. Covers LangChain, LlamaIndex, ChromaDB, and embedding models.
---

# Ollama RAG Guide

Build RAG systems with Ollama - run locally or use cloud for massive models.

## Ollama Cloud Models (Dec 2025)

Access via `ollama signin` (v0.12+). No local storage needed, privacy preserved.

| Model | Params | Context | Best For |
|-------|--------|---------|----------|
| `deepseek-v3.2:cloud` | 671B | 160K | **GPT-5 level**, reasoning |
| `deepseek-v3.1:671b-cloud` | 671B | 160K | Thinking + non-thinking hybrid |
| `qwen3-coder:480b-cloud` | 480B | **256K-1M** | Agentic coding, repo-scale |
| `minimax-m2:cloud` | 230B (10B active) | 128K | #1 open-source, tools |
| `gpt-oss:120b-cloud` | 120B | 128K | OpenAI open weights |
| `glm-4.6:cloud` | - | - | Code generation |

```bash
# Sign in to access cloud
ollama signin

# Run cloud models
ollama run deepseek-v3.2:cloud
ollama run qwen3-coder:480b-cloud
ollama run minimax-m2:cloud
```

## Local Models (Dec 2025)

### Reasoning Models

| Model | Params | Context | Best For |
|-------|--------|---------|----------|
| `nemotron-3-nano` | 30B (3.6B active) | **1M tokens** | Agents, long docs, code |
| `deepseek-r1` | 7B-671B | 128K | Reasoning, math, code |
| `qwq` | 32B | 32K | Logic, analysis |
| `llama4` | 109B/400B | 128K | General, multimodal |

### Fast/Efficient Models

| Model | Size | RAM | Speed |
|-------|------|-----|-------|
| `llama3.2:3b` | 2GB | 8GB | Very fast |
| `mistral-small-3.1` | 24B | 16GB | Fast |
| `gemma3` | 4B-27B | 8-32GB | Balanced |

### Embedding Models

| Model | Dims | Context | MTEB Score |
|-------|------|---------|------------|
| `snowflake-arctic-embed2` | 1024 | 8K | **67.5** |
| `mxbai-embed-large` | 1024 | 512 | 64.68 |
| `nomic-embed-text` | 768 | 8K | 53.01 |

**Recommendation**: `snowflake-arctic-embed2` for accuracy, `nomic-embed-text` for speed.

## Quick Start

### Cloud (No Local Resources)
```bash
ollama signin
ollama run deepseek-v3.2:cloud  # GPT-5 level
ollama run qwen3-coder:480b-cloud  # 1M context for huge repos
```

### Local
```bash
ollama pull nemotron-3-nano  # 1M context, 24GB VRAM
ollama pull snowflake-arctic-embed2

# Or for lower RAM (8GB)
ollama pull llama3.2:3b
ollama pull nomic-embed-text
```

## Stack Options

### Option A: LangChain + ChromaDB (Most Common)

```python
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

# Load and split
loader = PyPDFLoader("document.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(loader.load())

# Embed and store
embeddings = OllamaEmbeddings(model="snowflake-arctic-embed2")
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db")

# Query - LOCAL
llm = OllamaLLM(model="nemotron-3-nano")

# Or CLOUD (GPT-5 level, no local resources)
llm = OllamaLLM(model="deepseek-v3.2:cloud")

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa.invoke("What is the main topic?")
```

### Option B: LlamaIndex (Better Accuracy)

```python
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

# Configure
Settings.llm = Ollama(model="nemotron-3-nano", request_timeout=300.0)
Settings.embed_model = OllamaEmbedding(model_name="snowflake-arctic-embed2")

# Load and index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")
```

### Option C: Direct Ollama API (Minimal Dependencies)

```python
import ollama
import chromadb

# Embed
def embed(text):
    return ollama.embed(model="nomic-embed-text", input=text)["embeddings"][0]

# Store in ChromaDB
client = chromadb.PersistentClient(path="./db")
collection = client.get_or_create_collection("docs")
collection.add(ids=["1"], documents=["text"], embeddings=[embed("text")])

# Retrieve and generate
results = collection.query(query_embeddings=[embed("query")], n_results=3)
context = "\n".join(results["documents"][0])

response = ollama.chat(
    model="nemotron-3-nano",
    messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: ..."}]
)
```

## Vector Database Options

| Database | Install | Best For |
|----------|---------|----------|
| ChromaDB | `pip install chromadb` | Simple, embedded |
| FAISS | `pip install faiss-cpu` | Fast similarity |
| Qdrant | `pip install qdrant-client` | Production scale |
| Weaviate | Docker | Full-featured |

## Nemotron 3 Nano Deep Dive

**Why Nemotron for RAG:**
- 1M token context = entire codebases, long documents
- Hybrid Mamba-Transformer = 4x faster inference
- MoE (3.6B active params) = runs on 24GB VRAM
- Apache 2.0 license = commercial use OK

```python
# For very long documents
llm = OllamaLLM(
    model="nemotron-3-nano",
    num_ctx=131072,  # 128K context, increase as needed
    temperature=0.1,  # Lower for factual RAG
)
```

## Hardware Requirements

| Model | RAM | GPU VRAM |
|-------|-----|----------|
| 3B models | 8GB | 4GB |
| 7-8B models | 16GB | 8GB |
| 30B models | 32GB | 24GB |
| 70B+ models | 64GB+ | 48GB+ |

## References

- [Model selection guide](references/model-selection.md)
- [Ollama Library](https://ollama.com/library)
- [Nemotron on Ollama](https://ollama.com/library/nemotron-3-nano)