---
name: ollama-local
description: Local LLM inference with Ollama. Use when setting up local models for development, CI pipelines, or cost reduction. Covers model selection, LangChain integration, and performance tuning.
tags: [llm, ollama, local, self-hosted]
context: fork
agent: llm-integrator
version: 1.0.0
author: OrchestKit
user-invocable: false
---

# Ollama Local Inference

Run LLMs locally for cost savings, privacy, and offline development.

## Quick Start

```bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull models
ollama pull deepseek-r1:70b      # Reasoning (GPT-4 level)
ollama pull qwen2.5-coder:32b    # Coding
ollama pull nomic-embed-text     # Embeddings

# Start server
ollama serve
```

## Recommended Models (M4 Max 256GB)

| Task | Model | Size | Notes |
|------|-------|------|-------|
| Reasoning | `deepseek-r1:70b` | ~42GB | GPT-4 level |
| Coding | `qwen2.5-coder:32b` | ~35GB | 73.7% Aider benchmark |
| Embeddings | `nomic-embed-text` | ~0.5GB | 768 dims, fast |
| General | `llama3.2:70b` | ~40GB | Good all-around |

## LangChain Integration

```python
from langchain_ollama import ChatOllama, OllamaEmbeddings

# Chat model
llm = ChatOllama(
    model="deepseek-r1:70b",
    base_url="http://localhost:11434",
    temperature=0.0,
    num_ctx=32768,      # Context window
    keep_alive="5m",    # Keep model loaded
)

# Embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434",
)

# Generate
response = await llm.ainvoke("Explain async/await")
vector = await embeddings.aembed_query("search text")
```

## Tool Calling with Ollama

```python
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the document database."""
    return f"Found results for: {query}"

# Bind tools
llm_with_tools = llm.bind_tools([search_docs])
response = await llm_with_tools.ainvoke("Search for Python patterns")
```

## Structured Output

```python
from pydantic import BaseModel, Field

class CodeAnalysis(BaseModel):
    language: str = Field(description="Programming language")
    complexity: int = Field(ge=1, le=10)
    issues: list[str] = Field(description="Found issues")

structured_llm = llm.with_structured_output(CodeAnalysis)
result = await structured_llm.ainvoke("Analyze this code: ...")
# result is typed CodeAnalysis object
```

## Provider Factory Pattern

```python
import os

def get_llm_provider(task_type: str = "general"):
    """Auto-switch between Ollama and cloud APIs."""
    if os.getenv("OLLAMA_ENABLED") == "true":
        models = {
            "reasoning": "deepseek-r1:70b",
            "coding": "qwen2.5-coder:32b",
            "general": "llama3.2:70b",
        }
        return ChatOllama(
            model=models.get(task_type, "llama3.2:70b"),
            keep_alive="5m"
        )
    else:
        # Fall back to cloud API
        return ChatOpenAI(model="gpt-4o")

# Usage
llm = get_llm_provider(task_type="coding")
```

## Environment Configuration

```bash
# .env.local
OLLAMA_ENABLED=true
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL_REASONING=deepseek-r1:70b
OLLAMA_MODEL_CODING=qwen2.5-coder:32b
OLLAMA_MODEL_EMBED=nomic-embed-text

# Performance tuning (Apple Silicon)
OLLAMA_MAX_LOADED_MODELS=3    # Keep 3 models in memory
OLLAMA_KEEP_ALIVE=5m          # 5 minute keep-alive
```

## CI Integration

```yaml
# GitHub Actions (self-hosted runner)
jobs:
  test:
    runs-on: self-hosted  # M4 Max runner
    env:
      OLLAMA_ENABLED: "true"
    steps:
      - name: Pre-warm models
        run: |
          curl -s http://localhost:11434/api/embeddings \
            -d '{"model":"nomic-embed-text","prompt":"warmup"}' > /dev/null

      - name: Run tests
        run: pytest tests/
```

## Cost Comparison

| Provider | Monthly Cost | Latency |
|----------|-------------|---------|
| Cloud APIs | ~$675/month | 200-500ms |
| Ollama Local | ~$50 (electricity) | 50-200ms |
| **Savings** | **93%** | **2-3x faster** |

## Best Practices

- **DO** use `keep_alive="5m"` in CI (avoid cold starts)
- **DO** pre-warm models before first call
- **DO** set `num_ctx=32768` on Apple Silicon
- **DO** use provider factory for cloud/local switching
- **DON'T** use `keep_alive=-1` (wastes memory)
- **DON'T** skip pre-warming in CI (30-60s cold start)

## Troubleshooting

```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags

# List loaded models
ollama list

# Check model memory usage
ollama ps

# Pull specific version
ollama pull deepseek-r1:70b-q4_K_M
```

## Related Skills

- `embeddings` - Embedding patterns (works with nomic-embed-text)
- `llm-evaluation` - Testing with local models
- `cost-optimization` - Broader cost strategies

## Capability Details

### setup
**Keywords:** setup, install, configure, ollama
**Solves:**
- Set up Ollama locally
- Configure for development
- Install models

### model-selection
**Keywords:** model, llama, mistral, qwen, selection
**Solves:**
- Choose appropriate model
- Compare model capabilities
- Balance speed vs quality

### provider-template
**Keywords:** provider, template, python, implementation
**Solves:**
- Ollama provider template
- Python implementation
- Drop-in LLM provider