---
name: extracting-pdf-text
description: Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.
---

# Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

## Quick Decision Guide

| PDF Type | Best Approach | Script |
|----------|--------------|--------|
| Simple text PDF | PyMuPDF | `scripts/extract_pymupdf.py` |
| PDF with tables | pdfplumber | `scripts/extract_pdfplumber.py` |
| Scanned/image PDF (local) | pytesseract | `scripts/extract_with_ocr.py` |
| Complex layout, highest accuracy | Mistral OCR API | `scripts/extract_mistral_ocr.py` |
| End-to-end RAG pipeline | marker-pdf | `pip install marker-pdf` |

## Recommended Workflow

1. **Try PyMuPDF first** - fastest, handles most text-based PDFs well
2. **If tables are mangled** - switch to pdfplumber
3. **If scanned/image-based** - use Mistral OCR API (best accuracy) or local OCR (free but slower)

## Local Extraction (No API Required)

### PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

```bash
uv run scripts/extract_pymupdf.py input.pdf output.md
```

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses `pymupdf4llm` which formats text for RAG systems.

### pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

```bash
uv run scripts/extract_pdfplumber.py input.pdf output.md
```

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

### Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

```bash
uv run scripts/extract_with_ocr.py input.pdf output.txt
```

Requires: `pytesseract`, `pdf2image`, and Tesseract installed (`brew install tesseract` on macOS).

## API-Based Extraction

### Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

**Pricing**: ~1000 pages per dollar (very cost-effective)

```bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md
```

Features:
- Outputs clean markdown
- Preserves document structure (headings, lists, tables)
- Handles images, math equations, multilingual text
- 95%+ accuracy on complex documents

For detailed API options and other services, see [references/api-services.md](references/api-services.md).

## Output Format Recommendations

For LLM consumption, markdown is preferred:
- Preserves semantic structure (headings become context boundaries)
- Tables remain readable
- Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see [references/local-tools.md](references/local-tools.md).