---
name: extracting-mistral-ocr
description: >-
  Extracts text, tables, and images from PDFs (including scanned PDFs) using the Mistral OCR API.
  Use when user asks to OCR a PDF/image, extract text from a PDF, parse a scanned document,
  convert a PDF to Markdown, or extract structured fields from a document.
compatibility: >-
  Requires network access and a MISTRAL_API_KEY environment variable. Expects Python 3.9+ and the mistralai package.
allowed-tools: "Read,Write,Bash(python:*)"
metadata:
  author: generated-by-chatgpt
  version: 0.1.0
  api: mistral
  default-model: mistral-ocr-latest
---

# Mistral OCR PDF extraction

## Quick start (default)

Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:

```bash
python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr
```

Output directory layout:

- `combined.md` (all pages concatenated)
- `pages/page-000.md` (per-page markdown)
- `raw_response.json` (full OCR response)
- `images/` (decoded embedded images, if requested)
- `tables/` (separate tables, if requested)

## Workflow

1. **Pick input mode**
   - **Local PDF** (most common): upload via Files API, then OCR via `file_id`.
   - **Public URL**: OCR directly via `document_url`.

2. **Choose output fidelity** (defaults are safe for RAG)
   - Keep `table_format=inline` unless the user explicitly wants tables split out.
   - Set `--include-image-base64` when the user needs figures/diagrams extracted.
   - Use `--extract-header/--extract-footer` if header/footer noise hurts downstream search.

3. **Run OCR**
   - Use `scripts/mistral_ocr_extract.py` to produce a deterministic on-disk artefact set.

4. **(Optional) Structured extraction from the whole document**
   - If the user wants fields (invoice totals, contract parties, etc.), provide an annotation prompt.
   - The OCR API can return a document-level `document_annotation` in addition to page markdown.

   Example:

   ```bash
   python {baseDir}/scripts/mistral_ocr_extract.py \
     --input invoice.pdf \
     --out out/invoice \
     --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \
     --annotation-format json_object
   ```

## Decision rules

- **If the PDF is local and not publicly accessible**, upload it (the script does this automatically).
- **If the PDF URL is private or requires authentication**, do not pass it as `document_url`; upload instead.
- **If output quality is critical**, prefer `table_format=html` for downstream parsing over brittle regex.

## Common failure modes

- **Missing `MISTRAL_API_KEY`**: set it in the environment before running.
- **URL OCR fails**: the URL likely is not publicly accessible; upload the file.
- **Large files**: upload supports large files, but very large PDFs may need page selection (`--pages`) or batch processing.

## References

- API + parameters: `references/mistral_ocr_api.md`
- Output mapping rules (placeholders to extracted images/tables): `references/output_mapping.md`
- Example annotation prompts for common document types: `references/annotation_prompts.md`