--- name: markitdown description: Convert files and Office documents into clean Markdown when you need LLM-friendly, token-efficient text (e.g., for summarization, search, RAG ingestion, or dataset preparation). license: MIT author: aipoch --- > **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills) ## When to Use - Converting research papers or reports (PDF/DOCX/EPUB/HTML) into Markdown for LLM summarization, Q&A, or RAG indexing. - Extracting tables and structured content from spreadsheets (XLSX/CSV) into Markdown for analysis or documentation. - Turning slide decks (PPTX) into Markdown notes, including speaker notes and (optionally) AI-generated image descriptions. - Processing images or scanned documents with OCR to obtain searchable, editable Markdown text. - Transcribing audio (WAV/MP3) or pulling YouTube transcripts into Markdown for meeting notes, content analysis, or knowledge bases. ## Key Features - Converts many formats to structured Markdown (PDF, DOCX, PPTX, XLSX, images, audio, HTML, CSV, JSON, XML, ZIP, EPUB, YouTube URLs, etc.). - Produces token-efficient output suitable for LLM pipelines (summarization, chunking, embedding). - OCR support for images/scans (when OCR dependencies are installed). - Audio transcription support (when transcription dependencies are installed). - Optional AI-enhanced image/slide descriptions via an OpenAI-compatible client (e.g., OpenRouter). - Plugin system to extend format support and custom behaviors. - Stream-based conversion API for large files. ## Dependencies - Python: `>=3.9` (recommended) - Package: - `markitdown[all]` (installs all optional format handlers) Optional system dependencies (feature-dependent): - Tesseract OCR: `tesseract-ocr` (for image/scanned-text OCR) Optional external services (feature-dependent): - Azure Document Intelligence endpoint (for enhanced PDF extraction) - OpenAI-compatible LLM endpoint (e.g., OpenRouter) for AI image descriptions ## Example Usage ### Install ```bash pip install 'markitdown[all]' ``` ### CLI: Convert a PDF to Markdown ```bash markitdown document.pdf -o output.md ``` ### Python: Convert multiple formats (PDF/XLSX/PPTX/DOCX) and save outputs ```python from pathlib import Path from markitdown import MarkItDown md = MarkItDown() files = [ "document.pdf", "spreadsheet.xlsx", "presentation.pptx", "notes.docx", ] for path in files: result = md.convert(path) out = Path(path).with_suffix(".md") out.write_text(result.text_content, encoding="utf-8") print(f"Converted {path} -> {out}") ``` ### Python: Stream conversion (useful for large files) ```python from markitdown import MarkItDown md = MarkItDown() with open("large_file.pdf", "rb") as f: result = md.convert_stream(f, file_extension=".pdf") with open("large_file.md", "w", encoding="utf-8") as out: out.write(result.text_content) ``` ### Python: AI-enhanced image/slide descriptions (OpenAI-compatible, e.g., OpenRouter) ```python from markitdown import MarkItDown from openai import OpenAI client = OpenAI( api_key="YOUR_OPENROUTER_API_KEY", base_url="https://openrouter.ai/api/v1", ) md = MarkItDown( llm_client=client, llm_model="anthropic/claude-opus-4.5", llm_prompt="Describe this image in detail for scientific documentation.", ) result = md.convert("presentation.pptx") print(result.text_content) ``` ## Implementation Details - **Conversion entry points** - `MarkItDown().convert(path)` converts a file by path/URL and returns an object whose primary payload is `result.text_content` (Markdown). - `MarkItDown().convert_stream(stream, file_extension=".pdf")` converts from a binary stream; use this for large files or when data is not on disk. - **Format handling** - Format support is provided by optional extras (e.g., `pdf`, `docx`, `pptx`, `xlsx`, `audio-transcription`, `youtube-transcription`) or `all`. - ZIP inputs are typically processed by iterating through contained files and converting each supported entry. - **OCR** - For images/scanned documents, OCR is enabled when OCR tooling is available (commonly Tesseract). Ensure the OS-level OCR binary is installed and accessible in `PATH`. - **AI image descriptions** - When `llm_client`, `llm_model`, and `llm_prompt` are provided, MarkItDown can request model-generated descriptions for images (including slide images), then inject those descriptions into the Markdown output. - Any OpenAI-compatible client can be used (e.g., OpenRouter) by setting `base_url` and `api_key`. - **Enhanced PDF extraction (Azure Document Intelligence)** - When configured with a Document Intelligence endpoint, PDF extraction can be improved for complex layouts (tables, multi-column text, scanned PDFs), producing more faithful Markdown structure. - **Plugins** - Plugins can be listed and enabled from the CLI (e.g., `--list-plugins`, `--use-plugins`) to extend conversion behavior or add new format handlers.