--- name: extractor description: > Extract content from any document using the Preset-First Agentic Pipeline. Auto-detects format and document type (scientific papers, requirements specs, etc.). Supports PDF, DOCX, HTML, XML, PPTX, XLSX, EPUB, Markdown, images. Use when user says "extract this", "convert to markdown", "process pdf", or provides a document. allowed-tools: Bash, Read triggers: - extract this - extract document - extract pdf - extract text - convert to markdown - convert to text - parse this file - process document - process pdf - get sections from - extract sections - run extractor - pdf to markdown - docx to markdown - document to json metadata: short-description: Preset-First document extraction (PDF/DOCX/HTML/XML) project-path: /home/graham/workspace/experiments/extractor --- # Extractor Self-correcting agentic document extraction using a **Preset-First Methodology**. Auto-detects document type and applies calibrated extraction settings. ## Quick Start ```bash # Auto mode (recommended) - detects document type automatically .pi/skills/extractor/run.sh paper.pdf # Specify output directory .pi/skills/extractor/run.sh paper.pdf --out ./results # Get markdown output directly .pi/skills/extractor/run.sh paper.pdf --markdown # OCR scanned PDFs (lazy-loads OCRmyPDF docker image if needed) .pi/skills/extractor/run.sh scanned.pdf --auto-ocr ``` ## Extraction Modes | Mode | Flag | Description | | ------------ | ------------ | ------------------------------------ | | **Auto** | (default) | Profile detector picks best settings | | **Fast** | `--fast` | PyMuPDF only, no ML/LLM (fastest) | | **Accurate** | `--accurate` | Full pipeline with LLM enhancements | | **Offline** | `--offline` | Deterministic, no network calls | ```bash # Fast mode - quick extraction, no LLM .pi/skills/extractor/run.sh report.pdf --fast # Accurate mode - full pipeline with LLM for tables/math .pi/skills/extractor/run.sh paper.pdf --accurate # Offline smoke test (deterministic) .pi/skills/extractor/run.sh doc.pdf --offline ``` ## Collaboration Flow For PDFs without `--preset`, the skill runs an intelligent collaboration flow: 1. **Profile Detection**: Analyzes document (layout, tables, formulas, requirements) 2. **High Confidence Match**: If confidence >= 8, auto-extracts with detected preset 3. **Low Confidence / Unknown**: - **Interactive (TTY)**: Prompts user to select preset - **Non-interactive**: Uses auto mode with warning ```bash # See what the detector finds (no extraction) .pi/skills/extractor/run.sh paper.pdf --profile-only # Output: # { # "preset": "arxiv", # "confidence": 12, # "tables": true, # "figures": true, # "formulas": true, # "recommended_mode": "accurate" # } # Interactive prompt (in terminal) .pi/skills/extractor/run.sh unknown_paper.pdf # Analyzing: unknown_paper.pdf # Detected: multi-column layout, 12 pages # Contains: tables, figures, formulas # # Select extraction preset: # [1] arxiv - Academic papers [RECOMMENDED] # [2] requirements_spec - Engineering specs # [3] auto - Let pipeline decide # [4] fast - Quick extraction, no LLM # Enter choice [1-4]: # Non-interactive (batch/CI) - auto-selects echo | .pi/skills/extractor/run.sh paper.pdf --no-interactive ``` ## Preset Selection The pipeline auto-detects document type via `s00_profile_detector`: | Preset | Detected When | Confidence Points | | --------------------- | ------------------------------------------------------- | ----------------------------------- | | **arxiv** | Academic papers (2-column, math, "Abstract/References") | +5 filename, +4 sections, +3 layout | | **requirements_spec** | Engineering specs (REQ-xxx, "Shall", nested sections) | +5 filename, +4 REQ pattern | | **auto** | Unknown documents | Fallback when confidence < 8 | ```bash # Force a specific preset (skip detection) .pi/skills/extractor/run.sh paper.pdf --preset arxiv .pi/skills/extractor/run.sh spec.pdf --preset requirements_spec # Let collaboration flow decide .pi/skills/extractor/run.sh paper.pdf ``` ## Output Options ```bash # JSON output (default) - full structured data .pi/skills/extractor/run.sh doc.pdf --json # Markdown output - human-readable text .pi/skills/extractor/run.sh doc.pdf --markdown # Sections only (skip tables/figures) .pi/skills/extractor/run.sh doc.pdf --sections-only ``` ## Supported Formats Cross-format parity measured against HTML reference (2026-01-17): | Format | Method | Parity | Notes | | ------------ | ------------------------ | --------- | ------------------------------------- | | **Markdown** | Direct parse | 100% | Perfect structural match | | **DOCX** | Native XML (python-docx) | 100% | Perfect structural match | | **HTML** | BeautifulSoup | Reference | Baseline for comparison | | **XML** | defusedxml | 90% | Structure preserved, markdown differs | | **PDF** | 14-stage pipeline | 87% | Varies by document complexity | | **RST** | docutils | 85% | Section structure varies | | **EPUB** | ebooklib | 82% | Chapter structure varies | | **PPTX** | python-pptx | 81% | Slide-based structure | | **XLSX** | openpyxl | 16% | Expected (spreadsheet format) | | **Images** | OCR/VLM | 16% | Requires VLM for text extraction | ## Pipeline Stages The full pipeline runs 14+ stages: ``` 00_profile_detector Detect document type, select preset 01_annotation_processor Strip PDF annotations 02_marker_extractor Extract blocks (text, tables, figures) 03_suspicious_headers Verify header classifications with VLM 04_section_builder Build document sections 05_table_extractor Extract and describe tables 06_figure_extractor Extract and describe figures 07_duckdb_ingest Assemble into queryable DB 08_extract_requirements Mine requirements (if detected) 08b_lean4_theorem_prover Formal proofs (scientific only) 09_section_summarizer Generate section summaries 10_markdown_exporter Export to Markdown 14_report_generator Generate extraction report ``` ## Output Structure ```json { "success": true, "preset": "arxiv", "outputs": { "markdown": "results/10_markdown_exporter/document.md", "sections": "results/04_section_builder/json_output/04_sections.json", "tables": "results/05_table_extractor/json_output/05_tables.json", "figures": "results/06_figure_extractor/json_output/06_figures.json", "report": "results/14_report_generator/json_output/final_report.json" }, "counts": { "sections": 12, "tables": 5, "figures": 8 } } ``` ## Batch Processing ```bash # Process all PDFs in a directory .pi/skills/extractor/run.sh ./documents/ --out ./results # With glob pattern .pi/skills/extractor/run.sh ./documents/ --glob "**/*.pdf" # Non-interactive batch (CI/scripts) .pi/skills/extractor/run.sh ./documents/ --no-interactive # Force preset for entire batch .pi/skills/extractor/run.sh ./documents/ --preset arxiv --out ./results ``` ## Agent-Friendly Flags | Flag | Purpose | | --------------------- | -------------------------------------------------------- | | `--profile-only` | Return profile JSON without extraction | | `--no-interactive` | Skip prompts, use auto mode | | `--preset ` | Force preset (skip detection) | | `--fast` | No LLM, quick extraction | | `--toc-check` | Check TOC integrity against extracted sections | | `--auto-ocr` | OCR scanned PDFs with OCRmyPDF (lazy-loads docker image) | | `--no-auto-ocr` | Disable OCRmyPDF preprocessing for scanned PDFs | | `--skip-scanned` | Skip scanned PDFs and write a skip manifest | | `--ocr-lang ` | OCR language(s), e.g. `eng` or `eng+deu` | | `--ocr-deskew` | Deskew scanned pages during OCR | | `--ocr-force` | Force OCR even if text exists | | `--ocr-timeout ` | OCR timeout in seconds | | `--continue-on-error` | Continue pipeline on step failures (batch-friendly) | ## TOC Integrity Check Verify that extracted sections match the PDF's Table of Contents (bookmarks): ```bash # Check integrity on pipeline output directory .pi/skills/extractor/run.sh ./results/ --toc-check # Check specific DuckDB file .pi/skills/extractor/run.sh ./results/corpus.duckdb --toc-check ``` Output: ```json { "success": true, "has_toc": true, "integrity_score": 0.85, "status": "GOOD", "toc_entries_count": 20, "sections_count": 18, "matched_count": 17, "missing_count": 3, "matched": [ { "toc_title": "1. Introduction", "section_id": "sec_001", "score": 0.95 } ], "missing": [{ "toc_title": "Appendix A", "toc_page": 45 }] } ``` Status levels: - **EXCELLENT**: >= 90% match - **GOOD**: >= 70% match - **FAIR**: >= 50% match - **POOR**: < 50% match ## Environment Requires the extractor project with its virtual environment: - **Project**: `/home/graham/workspace/experiments/extractor` - **Venv**: `.venv/bin/python` - **Dependencies**: `scillm`, `fetcher` (local paths) Set `EXTRACTOR_ROOT` to override the project location. ## Sanity Check ```bash # Verify skill works across all formats .pi/skills/extractor/sanity.sh ``` Tests: HTML, MD, XML, RST, DOCX, PPTX, EPUB, XLSX, PDF, PNG ## LLM Requirements For accurate mode (VLM/table descriptions): - `CHUTES_API_BASE` - Chutes API endpoint - `CHUTES_API_KEY` - API key - `CHUTES_VLM_MODEL` - Vision model (default: Qwen/Qwen3-VL-235B-A22B-Instruct) - `CHUTES_TEXT_MODEL` - Text model (default: moonshotai/Kimi-K2-Instruct-0905) For Lean4 proving (arxiv preset): - `lean_runner` container running - `OPENROUTER_API_KEY` set