--- name: convert-doc description: | Smart Document Pipeline - Context-Efficient Document Handling Converts PDF, Word, PowerPoint, Excel to clean markdown. Auto-summarizes large documents. Caches for reuse. TRIGGERS: convert, konvertera, pdf to markdown, docx to md, pptx WHY: - PDFs can consume 30+ MB context when read directly - Converted markdown is 95-99% smaller - Summary versions are 99.9% smaller - Cached versions reuse across sessions --- # Smart Document Pipeline ## Quick Reference ```bash # Convert document (auto-caches, auto-summarizes if >100KB) python ~/.claude/lib/document-converter.py "/path/to/file.pdf" # Force regenerate python ~/.claude/lib/document-converter.py "/path/to/file.pdf" --force # List cached documents python ~/.claude/lib/document-converter.py --list # Cleanup old cache (>1 week) python ~/.claude/lib/document-converter.py --cleanup ``` ## Supported Formats | Format | Extension | Tool | Notes | |--------|-----------|------|-------| | PDF | .pdf | PyMuPDF | Text extraction, page-by-page | | Word | .docx, .doc | pandoc/python-docx | Full markdown | | PowerPoint | .pptx, .ppt | python-pptx | Slide-by-slide with notes | | Excel | .xlsx, .xls | openpyxl | Tables as markdown | | RTF | .rtf | pandoc | Rich text | ## Output Structure ```json { "cache_path": "/path/to/cached/file.md", "summary_path": "/path/to/cached/file_summary.md", // if >100KB "from_cache": false, "original_size": 26744198, "converted_size": 129844, "summary_size": 30638, "savings_percent": 99.5, "recommendation": "summary" // "summary" or "full" } ``` ## Auto-Summary Documents >100KB automatically get a summary version: | Version | Purpose | Size Target | |---------|---------|-------------| | Full | Complete content | As converted | | Summary | Quick overview | ~30KB | The summary preserves: - All headers and structure - First portion of each section - Metadata and source reference ## Automatic Integration The `smart-read-interceptor` hook automatically triggers when you read: - PDF, Word, PowerPoint, Excel files - Any file >200KB It will suggest: 1. **Use summary** - If summary exists (best for overview) 2. **Use cache** - If full cached version exists 3. **Convert first** - If no cache exists 4. **Delegate** - For very large files, use subagent ## Subagent Delegation Pattern For very large documents, delegate to isolated context: ``` Task( subagent_type="Explore", prompt="Read and summarize key points from: /path/to/large-file.pdf. Focus on: [specific topics]. Max 500 words summary." ) ``` This keeps the large content OUT of main context. ## Cache Location ``` ~/.claude/cache/documents/ ├── filename_hash.md # Full converted version ├── filename_hash_summary.md # Summary (if >100KB) └── ... ``` Cache expires after 1 week. Run `--cleanup` to remove old files. ## Real-World Results | Document | Original | Converted | Summary | Savings | |----------|----------|-----------|---------|---------| | Google AI Guide (PDF) | 26.7 MB | 127 KB | 30 KB | 99.9% | | Debatt (Word) | 206 KB | 5.4 KB | - | 97% | | Övning (PowerPoint) | 7.2 MB | 3.1 KB | - | 99.96% | ## Workflow Examples ### Reading a PDF for research ``` 1. User asks to analyze a PDF 2. Hook detects: "📄 DOCUMENT FILE: .PDF" 3. Convert: python ~/.claude/lib/document-converter.py "file.pdf" 4. Read the summary for overview 5. Read specific sections from full version if needed ``` ### Processing multiple documents ``` 1. Convert all documents first (batch): for f in *.pdf; do python ~/.claude/lib/document-converter.py "$f"; done 2. Read summaries in main context 3. Delegate deep analysis to subagents ```