--- name: paddleocr-vl-1.5 description: > Advanced document parsing with PaddleOCR-VL 1.5 vision-language model. Returns complete document structure including text, tables, formulas, charts, and layout information. Claude extracts relevant content based on user needs. --- # PaddleOCR-VL 1.5 Document Parsing Skill ## When to Use This Skill ✅ **Use PaddleOCR-VL 1.5 for**: - Documents with tables (invoices, financial reports, spreadsheets) - Documents with mathematical formulas (academic papers, scientific documents) - Documents with charts and diagrams - Multi-column layouts (newspapers, magazines, brochures) - Complex document structures requiring layout analysis - Any document requiring structured understanding ❌ **Use PP-OCRv5 instead for**: - Simple text-only extraction - Quick OCR tasks where speed is critical - Screenshots or simple images with clear text ## How to Use This Skill **⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔** 1. **ONLY use PaddleOCR-VL 1.5 API** - Execute the script `python scripts/paddleocr-vl-1.5/vl_caller.py` 2. **NEVER use Claude's built-in vision** - Do NOT parse documents yourself 3. **NEVER offer alternatives** - Do NOT suggest "I can try to analyze it" or similar 4. **IF API fails** - Display the error message and STOP immediately 5. **NO fallback methods** - Do NOT attempt document parsing any other way If the script execution fails (API not configured, network error, etc.): - Show the error message to the user - Do NOT offer to help using your vision capabilities - Do NOT ask "Would you like me to try parsing it?" - Simply stop and wait for user to fix the configuration ### Basic Workflow 1. **Execute document parsing**: ```bash python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "URL provided by user" ``` Or for local files: ```bash python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "file path" ``` **Save result to file** (recommended): ```bash python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "URL" --output result.json --pretty ``` - The script will display: `Result saved to: /absolute/path/to/result.json` - This message appears on stderr, the JSON is saved to the file - **Tell the user the file path** shown in the message 2. **The script returns COMPLETE JSON** with all document content: - Headers, footers, page numbers - Main text content - Tables with structure - Formulas (with LaTeX) - Figures and charts - Footnotes and references - Layout and reading order 3. **Extract what the user needs** from the complete data based on their request. ### IMPORTANT: Complete Content Display **CRITICAL**: You must display the COMPLETE extracted content to the user based on their needs. - The script returns ALL document content in a structured format - **Display the full content requested by the user**, do NOT truncate or summarize - If user asks for "all text", show the entire `result.full_text` - If user asks for "tables", show ALL tables in the document - If user asks for "main content", filter out headers/footers but show ALL body text **What this means**: - ✅ **DO**: Display complete text, all tables, all formulas as requested - ✅ **DO**: Present content in reading order using `reading_order` array - ❌ **DON'T**: Truncate with "..." unless content is excessively long (>10,000 chars) - ❌ **DON'T**: Summarize or provide excerpts when user asks for full content - ❌ **DON'T**: Say "Here's a preview" when user expects complete output **Example - Correct**: ``` User: "Extract all the text from this document" Claude: I've parsed the complete document. Here's all the extracted text: [Display entire result.full_text or concatenated regions in reading order] Document Statistics: - Total regions: 25 - Text blocks: 15 - Tables: 3 - Formulas: 2 Quality: Excellent (confidence: 0.92) ``` **Example - Incorrect** ❌: ``` User: "Extract all the text" Claude: "I found a document with multiple sections. Here's the beginning: 'Introduction...' (content truncated for brevity)" ``` ### Understanding the JSON Response The script returns a complete JSON structure: ```json { "ok": true, "result": { "full_text": "Complete text with all content including headers, footers, etc.", "layout": { "regions": [ { "id": 0, "type": "header", "content": "Chapter 3: Methods", "bbox": [100, 50, 500, 100] }, { "id": 1, "type": "text", "content": "Main body text content...", "bbox": [100, 150, 500, 300] }, { "id": 2, "type": "table", "content": { "rows": 3, "cols": 2, "cells": [["Header1", "Header2"], ["Data1", "Data2"]] }, "bbox": [100, 350, 500, 550] }, { "id": 3, "type": "formula", "content": "E = mc^2", "latex": "$E = mc^2$", "bbox": [200, 600, 400, 630] }, { "id": 4, "type": "footnote", "content": "[1] Reference citation", "bbox": [100, 650, 500, 680] }, { "id": 5, "type": "footer", "content": "University Name 2024", "bbox": [100, 750, 500, 800] }, { "id": 6, "type": "page_number", "content": "25", "bbox": [250, 770, 280, 790] } ], "reading_order": [0, 1, 2, 3, 4, 5, 6] } }, "metadata": { "processing_time_ms": 3500, "total_pages": 1, "model_version": "paddleocr-vl-1.5" } } ``` ### Region Types The `type` field indicates the element category: | Type | Description | Typically Include? | |------|-------------|-------------------| | `header` | Page headers (chapter/section titles) | Exclude (usually repetitive) | | `text` | Main body text | **Include** | | `table` | Tables with structured data | **Include** | | `formula` | Mathematical formulas | **Include** | | `figure` | Images, charts, diagrams | **Include** | | `footnote` | Footnotes and references | **Include** (often important) | | `footer` | Page footers (author/institution) | Exclude (usually repetitive) | | `page_number` | Page numbers | Exclude (not content) | | `margin_note` | Margin annotations | Context-dependent | ### Content Extraction Guidelines **Based on user intent, filter the regions**: | User Says | What to Extract | How | |-----------|-----------------|-----| | "Extract main content" | text, table, formula, figure, footnote | Skip header, footer, page_number | | "Get all tables" | table only | Filter by type="table" | | "Extract formulas" | formula only | Filter by type="formula" | | "Complete document" | Everything | Use all regions or full_text | | "Without headers/footers" | Core content | Skip header, footer types | | "Include page numbers" | Core + page_number | Keep page_number type | ### Usage Examples **Example 1: Extract Main Content** (default behavior) ```bash python scripts/paddleocr-vl-1.5/vl_caller.py \ --file-url "https://example.com/paper.pdf" \ --pretty ``` Then filter JSON to extract core content: - Include: text, table, formula, figure, footnote - Exclude: header, footer, page_number **Example 2: Extract Tables Only** ```bash python scripts/paddleocr-vl-1.5/vl_caller.py \ --file-path "./financial_report.pdf" \ --pretty ``` Then filter JSON: - Only keep regions where type="table" - Present table content in markdown format **Example 3: Complete Document with Everything** ```bash python scripts/paddleocr-vl-1.5/vl_caller.py \ --file-url "URL" \ --pretty ``` Then use `result.full_text` or present all regions in reading_order. ### First-Time Configuration **When API is not configured**: The error will show: ``` Configuration error: API not configured. Get your API at: https://aistudio.baidu.com/paddleocr/task ``` **Auto-configuration workflow**: 1. **Show the exact error message** to user (including the URL) 2. **Tell user to provide credentials**: ``` Please visit the URL above to get your VL_API_URL and VL_TOKEN. Once you have them, send them to me and I'll configure it automatically. ``` 3. **When user provides credentials** (accept any format): - `VL_API_URL=https://xxx.com/v1, VL_TOKEN=abc123...` - `Here's my API: https://xxx and token: abc123` - Copy-pasted code format - Any other reasonable format 4. **Parse credentials from user's message**: - Extract VL_API_URL value (look for URLs) - Extract VL_TOKEN value (long alphanumeric string, usually 40+ chars) 5. **Configure automatically**: ```bash python scripts/paddleocr-vl-1.5/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN" ``` 6. **If configuration succeeds**: - Inform user: "Configuration complete! Parsing document now..." - Retry the original parsing task 7. **If configuration fails**: - Show the error - Ask user to verify the credentials **IMPORTANT**: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it. ### Handling Large Files (>20MB) **Problem**: Local files larger than 20MB are rejected by default. **Solutions** (Choose based on your situation): #### Solution 1: Use URL Upload (Recommended) ⭐ Upload your file to a web server and use `--file-url`: ```bash # Instead of local file python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "large_file.pdf" # ❌ May fail # Use URL instead python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "https://your-server.com/large_file.pdf" # ✅ No size limit ``` Benefits: - No local file size limit - Faster upload (direct from API server) - Suitable for very large files (>100MB) #### Solution 2: Increase Size Limit Adjust the limit in `.env` file: ```bash # .env VL_MAX_FILE_SIZE_MB=50 # Increase from 20MB to 50MB ``` Then process as usual: ```bash python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "large_file.pdf" ``` **Note**: Your API provider may still have upload limits. Check with your VL API service. #### Solution 3: Compress/Optimize File Use the built-in optimizer to reduce file size: ```bash # Install optimization dependencies first pip install -r scripts/paddleocr-vl-1.5/requirements-optimize.txt # Optimize image (reduce quality) python scripts/paddleocr-vl-1.5/optimize_file.py input.png output.png --quality 70 # Optimize PDF (compress images within) python scripts/paddleocr-vl-1.5/optimize_file.py input.pdf output.pdf --target-size 15 # Then process optimized file python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "output.pdf" --pretty ``` The optimizer will: - Compress images (adjust JPEG quality) - Resize images if needed (maintain aspect ratio) - Compress PDF images (reduce DPI to 150) - Show before/after size comparison #### Solution 4: Process Specific Pages (PDF Only) If you only need certain pages from a large PDF, extract them first: ```bash # Using PyMuPDF (requires: pip install PyMuPDF) python -c " import fitz doc = fitz.open('large.pdf') writer = fitz.open() writer.insert_pdf(doc, from_page=0, to_page=4) # Pages 1-5 writer.save('pages_1_5.pdf') " # Then process the smaller file python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "pages_1_5.pdf" ``` #### Solution 5: Upload to Cloud Storage For extremely large files (>100MB), use cloud storage with public URLs: ```bash # Example: Upload to AWS S3, Google Drive, or similar # Get public URL: https://storage.example.com/my-document.pdf # Process via URL python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "https://storage.example.com/my-document.pdf" ``` **Comparison Table**: | Solution | Max Size | Speed | Complexity | Best For | |----------|----------|-------|------------|----------| | URL Upload | Unlimited | Fast | Low | Any large file | | Increase Limit | Configurable | Medium | Very Low | Slightly over limit | | Compress | ~70% reduction | Slow | Medium | Images/PDFs with images | | Extract Pages | As needed | Fast | Medium | Multi-page PDFs | | Cloud Storage | Unlimited | Fast | High | Very large files (>100MB) | ### Error Handling **Authentication failed (401/403)**: ``` error: Authentication failed ``` → Token is invalid, reconfigure with correct credentials **API quota exceeded (429)**: ``` error: API quota exceeded ``` → Daily API quota exhausted, inform user to wait or upgrade **Unsupported format**: ``` error: Unsupported file format ``` → File format not supported, convert to PDF/PNG/JPG ### Pseudo-Code: Content Extraction **Extract main content** (most common): ```python def extract_main_content(json_response): regions = json_response['result']['layout']['regions'] # Keep only core content types core_types = ['text', 'table', 'formula', 'figure', 'footnote'] main_regions = [r for r in regions if r['type'] in core_types] # Sort by reading order reading_order = json_response['result']['layout']['reading_order'] sorted_regions = sort_by_order(main_regions, reading_order) # Present to user for region in sorted_regions: if region['type'] == 'text': print(region['content']) elif region['type'] == 'table': print_as_markdown_table(region['content']) elif region['type'] == 'formula': print(f"Formula: {region['latex']}") ``` **Extract tables only**: ```python def extract_tables(json_response): regions = json_response['result']['layout']['regions'] tables = [r for r in regions if r['type'] == 'table'] for i, table in enumerate(tables): print(f"Table {i+1}:") print_as_markdown_table(table['content']) ``` **Extract everything**: ```python def extract_complete(json_response): # Simply use the full_text which includes everything print(json_response['result']['full_text']) # Or present all regions in order regions = json_response['result']['layout']['regions'] reading_order = json_response['result']['layout']['reading_order'] for idx in reading_order: region = regions[idx] print(f"[{region['type']}] {region['content']}") ``` ## Important Notes - **The script NEVER filters content** - It always returns complete data - **Claude decides what to present** - Based on user's specific request - **All data is always available** - Can be re-interpreted for different needs - **No information is lost** - Complete document structure preserved ## Reference Documentation For in-depth understanding of the PaddleOCR-VL 1.5 system, refer to: - `references/paddleocr-vl-1.5/vl_model_spec.md` - VL model specifications - `references/paddleocr-vl-1.5/layout_schema.md` - Layout detection schema - `references/paddleocr-vl-1.5/output_format.md` - Complete output format specification Load these reference documents into context when: - Debugging complex parsing issues - Understanding layout detection algorithm - Working with special document types - Customizing content extraction logic ## Testing the Skill To verify the skill is working properly: ```bash python scripts/paddleocr-vl-1.5/smoke_test.py ``` This tests configuration and optionally API connectivity.