--- name: paddleocr-doc-parsing description: > Advanced document parsing with PaddleOCR. Returns complete document structure including text, tables, formulas, charts, and layout information. Claude extracts relevant content based on user needs. --- # PaddleOCR Document Parsing Skill ## When to Use This Skill ✅ **Use Document Parsing for**: - Documents with tables (invoices, financial reports, spreadsheets) - Documents with mathematical formulas (academic papers, scientific documents) - Documents with charts and diagrams - Multi-column layouts (newspapers, magazines, brochures) - Complex document structures requiring layout analysis - Any document requiring structured understanding ❌ **Use Text Recognition instead for**: - Simple text-only extraction - Quick OCR tasks where speed is critical - Screenshots or simple images with clear text ## How to Use This Skill **⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔** 1. **ONLY use PaddleOCR Document Parsing API** - Execute the script `python scripts/vl_caller.py` 2. **NEVER use Claude's built-in vision** - Do NOT parse documents yourself 3. **NEVER offer alternatives** - Do NOT suggest "I can try to analyze it" or similar 4. **IF API fails** - Display the error message and STOP immediately 5. **NO fallback methods** - Do NOT attempt document parsing any other way If the script execution fails (API not configured, network error, etc.): - Show the error message to the user - Do NOT offer to help using your vision capabilities - Do NOT ask "Would you like me to try parsing it?" - Simply stop and wait for user to fix the configuration ### Basic Workflow 1. **Execute document parsing**: ```bash python scripts/vl_caller.py --file-url "URL provided by user" ``` Or for local files: ```bash python scripts/vl_caller.py --file-path "file path" ``` **Save result to file** (recommended): ```bash python scripts/vl_caller.py --file-url "URL" --output result.json --pretty ``` - The script will display: `Result saved to: /absolute/path/to/result.json` - This message appears on stderr, the JSON is saved to the file - **Tell the user the file path** shown in the message 2. **The script returns COMPLETE JSON** with all document content: - Headers, footers, page numbers - Main text content - Tables with structure - Formulas (with LaTeX) - Figures and charts - Footnotes and references - Seals and stamps - Layout and reading order 3. **Extract what the user needs** from the complete data based on their request. ### IMPORTANT: Complete Content Display **CRITICAL**: You must display the COMPLETE extracted content to the user based on their needs. - The script returns ALL document content in a structured format - **Display the full content requested by the user**, do NOT truncate or summarize - If user asks for "all text", show the entire `text` field - If user asks for "tables", show ALL tables in the document - If user asks for "main content", filter out headers/footers but show ALL body text **What this means**: - ✅ **DO**: Display complete text, all tables, all formulas as requested - ✅ **DO**: Present content in reading order using `reading_order` array - ❌ **DON'T**: Truncate with "..." unless content is excessively long (>10,000 chars) - ❌ **DON'T**: Summarize or provide excerpts when user asks for full content - ❌ **DON'T**: Say "Here's a preview" when user expects complete output **Example - Correct**: ``` User: "Extract all the text from this document" Claude: I've parsed the complete document. Here's all the extracted text: [Display entire text field or concatenated regions in reading order] Document Statistics: - Total regions: 25 - Text blocks: 15 - Tables: 3 - Formulas: 2 Quality: Excellent (confidence: 0.92) ``` **Example - Incorrect** ❌: ``` User: "Extract all the text" Claude: "I found a document with multiple sections. Here's the beginning: 'Introduction...' (content truncated for brevity)" ``` ### Understanding the JSON Response The script returns a JSON envelope wrapping the raw API result: ```json { "ok": true, "text": "Full markdown/HTML text extracted from all pages", "result": [ { "prunedResult": { "parsing_res_list": [ {"block_label": "text", "block_content": "Paragraph text content here...", "block_bbox": [100, 200, 500, 230], "block_id": 3}, {"block_label": "table", "block_content": "...
", "block_bbox": [50, 300, 900, 600], "block_id": 5}, {"block_label": "seal", "block_content": "", "block_bbox": [400, 50, 600, 180], "block_id": 2} ] }, "markdown": { "text": "Full page content in markdown/HTML format", "images": {"imgs/filename.jpg": "https://..."} } } ], "error": null } ``` **Key fields**: - `text` — extracted markdown text from all pages (use this for quick text display) - `result` — raw API result array (one object per page), for detailed block-level access - `result[n].prunedResult.parsing_res_list` — array of detected content blocks - `result[n].markdown.text` — full page content in markdown/HTML ### Block Labels The `block_label` field indicates the content type: | Label | Description | |-------|-------------| | `text` | Regular text content | | `table` | Table (content is HTML ``) | | `image` | Embedded image | | `seal` | Seal or stamp | | `figure_title` | Figure/chart title or caption | | `vision_footnote` | Footnote detected by vision model | | `aside_text` | Side/margin text | ### Content Extraction Guidelines **Based on user intent, filter the blocks**: | User Says | What to Extract | How | |-----------|-----------------|-----| | "Extract all text" | Everything | Use `text` field directly | | "Get all tables" | table blocks only | Filter `parsing_res_list` by `block_label == "table"` | | "Show main content" | text + table blocks | Filter out `aside_text`, `image` | | "Complete document" | Everything | Use `text` field or iterate all blocks | ### Usage Examples **Example 1: Extract Main Content** (default behavior) ```bash python scripts/vl_caller.py \ --file-url "https://example.com/paper.pdf" \ --pretty ``` Then filter JSON to extract core content: - Include: text, table, formula, figure, footnote - Exclude: header, footer, page_number **Example 2: Extract Tables Only** ```bash python scripts/vl_caller.py \ --file-path "./financial_report.pdf" \ --pretty ``` Then filter JSON: - Only keep regions where type="table" - Present table content in markdown format **Example 3: Complete Document with Everything** ```bash python scripts/vl_caller.py \ --file-url "URL" \ --pretty ``` Then use the `text` field or present all regions in reading_order. ### First-Time Configuration **When API is not configured**: The error will show: ``` Configuration error: API not configured. Get your API at: https://paddleocr.com ``` **Auto-configuration workflow**: 1. **Show the exact error message** to user (including the URL) 2. **Tell user to provide credentials**: ``` Please visit the URL above to get your PADDLEOCR_DOC_PARSING_API_URL and PADDLEOCR_ACCESS_TOKEN. Once you have them, send them to me and I'll configure it automatically. ``` 3. **When user provides credentials** (accept any format): - `PADDLEOCR_DOC_PARSING_API_URL=https://xxx.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...` - `Here's my API: https://xxx and token: abc123` - Copy-pasted code format - Any other reasonable format 4. **Parse credentials from user's message**: - Extract PADDLEOCR_DOC_PARSING_API_URL value (look for URLs) - Extract PADDLEOCR_ACCESS_TOKEN value (long alphanumeric string, usually 40+ chars) 5. **Configure automatically**: ```bash python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN" ``` 6. **If configuration succeeds**: - Inform user: "Configuration complete! Parsing document now..." - Retry the original parsing task 7. **If configuration fails**: - Show the error - Ask user to verify the credentials **IMPORTANT**: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it. ### Handling Large Files There is no file size limit for the API. For PDFs, the maximum is 100 pages per request. **Tips for large files**: #### Use URL for Large Local Files (Recommended) For very large local files, prefer `--file-url` over `--file-path` to avoid base64 encoding overhead: ```bash python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf" ``` #### Process Specific Pages (PDF Only) If you only need certain pages from a large PDF, extract them first: ```bash # Using pypdfium2 (requires: pip install pypdfium2) python -c " import pypdfium2 as pdfium doc = pdfium.PdfDocument('large.pdf') # Extract pages 0-4 (first 5 pages) new_doc = pdfium.PdfDocument.new() for i in range(min(5, len(doc))): new_doc.import_pages(doc, [i]) new_doc.save('pages_1_5.pdf') " # Then process the smaller file python scripts/vl_caller.py --file-path "pages_1_5.pdf" ``` ### Error Handling **Authentication failed (401/403)**: ``` error: Authentication failed ``` → Token is invalid, reconfigure with correct credentials **API quota exceeded (429)**: ``` error: API quota exceeded ``` → Daily API quota exhausted, inform user to wait or upgrade **Unsupported format**: ``` error: Unsupported file format ``` → File format not supported, convert to PDF/PNG/JPG ### Pseudo-Code: Content Extraction **Extract all text** (most common): ```python def extract_all_text(json_response): # Quickest: use the pre-extracted text field print(json_response['text']) ``` **Extract tables only**: ```python def extract_tables(json_response): for page in json_response['result']: blocks = page['prunedResult']['parsing_res_list'] tables = [b for b in blocks if b['block_label'] == 'table'] for i, table in enumerate(tables): print(f"Table {i+1}:") print(table['block_content']) # HTML table ``` **Iterate all blocks**: ```python def extract_by_block(json_response): for page in json_response['result']: blocks = page['prunedResult']['parsing_res_list'] for block in blocks: print(f"[{block['block_label']}] {block['block_content'][:100]}") ``` ## Important Notes - **The script NEVER filters content** - It always returns complete data - **Claude decides what to present** - Based on user's specific request - **All data is always available** - Can be re-interpreted for different needs - **No information is lost** - Complete document structure preserved ## Reference Documentation For in-depth understanding of the PaddleOCR Document Parsing system, refer to: - `references/output_schema.md` - Output format specification - `references/provider_api.md` - Provider API contract > **Note**: Model version and capabilities are determined by your API endpoint (PADDLEOCR_DOC_PARSING_API_URL). Load these reference documents into context when: - Debugging complex parsing issues - Need to understand output format - Working with provider API details ## Testing the Skill To verify the skill is working properly: ```bash python scripts/smoke_test.py ``` This tests configuration and optionally API connectivity.