--- name: ocr-document-processor description: Extract text from images and scanned PDFs using OCR. Supports 100+ languages, table detection, structured output (markdown/JSON), and batch processing. --- # OCR Document Processor Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing. ## Core Capabilities - **Image OCR**: Extract text from PNG, JPEG, TIFF, BMP images - **PDF OCR**: Process scanned PDFs page by page - **Multi-language**: Support for 100+ languages - **Structured Output**: Plain text, Markdown, JSON, or HTML - **Table Detection**: Extract tabular data to CSV/JSON - **Batch Processing**: Process multiple documents at once - **Quality Assessment**: Confidence scoring for OCR results ## Quick Start ```python from scripts.ocr_processor import OCRProcessor # Simple text extraction processor = OCRProcessor("document.png") text = processor.extract_text() print(text) # Extract to structured format result = processor.extract_structured() print(result['text']) print(result['confidence']) print(result['blocks']) # Text blocks with positions ``` ## Core Workflow ### 1. Basic Text Extraction ```python from scripts.ocr_processor import OCRProcessor # From image processor = OCRProcessor("scan.png") text = processor.extract_text() # From PDF processor = OCRProcessor("scanned.pdf") text = processor.extract_text() # All pages # Specific pages text = processor.extract_text(pages=[1, 2, 3]) ``` ### 2. Structured Extraction ```python # Get detailed results result = processor.extract_structured() # Result contains: # - text: Full extracted text # - blocks: Text blocks with bounding boxes # - lines: Individual lines # - words: Individual words with confidence # - confidence: Overall confidence score # - language: Detected language ``` ### 3. Export Formats ```python # Export to Markdown processor.export_markdown("output.md") # Export to JSON processor.export_json("output.json") # Export to searchable PDF processor.export_searchable_pdf("searchable.pdf") # Export to HTML processor.export_html("output.html") ``` ## Language Support ```python # Specify language for better accuracy processor = OCRProcessor("german_doc.png", lang='deu') # Multiple languages processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu') # Auto-detect language processor = OCRProcessor("document.png", lang='auto') ``` ### Supported Languages (Common) | Code | Language | Code | Language | |------|----------|------|----------| | eng | English | fra | French | | deu | German | spa | Spanish | | ita | Italian | por | Portuguese | | rus | Russian | chi_sim | Chinese (Simplified) | | chi_tra | Chinese (Traditional) | jpn | Japanese | | kor | Korean | ara | Arabic | | hin | Hindi | nld | Dutch | ## Image Preprocessing Preprocessing improves OCR accuracy on low-quality images. ```python # Enable preprocessing processor = OCRProcessor("noisy_scan.png") processor.preprocess( deskew=True, # Fix rotation denoise=True, # Remove noise threshold=True, # Binarize image contrast=1.5 # Enhance contrast ) text = processor.extract_text() ``` ### Available Preprocessing Options | Option | Description | Default | |--------|-------------|---------| | `deskew` | Correct skewed/rotated images | False | | `denoise` | Remove noise and artifacts | False | | `threshold` | Convert to black/white | False | | `threshold_method` | 'otsu', 'adaptive', 'simple' | 'otsu' | | `contrast` | Contrast factor (1.0 = no change) | 1.0 | | `sharpen` | Sharpen factor (0 = none) | 0 | | `scale` | Upscale factor for small text | 1.0 | | `remove_shadows` | Remove shadow artifacts | False | ## Table Extraction ```python # Extract tables from document tables = processor.extract_tables() # Each table is a list of rows for table in tables: for row in table: print(row) # Export tables to CSV processor.export_tables_csv("tables/") # Export to JSON processor.export_tables_json("tables.json") ``` ## PDF Processing ### Multi-Page PDFs ```python # Process all pages processor = OCRProcessor("document.pdf") full_text = processor.extract_text() # Process specific pages page_3 = processor.extract_text(pages=[3]) # Get per-page results results = processor.extract_by_page() for page_num, text in results.items(): print(f"Page {page_num}: {len(text)} characters") ``` ### Create Searchable PDF ```python # Convert scanned PDF to searchable PDF processor = OCRProcessor("scanned.pdf") processor.export_searchable_pdf("searchable.pdf") ``` ## Batch Processing ```python from scripts.ocr_processor import batch_ocr # Process directory of images results = batch_ocr( input_dir="scans/", output_dir="extracted/", output_format="markdown", lang="eng", recursive=True ) print(f"Processed: {results['success']} files") print(f"Failed: {results['failed']} files") ``` ## Receipt/Document Parsing ### Receipt Extraction ```python # Parse receipt structure processor = OCRProcessor("receipt.jpg") receipt_data = processor.parse_receipt() # Returns structured data: # - vendor: Store name # - date: Transaction date # - items: List of items with prices # - subtotal: Subtotal amount # - tax: Tax amount # - total: Total amount ``` ### Business Card Parsing ```python # Extract business card info processor = OCRProcessor("card.jpg") contact = processor.parse_business_card() # Returns: # - name: Person's name # - title: Job title # - company: Company name # - email: Email addresses # - phone: Phone numbers # - address: Physical address # - website: Website URLs ``` ## Configuration ```python processor = OCRProcessor("document.png") # Configure OCR settings processor.config.update({ 'psm': 3, # Page segmentation mode 'oem': 3, # OCR engine mode 'dpi': 300, # DPI for processing 'timeout': 30, # Timeout in seconds 'min_confidence': 60, # Minimum word confidence }) ``` ### Page Segmentation Modes (PSM) | Mode | Description | |------|-------------| | 0 | Orientation and script detection only | | 1 | Automatic page segmentation with OSD | | 3 | Fully automatic page segmentation (default) | | 4 | Assume single column of text | | 6 | Assume single uniform block of text | | 7 | Treat image as single text line | | 8 | Treat image as single word | | 11 | Sparse text. Find as much text as possible | | 12 | Sparse text with OSD | ## Quality Assessment ```python # Get confidence scores result = processor.extract_structured() # Overall confidence (0-100) print(f"Confidence: {result['confidence']}%") # Per-word confidence for word in result['words']: print(f"{word['text']}: {word['confidence']}%") # Filter low-confidence words high_conf_words = [w for w in result['words'] if w['confidence'] > 80] ``` ## Output Formats ### Markdown Export ```python processor.export_markdown("output.md") ``` Output includes: - Document title (if detected) - Structured headings - Paragraphs - Tables (as Markdown tables) - Page breaks for multi-page docs ### JSON Export ```python processor.export_json("output.json") ``` Output structure: ```json { "source": "document.pdf", "pages": 5, "language": "eng", "confidence": 92.5, "text": "Full extracted text...", "blocks": [ { "type": "paragraph", "text": "Block text...", "bbox": [x, y, width, height], "confidence": 95.2 } ], "tables": [...] } ``` ### HTML Export ```python processor.export_html("output.html") ``` Creates styled HTML with: - Preserved layout approximation - Highlighted low-confidence regions - Embedded images (optional) - Print-friendly styling ## CLI Usage ```bash # Basic extraction python ocr_processor.py image.png -o output.txt # Extract to markdown python ocr_processor.py document.pdf -o output.md --format markdown # Specify language python ocr_processor.py german.png --lang deu # Batch processing python ocr_processor.py scans/ -o extracted/ --batch # With preprocessing python ocr_processor.py noisy.png --preprocess --deskew --denoise ``` ## Error Handling ```python from scripts.ocr_processor import OCRProcessor, OCRError try: processor = OCRProcessor("document.png") text = processor.extract_text() except OCRError as e: print(f"OCR failed: {e}") except FileNotFoundError: print("File not found") ``` ## Performance Tips 1. **Image Quality**: Higher resolution (300+ DPI) improves accuracy 2. **Preprocessing**: Use for low-quality scans 3. **Language**: Specifying language improves speed and accuracy 4. **PSM Mode**: Choose appropriate mode for document type 5. **Large Files**: Process PDFs page by page for memory efficiency ## Limitations - Handwritten text: Limited accuracy - Complex layouts: May lose structure - Very low quality: Preprocessing helps but has limits - Non-Latin scripts: Require specific language packs ## Dependencies ``` pytesseract>=0.3.10 Pillow>=10.0.0 PyMuPDF>=1.23.0 opencv-python>=4.8.0 numpy>=1.24.0 ``` ## System Requirements - Tesseract OCR engine must be installed - Language data files for non-English languages