# Getting Started with PDFOxide (Python) PDFOxide is the complete PDF toolkit. One library for extracting, creating, and editing PDFs with a unified API. Built on a Rust core for maximum performance. ## Installation ```bash pip install pdf_oxide ``` ## Quick Start - The Unified `Pdf` API The `Pdf` class is your main entry point for all PDF operations: ```python from pdf_oxide import Pdf # Create from Markdown pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.") pdf.save("output.pdf") ``` ## Creating PDFs ### From Markdown ```python from pdf_oxide import Pdf pdf = Pdf.from_markdown(""" # Report Title ## Introduction This is **bold** and *italic* text. - Item 1 - Item 2 - Item 3 ## Code Example ```python print("Hello, World!") ``` """) pdf.save("report.pdf") ``` ### From HTML ```python from pdf_oxide import Pdf pdf = Pdf.from_html("""

Invoice

Thank you for your purchase.

Item	Price
Widget	$10.00

""") pdf.save("invoice.pdf") ``` ### From Plain Text ```python from pdf_oxide import Pdf pdf = Pdf.from_text("Simple plain text document.\n\nWith paragraphs.") pdf.save("notes.pdf") ``` ### From Images ```python from pdf_oxide import Pdf # Single image pdf = Pdf.from_image("photo.jpg") pdf.save("photo.pdf") # Multiple images (one per page) album = Pdf.from_images(["page1.jpg", "page2.png", "page3.jpg"]) album.save("album.pdf") ``` ## Opening and Reading PDFs ```python from pdf_oxide import PdfDocument # Open existing PDF (path can be str or pathlib.Path) doc = PdfDocument("document.pdf") # Or use as a context manager with PdfDocument("document.pdf") as doc: text = doc.extract_text(0) print(f"Text: {text}") markdown = doc.to_markdown(0) print(f"Pages: {doc.page_count()}") # Extract text from page 0 text = doc.extract_text(0) print(f"Text: {text}") # Convert to Markdown markdown = doc.to_markdown(0) print(f"Markdown:\n{markdown}") # Get page count print(f"Pages: {doc.page_count()}") ``` ## Builder Pattern for Advanced Creation For full control over PDF creation, use `PdfBuilder`: ```python from pdf_oxide import PdfBuilder, PageSize pdf = (PdfBuilder() .title("Annual Report 2025") .author("Company Inc.") .subject("Financial Summary") .page_size(PageSize.A4) .margins(72.0, 72.0, 72.0, 72.0) # 1 inch margins .font_size(11.0) .from_markdown("# Annual Report\n\n...")) pdf.save("annual-report.pdf") ``` ## Encryption and Security ### Password Protection ```python from pdf_oxide import Pdf pdf = Pdf.from_markdown("# Confidential Document") # Simple password protection (AES-256) pdf.save_encrypted("secure.pdf", "user-password", "owner-password") ``` ## Text Extraction Options ### Basic Extraction ```python from pdf_oxide import PdfDocument doc = PdfDocument("paper.pdf") text = doc.extract_text(0) ``` ### With Options ```python from pdf_oxide import PdfDocument, ConversionOptions doc = PdfDocument("paper.pdf") options = ConversionOptions( detect_headings=True, detect_lists=True, embed_images=True ) markdown = doc.to_markdown(0, options) ``` ### Extract All Pages ```python from pdf_oxide import PdfDocument doc = PdfDocument("book.pdf") # Extract text from all pages all_text = doc.extract_text_all() # Convert entire document to Markdown all_markdown = doc.to_markdown_all() ``` ## Office Document Conversion Convert DOCX, XLSX, and PPTX files to PDF: ```python from pdf_oxide import OfficeConverter # Auto-detect format converter = OfficeConverter() # Convert Word document converter.convert("report.docx", "report.pdf") # Convert Excel spreadsheet converter.convert("data.xlsx", "data.pdf") # Convert PowerPoint presentation converter.convert("slides.pptx", "slides.pdf") ``` ## Working with Images ### Extract Images from PDF ```python from pdf_oxide import PdfDocument doc = PdfDocument("document.pdf") images = doc.extract_images(0) for i, img in enumerate(images): img.save(f"image_{i}.png") ``` ### Embed Images in Output ```python from pdf_oxide import PdfDocument, ConversionOptions doc = PdfDocument("paper.pdf") options = ConversionOptions(embed_images=True) # Images embedded as base64 data URIs html = doc.to_html(0, options) ``` ## OCR - Extracting Text from Scanned PDFs > For a comprehensive guide covering model selection, configuration reference, resize strategies, and troubleshooting, see the [OCR Guide](OCR_GUIDE.md). ### Setup The OCR feature requires heavy machine learning dependencies (ONNX Runtime) and is optional. ```bash # Recommended: Install with OCR support pip install pdf_oxide[ocr] # Or build from source with OCR maturin develop --features python,ocr ``` > **Troubleshooting:** If you see `RuntimeError: OCR feature not enabled`, it means the library was installed without OCR support. Re-install using the `[ocr]` extra above. **Quick start** — download the recommended models: ```bash ./scripts/setup_ocr_models.sh ``` #### Model Selection Guide PDFOxide supports PaddleOCR v3, v4, and v5 models. You can mix detection and recognition models from different versions. | Combination | Detection | Recognition | English Accuracy | Total Size | |---|---|---|---|---| | **V4 det + V5 rec (recommended)** | ch_PP-OCRv4_det | en_PP-OCRv5_mobile_rec | Best | ~12.5 MB | | V4 det + V4 rec | ch_PP-OCRv4_det | en_PP-OCRv4_rec | Good | ~12.4 MB | | V5 det + V5 rec | PP-OCRv5_server_det | en_PP-OCRv5_mobile_rec | Good (different errors) | ~96 MB | | V3 det + V3 rec | en_PP-OCRv3_det | en_PP-OCRv3_rec | Fair | ~11 MB | The **V4 detection + V5 recognition** combination gives the best results for English documents: V4 detection reliably segments text lines, while V5 recognition has the highest character-level accuracy. **Manual download:** ```bash # Recommended: V4 detection + V5 recognition # Detection (4.7 MB): curl -L https://huggingface.co/deepghs/paddleocr/resolve/main/det/ch_PP-OCRv4_det/model.onnx -o .models/det.onnx # Recognition (7.8 MB): curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/rec.onnx -o .models/rec.onnx # Dictionary (must include space as last entry): curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/dict.txt -o .models/en_dict.txt echo " " >> .models/en_dict.txt ``` ### Basic OCR Usage ```python from pdf_oxide import PdfDocument, OcrEngine, OcrConfig # Create OCR engine (default config works with recommended V4 det + V5 rec models) engine = OcrEngine( det_model_path=".models/det.onnx", rec_model_path=".models/rec.onnx", dict_path=".models/en_dict.txt", ) # Extract text using OCR doc = PdfDocument("scanned.pdf") text = doc.extract_text_ocr(page=0, engine=engine) print(text) ``` ### Processing Multiple Pages ```python doc = PdfDocument("scanned.pdf") for page in range(doc.page_count()): text = doc.extract_text_ocr(page=page, engine=engine) if text.strip(): print(f"--- Page {page + 1} ---") print(text) ``` ### Using PP-OCRv5 Detection If you use the full PP-OCRv5 stack (v5 detection + v5 recognition), pass `use_v5=True` to `OcrConfig`. This preserves the original image resolution instead of downscaling to 960px, which the larger v5 detection model needs: ```python config = OcrConfig(use_v5=True) engine = OcrEngine( det_model_path="v5_det.onnx", rec_model_path="v5_rec.onnx", dict_path="v5_dict.txt", config=config, ) ``` > **Note:** The `OcrEngine` is reusable — create it once and pass it to multiple `extract_text_ocr` calls. ONNX Runtime requires `libonnxruntime.so` (v1.23+) to be available at runtime (via `LD_LIBRARY_PATH` or system install). ## Structured Extraction Beyond plain text, PDFOxide can extract structured content from pages: ```python from pdf_oxide import PdfDocument doc = PdfDocument("document.pdf") # 1. Scoped extraction from a specific area (v0.3.14) # Area: (x, y, width, height) in points header = doc.within(0, (0, 700, 612, 92)).extract_text() # 2. Text spans with font info, position, and style spans = doc.extract_spans(0) for span in spans: print(f"{span.text} — {span.font_name} {span.font_size}pt") # 3. Word-level extraction (v0.3.14) words = doc.extract_words(0) for w in words: print(f"Word: {w.text} at {w.bbox}") # Access character metadata for the word # print(w.chars[0].font_name) # Optional: override the adaptive word gap threshold (in PDF points). # Smaller values split more aggressively; useful for dense forms. words = doc.extract_words(0, word_gap_threshold=2.5) # 4. Line-level extraction (v0.3.14) lines = doc.extract_text_lines(0) for line in lines: print(f"Line: {line.text}") # Optional: override word and/or line gap thresholds (in PDF points). lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0) # 5. Inspect computed layout params before overriding params = doc.page_layout_params(0) print(f"Adaptive word gap: {params.word_gap_threshold:.1f}pt") print(f"Adaptive line gap: {params.line_gap_threshold:.1f}pt") # 6. Pre-tuned extraction profiles for different document types from pdf_oxide import ExtractionProfile profile = ExtractionProfile.form() print(f"Profile: {profile.name}, word_margin_ratio={profile.word_margin_ratio}") # Pass a profile to extraction methods to control how raw text is parsed words = doc.extract_words(0, profile=ExtractionProfile.form()) lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic()) # Combine profile with threshold overrides (profile controls span parsing, # thresholds control word/line clustering) words = doc.extract_words(0, word_gap_threshold=1.5, profile=ExtractionProfile.aggressive()) # 7. Image metadata images = doc.extract_images(0) for img in images: print(f"{img['width']}x{img['height']} {img['color_space']}") # 8. Bookmarks / table of contents outline = doc.get_outline() # None if no outline if outline: for item in outline: print(f"{item['title']} -> page {item.get('page')}") # 9. Vector paths (lines, curves, shapes) paths = doc.extract_paths(0) for path in paths: print(f"bbox={path['bbox']}, stroke={path.get('stroke_color')}") ``` ## Working with Form Fields PDFOxide can extract, read, fill, and export PDF form field data (AcroForm fields). ### List All Form Fields ```python from pdf_oxide import PdfDocument doc = PdfDocument("tax-form.pdf") fields = doc.get_form_fields() for f in fields: print(f"{f.name} ({f.field_type}) = {f.value}") ``` Each `FormField` has: - `name` — fully qualified field name (e.g. `"topmostSubform[0].Page1[0].f1_01[0]"`) - `field_type` — `"text"`, `"button"`, `"choice"`, `"signature"` - `value` — current value (`str`, `bool`, or `None`) - `flags` — field flags (read-only, required, etc.) ### Read and Set Field Values ```python doc = PdfDocument("w2.pdf") # Read a field value ssn = doc.get_form_field_value("topmostSubform[0].CopyA[0].f1_01[0]") print(f"SSN: {ssn}") # Fill fields doc.set_form_field_value("employee_name", "Jane Doe") doc.set_form_field_value("wages", "85000.00") doc.set_form_field_value("retirement_plan", True) # checkbox # Save (values are persisted via incremental save) doc.save("filled_w2.pdf") ``` ### Extract Text with Form Field Values Filled form field values appear inline in `extract_text` and `to_markdown`: ```python doc = PdfDocument("filled_w2.pdf") # Form values appear inline in extracted text text = doc.extract_text(0) print(text) # "Jane Doe" appears where the name field is # to_markdown includes form fields by default md = doc.to_markdown(0, include_form_fields=True) # Exclude form field values md_clean = doc.to_markdown(0, include_form_fields=False) ``` ### Export Form Data ```python doc = PdfDocument("filled-form.pdf") # Export as FDF doc.export_form_data("form_data.fdf") # Export as XFDF doc.export_form_data("form_data.xfdf", format="xfdf") ``` ## Performance Tips 1. **Reuse document objects** - Opening a PDF has overhead, reuse the object for multiple operations 2. **Use specific page extraction** - `extract_text(page_num)` is faster than `extract_text_all()` if you only need some pages 3. **Disable features you don't need** - Use `ConversionOptions` to skip heading detection, image extraction, etc. ```python from pdf_oxide import PdfDocument, ConversionOptions doc = PdfDocument("large.pdf") # Fast extraction - minimal processing options = ConversionOptions( detect_headings=False, detect_lists=False, embed_images=False ) text = doc.to_markdown(0, options) ``` ## Error Handling ```python from pdf_oxide import PdfDocument, PdfError try: doc = PdfDocument("document.pdf") text = doc.extract_text(0) except PdfError as e: print(f"PDF error: {e}") except FileNotFoundError: print("File not found") ``` ## Examples See the [examples/](../examples/) directory for complete working examples. ### Quick Script Examples **Extract text from all PDFs in a folder:** ```python from pdf_oxide import PdfDocument from pathlib import Path for pdf_path in Path("documents").glob("*.pdf"): # PdfDocument accepts pathlib.Path directly with PdfDocument(pdf_path) as doc: text = doc.extract_text_all() output_path = pdf_path.with_suffix(".txt") output_path.write_text(text) print(f"Extracted: {pdf_path.name}") ``` **Batch convert Markdown to PDF:** ```python from pdf_oxide import Pdf from pathlib import Path for md_path in Path("notes").glob("*.md"): content = md_path.read_text() pdf = Pdf.from_markdown(content) output_path = md_path.with_suffix(".pdf") pdf.save(str(output_path)) print(f"Created: {output_path.name}") ``` ## Next Steps - [API Reference](https://docs.rs/pdf_oxide) - Full API documentation - [PDF Creation Guide](PDF_CREATION_GUIDE.md) - Advanced creation options - [GitHub Issues](https://github.com/yfedoseev/pdf_oxide/issues) - Report bugs or request features