---
name: document-processing
description: Use when working with "PDF", "Excel", "Word", "PowerPoint", "XLSX", "DOCX", "PPTX", "spreadsheets", "presentations", "extract text", "merge documents", "convert documents", or asking about "office document manipulation"
version: 1.0.0
---
# Document Processing Guide
Work with office documents: PDF, Excel, Word, and PowerPoint.
---
## Format Overview
| Format | Extension | Structure | Best For |
|--------|-----------|-----------|----------|
| **PDF** | .pdf | Binary/text | Reports, forms, archives |
| **Excel** | .xlsx | XML in ZIP | Data, calculations, models |
| **Word** | .docx | XML in ZIP | Text documents, contracts |
| **PowerPoint** | .pptx | XML in ZIP | Presentations, slides |
**Key concept**: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.
---
## PDF Processing
### PDF Tools
| Task | Best Tool |
|------|-----------|
| Basic read/write | pypdf |
| Text extraction | pdfplumber |
| Table extraction | pdfplumber |
| Create PDFs | reportlab |
| OCR scanned PDFs | pytesseract + pdf2image |
| Command line | qpdf, pdftotext |
### Common Operations
| Operation | Approach |
|-----------|----------|
| **Merge** | Loop through files, add pages to writer |
| **Split** | Create new writer per page |
| **Extract tables** | Use pdfplumber, convert to DataFrame |
| **Rotate** | Call `.rotate(degrees)` on page |
| **Encrypt** | Use writer's `.encrypt()` method |
| **OCR** | Convert to images, run pytesseract |
---
## Excel Processing
### Excel Tools
| Task | Best Tool |
|------|-----------|
| Data analysis | pandas |
| Formulas & formatting | openpyxl |
| Simple CSV | pandas |
| Financial models | openpyxl |
### Critical Rule: Use Formulas
| Approach | Result |
|----------|--------|
| **Wrong**: Calculate in Python, write value | Static number, breaks when data changes |
| **Right**: Write Excel formula | Dynamic, recalculates automatically |
### Financial Model Standards
| Convention | Meaning |
|------------|---------|
| Blue text | Hardcoded inputs |
| Black text | Formulas |
| Green text | Links to other sheets |
| Yellow fill | Needs attention |
### Common Formula Errors
| Error | Cause |
|-------|-------|
| #REF! | Invalid cell reference |
| #DIV/0! | Division by zero |
| #VALUE! | Wrong data type |
| #NAME? | Unknown function name |
---
## Word Processing
### Word Tools
| Task | Best Tool |
|------|-----------|
| Text extraction | pandoc |
| Create new | python-docx or docx-js |
| Simple edits | python-docx |
| Tracked changes | Direct XML editing |
### Document Structure
| File | Contains |
|------|----------|
| `word/document.xml` | Main content |
| `word/comments.xml` | Comments |
| `word/media/` | Images |
### Tracked Changes (Redlining)
| Element | XML Tag |
|---------|---------|
| Deletion | `...` |
| Insertion | `...` |
**Key concept**: For professional/legal documents, use tracked changes XML rather than replacing text directly.
---
## PowerPoint Processing
### PowerPoint Tools
| Task | Best Tool |
|------|-----------|
| Text extraction | markitdown |
| Create new | pptxgenjs (JS) or python-pptx |
| Edit existing | Direct XML or python-pptx |
### Slide Structure
| Path | Contains |
|------|----------|
| `ppt/slides/slide{N}.xml` | Slide content |
| `ppt/notesSlides/` | Speaker notes |
| `ppt/slideMasters/` | Master templates |
| `ppt/media/` | Images |
### Design Principles
| Principle | Guideline |
|-----------|-----------|
| Fonts | Use web-safe: Arial, Helvetica, Georgia |
| Layout | Two-column preferred, avoid vertical stacking |
| Hierarchy | Size, weight, color for emphasis |
| Consistency | Repeat patterns across slides |
---
## Converting Between Formats
| Conversion | Tool |
|------------|------|
| Any → PDF | LibreOffice headless |
| PDF → Images | pdftoppm |
| DOCX → Markdown | pandoc |
| Any → Text | Appropriate extractor |
---
## Best Practices
| Practice | Why |
|----------|-----|
| Use formulas in Excel | Dynamic calculations |
| Preserve formatting on edit | Don't lose styles |
| Test output opens correctly | Catch corruption early |
| Use tracked changes for contracts | Audit trail |
| Extract to markdown for analysis | Easier to process |
## Common Packages
| Language | Packages |
|----------|----------|
| **Python** | pypdf, pdfplumber, openpyxl, python-docx, python-pptx |
| **JavaScript** | docx, pptxgenjs |
| **CLI** | pandoc, qpdf, pdftotext, libreoffice |