---
name: pdf
description: "Comprehensive PDF manipulation, extraction, and generation with support for text extraction, form filling, merging, splitting, annotations, and creation. Use when working with .pdf files for: (1) Extracting text and tables, (2) Filling PDF forms, (3) Merging/splitting PDFs, (4) Creating PDFs programmatically, (5) Adding watermarks/annotations, (6) PDF metadata management"
---

# PDF Manipulation Skill

Comprehensive guide for working with PDF files in Python, covering extraction, manipulation, creation, and advanced operations using progressive disclosure for efficiency.

## Core Capabilities

Extract and manipulate PDF content:
- Extract text with layout preservation
- Extract tables and parse structured data
- Fill PDF forms programmatically
- Merge multiple PDFs into a single document
- Split PDFs by pages or ranges
- Create PDFs from scratch with text, images, and graphics
- Add watermarks and annotations
- Extract and modify metadata (author, title, keywords)
- Add password protection and encryption
- Perform OCR on scanned documents
- Convert images to PDF
- Compress and optimize PDF files
- Extract images from PDFs
- Rotate and reorder pages

## Quick Start

Install required libraries:

```bash
pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow
```

For detailed installation instructions including system dependencies, see:
- [Library Installation Guide](./references/library-installation.md)

## Python Libraries Overview

**pypdf**: Basic operations (merge, split, rotate, metadata)
**pdfplumber**: Advanced text/table extraction with layout awareness
**reportlab**: Create PDFs from scratch (reports, invoices, documents)
**PyMuPDF (fitz)**: Advanced manipulation, annotations, compression
**pdf2image**: Convert PDF pages to images (requires poppler)
**pytesseract**: OCR for scanned documents (requires tesseract)

## Text Extraction Workflow

### Basic Extraction

```python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)
```

### Layout-Aware Extraction

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        words = page.extract_words()  # With positioning
        print(text)
```

### Extract from Specific Region

```python
with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    bbox = (0, 0, 612, 100)  # x0, y0, x1, y1
    header = page.crop(bbox).extract_text()
```

For detailed text extraction methods including OCR fallback and encoding handling, see:
- [Text Extraction Reference](./references/text-extraction.md)

## Table Extraction Workflow

### Extract All Tables

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            print(table)
```

### Advanced Table Detection

```python
table_settings = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 3
}

tables = page.extract_tables(table_settings=table_settings)
```

For detailed table extraction strategies and data cleaning, see:
- [Table Extraction Reference](./references/table-extraction.md)

## PDF Form Operations

### Fill Form Fields

```python
import fitz

doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        if widget.field_name == "name":
            widget.field_value = "John Doe"
            widget.update()
doc.save("filled.pdf")
doc.close()
```

### Extract Form Field Names

```python
doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        print(f"{widget.field_name}: {widget.field_type_string}")
doc.close()
```

For form filling, flattening, and debugging, see:
- [PDF Operations Reference](./references/pdf-operations.md)

## Merging and Splitting

### Merge PDFs

```python
from pypdf import PdfMerger

merger = PdfMerger()
for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)
merger.write("merged.pdf")
merger.close()
```

### Merge with Page Ranges

```python
merger = PdfMerger()
merger.append("doc1.pdf", pages=(0, 3))  # First 3 pages
merger.append("doc2.pdf")  # All pages
merger.write("compiled.pdf")
merger.close()
```

### Split into Individual Pages

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", 'wb') as f:
        writer.write(f)
```

For merging with bookmarks and splitting by size, see:
- [PDF Operations Reference](./references/pdf-operations.md)

## Creating PDFs

### Simple Text PDF

```python
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

c = canvas.Canvas("output.pdf", pagesize=letter)
c.setFont("Helvetica", 12)
c.drawString(50, 750, "Hello, World!")
c.save()
```

### Styled Report

```python
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf")
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Content here", styles['BodyText']))

doc.build(story)
```

### PDF with Table

```python
from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors

data = [
    ['Product', 'Quantity', 'Price'],
    ['Widget A', '10', '$50'],
    ['Widget B', '5', '$75']
]

table = Table(data)
table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
    ('GRID', (0, 0), (-1, -1), 1, colors.black)
]))
```

For complete PDF creation workflows including images, multi-column layouts, and custom fonts, see:
- [PDF Creation Reference](./references/pdf-creation.md)

For practical examples:
- [Invoice Generator](./examples/invoice-generator.md)
- [Report Automation](./examples/report-automation.md)

## Metadata and Security

### Extract Metadata

```python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
metadata = reader.metadata
print(f"Title: {metadata.get('/Title')}")
print(f"Author: {metadata.get('/Author')}")
```

### Modify Metadata

```python
from pypdf import PdfWriter

writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)

writer.add_metadata({
    '/Title': 'New Title',
    '/Author': 'John Doe'
})

with open("updated.pdf", 'wb') as f:
    writer.write(f)
```

### Add Password Protection

```python
writer.encrypt(
    user_password="user123",
    owner_password="owner456",
    algorithm="AES-256"
)
```

For detailed security operations and comprehensive metadata management, see:
- [Metadata, Security, and OCR Reference](./references/metadata-security-ocr.md)

## OCR for Scanned Documents

### Basic OCR

```python
from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("scanned.pdf")
for i, image in enumerate(images):
    text = pytesseract.image_to_string(image)
    print(f"Page {i+1}:\n{text}")
```

### Multi-Language OCR

```python
text = pytesseract.image_to_string(image, lang='eng+fra+deu')
```

For searchable PDF creation and OCR preprocessing, see:
- [Metadata, Security, and OCR Reference](./references/metadata-security-ocr.md)

## Watermarks and Annotations

### Add Text Watermark

```python
import fitz

doc = fitz.open("document.pdf")
for page in doc:
    page.insert_textbox(
        page.rect,
        "CONFIDENTIAL",
        fontsize=50,
        rotate=45,
        opacity=0.3,
        color=(0.7, 0.7, 0.7)
    )
doc.save("watermarked.pdf")
doc.close()
```

### Add Annotations

```python
page.add_highlight_annot(rect)  # Highlight
page.add_text_annot(point, "Note")  # Text note
page.add_underline_annot(rect)  # Underline
```

For stamps and image watermarks, see:
- [Metadata, Security, and OCR Reference](./references/metadata-security-ocr.md)

## Page Operations

### Rotate Pages

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.rotate(90)
    writer.add_page(page)

with open("rotated.pdf", 'wb') as f:
    writer.write(f)
```

### Extract Images

```python
import fitz

doc = fitz.open("document.pdf")
for page_num in range(len(doc)):
    page = doc[page_num]
    for img_index, img in enumerate(page.get_images()):
        xref = img[0]
        base_image = doc.extract_image(xref)
        with open(f"image_{page_num}_{img_index}.png", "wb") as f:
            f.write(base_image["image"])
doc.close()
```

### Convert Images to PDF

```python
from PIL import Image
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf")
for img_path in ["img1.jpg", "img2.jpg"]:
    img = Image.open(img_path)
    c.setPageSize(img.size)
    c.drawImage(img_path, 0, 0, width=img.width, height=img.height)
    c.showPage()
c.save()
```

For detailed page operations, see:
- [PDF Operations Reference](./references/pdf-operations.md)

## Optimization

### Compress PDF

```python
import fitz

doc = fitz.open("large.pdf")
doc.save(
    "optimized.pdf",
    garbage=4,
    deflate=True,
    clean=True
)
doc.close()
```

## Best Practices

### Memory Management

Process large PDFs in chunks:

```python
from pypdf import PdfReader
import gc

reader = PdfReader("large.pdf")
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    # Process text
    if i % 10 == 0:
        gc.collect()
```

### Error Handling

Always handle encryption and errors:

```python
from pypdf import PdfReader

try:
    reader = PdfReader("document.pdf")

    if reader.is_encrypted:
        reader.decrypt(password)

    for page in reader.pages:
        text = page.extract_text()
except Exception as e:
    print(f"Error: {e}")
```

### OCR Fallback

Detect and handle scanned documents:

```python
import fitz

doc = fitz.open("document.pdf")
text = doc[0].get_text()

if not text.strip():
    # Use OCR for scanned document
    from pdf2image import convert_from_path
    import pytesseract

    images = convert_from_path("document.pdf")
    text = pytesseract.image_to_string(images[0])
```

For comprehensive best practices, common pitfalls, and troubleshooting, see:
- [Best Practices and Common Pitfalls](./references/best-practices.md)

## Common Pitfalls

**Scanned Documents**: Text extraction returns empty for scanned PDFs. Use OCR (pytesseract).

**Table Detection**: Tables not detected correctly. Adjust table_settings strategies.

**Encrypted PDFs**: Operations fail. Check and decrypt with password first.

**Form Fields**: Can't find field names. Use debug helper to list all fields.

**Memory Issues**: Large PDFs cause crashes. Process in chunks with garbage collection.

**Encoding Issues**: Special characters corrupted. Handle with UTF-8 encoding explicitly.

For detailed solutions and debugging strategies, see:
- [Best Practices and Common Pitfalls](./references/best-practices.md)

## Quick Reference

**Text Extraction**:
- Simple: `pypdf` - `page.extract_text()`
- Advanced: `pdfplumber` - `page.extract_text()` + `page.extract_words()`

**Table Extraction**:
- Always use: `pdfplumber` - `page.extract_tables()`

**PDF Creation**:
- Use: `reportlab` - `canvas.Canvas()` or `SimpleDocTemplate()`

**Advanced Operations**:
- Use: `PyMuPDF (fitz)` - forms, annotations, compression

**OCR**:
- Use: `pytesseract` + `pdf2image`

**Merging/Splitting**:
- Use: `pypdf` - `PdfMerger()` and `PdfWriter()`

## Helper Scripts

The skill includes helper scripts for common operations:

```bash
# See scripts directory for utilities
python scripts/pdf_helper.py --help
```

## Additional Resources

**Comprehensive References**:
- [Library Installation](./references/library-installation.md) - Setup and dependencies
- [Text Extraction](./references/text-extraction.md) - All extraction methods
- [Table Extraction](./references/table-extraction.md) - Table detection strategies
- [PDF Operations](./references/pdf-operations.md) - Forms, merge, split, pages
- [PDF Creation](./references/pdf-creation.md) - Creating PDFs from scratch
- [Metadata, Security, OCR](./references/metadata-security-ocr.md) - Advanced operations
- [Best Practices](./references/best-practices.md) - Pitfalls and solutions

**Practical Examples**:
- [Invoice Generator](./examples/invoice-generator.md) - Professional invoice templates
- [Report Automation](./examples/report-automation.md) - Automated report generation

## Implementation Guidelines

When working with PDFs:

1. **Choose the right library** for your task (see Quick Reference)
2. **Handle errors** with try-except blocks
3. **Check for encryption** before processing
4. **Use OCR fallback** for scanned documents
5. **Process large files in chunks** to manage memory
6. **Validate input files** before operations
7. **Close documents** to free resources: `doc.close()`

For production use, always implement proper error handling, validate inputs, and test with various PDF types and versions.