---
version: "1.0.0"
evaluation: programmatic
agent: claude-code
model: claude-sonnet-4-6
snapshot: python312-uv
origin:
url: "https://skills.sh/anthropics/skills/pdf"
source_host: "skills.sh"
source_title: "PDF Processing Guide"
imported_at: "2026-05-01T00:00:00Z"
imported_by: "skill-to-runbook-converter@1.0.0"
attribution:
collection_or_org: "anthropics"
skill_name: "pdf"
confidence: "high"
secrets: {}
---
# PDF Processing Guide — Agent Runbook
## Objective
Process PDF files using Python libraries and command-line tools to perform operations such as reading, extracting text and tables, merging, splitting, rotating pages, adding watermarks, creating new PDFs, filling forms, encrypting/decrypting, extracting images, and performing OCR on scanned documents. This runbook covers the full lifecycle of PDF manipulation: from initial environment setup and dependency installation through execution of the requested operation to final verification of outputs. Use this runbook whenever the user references a `.pdf` file or requests any PDF-related transformation. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md; for form-filling workflows, see FORMS.md.
## REQUIRED OUTPUT FILES (MANDATORY)
**You MUST write all of the following files to `/app/results`.
The task is NOT complete until every file exists and is non-empty.**
| File | Description |
|------|-------------|
| `/app/results/output.pdf` | The primary output PDF (merged, split page, rotated, watermarked, encrypted, or newly created) — omit if the task is text/table extraction only |
| `/app/results/extracted_text.txt` | Extracted text content (for text-extraction tasks) |
| `/app/results/extracted_tables.xlsx` | Extracted tables as Excel workbook (for table-extraction tasks) |
| `/app/results/summary.md` | Executive summary: operation performed, input files, output files, any warnings |
| `/app/results/validation_report.json` | Structured validation results with stages, results, and `overall_passed` |
If a task only produces a subset of the above (e.g., only a merged PDF with no text extraction), mark inapplicable files as skipped in `validation_report.json` and note them in `summary.md`.
## Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| Results directory | `/app/results` | Output directory for all results |
| Input PDF(s) | *(required)* | One or more input PDF file paths |
| Operation | *(required)* | One of: `merge`, `split`, `extract-text`, `extract-tables`, `rotate`, `watermark`, `create`, `encrypt`, `decrypt`, `ocr`, `extract-images`, `fill-form` |
| Output filename | `output.pdf` | Name for the primary output file |
| Password | *(optional)* | Password for encrypt/decrypt operations |
| Rotation degrees | `90` | Degrees to rotate pages (for `rotate` operation) |
| Page range | *(optional)* | Page range for split/extract operations (e.g., `1-5`) |
## Dependencies
| Dependency | Type | Required | Description |
|------------|------|----------|-------------|
| `pypdf` | Python package | Yes | Basic PDF operations: merge, split, rotate, metadata, password protection |
| `pdfplumber` | Python package | Yes | Text and table extraction with layout preservation |
| `reportlab` | Python package | Yes | PDF creation using canvas or Platypus document templates |
| `pytesseract` | Python package | Conditional | OCR on scanned PDFs (required for `ocr` operation) |
| `pdf2image` | Python package | Conditional | Convert PDF pages to images for OCR (required for `ocr` operation) |
| `pandas` | Python package | Conditional | Advanced table export to Excel (required for `extract-tables` operation) |
| `openpyxl` | Python package | Conditional | Excel writer backend for pandas (required for `extract-tables` operation) |
| `poppler-utils` | System package | Conditional | Provides `pdftotext` and `pdfimages` CLI tools |
| `qpdf` | System package | Conditional | CLI-based merge, split, rotate, decrypt |
| `pdftk` | System package | Optional | Alternative CLI merge/split/rotate tool |
| `tesseract-ocr` | System package | Conditional | OCR engine (required for `ocr` operation) |
---
## Step 1: Environment Setup
Install all Python dependencies and verify CLI tools are available.
```bash
echo "=== Installing Python dependencies ==="
pip install pypdf pdfplumber reportlab
# Install optional dependencies based on operation
OPERATION="${OPERATION:-extract-text}"
if [[ "$OPERATION" == "ocr" ]]; then
pip install pytesseract pdf2image
fi
if [[ "$OPERATION" == "extract-tables" ]]; then
pip install pandas openpyxl
fi
echo "=== Checking CLI tools ==="
command -v pdftotext >/dev/null 2>&1 && echo "pdftotext: OK" || echo "pdftotext: not found (install poppler-utils)"
command -v qpdf >/dev/null 2>&1 && echo "qpdf: OK" || echo "qpdf: not found"
command -v pdftk >/dev/null 2>&1 && echo "pdftk: OK" || echo "pdftk: not found (optional)"
echo "=== Creating output directory ==="
mkdir -p /app/results
```
Verify Python imports succeed before proceeding:
```python
from pypdf import PdfReader, PdfWriter
import pdfplumber
from reportlab.lib.pagesizes import letter
print("All core dependencies imported successfully")
```
---
## Step 2: Validate Inputs
Verify that the input PDF(s) exist and are readable before running any operation.
```python
import pathlib, sys
input_files = ["/app/results/input.pdf"] # Replace with actual input path(s)
for f in input_files:
p = pathlib.Path(f)
if not p.exists():
print(f"ERROR: Input file not found: {f}")
sys.exit(1)
if p.stat().st_size == 0:
print(f"ERROR: Input file is empty: {f}")
sys.exit(1)
print(f"OK: {f} ({p.stat().st_size} bytes)")
print("All input files validated.")
```
---
## Step 3: Execute PDF Operation
Choose the appropriate code block for the requested operation. Run the relevant section only.
### 3a: Merge PDFs
```python
from pypdf import PdfWriter, PdfReader
import pathlib
input_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"] # Replace with actual paths
output_path = "/app/results/output.pdf"
writer = PdfWriter()
for pdf_file in input_files:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open(output_path, "wb") as output:
writer.write(output)
print(f"Merged {len(input_files)} PDFs → {output_path}")
```
### 3b: Split PDF
```python
from pypdf import PdfReader, PdfWriter
import pathlib
input_path = "/app/results/input.pdf"
output_dir = pathlib.Path("/app/results")
reader = PdfReader(input_path)
output_files = []
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
out_path = output_dir / f"page_{i+1}.pdf"
with open(out_path, "wb") as output:
writer.write(output)
output_files.append(str(out_path))
print(f"Split into {len(output_files)} pages: {output_files}")
```
### 3c: Extract Text
```python
import pdfplumber, pathlib
input_path = "/app/results/input.pdf"
output_path = "/app/results/extracted_text.txt"
text_parts = []
with pdfplumber.open(input_path) as pdf:
for i, page in enumerate(pdf.pages):
t = page.extract_text()
if t:
text_parts.append(f"=== Page {i+1} ===\n{t}")
full_text = "\n\n".join(text_parts)
pathlib.Path(output_path).write_text(full_text)
print(f"Extracted {len(full_text)} characters → {output_path}")
```
For scanned PDFs, use OCR instead (see Step 3g).
### 3d: Extract Tables
```python
import pdfplumber, pandas as pd, pathlib
input_path = "/app/results/input.pdf"
output_path = "/app/results/extracted_tables.xlsx"
all_tables = []
with pdfplumber.open(input_path) as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
df.insert(0, "source_page", i + 1)
all_tables.append(df)
if all_tables:
combined = pd.concat(all_tables, ignore_index=True)
combined.to_excel(output_path, index=False)
print(f"Extracted {len(all_tables)} table(s) with {len(combined)} rows → {output_path}")
else:
pathlib.Path(output_path).write_text("")
print("No tables found in the PDF.")
```
### 3e: Rotate Pages
```python
from pypdf import PdfReader, PdfWriter
input_path = "/app/results/input.pdf"
output_path = "/app/results/output.pdf"
degrees = 90 # Replace with desired rotation
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
page.rotate(degrees)
writer.add_page(page)
with open(output_path, "wb") as out:
writer.write(out)
print(f"Rotated all pages by {degrees}° → {output_path}")
```
### 3f: Add Watermark
```python
from pypdf import PdfReader, PdfWriter
watermark_path = "/app/results/watermark.pdf"
input_path = "/app/results/input.pdf"
output_path = "/app/results/output.pdf"
watermark = PdfReader(watermark_path).pages[0]
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open(output_path, "wb") as out:
writer.write(out)
print(f"Watermark applied → {output_path}")
```
### 3g: OCR Scanned PDFs
```python
# Requires: pip install pytesseract pdf2image
# Requires: apt-get install tesseract-ocr poppler-utils
import pytesseract, pathlib
from pdf2image import convert_from_path
input_path = "/app/results/input.pdf"
output_path = "/app/results/extracted_text.txt"
images = convert_from_path(input_path)
text_parts = []
for i, image in enumerate(images):
page_text = pytesseract.image_to_string(image)
text_parts.append(f"=== Page {i+1} ===\n{page_text}")
full_text = "\n\n".join(text_parts)
pathlib.Path(output_path).write_text(full_text)
print(f"OCR complete: {len(images)} pages, {len(full_text)} chars → {output_path}")
```
### 3h: Password Protection / Encryption
```python
from pypdf import PdfReader, PdfWriter
input_path = "/app/results/input.pdf"
output_path = "/app/results/output.pdf"
user_password = "userpassword" # Replace
owner_password = "ownerpassword" # Replace
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt(user_password, owner_password)
with open(output_path, "wb") as out:
writer.write(out)
print(f"Encrypted → {output_path}")
```
### 3i: Decrypt / Remove Password
```bash
# Using qpdf
qpdf --password=mypassword --decrypt /app/results/input.pdf /app/results/output.pdf
echo "Decrypted → /app/results/output.pdf"
```
### 3j: Extract Images
```bash
# Using pdfimages (poppler-utils)
mkdir -p /app/results/images
pdfimages -j /app/results/input.pdf /app/results/images/img
echo "Images extracted to /app/results/images/"
ls /app/results/images/
```
### 3k: Create New PDF
```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
output_path = "/app/results/output.pdf"
doc = SimpleDocTemplate(output_path, pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("Document Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Document body content goes here.", styles['Normal']))
# IMPORTANT: Never use Unicode subscript/superscript chars (₀¹²) in ReportLab.
# Use XML tags instead: 2 or 2 in Paragraph objects.
doc.build(story)
print(f"Created PDF → {output_path}")
```
---
## Step 4: Iterate on Errors (max 3 rounds)
If Step 3 raised an exception or produced an empty/corrupt output file:
1. Check the error message for the specific failure class
2. Apply the relevant fix from the Common Fixes table below
3. Re-run the affected Step 3 sub-section
4. Repeat up to **3 rounds total**
After 3 rounds, if the output is still invalid, record the failure in `summary.md` and `validation_report.json` with `overall_passed: false` and stop.
### Common Fixes
| Issue | Fix |
|-------|-----|
| `FileNotFoundError` on input PDF | Verify the input path; check for typos or missing uploads |
| `PdfReadError: EOF marker not found` | The PDF may be corrupt or truncated; try with qpdf: `qpdf --check input.pdf` |
| `PasswordError` on encrypted PDF | Pass the correct password to `PdfReader("file.pdf", password="...")` |
| Empty extracted text (not scanned) | Try pdfplumber instead of pypdf for text extraction |
| Empty extracted text (scanned PDF) | Install tesseract and use OCR workflow (Step 3g) |
| ReportLab renders black boxes for subscripts | Replace Unicode sub/superscript chars with `` / `` XML tags |
| `qpdf` or `pdftotext` not found | Install system packages: `apt-get install -y qpdf poppler-utils pdftk` |
| Tables extracted with None values | Rows with merged cells return `None`; filter with `if cell is not None` |
---
## Step 5: Validate Outputs
Verify that all expected output files exist and are non-empty.
```python
import pathlib, json, sys
results_dir = pathlib.Path("/app/results")
operation = "extract-text" # Replace with actual operation
# Define expected outputs per operation
EXPECTED = {
"merge": ["output.pdf"],
"split": [], # Dynamic; check for page_*.pdf files
"extract-text": ["extracted_text.txt"],
"extract-tables": ["extracted_tables.xlsx"],
"rotate": ["output.pdf"],
"watermark": ["output.pdf"],
"create": ["output.pdf"],
"encrypt": ["output.pdf"],
"decrypt": ["output.pdf"],
"ocr": ["extracted_text.txt"],
"extract-images": [], # Check images/ subdir
"fill-form": ["output.pdf"],
}
stages = []
overall_passed = True
for fname in EXPECTED.get(operation, []):
fpath = results_dir / fname
passed = fpath.exists() and fpath.stat().st_size > 0
stages.append({"name": f"output:{fname}", "passed": passed,
"message": f"{fpath} ({fpath.stat().st_size if fpath.exists() else 'MISSING'} bytes)"})
if not passed:
overall_passed = False
# Always check summary.md and validation_report.json
for fname in ["summary.md", "validation_report.json"]:
fpath = results_dir / fname
passed = fpath.exists() and fpath.stat().st_size > 0
stages.append({"name": f"output:{fname}", "passed": passed,
"message": str(fpath)})
print(json.dumps({"stages": stages, "overall_passed": overall_passed}, indent=2))
```
---
## Step 6: Write Executive Summary
Write `/app/results/summary.md` with a concise record of the run.
```python
import pathlib, datetime
content = f"""# PDF Processing — Run Summary
## Operation
- **Operation**:
- **Input**:
- **Output**:
- **Date**: {datetime.datetime.utcnow().isoformat()}Z
## Validation
| Check | Status | Notes |
|-------|--------|-------|
| Input file exists | ✓ PASS | |
| Operation completed | ✓ PASS | |
| Output file non-empty | ✓ PASS | |
| summary.md written | ✓ PASS | |
| validation_report.json written | ✓ PASS | |
## Issues / Notes
- None
## Provenance
- Skill: pdf by anthropics/skills
- Origin: https://skills.sh/anthropics/skills/pdf
- Imported by: skill-to-runbook-converter v1.0.0
"""
pathlib.Path("/app/results/summary.md").write_text(content)
print("summary.md written")
```
---
## Step 7: Write Validation Report
Write `/app/results/validation_report.json`.
```python
import json, pathlib, datetime
report = {
"version": "1.0.0",
"run_date": datetime.datetime.utcnow().isoformat() + "Z",
"parameters": {
"skill_url": "https://skills.sh/anthropics/skills/pdf",
"operation": "",
"input_files": [""],
},
"stages": [
{"name": "setup", "passed": True, "message": "Dependencies installed"},
{"name": "validation", "passed": True, "message": "Input files verified"},
{"name": "execution", "passed": True, "message": "Operation completed"},
{"name": "output_check", "passed": True, "message": "Output files non-empty"},
],
"results": {"pass": 4, "partial": 0, "fail": 0},
"overall_passed": True,
"output_files": [
"/app/results/output.pdf",
"/app/results/summary.md",
"/app/results/validation_report.json",
]
}
pathlib.Path("/app/results/validation_report.json").write_text(json.dumps(report, indent=2))
print("validation_report.json written")
```
---
## Step 8: Final Checklist (MANDATORY — do not skip)
### Verification Script
```bash
echo "=== FINAL OUTPUT VERIFICATION ==="
RESULTS_DIR="/app/results"
# Check mandatory files
for f in \
"$RESULTS_DIR/summary.md" \
"$RESULTS_DIR/validation_report.json"; do
if [ ! -s "$f" ]; then
echo "FAIL: $f is missing or empty"
else
echo "PASS: $f ($(wc -c < "$f") bytes)"
fi
done
# Check operation-specific output (adjust as needed)
OPERATION="${OPERATION:-extract-text}"
case "$OPERATION" in
merge|rotate|watermark|create|encrypt|decrypt|fill-form)
OUT="$RESULTS_DIR/output.pdf"
[ -s "$OUT" ] && echo "PASS: $OUT ($(wc -c < "$OUT") bytes)" || echo "FAIL: $OUT missing or empty"
;;
extract-text|ocr)
OUT="$RESULTS_DIR/extracted_text.txt"
[ -s "$OUT" ] && echo "PASS: $OUT ($(wc -c < "$OUT") bytes)" || echo "FAIL: $OUT missing or empty"
;;
extract-tables)
OUT="$RESULTS_DIR/extracted_tables.xlsx"
[ -s "$OUT" ] && echo "PASS: $OUT ($(wc -c < "$OUT") bytes)" || echo "FAIL: $OUT missing or empty"
;;
split)
COUNT=$(ls "$RESULTS_DIR"/page_*.pdf 2>/dev/null | wc -l)
[ "$COUNT" -gt 0 ] && echo "PASS: $COUNT split page files found" || echo "FAIL: no split page files found"
;;
extract-images)
COUNT=$(ls "$RESULTS_DIR"/images/ 2>/dev/null | wc -l)
[ "$COUNT" -gt 0 ] && echo "PASS: $COUNT image files extracted" || echo "FAIL: no images extracted"
;;
esac
echo "=== VERIFICATION COMPLETE ==="
```
### Checklist
- [ ] Input PDF(s) existed and were readable
- [ ] Correct operation was selected and executed without unhandled exceptions
- [ ] Primary output file is non-empty and valid (not corrupt)
- [ ] `extracted_text.txt` exists for text/OCR operations
- [ ] `extracted_tables.xlsx` exists for table-extraction operations
- [ ] `summary.md` documents the operation, inputs, outputs, and any warnings
- [ ] `validation_report.json` has `overall_passed: true` (or documents why it is `false`)
- [ ] Verification script above printed PASS for every applicable line
- [ ] No credentials or sensitive data were written to output files
**If ANY item fails, return to the relevant step and fix it. Do NOT finish until all applicable items pass.**
---
## Tips
- **pypdf vs pdfplumber for text extraction**: Use `pdfplumber` for higher-fidelity text and table extraction (preserves layout). Use `pypdf` for structural operations (merge, split, rotate, encrypt).
- **Scanned PDFs**: If `page.extract_text()` returns empty or garbage, the PDF is likely scanned. Switch to the OCR workflow (Step 3g) with `pytesseract` and `pdf2image`.
- **ReportLab subscripts/superscripts**: Never use Unicode sub/superscript characters (₀¹²³) in ReportLab. Use `` and `` XML tags inside `Paragraph` objects. For canvas-drawn text, manually adjust font size and baseline offset instead.
- **Large PDFs**: For PDFs with hundreds of pages, process in batches or use `qpdf` CLI tools which are more memory-efficient than pure-Python approaches.
- **Password-protected PDFs**: Pass `password=` to `PdfReader` constructor; use `qpdf --decrypt` for command-line decryption.
- **Form filling**: This runbook covers standard PDF operations. For form filling (AcroForms), follow the dedicated instructions in FORMS.md.
- **CLI tools as fallback**: If Python libraries fail on a malformed PDF, `qpdf` and `pdftotext` (poppler-utils) are often more robust and worth trying as fallbacks.