--- version: "1.0.0" evaluation: programmatic agent: claude-code model: claude-sonnet-4-6 snapshot: python312-uv origin: url: "https://skills.sh/anthropics/skills/pdf" source_host: "skills.sh" source_title: "PDF Processing Guide" imported_at: "2026-05-01T00:00:00Z" imported_by: "skill-to-runbook-converter@1.0.0" attribution: collection_or_org: "anthropics" skill_name: "pdf" confidence: "high" secrets: {} --- # PDF Processing Guide — Agent Runbook ## Objective Process PDF files using Python libraries and command-line tools to perform operations such as reading, extracting text and tables, merging, splitting, rotating pages, adding watermarks, creating new PDFs, filling forms, encrypting/decrypting, extracting images, and performing OCR on scanned documents. This runbook covers the full lifecycle of PDF manipulation: from initial environment setup and dependency installation through execution of the requested operation to final verification of outputs. Use this runbook whenever the user references a `.pdf` file or requests any PDF-related transformation. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md; for form-filling workflows, see FORMS.md. ## REQUIRED OUTPUT FILES (MANDATORY) **You MUST write all of the following files to `/app/results`. The task is NOT complete until every file exists and is non-empty.** | File | Description | |------|-------------| | `/app/results/output.pdf` | The primary output PDF (merged, split page, rotated, watermarked, encrypted, or newly created) — omit if the task is text/table extraction only | | `/app/results/extracted_text.txt` | Extracted text content (for text-extraction tasks) | | `/app/results/extracted_tables.xlsx` | Extracted tables as Excel workbook (for table-extraction tasks) | | `/app/results/summary.md` | Executive summary: operation performed, input files, output files, any warnings | | `/app/results/validation_report.json` | Structured validation results with stages, results, and `overall_passed` | If a task only produces a subset of the above (e.g., only a merged PDF with no text extraction), mark inapplicable files as skipped in `validation_report.json` and note them in `summary.md`. ## Parameters | Parameter | Default | Description | |-----------|---------|-------------| | Results directory | `/app/results` | Output directory for all results | | Input PDF(s) | *(required)* | One or more input PDF file paths | | Operation | *(required)* | One of: `merge`, `split`, `extract-text`, `extract-tables`, `rotate`, `watermark`, `create`, `encrypt`, `decrypt`, `ocr`, `extract-images`, `fill-form` | | Output filename | `output.pdf` | Name for the primary output file | | Password | *(optional)* | Password for encrypt/decrypt operations | | Rotation degrees | `90` | Degrees to rotate pages (for `rotate` operation) | | Page range | *(optional)* | Page range for split/extract operations (e.g., `1-5`) | ## Dependencies | Dependency | Type | Required | Description | |------------|------|----------|-------------| | `pypdf` | Python package | Yes | Basic PDF operations: merge, split, rotate, metadata, password protection | | `pdfplumber` | Python package | Yes | Text and table extraction with layout preservation | | `reportlab` | Python package | Yes | PDF creation using canvas or Platypus document templates | | `pytesseract` | Python package | Conditional | OCR on scanned PDFs (required for `ocr` operation) | | `pdf2image` | Python package | Conditional | Convert PDF pages to images for OCR (required for `ocr` operation) | | `pandas` | Python package | Conditional | Advanced table export to Excel (required for `extract-tables` operation) | | `openpyxl` | Python package | Conditional | Excel writer backend for pandas (required for `extract-tables` operation) | | `poppler-utils` | System package | Conditional | Provides `pdftotext` and `pdfimages` CLI tools | | `qpdf` | System package | Conditional | CLI-based merge, split, rotate, decrypt | | `pdftk` | System package | Optional | Alternative CLI merge/split/rotate tool | | `tesseract-ocr` | System package | Conditional | OCR engine (required for `ocr` operation) | --- ## Step 1: Environment Setup Install all Python dependencies and verify CLI tools are available. ```bash echo "=== Installing Python dependencies ===" pip install pypdf pdfplumber reportlab # Install optional dependencies based on operation OPERATION="${OPERATION:-extract-text}" if [[ "$OPERATION" == "ocr" ]]; then pip install pytesseract pdf2image fi if [[ "$OPERATION" == "extract-tables" ]]; then pip install pandas openpyxl fi echo "=== Checking CLI tools ===" command -v pdftotext >/dev/null 2>&1 && echo "pdftotext: OK" || echo "pdftotext: not found (install poppler-utils)" command -v qpdf >/dev/null 2>&1 && echo "qpdf: OK" || echo "qpdf: not found" command -v pdftk >/dev/null 2>&1 && echo "pdftk: OK" || echo "pdftk: not found (optional)" echo "=== Creating output directory ===" mkdir -p /app/results ``` Verify Python imports succeed before proceeding: ```python from pypdf import PdfReader, PdfWriter import pdfplumber from reportlab.lib.pagesizes import letter print("All core dependencies imported successfully") ``` --- ## Step 2: Validate Inputs Verify that the input PDF(s) exist and are readable before running any operation. ```python import pathlib, sys input_files = ["/app/results/input.pdf"] # Replace with actual input path(s) for f in input_files: p = pathlib.Path(f) if not p.exists(): print(f"ERROR: Input file not found: {f}") sys.exit(1) if p.stat().st_size == 0: print(f"ERROR: Input file is empty: {f}") sys.exit(1) print(f"OK: {f} ({p.stat().st_size} bytes)") print("All input files validated.") ``` --- ## Step 3: Execute PDF Operation Choose the appropriate code block for the requested operation. Run the relevant section only. ### 3a: Merge PDFs ```python from pypdf import PdfWriter, PdfReader import pathlib input_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"] # Replace with actual paths output_path = "/app/results/output.pdf" writer = PdfWriter() for pdf_file in input_files: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) with open(output_path, "wb") as output: writer.write(output) print(f"Merged {len(input_files)} PDFs → {output_path}") ``` ### 3b: Split PDF ```python from pypdf import PdfReader, PdfWriter import pathlib input_path = "/app/results/input.pdf" output_dir = pathlib.Path("/app/results") reader = PdfReader(input_path) output_files = [] for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) out_path = output_dir / f"page_{i+1}.pdf" with open(out_path, "wb") as output: writer.write(output) output_files.append(str(out_path)) print(f"Split into {len(output_files)} pages: {output_files}") ``` ### 3c: Extract Text ```python import pdfplumber, pathlib input_path = "/app/results/input.pdf" output_path = "/app/results/extracted_text.txt" text_parts = [] with pdfplumber.open(input_path) as pdf: for i, page in enumerate(pdf.pages): t = page.extract_text() if t: text_parts.append(f"=== Page {i+1} ===\n{t}") full_text = "\n\n".join(text_parts) pathlib.Path(output_path).write_text(full_text) print(f"Extracted {len(full_text)} characters → {output_path}") ``` For scanned PDFs, use OCR instead (see Step 3g). ### 3d: Extract Tables ```python import pdfplumber, pandas as pd, pathlib input_path = "/app/results/input.pdf" output_path = "/app/results/extracted_tables.xlsx" all_tables = [] with pdfplumber.open(input_path) as pdf: for i, page in enumerate(pdf.pages): tables = page.extract_tables() for table in tables: if table: df = pd.DataFrame(table[1:], columns=table[0]) df.insert(0, "source_page", i + 1) all_tables.append(df) if all_tables: combined = pd.concat(all_tables, ignore_index=True) combined.to_excel(output_path, index=False) print(f"Extracted {len(all_tables)} table(s) with {len(combined)} rows → {output_path}") else: pathlib.Path(output_path).write_text("") print("No tables found in the PDF.") ``` ### 3e: Rotate Pages ```python from pypdf import PdfReader, PdfWriter input_path = "/app/results/input.pdf" output_path = "/app/results/output.pdf" degrees = 90 # Replace with desired rotation reader = PdfReader(input_path) writer = PdfWriter() for page in reader.pages: page.rotate(degrees) writer.add_page(page) with open(output_path, "wb") as out: writer.write(out) print(f"Rotated all pages by {degrees}° → {output_path}") ``` ### 3f: Add Watermark ```python from pypdf import PdfReader, PdfWriter watermark_path = "/app/results/watermark.pdf" input_path = "/app/results/input.pdf" output_path = "/app/results/output.pdf" watermark = PdfReader(watermark_path).pages[0] reader = PdfReader(input_path) writer = PdfWriter() for page in reader.pages: page.merge_page(watermark) writer.add_page(page) with open(output_path, "wb") as out: writer.write(out) print(f"Watermark applied → {output_path}") ``` ### 3g: OCR Scanned PDFs ```python # Requires: pip install pytesseract pdf2image # Requires: apt-get install tesseract-ocr poppler-utils import pytesseract, pathlib from pdf2image import convert_from_path input_path = "/app/results/input.pdf" output_path = "/app/results/extracted_text.txt" images = convert_from_path(input_path) text_parts = [] for i, image in enumerate(images): page_text = pytesseract.image_to_string(image) text_parts.append(f"=== Page {i+1} ===\n{page_text}") full_text = "\n\n".join(text_parts) pathlib.Path(output_path).write_text(full_text) print(f"OCR complete: {len(images)} pages, {len(full_text)} chars → {output_path}") ``` ### 3h: Password Protection / Encryption ```python from pypdf import PdfReader, PdfWriter input_path = "/app/results/input.pdf" output_path = "/app/results/output.pdf" user_password = "userpassword" # Replace owner_password = "ownerpassword" # Replace reader = PdfReader(input_path) writer = PdfWriter() for page in reader.pages: writer.add_page(page) writer.encrypt(user_password, owner_password) with open(output_path, "wb") as out: writer.write(out) print(f"Encrypted → {output_path}") ``` ### 3i: Decrypt / Remove Password ```bash # Using qpdf qpdf --password=mypassword --decrypt /app/results/input.pdf /app/results/output.pdf echo "Decrypted → /app/results/output.pdf" ``` ### 3j: Extract Images ```bash # Using pdfimages (poppler-utils) mkdir -p /app/results/images pdfimages -j /app/results/input.pdf /app/results/images/img echo "Images extracted to /app/results/images/" ls /app/results/images/ ``` ### 3k: Create New PDF ```python from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet output_path = "/app/results/output.pdf" doc = SimpleDocTemplate(output_path, pagesize=letter) styles = getSampleStyleSheet() story = [] story.append(Paragraph("Document Title", styles['Title'])) story.append(Spacer(1, 12)) story.append(Paragraph("Document body content goes here.", styles['Normal'])) # IMPORTANT: Never use Unicode subscript/superscript chars (₀¹²) in ReportLab. # Use XML tags instead: 2 or 2 in Paragraph objects. doc.build(story) print(f"Created PDF → {output_path}") ``` --- ## Step 4: Iterate on Errors (max 3 rounds) If Step 3 raised an exception or produced an empty/corrupt output file: 1. Check the error message for the specific failure class 2. Apply the relevant fix from the Common Fixes table below 3. Re-run the affected Step 3 sub-section 4. Repeat up to **3 rounds total** After 3 rounds, if the output is still invalid, record the failure in `summary.md` and `validation_report.json` with `overall_passed: false` and stop. ### Common Fixes | Issue | Fix | |-------|-----| | `FileNotFoundError` on input PDF | Verify the input path; check for typos or missing uploads | | `PdfReadError: EOF marker not found` | The PDF may be corrupt or truncated; try with qpdf: `qpdf --check input.pdf` | | `PasswordError` on encrypted PDF | Pass the correct password to `PdfReader("file.pdf", password="...")` | | Empty extracted text (not scanned) | Try pdfplumber instead of pypdf for text extraction | | Empty extracted text (scanned PDF) | Install tesseract and use OCR workflow (Step 3g) | | ReportLab renders black boxes for subscripts | Replace Unicode sub/superscript chars with `` / `` XML tags | | `qpdf` or `pdftotext` not found | Install system packages: `apt-get install -y qpdf poppler-utils pdftk` | | Tables extracted with None values | Rows with merged cells return `None`; filter with `if cell is not None` | --- ## Step 5: Validate Outputs Verify that all expected output files exist and are non-empty. ```python import pathlib, json, sys results_dir = pathlib.Path("/app/results") operation = "extract-text" # Replace with actual operation # Define expected outputs per operation EXPECTED = { "merge": ["output.pdf"], "split": [], # Dynamic; check for page_*.pdf files "extract-text": ["extracted_text.txt"], "extract-tables": ["extracted_tables.xlsx"], "rotate": ["output.pdf"], "watermark": ["output.pdf"], "create": ["output.pdf"], "encrypt": ["output.pdf"], "decrypt": ["output.pdf"], "ocr": ["extracted_text.txt"], "extract-images": [], # Check images/ subdir "fill-form": ["output.pdf"], } stages = [] overall_passed = True for fname in EXPECTED.get(operation, []): fpath = results_dir / fname passed = fpath.exists() and fpath.stat().st_size > 0 stages.append({"name": f"output:{fname}", "passed": passed, "message": f"{fpath} ({fpath.stat().st_size if fpath.exists() else 'MISSING'} bytes)"}) if not passed: overall_passed = False # Always check summary.md and validation_report.json for fname in ["summary.md", "validation_report.json"]: fpath = results_dir / fname passed = fpath.exists() and fpath.stat().st_size > 0 stages.append({"name": f"output:{fname}", "passed": passed, "message": str(fpath)}) print(json.dumps({"stages": stages, "overall_passed": overall_passed}, indent=2)) ``` --- ## Step 6: Write Executive Summary Write `/app/results/summary.md` with a concise record of the run. ```python import pathlib, datetime content = f"""# PDF Processing — Run Summary ## Operation - **Operation**: - **Input**: - **Output**: - **Date**: {datetime.datetime.utcnow().isoformat()}Z ## Validation | Check | Status | Notes | |-------|--------|-------| | Input file exists | ✓ PASS | | | Operation completed | ✓ PASS | | | Output file non-empty | ✓ PASS | | | summary.md written | ✓ PASS | | | validation_report.json written | ✓ PASS | | ## Issues / Notes - None ## Provenance - Skill: pdf by anthropics/skills - Origin: https://skills.sh/anthropics/skills/pdf - Imported by: skill-to-runbook-converter v1.0.0 """ pathlib.Path("/app/results/summary.md").write_text(content) print("summary.md written") ``` --- ## Step 7: Write Validation Report Write `/app/results/validation_report.json`. ```python import json, pathlib, datetime report = { "version": "1.0.0", "run_date": datetime.datetime.utcnow().isoformat() + "Z", "parameters": { "skill_url": "https://skills.sh/anthropics/skills/pdf", "operation": "", "input_files": [""], }, "stages": [ {"name": "setup", "passed": True, "message": "Dependencies installed"}, {"name": "validation", "passed": True, "message": "Input files verified"}, {"name": "execution", "passed": True, "message": "Operation completed"}, {"name": "output_check", "passed": True, "message": "Output files non-empty"}, ], "results": {"pass": 4, "partial": 0, "fail": 0}, "overall_passed": True, "output_files": [ "/app/results/output.pdf", "/app/results/summary.md", "/app/results/validation_report.json", ] } pathlib.Path("/app/results/validation_report.json").write_text(json.dumps(report, indent=2)) print("validation_report.json written") ``` --- ## Step 8: Final Checklist (MANDATORY — do not skip) ### Verification Script ```bash echo "=== FINAL OUTPUT VERIFICATION ===" RESULTS_DIR="/app/results" # Check mandatory files for f in \ "$RESULTS_DIR/summary.md" \ "$RESULTS_DIR/validation_report.json"; do if [ ! -s "$f" ]; then echo "FAIL: $f is missing or empty" else echo "PASS: $f ($(wc -c < "$f") bytes)" fi done # Check operation-specific output (adjust as needed) OPERATION="${OPERATION:-extract-text}" case "$OPERATION" in merge|rotate|watermark|create|encrypt|decrypt|fill-form) OUT="$RESULTS_DIR/output.pdf" [ -s "$OUT" ] && echo "PASS: $OUT ($(wc -c < "$OUT") bytes)" || echo "FAIL: $OUT missing or empty" ;; extract-text|ocr) OUT="$RESULTS_DIR/extracted_text.txt" [ -s "$OUT" ] && echo "PASS: $OUT ($(wc -c < "$OUT") bytes)" || echo "FAIL: $OUT missing or empty" ;; extract-tables) OUT="$RESULTS_DIR/extracted_tables.xlsx" [ -s "$OUT" ] && echo "PASS: $OUT ($(wc -c < "$OUT") bytes)" || echo "FAIL: $OUT missing or empty" ;; split) COUNT=$(ls "$RESULTS_DIR"/page_*.pdf 2>/dev/null | wc -l) [ "$COUNT" -gt 0 ] && echo "PASS: $COUNT split page files found" || echo "FAIL: no split page files found" ;; extract-images) COUNT=$(ls "$RESULTS_DIR"/images/ 2>/dev/null | wc -l) [ "$COUNT" -gt 0 ] && echo "PASS: $COUNT image files extracted" || echo "FAIL: no images extracted" ;; esac echo "=== VERIFICATION COMPLETE ===" ``` ### Checklist - [ ] Input PDF(s) existed and were readable - [ ] Correct operation was selected and executed without unhandled exceptions - [ ] Primary output file is non-empty and valid (not corrupt) - [ ] `extracted_text.txt` exists for text/OCR operations - [ ] `extracted_tables.xlsx` exists for table-extraction operations - [ ] `summary.md` documents the operation, inputs, outputs, and any warnings - [ ] `validation_report.json` has `overall_passed: true` (or documents why it is `false`) - [ ] Verification script above printed PASS for every applicable line - [ ] No credentials or sensitive data were written to output files **If ANY item fails, return to the relevant step and fix it. Do NOT finish until all applicable items pass.** --- ## Tips - **pypdf vs pdfplumber for text extraction**: Use `pdfplumber` for higher-fidelity text and table extraction (preserves layout). Use `pypdf` for structural operations (merge, split, rotate, encrypt). - **Scanned PDFs**: If `page.extract_text()` returns empty or garbage, the PDF is likely scanned. Switch to the OCR workflow (Step 3g) with `pytesseract` and `pdf2image`. - **ReportLab subscripts/superscripts**: Never use Unicode sub/superscript characters (₀¹²³) in ReportLab. Use `` and `` XML tags inside `Paragraph` objects. For canvas-drawn text, manually adjust font size and baseline offset instead. - **Large PDFs**: For PDFs with hundreds of pages, process in batches or use `qpdf` CLI tools which are more memory-efficient than pure-Python approaches. - **Password-protected PDFs**: Pass `password=` to `PdfReader` constructor; use `qpdf --decrypt` for command-line decryption. - **Form filling**: This runbook covers standard PDF operations. For form filling (AcroForms), follow the dedicated instructions in FORMS.md. - **CLI tools as fallback**: If Python libraries fail on a malformed PDF, `qpdf` and `pdftotext` (poppler-utils) are often more robust and worth trying as fallbacks.