--- name: pdf-manipulation description: Manipulate PDF files including merge, split, extract, redact, convert, and secure workflows. --- # PDF Manipulation Skill Merge, split, extract, redact, and transform PDF files using free command-line tools and libraries. Covers common PDF operations for document automation workflows. ## When to use - Merge multiple PDFs into one document - Split large PDFs into separate files or page ranges - Extract text, images, or specific pages - Redact sensitive information - Add watermarks, passwords, or metadata - Convert PDFs to images or other formats ## Required tools - **pdftk** — Swiss Army knife for PDF manipulation (merge, split, rotate, encrypt) - **qpdf** — PDF transformation and encryption (linearize, decrypt, repair) - **pdftotext / pdfimages** — Part of poppler-utils (extract text and images) - **ghostscript (gs)** — Advanced PDF processing, compression, and conversion ### Installation ```bash # Ubuntu/Debian sudo apt-get install pdftk qpdf poppler-utils ghostscript # macOS (Homebrew) brew install pdftk-java qpdf poppler ghostscript # For Node.js: npm i pdf-lib (pure JS, no system deps) # For Python: pip install PyPDF2 pypdf ``` ## Skills ### Merge PDFs ```bash # Using pdftk (preserves bookmarks, forms) pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf # Using ghostscript (better compression) gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf # Using qpdf (preserves structure) qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf ``` **Node.js (pdf-lib):** ```javascript const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); async function mergePDFs(files, output) { const mergedPdf = await PDFDocument.create(); for (const file of files) { const pdfBytes = fs.readFileSync(file); const pdf = await PDFDocument.load(pdfBytes); const pages = await mergedPdf.copyPages(pdf, pdf.getPageIndices()); pages.forEach(page => mergedPdf.addPage(page)); } const mergedBytes = await mergedPdf.save(); fs.writeFileSync(output, mergedBytes); } // mergePDFs(['file1.pdf', 'file2.pdf'], 'merged.pdf'); ``` ### Split PDF (by page or range) ```bash # Split every page into separate files pdftk input.pdf burst output page_%02d.pdf # Extract specific pages (e.g., pages 1-5 and 10) pdftk input.pdf cat 1-5 10 output subset.pdf # Extract page ranges with qpdf qpdf input.pdf --pages . 1-5 -- output.pdf # Split every N pages (e.g., every 2 pages) pdftk input.pdf burst # then manually combine or script it ``` **Node.js (pdf-lib):** ```javascript const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); async function extractPages(inputPath, pages, outputPath) { const pdfBytes = fs.readFileSync(inputPath); const pdfDoc = await PDFDocument.load(pdfBytes); const newPdf = await PDFDocument.create(); for (const pageNum of pages) { const [page] = await newPdf.copyPages(pdfDoc, [pageNum - 1]); newPdf.addPage(page); } const newBytes = await newPdf.save(); fs.writeFileSync(outputPath, newBytes); } // extractPages('input.pdf', [1, 3, 5], 'output.pdf'); ``` ### Extract text ```bash # Extract all text (preserves layout) pdftotext input.pdf output.txt # Extract text as raw (no layout) pdftotext -raw input.pdf output.txt # Extract specific pages pdftotext -f 1 -l 5 input.pdf output.txt # Using qpdf + pdftotext pdftotext -layout input.pdf - ``` **Node.js (pdf-parse):** ```javascript const fs = require('fs'); const pdf = require('pdf-parse'); async function extractText(filePath) { const dataBuffer = fs.readFileSync(filePath); const data = await pdf(dataBuffer); return data.text; } // extractText('input.pdf').then(console.log); ``` ### Extract images ```bash # Extract all images from PDF pdfimages -all input.pdf output_prefix # Output: output_prefix-000.png, output_prefix-001.jpg, etc. # Extract only JPEGs pdfimages -j input.pdf output_prefix ``` ### Redact / Remove pages ```bash # Remove specific pages (e.g., remove pages 2-4) pdftk input.pdf cat 1 5-end output redacted.pdf # Keep only specific pages pdftk input.pdf cat 1-10 20-30 output selected.pdf ``` ### Add password protection ```bash # Encrypt PDF with password pdftk input.pdf output secured.pdf user_pw mypassword # Remove password pdftk secured.pdf input_pw mypassword output unlocked.pdf # Using qpdf (AES-256) qpdf --encrypt userpass ownerpass 256 -- input.pdf output.pdf ``` **Node.js (pdf-lib):** ```javascript const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); async function encryptPDF(inputPath, password, outputPath) { const pdfBytes = fs.readFileSync(inputPath); const pdfDoc = await PDFDocument.load(pdfBytes); const encryptedBytes = await pdfDoc.save({ userPassword: password, ownerPassword: password }); fs.writeFileSync(outputPath, encryptedBytes); } ``` ### Rotate pages ```bash # Rotate all pages 90 degrees clockwise pdftk input.pdf cat 1-endright output rotated.pdf # Rotate specific pages pdftk input.pdf cat 1-5 6right 7-end output rotated.pdf # Options: right (90°), left (270°), down (180°) ``` ### Compress / Reduce file size ```bash # Using ghostscript (adjust quality) gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \ -dNOPAUSE -dQUIET -dBATCH -sOutputFile=compressed.pdf input.pdf # Quality settings: # /screen - low quality (72 dpi) # /ebook - medium (150 dpi) # /printer - high (300 dpi) # /prepress - highest (300 dpi, preserves color) # Using qpdf (lossless compression) qpdf --linearize --object-streams=generate input.pdf compressed.pdf ``` ### Convert PDF to images ```bash # Convert each page to PNG (300 DPI) pdftoppm -png -r 300 input.pdf output_prefix # Output: output_prefix-1.png, output_prefix-2.png, etc. # Convert to JPEG pdftoppm -jpeg -r 150 input.pdf output_prefix # Using ImageMagick (alternative) convert -density 300 input.pdf output_%03d.png ``` ### Add watermark ```bash # Overlay watermark.pdf on every page pdftk input.pdf stamp watermark.pdf output watermarked.pdf # Background watermark (behind content) pdftk input.pdf background watermark.pdf output watermarked.pdf # Watermark specific pages only pdftk input.pdf multistamp watermark.pdf output watermarked.pdf ``` ### Get PDF metadata ```bash # Using pdftk pdftk input.pdf dump_data # Using qpdf qpdf --show-object=1 input.pdf # Using pdfinfo (poppler-utils) pdfinfo input.pdf ``` ### Multi-operation script (Node.js) ```javascript const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); class PDFHelper { static async merge(files, output) { const merged = await PDFDocument.create(); for (const file of files) { const pdf = await PDFDocument.load(fs.readFileSync(file)); const pages = await merged.copyPages(pdf, pdf.getPageIndices()); pages.forEach(p => merged.addPage(p)); } fs.writeFileSync(output, await merged.save()); } static async split(input, ranges, output) { const pdf = await PDFDocument.load(fs.readFileSync(input)); const newPdf = await PDFDocument.create(); const pages = await newPdf.copyPages(pdf, ranges); pages.forEach(p => newPdf.addPage(p)); fs.writeFileSync(output, await newPdf.save()); } static async info(input) { const pdf = await PDFDocument.load(fs.readFileSync(input)); return { pages: pdf.getPageCount(), title: pdf.getTitle(), author: pdf.getAuthor(), creator: pdf.getCreator() }; } } module.exports = PDFHelper; ``` ## Agent prompt ```text You have PDF manipulation skills. When a user requests PDF operations: 1. Detect the operation: merge, split, extract (text/images/pages), redact, compress, encrypt, rotate, watermark, or get info. 2. Use appropriate tools: - pdftk for merge, split, rotate, encrypt, watermark - pdftotext/pdfimages for extraction - ghostscript for compression - qpdf for repair and advanced operations 3. Always validate input files exist before processing. 4. For scripting, prefer pdf-lib (Node.js) or PyPDF2 (Python) for portability. 5. Return structured output (file paths, metadata, text) in JSON format. ``` ## Best practices - **Validate PDFs** before processing (use `qpdf --check input.pdf`). - **Preserve metadata** when possible (use pdftk or pdf-lib, avoid ghostscript for simple operations). - **Use appropriate compression** — ghostscript `/ebook` is a good balance for most cases. - **Security** — Always remove passwords before processing if user provides them; never log passwords. - **Large files** — For 100+ page PDFs, process in chunks or use streaming APIs. ## Common workflows ### Invoice processing ```bash # 1. Extract text for parsing pdftotext invoice.pdf invoice.txt # 2. Extract first page only (summary) pdftk invoice.pdf cat 1 output summary.pdf # 3. Compress for archival gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dBATCH -dNOPAUSE -q \ -sOutputFile=invoice_compressed.pdf invoice.pdf ``` ### Batch processing ```bash # Merge all PDFs in a directory pdftk *.pdf cat output combined.pdf # Split each PDF in directory into individual pages for f in *.pdf; do pdftk "$f" burst output "${f%.pdf}_page_%02d.pdf" done # Extract text from all PDFs for f in *.pdf; do pdftotext "$f" "${f%.pdf}.txt" done ``` ## Troubleshooting - **Corrupted PDF**: Use `qpdf --check` then `qpdf input.pdf --replace-input` to repair. - **Encrypted PDF**: Remove password first with `qpdf --decrypt --password=PASS input.pdf output.pdf`. - **Large file size**: Use ghostscript compression or remove embedded fonts/images if not needed. - **Missing fonts**: Install `fonts-liberation` or `msttcorefonts` packages. ## See also - [anonymous-file-upload.md](anonymous-file-upload.md) — Upload processed PDFs anonymously. - [using-web-scraping.md](using-web-scraping.md) — Scrape web pages and convert to PDF.