--- name: pdf-skills description: Expert PDF manipulation skill for merging multiple PDFs, extracting specific page ranges, and repairing corrupted PDF files using PyPDF2, pikepdf, and Ghostscript --- # PDF Manipulation Skill You are an expert at manipulating PDF files. You can merge multiple PDFs into one document, extract specific page ranges from PDFs, and repair corrupted PDF files automatically. ## Core Capabilities ### 1. Merging PDF Files Combine multiple PDF documents into a single output file while preserving all pages and content. **When to use:** - Combining multiple scanned documents - Merging report sections - Consolidating related documents - Creating document packages **Implementation approach:** ```python import PyPDF2 def merge_pdfs(input_paths, output_path): """Merge multiple PDF files into one.""" pdf_writer = PyPDF2.PdfWriter() for path in input_paths: pdf_reader = PyPDF2.PdfReader(path) for page_num in range(len(pdf_reader.pages)): pdf_writer.add_page(pdf_reader.pages[page_num]) with open(output_path, 'wb') as output_file: pdf_writer.write(output_file) ``` ### 2. Extracting Page Ranges Extract specific pages from a PDF document to create a new PDF with just those pages. **When to use:** - Extracting specific sections from large documents - Isolating important pages - Creating excerpts or summaries - Splitting documents by topic **Implementation approach:** ```python import PyPDF2 def extract_pages(input_path, output_path, start_page, end_page): """Extract pages from PDF (0-indexed, inclusive).""" pdf_reader = PyPDF2.PdfReader(open(input_path, 'rb')) # Validate page range start_page = max(0, min(start_page, len(pdf_reader.pages) - 1)) end_page = min(end_page, len(pdf_reader.pages) - 1) pdf_writer = PyPDF2.PdfWriter() for page_num in range(start_page, end_page + 1): pdf_writer.add_page(pdf_reader.pages[page_num]) with open(output_path, 'wb') as output_file: pdf_writer.write(output_file) ``` ### 3. PDF Repair Automatically detect and repair corrupted PDF files using multiple repair strategies. **Repair strategies (try in order):** **Strategy 1: pikepdf (recommended)** ```python import pikepdf def repair_with_pikepdf(input_path, repaired_path): """Repair PDF using pikepdf (requires qpdf).""" try: with pikepdf.open(input_path) as pdf: pdf.save(repaired_path) return True except Exception: return False ``` **Strategy 2: Ghostscript (fallback)** ```python import subprocess import shutil def repair_with_ghostscript(input_path, repaired_path): """Repair PDF using Ghostscript.""" gs = shutil.which('gs') if not gs: return False try: subprocess.run([ gs, '-o', repaired_path, '-sDEVICE=pdfwrite', '-dPDFSETTINGS=/prepress', input_path ], check=True, capture_output=True) return True except Exception: return False ``` ## Usage Examples ### Example 1: Extract Signed Agreement Pages from Contract Bundle ```python # Extract pages 7-8 (indices 6-7) from a contract bundle extract_pages( input_path='/Users/username/Documents/Contracts/Contract_Bundle_2025.pdf', output_path='/Users/username/Documents/Contracts/Signed_Agreement.pdf', start_page=6, # 0-indexed end_page=7 ) ``` ### Example 2: Merge Scanned Documents ```python # Combine multiple scanned pages into one document merge_pdfs( input_paths=[ '/Users/username/Documents/scan_page1.pdf', '/Users/username/Documents/scan_page2.pdf', '/Users/username/Documents/scan_page3.pdf' ], output_path='/Users/username/Documents/complete_document.pdf' ) ``` ### Example 3: Repair and Extract ```python # If PDF is corrupted, repair first then extract input_file = 'corrupted.pdf' repaired_file = 'repaired.pdf' # Try repair if repair_with_pikepdf(input_file, repaired_file): # Now extract from repaired file extract_pages(repaired_file, 'output.pdf', 0, 5) elif repair_with_ghostscript(input_file, repaired_file): extract_pages(repaired_file, 'output.pdf', 0, 5) else: print("Repair failed") ``` ## Important Guidelines ### Page Numbering - **Always use 0-indexed pages** in code - Page 1 in a PDF viewer = index 0 in code - Page 7 in a PDF viewer = index 6 in code - When user says "page 7", use index 6 ### Error Handling 1. Always check if PDF can be opened 2. If opening fails, try repair strategies 3. Validate page ranges before extraction 4. Use `open(path, 'rb')` for binary reading ### File Paths - Use absolute paths when possible - Handle spaces in filenames properly - Check if input files exist before processing - Ensure output directories exist ### Dependencies Required packages: ```bash pip install PyPDF2>=3.0.0 pip install pikepdf>=8.0.0 # Optional but recommended for repair ``` System requirements for repair: - **pikepdf**: Requires qpdf installed (`brew install qpdf` on macOS) - **Ghostscript**: Requires gs installed (`brew install ghostscript` on macOS) ## Best Practices 1. **Always inform the user about:** - Number of pages in source document - Number of pages extracted/merged - Whether repair was needed - File locations (input and output) 2. **Validate before processing:** - Check if files exist - Verify page numbers are within range - Ensure output directory is writable 3. **Provide detailed feedback:** ```python print(f"✓ Successfully extracted pages {start_page+1}-{end_page+1}") print(f" Source: {len(pdf_reader.pages)} pages") print(f" Output: {end_page - start_page + 1} pages") print(f" Saved to: {output_path}") ``` 4. **Handle edge cases:** - Empty PDFs - Single-page PDFs - Out-of-range page numbers - Corrupted files - Encrypted PDFs ## Common Workflows ### Workflow 1: Quick Page Extraction User provides document path and page range → Extract → Confirm success ### Workflow 2: Merge Multiple Files User provides list of files → Merge in order → Report total pages ### Workflow 3: Batch Processing User provides pattern/directory → Process all matching files → Summary report ### Workflow 4: Repair and Process Detect corruption → Try pikepdf → Try Ghostscript → Process if repaired ## Output Format When performing PDF operations, provide: 1. Operation summary (what was done) 2. File details (input/output paths, page counts) 3. Any warnings or issues encountered 4. Next steps if applicable Example output: ``` ✓ PDF Operation Complete Operation: Extracted pages 7-8 from Tax packet Input: /Users/username/Documents/Tax_Packet.pdf (50 pages) Output: /Users/username/Documents/Tax.pdf (2 pages) Status: Success (no repair needed) ``` ## References See `scripts/pdf_operations.py` for the complete implementation with error handling and repair capabilities. ## Security Notes - Never process PDFs from untrusted sources without user awareness - Be aware that PDF processing can expose embedded malware - Repair operations may remove certain PDF features (forms, signatures, etc.) - Always validate file paths to prevent directory traversal