---
name: import-pdf-processing-anthropic
description: Use when migrating a PDF document-processing skill originally built for the Anthropic Claude API into the mini-claude-for-legal format. The adapter maps legacy PDF extraction configuration — text layer parsing, OCR fallback, table extraction, form field reading, and annotation handling — into the standard skill model. Critical for legal workflows involving court filings, regulatory submissions, notarised documents, and scanned contracts.
license: MIT
metadata:
  id: import.pdf-processing-anthropic
  category: import
  jurisdictions: [__multi__]
  priority: P3
  intent: [__import__, pdf, document-processing, migration, anthropic]
  related: [import-docx-processing-anthropic, import-pptx-processing-anthropic, import-contract-review-anthropic, multimodal-document-ingestion]
  source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
  version: "1.0"
---

# Import: PDF Processing (Anthropic)

## What it does

This import adapter migrates a **PDF document-processing skill** originally built for Anthropic's Claude API into the `mini-claude-for-legal` standard format. PDF is the dominant format for legal documents globally: court filings, notarised instruments, regulatory submissions, signed contracts, government-issued licences, and certified translations all arrive as PDFs.

Unlike DOCX, PDFs may or may not have a selectable text layer. Scanned documents require OCR; encrypted PDFs require a password; form PDFs have interactive fields; certified documents may have a digital signature that must be preserved. The Anthropic-native skill may have used Claude's native PDF vision capability, a dedicated extraction library, or both.

## Import config

| Field | Source mapping | Default if absent |
|---|---|---|
| `extraction_mode` | Legacy `mode` | `text_first_ocr_fallback` |
| `ocr_engine` | Legacy `ocr` | `auto` (system default) |
| `tables` | Legacy `extract_tables` boolean | `true` |
| `form_fields` | Legacy `extract_forms` boolean | `true` |
| `annotations` | Legacy `extract_annotations` boolean | `true` |
| `digital_signature_check` | Legacy `check_signature` boolean | `true` |
| `language` | Legacy `lang` | `auto-detect` |
| `rtl_support` | Legacy `rtl` boolean | `true` (for MENA) |
| `chunk_size` | Legacy `chunk_tokens` | `2000` |
| `overlap` | Legacy `overlap_tokens` | `200` |
| `output_format` | Legacy `format` | `markdown` |

## Dry-run preview

```
IMPORT PREVIEW — pdf-processing-anthropic
Source shape     : Anthropic PDF extraction config
Mode             : text_first_ocr_fallback
OCR              : auto
Tables           : extracted
Form fields      : extracted
Annotations      : extracted
Digital signature: checked (not verified cryptographically — visual only)
RTL              : enabled (Arabic/Hebrew support)
Chunk            : 2000 tokens / 200 overlap
Output           : markdown
```

## Extraction pipeline (post-import)

1. **Detect PDF type**:
   - Native (selectable text) → direct text extraction
   - Scanned image → OCR pipeline
   - Mixed (some pages scanned, some native) → page-by-page detection
   - Encrypted → request password; log attempt; never store password

2. **Page-by-page extraction**: maintain page-number metadata for every extracted element; legal documents frequently reference "page X, clause Y".

3. **Table extraction**: detect table boundaries; serialize to markdown pipe tables; flag tables with merged cells as `[COMPLEX TABLE — manual verification recommended]`.

4. **Form field extraction**: identify interactive PDF form fields; extract field names and values; flag unsigned or blank required fields.

5. **Annotation extraction**: extract highlighted text, comments, and sticky notes; attribute to author if metadata available; append as footnote block.

6. **Digital signature check**: detect presence of digital signature (visual check only — not cryptographic validation); flag signed documents separately.

7. **RTL handling**: detect right-to-left text (Arabic, Hebrew); preserve paragraph ordering; ensure column order in tables reflects RTL layout.

8. **Chunking**: split by token count with overlap; preserve clause boundaries where possible.

## Legal document types and special handling

| Document type | Special requirement |
|---|---|
| Court filing (Lebanon / UAE) | Page/paragraph numbering must be preserved; filing stamps and seals should be flagged |
| Notarised document | Notary signature block and certification text must be extracted intact |
| Certified translation | Translator certification page must be preserved as a separate block |
| Government licence / certificate | Issue date, expiry date, and licence number should be extracted as structured metadata |
| Board resolution / POA | Signatory block and witness/notary endorsement are legally significant |
| Regulatory submission | Submission reference number and date should be extracted as metadata |

## Arabic / bilingual PDF considerations

PDFs from MENA jurisdictions frequently mix Arabic (RTL) and English (LTR) text:
- Set `rtl_support: true` to handle RTL paragraphs
- In bilingual contracts, Arabic is often the prevailing language (UAE and KSA government contracts); flag and note the prevailing language
- OCR quality for Arabic varies by font and scan quality; flag low-confidence passages

## Failure modes

| Error | Likely cause | Resolution |
|---|---|---|
| `no_text_layer` | Scanned PDF without OCR | Switch to OCR mode; warn user of quality risk |
| `password_protected` | Encrypted PDF | Prompt for password; never log |
| `rtl_reversal` | RTL paragraphs extracted as LTR | Ensure `rtl_support: true` |
| `table_parse_fail` | Complex merged-cell tables | Flag for manual review |
| `signature_not_found` | Digital signature present but not detected | Flag as `[SIGNATURE STATUS UNKNOWN]` |
| `arabic_ocr_low_quality` | Poor scan of Arabic text | Warn user; recommend re-scan at 300 DPI minimum |

## Related skills

- [[import-docx-processing-anthropic]]
- [[import-pptx-processing-anthropic]]
- [[import-contract-review-anthropic]]
- [[multimodal-document-ingestion]]
- [[review-contract-generic]]