---
name: form-filling
description: "Fill PDF and image forms using the Datalab Python SDK. Triggers: form filling, PDF forms, fillable documents, FormFillingOptions, batch fill forms."
compatibility: "Requires Python 3.10+, datalab-python-sdk, python-dotenv, and DATALAB_API_KEY (env or .env)."
---

# Datalab Form Filling

Fill PDF and image forms using the Datalab Python SDK (`datalab-python-sdk`).

## Prerequisites

```bash
pip install datalab-python-sdk python-dotenv
```

**API Key Setup**: The SDK requires `DATALAB_API_KEY`. Either:

- Set as environment variable: `export DATALAB_API_KEY=your_key`
- Or use a `.env` file in your project directory (recommended)

## Workflow

1. Gather field data from the user (field names, values, descriptions)
2. Determine form source (local file, URL, or image)
3. Configure options (context, confidence threshold, page range)
4. Fill the form using the SDK
5. Check results and handle unmatched fields

## When NOT to Use This Skill

- **Form creation** - This fills existing forms, doesn't create new ones
- **OCR/text extraction** - Use Datalab's OCR endpoints instead
- **Non-form documents** - Regular PDFs without fillable fields or clear form structure

## Quick Start

Use this in a **script file** (`.py`). In a notebook or REPL, `__file__` is undefined—use explicit paths for the form and output instead.

```python
import os
from pathlib import Path
from dotenv import load_dotenv
from datalab_sdk import DatalabClient, FormFillingOptions

# In a .py file: script_dir = Path(__file__).parent. In notebook/REPL: script_dir = Path(".")
script_dir = Path(__file__).parent
load_dotenv(script_dir / ".env")

client = DatalabClient(api_key=os.getenv("DATALAB_API_KEY"))

options = FormFillingOptions(
    field_data={
        "full_name": {"value": "John Doe", "description": "Full legal name"},
        "date_of_birth": {"value": "1990-01-15", "description": "Date of birth"},
    },
    context="Employment application form",
    confidence_threshold=0.5,
)

form_path = script_dir / "form.pdf"
result = client.fill(str(form_path), options=options)
result.save_output(str(script_dir / "filled_form.pdf"))

print(f"Filled: {result.fields_filled}")
print(f"Not found: {result.fields_not_found}")
```

## Using the Fill Form Script

For quick command-line filling, use the bundled script. Run from the skill directory or use the full path:

```bash
# From skill directory (form.pdf and field_data.json in current dir)
python scripts/fill_form.py form.pdf field_data.json -o filled.pdf

# From another directory: use full paths for script, form, and field data
python /path/to/form-filling/scripts/fill_form.py /path/to/form.pdf /path/to/field_data.json -o filled.pdf
```

Options: `-o output.pdf`, `-c "context string"`, `-t 0.7` (threshold), `-p "0-2"` (pages 1-3, 0-indexed), `--async`

See `scripts/sample_field_data.json` for a template. The `field_data.json` format:

```json
{
  "name": { "value": "Jane Smith", "description": "Full name" },
  "ssn": { "value": "123-45-6789", "description": "Social Security Number" }
}
```

## Key Guidance

### Field Data Design

- Always include `description` for each field to improve matching accuracy
- Use `context` to describe the form type (e.g., "IRS W-4 Employee's Withholding Certificate")
- Field values are always strings, even for numbers and dates

### Supported Field Types

Text, date, numeric, checkbox (`"Yes"`/`"No"`), and signature (rendered as text).

### Handling Unmatched Fields

If `result.fields_not_found` is non-empty:

1. Improve field descriptions to better match the form's labels
2. Add or refine the `context` parameter
3. Lower `confidence_threshold` to catch more matches

### URL Source

```python
result = client.fill(file_url="https://example.com/form.pdf", options=options)
```

### Image Forms (Scanned PDFs, PNG, JPG)

The SDK handles image-based forms automatically:

```python
# Scanned form or image file
result = client.fill("scanned_form.png", options=options)
result.save_output("filled_form.png")  # Output matches input format
```

### Async Processing

For batch operations or non-blocking calls. Paths are relative to the current working directory.

```python
from datalab_sdk import AsyncDatalabClient, FormFillingOptions

async with AsyncDatalabClient(api_key=os.getenv("DATALAB_API_KEY")) as client:
    result = await client.fill("form.pdf", options=options)
    result.save_output("filled.pdf")
```

## Common Pitfalls

### API Key Not Found

**Problem**: `DatalabAPIError: You must pass in an api_key or set DATALAB_API_KEY`

**Solution**: The `.env` file isn't auto-loaded. Always:

1. Use `load_dotenv()` with explicit path: `load_dotenv(Path(__file__).parent / ".env")`
2. Pass API key explicitly: `DatalabClient(api_key=os.getenv("DATALAB_API_KEY"))`

### File Not Found When Running Script

**Problem**: Relative paths like `"form.pdf"` fail when script runs from a different directory.

**Solution**: Use absolute paths based on script location:

```python
script_dir = Path(__file__).parent
form_path = script_dir / "form.pdf"
result = client.fill(str(form_path), options=options)
```

### Module Not Found

**Problem**: `ModuleNotFoundError: No module named 'datalab_sdk'`

**Solution**: Install the SDK first:

```bash
pip install datalab-python-sdk python-dotenv
```

## References

- **Full API details**: See [references/api-reference.md](references/api-reference.md) for installation/prerequisites, FormFillingOptions, confidence threshold tuning, image form handling, batch async patterns, result fields, error handling, and client configuration