---
description: "Convert PDF datasets into interleaved Parquet output using NVIDIA's Nemotron-Parse VLM with the four-stage NemotronParsePDFReader composite pipeline"
categories: ["how-to-guides"]
tags: ["pdf", "nemotron-parse", "interleaved", "vllm", "parsing"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "universal"
---

# Nemotron-Parse PDF Pipeline

Convert PDF datasets into interleaved Parquet output using NVIDIA's [Nemotron-Parse](https://huggingface.co/nvidia) vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the [interleaved dataset](/curate-text/process-data/interleaved) format.

## How it Works

`NemotronParsePDFReader` is a composite stage that expands into four underlying sub-stages:

1. **`PDFPartitioningStage`** — reads a JSONL manifest of PDF entries and packs them into `FileGroupTask` objects.
2. **`PDFPreprocessStage`** — extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages.
3. **`NemotronParseInferenceStage`** — runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, with `text_in_pic` and `enforce_eager` flags and free-port retry on collisions.
4. **`NemotronParsePostprocessStage`** — parses model output, aligns images and captions, crops images, and emits the final interleaved rows.

The output is interleaved Parquet ready to be filtered with [Interleaved Filters](/curate-text/process-data/interleaved/filters) and written to MINT-1T-style WebDataset shards.

## Before You Start

Choose your PDF source and confirm the prerequisites:

- **GPU**: Required. Nemotron-Parse runs on GPU via vLLM (recommended) or Hugging Face Transformers.
- **vLLM**: Strongly recommended for throughput. Falls back to HF Transformers if `backend="hf"` is set.
- **`pypdfium2`**: Required Python dependency for PDF rendering. Installed automatically with the `interleaved_cpu` or `interleaved_cuda12` extras (e.g., `uv sync --extra interleaved_cuda12`).
- **Manifest**: A JSONL file listing the PDFs to process. Each line should specify the PDF location relative to the source directory you choose.

### Choosing a PDF Source

Pass exactly one of `pdf_dir`, `zip_base_dir`, or `jsonl_base_dir` so the preprocess stage knows where to find the PDF bytes:

| Parameter | Source Layout | When to Use |
|-----------|---------------|-------------|
| `pdf_dir` | A directory of `.pdf` files | Local or mounted directories of standalone PDFs |
| `zip_base_dir` | A `CC-MAIN-2021-31-PDF-UNTRUNCATED` zip hierarchy | Common Crawl PDF dumps |
| `jsonl_base_dir` | JSONL-encoded PDF datasets where each line carries the PDF bytes | GitHub-hosted PDF datasets, custom JSONL collections |

### Backend Selection

| Backend | When to Use |
|---------|-------------|
| `vllm` (recommended) | High-throughput GPU inference with batching. Set `enforce_eager=True` if you hit compilation issues. |
| `hf` | Hugging Face Transformers fallback when vLLM is unavailable or for debugging. |

The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly.

---

## Usage

A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.interleaved.pdf.nemotron_parse import NemotronParsePDFReader
from nemo_curator.stages.interleaved.io.writers.tabular import InterleavedParquetWriter

pipeline = Pipeline(name="pdf_to_interleaved")

# 1. Parse PDFs into interleaved rows
pipeline.add_stage(
    NemotronParsePDFReader(
        manifest_path="./pdfs.jsonl",
        pdf_dir="/data/pdfs",
        backend="vllm",
        pdfs_per_task=10,
        max_pages=50,
        inference_batch_size=4,
    )
)

# 2. Write interleaved Parquet
pipeline.add_stage(InterleavedParquetWriter(output_dir="./parsed_pdfs"))

executor = XennaExecutor()
pipeline.run(executor)
```

For executor options and configuration, refer to [Execution Backends](/reference/infra/execution-backends).

### Example: CC-MAIN PDF Dump

Parse a Common Crawl PDF dump from its zip hierarchy:

```python
NemotronParsePDFReader(
    manifest_path="./cc_pdfs.jsonl",
    zip_base_dir="/data/CC-MAIN-2021-31-PDF-UNTRUNCATED",
    backend="vllm",
    file_names_field="cc_pdf_file_names",
    pdfs_per_task=20,
)
```

### Example: JSONL-Encoded PDFs

Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes):

```python
NemotronParsePDFReader(
    manifest_path="./github_pdfs.jsonl",
    jsonl_base_dir="/data/github_pdfs",
    backend="vllm",
)
```

### Parameters

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| `manifest_path` | str \| None | `None` | JSONL manifest listing PDF entries. |
| `pdf_dir` | str \| None | `None` | Directory containing `.pdf` files. |
| `zip_base_dir` | str \| None | `None` | Root directory of CC-MAIN PDF zip hierarchy. |
| `jsonl_base_dir` | str \| None | `None` | Root directory of JSONL-encoded PDF datasets. |
| `model_path` | str | (default model) | Local path or HF repo ID for the Nemotron-Parse weights. |
| `backend` | str | `"vllm"` | Inference backend (`vllm` or `hf`). |
| `pdfs_per_task` | int | `10` | Number of PDFs grouped into each `FileGroupTask`. |
| `max_pdfs` | int \| None | `None` | Hard cap on total PDFs processed (debug aid). |
| `dpi` | int | `300` | Render DPI for PDF pages. |
| `max_pages` | int | `50` | Maximum pages rendered per PDF; longer PDFs are truncated. |
| `inference_batch_size` | int | `4` | vLLM/HF batch size. |
| `max_num_seqs` | int | `64` | Maximum concurrent vLLM sequences. |
| `text_in_pic` | bool | `False` | When `True`, treat embedded text within rendered images as part of the text content. |
| `enforce_eager` | bool | `False` | Disable vLLM compilation for compatibility with restricted environments. |
| `min_crop_px` | int | `10` | Minimum dimension (pixels) for cropped image regions. |
| `dataset_name` | str | `"pdf_dataset"` | Logical dataset label written to output rows. |
| `file_name_field` | str | `"file_name"` | Manifest field naming a single PDF file. |
| `file_names_field` | str | `"cc_pdf_file_names"` | Manifest field naming a list of PDF files (CC-MAIN layout). |
| `url_field` | str | `"url"` | Manifest field for the source URL passthrough. |

## Output Format

Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a `sample_id` belong to the same document. Example output JSON:

```json
{
  "sample_id": "doc_42",
  "position": 0,
  "modality": "text",
  "text_content": "# Introduction\n\nThis paper investigates...",
  "binary_content": null,
  "source_files": ["pdf_42.pdf"],
  "url": "https://example.com/pdf_42.pdf"
}
{
  "sample_id": "doc_42",
  "position": 1,
  "modality": "image",
  "text_content": null,
  "binary_content": "<bytes>",
  "source_files": ["pdf_42.pdf"]
}
{
  "sample_id": "doc_42",
  "position": 2,
  "modality": "text",
  "text_content": "Figure 1 shows the architecture...",
  "binary_content": null,
  "source_files": ["pdf_42.pdf"]
}
```

### Output Schema

| Column | Type | Description |
| --- | --- | --- |
| `sample_id` | string | PDF identifier; rows sharing a `sample_id` belong to the same document. |
| `position` | int | Zero-based item position within the sample, used to reconstruct ordering. |
| `modality` | string | One of `text`, `image`, or `metadata`. |
| `text_content` | string \| null | Text payload for `text` and `metadata` rows. |
| `binary_content` | bytes \| null | Image payload for `image` rows. |
| `source_files` | list[string] | Source PDF files that produced this row (for lineage tracking). |

The output is directly compatible with [Interleaved IO](/curate-text/process-data/interleaved/io) readers and writers — the schema matches `INTERLEAVED_SCHEMA` exactly.

## Render Timeout

The preprocess stage replaces `signal.SIGALRM` with a `multiprocessing` fork-based timeout (`_RENDER_TIMEOUT_S = 60` by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where `SIGALRM` raises `ValueError: signal only works in main thread`. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside `pypdfium2`.

You don't need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at `nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py`.

## Benchmarking

A standalone benchmark script ships at `benchmarking/scripts/nemotron_parse_pdf_benchmark.py`. Use it to measure throughput on representative datasets before scaling to your full corpus.

## Best Practices

- **Use vLLM unless you can't**: the `vllm` backend is substantially faster than `hf`. Only fall back to `hf` for debugging or in environments where vLLM is unavailable.
- **Cap `max_pages` for outliers**: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources.
- **Tune `pdfs_per_task` for parallelism**: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters.
- **Set `enforce_eager=True` in restricted environments**: vLLM's torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility.
- **Pair with interleaved filters**: PDF parsing produces noisy output. Chain with the [Interleaved Filters](/curate-text/process-data/interleaved/filters) (blur, CLIP score) to drop low-quality samples before training.

## Related Topics

- **[Interleaved IO](/curate-text/process-data/interleaved/io)** — readers and writers that consume the Parquet output of this pipeline.
- **[Interleaved Filters](/curate-text/process-data/interleaved/filters)** — sample-level filters to apply after parsing.
- **[Common Crawl](/curate-text/load-data/common-crawl)** — companion source for web-scale PDF input via CC-MAIN dumps.