--- description: "Convert PDF datasets into interleaved Parquet output using NVIDIA's Nemotron-Parse VLM with the four-stage NemotronParsePDFReader composite pipeline" categories: ["how-to-guides"] tags: ["pdf", "nemotron-parse", "interleaved", "vllm", "parsing"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "universal" --- # Nemotron-Parse PDF Pipeline Convert PDF datasets into interleaved Parquet output using NVIDIA's [Nemotron-Parse](https://huggingface.co/nvidia) vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the [interleaved dataset](/curate-text/process-data/interleaved) format. ## How it Works `NemotronParsePDFReader` is a composite stage that expands into four underlying sub-stages: 1. **`PDFPartitioningStage`** — reads a JSONL manifest of PDF entries and packs them into `FileGroupTask` objects. 2. **`PDFPreprocessStage`** — extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages. 3. **`NemotronParseInferenceStage`** — runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, with `text_in_pic` and `enforce_eager` flags and free-port retry on collisions. 4. **`NemotronParsePostprocessStage`** — parses model output, aligns images and captions, crops images, and emits the final interleaved rows. The output is interleaved Parquet ready to be filtered with [Interleaved Filters](/curate-text/process-data/interleaved/filters) and written to MINT-1T-style WebDataset shards. ## Before You Start Choose your PDF source and confirm the prerequisites: - **GPU**: Required. Nemotron-Parse runs on GPU via vLLM (recommended) or Hugging Face Transformers. - **vLLM**: Strongly recommended for throughput. Falls back to HF Transformers if `backend="hf"` is set. - **`pypdfium2`**: Required Python dependency for PDF rendering. Installed automatically with the `interleaved_cpu` or `interleaved_cuda12` extras (e.g., `uv sync --extra interleaved_cuda12`). - **Manifest**: A JSONL file listing the PDFs to process. Each line should specify the PDF location relative to the source directory you choose. ### Choosing a PDF Source Pass exactly one of `pdf_dir`, `zip_base_dir`, or `jsonl_base_dir` so the preprocess stage knows where to find the PDF bytes: | Parameter | Source Layout | When to Use | |-----------|---------------|-------------| | `pdf_dir` | A directory of `.pdf` files | Local or mounted directories of standalone PDFs | | `zip_base_dir` | A `CC-MAIN-2021-31-PDF-UNTRUNCATED` zip hierarchy | Common Crawl PDF dumps | | `jsonl_base_dir` | JSONL-encoded PDF datasets where each line carries the PDF bytes | GitHub-hosted PDF datasets, custom JSONL collections | ### Backend Selection | Backend | When to Use | |---------|-------------| | `vllm` (recommended) | High-throughput GPU inference with batching. Set `enforce_eager=True` if you hit compilation issues. | | `hf` | Hugging Face Transformers fallback when vLLM is unavailable or for debugging. | The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly. --- ## Usage A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet: ```python from nemo_curator.pipeline import Pipeline from nemo_curator.backends.xenna import XennaExecutor from nemo_curator.stages.interleaved.pdf.nemotron_parse import NemotronParsePDFReader from nemo_curator.stages.interleaved.io.writers.tabular import InterleavedParquetWriter pipeline = Pipeline(name="pdf_to_interleaved") # 1. Parse PDFs into interleaved rows pipeline.add_stage( NemotronParsePDFReader( manifest_path="./pdfs.jsonl", pdf_dir="/data/pdfs", backend="vllm", pdfs_per_task=10, max_pages=50, inference_batch_size=4, ) ) # 2. Write interleaved Parquet pipeline.add_stage(InterleavedParquetWriter(output_dir="./parsed_pdfs")) executor = XennaExecutor() pipeline.run(executor) ``` For executor options and configuration, refer to [Execution Backends](/reference/infra/execution-backends). ### Example: CC-MAIN PDF Dump Parse a Common Crawl PDF dump from its zip hierarchy: ```python NemotronParsePDFReader( manifest_path="./cc_pdfs.jsonl", zip_base_dir="/data/CC-MAIN-2021-31-PDF-UNTRUNCATED", backend="vllm", file_names_field="cc_pdf_file_names", pdfs_per_task=20, ) ``` ### Example: JSONL-Encoded PDFs Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes): ```python NemotronParsePDFReader( manifest_path="./github_pdfs.jsonl", jsonl_base_dir="/data/github_pdfs", backend="vllm", ) ``` ### Parameters | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `manifest_path` | str \| None | `None` | JSONL manifest listing PDF entries. | | `pdf_dir` | str \| None | `None` | Directory containing `.pdf` files. | | `zip_base_dir` | str \| None | `None` | Root directory of CC-MAIN PDF zip hierarchy. | | `jsonl_base_dir` | str \| None | `None` | Root directory of JSONL-encoded PDF datasets. | | `model_path` | str | (default model) | Local path or HF repo ID for the Nemotron-Parse weights. | | `backend` | str | `"vllm"` | Inference backend (`vllm` or `hf`). | | `pdfs_per_task` | int | `10` | Number of PDFs grouped into each `FileGroupTask`. | | `max_pdfs` | int \| None | `None` | Hard cap on total PDFs processed (debug aid). | | `dpi` | int | `300` | Render DPI for PDF pages. | | `max_pages` | int | `50` | Maximum pages rendered per PDF; longer PDFs are truncated. | | `inference_batch_size` | int | `4` | vLLM/HF batch size. | | `max_num_seqs` | int | `64` | Maximum concurrent vLLM sequences. | | `text_in_pic` | bool | `False` | When `True`, treat embedded text within rendered images as part of the text content. | | `enforce_eager` | bool | `False` | Disable vLLM compilation for compatibility with restricted environments. | | `min_crop_px` | int | `10` | Minimum dimension (pixels) for cropped image regions. | | `dataset_name` | str | `"pdf_dataset"` | Logical dataset label written to output rows. | | `file_name_field` | str | `"file_name"` | Manifest field naming a single PDF file. | | `file_names_field` | str | `"cc_pdf_file_names"` | Manifest field naming a list of PDF files (CC-MAIN layout). | | `url_field` | str | `"url"` | Manifest field for the source URL passthrough. | ## Output Format Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a `sample_id` belong to the same document. Example output JSON: ```json { "sample_id": "doc_42", "position": 0, "modality": "text", "text_content": "# Introduction\n\nThis paper investigates...", "binary_content": null, "source_files": ["pdf_42.pdf"], "url": "https://example.com/pdf_42.pdf" } { "sample_id": "doc_42", "position": 1, "modality": "image", "text_content": null, "binary_content": "", "source_files": ["pdf_42.pdf"] } { "sample_id": "doc_42", "position": 2, "modality": "text", "text_content": "Figure 1 shows the architecture...", "binary_content": null, "source_files": ["pdf_42.pdf"] } ``` ### Output Schema | Column | Type | Description | | --- | --- | --- | | `sample_id` | string | PDF identifier; rows sharing a `sample_id` belong to the same document. | | `position` | int | Zero-based item position within the sample, used to reconstruct ordering. | | `modality` | string | One of `text`, `image`, or `metadata`. | | `text_content` | string \| null | Text payload for `text` and `metadata` rows. | | `binary_content` | bytes \| null | Image payload for `image` rows. | | `source_files` | list[string] | Source PDF files that produced this row (for lineage tracking). | The output is directly compatible with [Interleaved IO](/curate-text/process-data/interleaved/io) readers and writers — the schema matches `INTERLEAVED_SCHEMA` exactly. ## Render Timeout The preprocess stage replaces `signal.SIGALRM` with a `multiprocessing` fork-based timeout (`_RENDER_TIMEOUT_S = 60` by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where `SIGALRM` raises `ValueError: signal only works in main thread`. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside `pypdfium2`. You don't need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at `nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py`. ## Benchmarking A standalone benchmark script ships at `benchmarking/scripts/nemotron_parse_pdf_benchmark.py`. Use it to measure throughput on representative datasets before scaling to your full corpus. ## Best Practices - **Use vLLM unless you can't**: the `vllm` backend is substantially faster than `hf`. Only fall back to `hf` for debugging or in environments where vLLM is unavailable. - **Cap `max_pages` for outliers**: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources. - **Tune `pdfs_per_task` for parallelism**: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters. - **Set `enforce_eager=True` in restricted environments**: vLLM's torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility. - **Pair with interleaved filters**: PDF parsing produces noisy output. Chain with the [Interleaved Filters](/curate-text/process-data/interleaved/filters) (blur, CLIP score) to drop low-quality samples before training. ## Related Topics - **[Interleaved IO](/curate-text/process-data/interleaved/io)** — readers and writers that consume the Parquet output of this pipeline. - **[Interleaved Filters](/curate-text/process-data/interleaved/filters)** — sample-level filters to apply after parsing. - **[Common Crawl](/curate-text/load-data/common-crawl)** — companion source for web-scale PDF input via CC-MAIN dumps.