---
description: "Complete installation guide for NeMo Curator with system requirements, package extras, verification steps, and troubleshooting"
categories: ["getting-started"]
tags: ["installation", "system-requirements", "pypi", "source-install", "container", "verification", "troubleshooting"]
personas: ["admin-focused", "devops-focused", "data-scientist-focused", "mle-focused"]
difficulty: "beginner"
content_type: "how-to"
modality: "universal"
---
# Install (All Modalities)
This guide covers installing NeMo Curator with support for **all modalities** and verifying your installation is working correctly. For a single-modality install or a 30-minute walkthrough, start with one of the [modality quickstarts](/get-started) instead.
## Before You Start
### System Requirements
For comprehensive system requirements and production deployment specifications, refer to [Production Deployment Requirements](/admin/deployment/requirements).
**Quick Start Requirements:**
- **OS**: Ubuntu 24.04/22.04/20.04 (recommended)
- **Python**: 3.10, 3.11, or 3.12
- **Memory**: 16GB+ RAM for basic text processing
- **GPU** (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
- **CUDA 12** (required for `audio_cuda12`, `video_cuda12`, `image_cuda12`, and `text_cuda12` extras)
**Python 3.10 support will be removed in NeMo Curator 26.06.** 26.04 is the last release to support Python 3.10. If you are setting up a new environment, install a newer supported Python version (3.11+) so you do not need to upgrade when moving to 26.06. See the [26.04 release notes](/about/release-notes) for details.
### Development vs Production
| Use Case | Requirements | See |
|----------|-------------|-----|
| **Local Development** | Minimum specs listed above | Continue below |
| **Production Clusters** | Detailed hardware, network, storage specs | [Deployment Requirements](/admin/deployment/requirements) |
| **Multi-node Setup** | Advanced infrastructure planning | [Deployment Options](/admin/deployment) |
---
## Installation Methods
Choose one of the following installation methods based on your needs:
**Docker is the recommended installation method** for video and audio workflows. The NeMo Curator container includes FFmpeg (with NVENC support) pre-configured, avoiding manual dependency setup. Refer to the [Container Installation](#container-installation) tab below.
Install NeMo Curator from the Python Package Index using `uv` for proper dependency resolution.
1. Install uv:
```bash
curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh
source $HOME/.local/bin/env
```
2. Create and activate a virtual environment:
```bash
uv venv
source .venv/bin/activate
```
3. Install NeMo Curator:
```bash
uv pip install torch wheel_stub psutil setuptools setuptools_scm
echo "transformers==4.55.2" > override.txt
uv pip install --no-build-isolation "nemo-curator[all]" --override override.txt
```
Install the latest development version directly from GitHub:
```bash
# Clone the repository
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
# Install uv if not already available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install with all extras using uv
uv sync --all-extras --all-groups
```
NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed, including FFmpeg with NVENC support.
```bash
# Pull the container from NGC
docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }}
# Run the container with GPU support
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:{{ container_version }}
```
After entering the container, activate the virtual environment before running any NeMo Curator commands:
source /opt/venv/env.sh
The container uses a virtual environment at `/opt/venv`. If you see `No module named nemo_curator`, the environment has not been activated.
Alternatively, you can build the NeMo Curator container locally using the provided Dockerfile:
```bash
# Build the container locally
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
docker build -t nemo-curator:latest -f docker/Dockerfile .
# Run the container with GPU support
docker run --gpus all -it --rm nemo-curator:latest
```
**Benefits:**
- Pre-configured environment with all dependencies (FFmpeg, CUDA libraries)
- Consistent runtime across different systems
- Ideal for production deployments
### Install FFmpeg and Encoders (Required for Video)
Curator’s video pipelines rely on `FFmpeg` for decoding and encoding. If you plan to encode clips (using `--transcode-encoder h264_nvenc` or `--transcode-encoder libvpx-vp9`), install `FFmpeg` with NVENC and `libvpx-vp9` support. The maintained install script bundles both.
Use the maintained script in the repository to build and install `FFmpeg` with NVIDIA NVENC and `libvpx-vp9` support. The script enables `--enable-cuda-nvcc`, `--enable-libnpp`, and `--enable-libvpx`.
- Script source: [docker/common/install_ffmpeg.sh](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/common/install_ffmpeg.sh)
```bash
curl -fsSL https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh -o install_ffmpeg.sh
chmod +x install_ffmpeg.sh
sudo bash install_ffmpeg.sh
```
Confirm that `FFmpeg` is on your `PATH` and that at least one supported encoder is available:
```bash
ffmpeg -hide_banner -version | head -n 5
ffmpeg -encoders | grep -E "h264_nvenc|libvpx-vp9" | cat
```
If encoders are missing, reinstall `FFmpeg` with the required options or use the Debian/Ubuntu script above.
**FFmpeg build requires CUDA toolkit (nvcc):** If you encounter `ERROR: failed checking for nvcc` during FFmpeg installation, ensure that the CUDA toolkit is installed and `nvcc` is available on your `PATH`. You can verify with `nvcc --version`. If using the NeMo Curator container, FFmpeg is pre-installed with NVENC support.
**Processing H.264/HEVC/AV1 inputs? You might still need a software decoder — even with NVENC/NVDEC.**
Curator's pipeline runs `ffprobe` for metadata extraction inside CPU-only Ray actors (`VideoReader` and `ClipWriter`). Those actors don't have GPU visibility, so the bundled `h264_cuvid` / `hevc_cuvid` / `av1_cuvid` decoders can't be opened from there. Without a software decoder, `ffprobe` exits non-zero and your h264/hevc/av1 inputs are silently skipped (you'll see a `SoftwareCodecMissingError` in the logs).
**Recommended fix:** run the bundled installer inside the container — no image rebuild needed:
```bash
bash /opt/Curator/docker/common/install_h264_support.sh
```
See [Software H.264/HEVC/AV1 Codec Support](#software-h264hevcav1-codec-support-advanced) below for the full picture (other paths, license notes, opt-in `libopenh264` encoder).
### Software H.264/HEVC/AV1 Codec Support (Advanced)
Curator's default FFmpeg build deliberately ships **NVDEC-only** decoders for `h264`, `hevc`, and `av1`, and **excludes** software H.264 encoders (`libopenh264`, `libx264`, `libx265`). This keeps the codec footprint tight and routes every H.264/HEVC/AV1 decode through the GPU.
You may need to add software codec support in two cases:
- **H.264 inputs in CPU-only pipeline stages.** `VideoReader` and `ClipWriter` invoke `ffprobe` from CPU-only Ray actors that can't see the GPU; they need a software `h264`/`hevc`/`av1` decoder to extract metadata. Without it you'll get a `SoftwareCodecMissingError` pointing back here.
- **H.264 software encoding** (for example, on GPUs without an NVENC encoder block such as A100 or H100, when VP9 isn't acceptable).
#### Option 1: Run the bundled installer inside the container (Recommended)
The repository ships a runtime opt-in script that recompiles FFmpeg with software h264/hevc/av1 decoders enabled, optionally including the `libopenh264` encoder. It runs **inside an existing container** — no image rebuild required.
```bash
# Inside the container — adds h264/hevc/av1 software decoders only (LGPLv3):
bash /opt/Curator/docker/common/install_h264_support.sh
# Same plus the libopenh264 software h264 ENCODER, so --transcode-encoder=libopenh264 works:
bash /opt/Curator/docker/common/install_h264_support.sh --with-libopenh264
```
The build takes ~5–10 minutes, replaces `/usr/local/bin/{ffmpeg,ffprobe}` in place, and pins to the same FFmpeg tag as the image build. Script source: [docker/common/install_h264_support.sh](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/common/install_h264_support.sh).
License notice: the default mode adds only FFmpeg-internal decoders (LGPL). With `--with-libopenh264` the binary additionally links Cisco's OpenH264 (BSD-2-Clause + Cisco-distributed binary license — see https://www.openh264.org/BINARY_LICENSE.txt). You are responsible for any license obligations the resulting binaries impose on your distribution.
#### Option 2: Use the System FFmpeg
If you're not using the Curator container, most Linux distributions ship FFmpeg with `libx264` (and sometimes `libopenh264`) preinstalled:
```bash
sudo apt-get install -y ffmpeg
ffmpeg -hide_banner -encoders | grep -E "libx264|libopenh264"
```
Make sure the `ffmpeg` on your `PATH` is the one you want — it must shadow Curator's bundled build.
#### Option 3: Edit `install_ffmpeg.sh` and Rebuild the Image
For users distributing customized images, edit [`docker/common/install_ffmpeg.sh`](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/common/install_ffmpeg.sh) before building the container:
- For software h264/hevc/av1 decoders: append `h264,hevc,av1` to the `--enable-decoder=...` list.
- For `libopenh264` encoder: add `libopenh264-dev` to the apt list, `libopenh264` to `--enable-encoder=...`, and `--enable-libopenh264` to the configure flags.
- For `libx264` encoder: add `libx264-dev` to the apt list and `--enable-libx264 --enable-gpl` to the configure flags. Note that `--enable-gpl` makes the resulting FFmpeg binary GPL-licensed.
Then rebuild your image.
#### Use the Encoder in `ClipTranscodingStage`
`libopenh264` is accepted by `ClipTranscodingStage` out of the box. At setup time, the stage probes the local FFmpeg build and raises a clear error pointing back to this section if the encoder is not actually compiled in. Once your FFmpeg build includes it, just pass:
```bash
python video_split_clip_example.py ... --transcode-encoder libopenh264
```
For other custom encoders not in `SUPPORTED_ENCODERS` (for example, `libx264`), edit `nemo_curator/stages/video/clipping/clip_extraction_stages.py` to extend the tuple, and add the encoder name to the `--transcode-encoder` argparse `choices` list in `tutorials/video/getting-started/video_split_clip_example.py`:
```python
SUPPORTED_ENCODERS = ("h264_nvenc", "libvpx-vp9", "libopenh264", "libx264") # add yours
```
#### Caveats
- **Default options for these encoders are not tuned.** `ClipTranscodingStage` only sets quality presets for `h264_nvenc` and `libvpx-vp9`. Other encoders run with FFmpeg defaults, which may produce different quality/size trade-offs than you expect — see [Configure encoders](/curate-video/process-data/transcoding#configure) for how to pass an explicit bitrate.
- **The NeMo Curator team does not test custom encoder configurations.** Issues filed against custom encoder builds may be closed.
---
## Package Extras
NeMo Curator provides several installation extras to install only the components you need:
| Extra | Installation Command | Description |
| --- | --- | --- |
| **text_cpu** | `uv pip install nemo-curator[text_cpu]` | CPU-only text processing and filtering |
| **text_cuda12** | `uv pip install nemo-curator[text_cuda12]` | GPU-accelerated text processing with RAPIDS |
| **audio_cpu** | `uv pip install nemo-curator[audio_cpu]` | CPU-only audio curation with NeMo Toolkit ASR |
| **audio_cuda12** | `uv pip install nemo-curator[audio_cuda12]` | GPU-accelerated audio curation. When using `uv`, requires `transformers==4.55.2` override. |
| **image_cpu** | `uv pip install nemo-curator[image_cpu]` | CPU-only image processing |
| **image_cuda12** | `uv pip install nemo-curator[image_cuda12]` | GPU-accelerated image processing with NVIDIA DALI |
| **video_cpu** | `uv pip install nemo-curator[video_cpu]` | CPU-only video processing |
| **video_cuda12** | `uv pip install --no-build-isolation nemo-curator[video_cuda12]` | GPU-accelerated video processing with CUDA libraries. Requires FFmpeg and additional build dependencies when using `uv`. |
| **inference_server** | `uv pip install nemo-curator[inference_server]` | Ray Serve + vLLM for serving LLMs alongside curation pipelines |
| **sdg_cpu** | `uv pip install nemo-curator[sdg_cpu]` | CPU-only synthetic data generation with Data Designer |
| **sdg_cuda12** | `uv pip install nemo-curator[sdg_cuda12]` | GPU-accelerated synthetic data generation with local inference server support |
**Development Dependencies**: For development tools (pre-commit, ruff, pytest), use `uv sync --group dev --group linting --group test` instead of pip extras. Development dependencies are managed as dependency groups, not optional dependencies.
**pip is not supported for installing all extras together.** Some optional dependencies have conflicting transitive version requirements (for example, `nemo-toolkit[asr]` and `vllm` require incompatible versions of `transformers`). NeMo Curator uses [uv dependency overrides](https://docs.astral.sh/uv/concepts/dependencies/#dependency-overrides) to resolve these conflicts, which pip does not support. If you must use pip, install only one modality extra at a time (for example, `pip install nemo-curator[text_cpu]`). For multi-modality installations, use `uv` or the [NeMo Curator container](#container-installation).
---
## Installation Verification
After installation, verify that NeMo Curator is working correctly:
### 1. Basic Import Test
```python
# Test basic imports
import nemo_curator
print(f"NeMo Curator version: {nemo_curator.__version__}")
# Test core modules
from nemo_curator.pipeline import Pipeline
from nemo_curator.tasks import DocumentBatch
print("✓ Core modules imported successfully")
```
### 2. GPU Availability Check
If you installed GPU support, verify GPU access:
```python
# Check GPU availability
try:
import torch
if torch.cuda.is_available():
print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
print(f"✓ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
print("⚠ No GPU detected")
# Check cuDF for GPU deduplication
import cudf
print("✓ cuDF available for GPU-accelerated deduplication")
except ImportError as e:
print(f"⚠ Some GPU modules not available: {e}")
```
### 3. Run a Quickstart Tutorial
Try a modality-specific quickstart to see NeMo Curator in action:
- [Text Curation Quickstart](/get-started/text) - Set up and run your first text curation pipeline
- [Audio Curation Quickstart](/get-started/audio) - Get started with audio dataset curation
- [Image Curation Quickstart](/get-started/image) - Curate image-text datasets for generative models
- [Video Curation Quickstart](/get-started/video) - Split, encode, and curate video clips at scale