---
description: "Comprehensive overview of NeMo Curator's text curation pipeline architecture including data acquisition and processing"
categories: ["concepts-architecture"]
tags: ["pipeline", "architecture", "text-curation", "distributed", "gpu-accelerated", "overview"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "beginner"
content_type: "concept"
modality: "text-only"
---
# Text Data Curation Pipeline
This guide provides a comprehensive overview of NeMo Curator's text curation pipeline architecture, from data acquisition through final dataset preparation.
## Architecture Overview
The following diagram provides a high-level outline of NeMo Curator's text curation architecture:
```mermaid
flowchart LR
A["Data Sources
(Cloud, Local,
Common Crawl, arXiv,
Wikipedia)"] --> B["Data Acquisition
& Loading"]
B --> C["Content Processing
& Cleaning"]
C --> D["Quality Assessment
& Filtering"]
D --> E["Deduplication
(Exact, Fuzzy,
Semantic)"]
E --> F["Curated Dataset
(JSONL/Parquet)"]
G["Ray + RAPIDS
(GPU-accelerated)"] -.->|"Distributed Execution"| B
G -.->|"Distributed Execution"| C
G -.->|"GPU Acceleration"| D
G -.->|"GPU Acceleration"| E
classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
class A,B,C,D,E stage
class F output
class G infra
```
## Pipeline Stages
NeMo Curator's text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:
### 1. Data Sources
Multiple input sources provide the foundation for text curation:
- **Cloud storage**: Amazon S3, Azure
- **Local workstation**: JSONL, Parquet
### 2. Data Acquisition & Processing
Raw data is downloaded, extracted, and converted into standardized formats:
- **Download & Extraction**: Retrieve and process remote data sources
- **Cleaning & Pre-processing**: Convert formats and normalize text
- **DocumentBatch Creation**: Standardize data into NeMo Curator's core data structure
### 3. Quality Assessment & Filtering
Multiple filtering stages ensure data quality:
- **Heuristic Quality Filtering**: Rule-based filters for basic quality checks
- **Model-based Quality Filtering**: Classification models trained to identify high vs. low quality text
### 4. Deduplication
Remove duplicate and near-duplicate content:
- **Exact Deduplication**: Remove identical documents using MD5 hashing
- **Fuzzy Deduplication**: Remove near-duplicates using MinHash and LSH similarity
- **Semantic Deduplication**: Remove semantically similar content using embeddings
### 5. Final Preparation
Prepare the curated dataset for training:
- **Format Standardization**: Ensure consistent output format
## Infrastructure Foundation
The entire pipeline runs on a robust, scalable infrastructure:
- **Ray**: Distributed computing framework for parallelization
- **RAPIDS**: GPU-accelerated data processing (cuDF, cuGraph, cuML)
- **Flexible Deployment**: CPU and GPU acceleration support
## Key Components
The pipeline leverages several core component types:
Core concepts for loading and managing text datasets from local files
Components for downloading and extracting data from remote sources
Concepts for filtering, deduplication, and classification
## Processing Modes
The pipeline supports different processing approaches:
**GPU Acceleration**: Leverage NVIDIA GPUs for:
- High-throughput data processing
- ML model inference for classification
- Embedding generation for semantic operations
**CPU Processing**: Scale across multiple CPU cores for:
- Text parsing and cleaning
- Rule-based filtering
- Large-scale data transformations
**Hybrid Workflows**: Combine CPU and GPU processing for optimal performance based on the specific operation.
## Scalability & Deployment
The architecture scales from single machines to large clusters:
- **Single Node**: Process datasets on laptops or workstations
- **Multi-Node**: Distribute processing across cluster resources
- **Cloud Native**: Deploy on cloud platforms
- **HPC Integration**: Run on HPC supercomputing clusters
---
For hands-on experience, refer to the [Text Curation Getting Started Guide ](/get-started/text).