--- description: "Comprehensive overview of NeMo Curator's text curation pipeline architecture including data acquisition and processing" categories: ["concepts-architecture"] tags: ["pipeline", "architecture", "text-curation", "distributed", "gpu-accelerated", "overview"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "beginner" content_type: "concept" modality: "text-only" --- # Text Data Curation Pipeline This guide provides a comprehensive overview of NeMo Curator's text curation pipeline architecture, from data acquisition through final dataset preparation. ## Architecture Overview The following diagram provides a high-level outline of NeMo Curator's text curation architecture: ```mermaid flowchart LR A["Data Sources
(Cloud, Local,
Common Crawl, arXiv,
Wikipedia)"] --> B["Data Acquisition
& Loading"] B --> C["Content Processing
& Cleaning"] C --> D["Quality Assessment
& Filtering"] D --> E["Deduplication
(Exact, Fuzzy,
Semantic)"] E --> F["Curated Dataset
(JSONL/Parquet)"] G["Ray + RAPIDS
(GPU-accelerated)"] -.->|"Distributed Execution"| B G -.->|"Distributed Execution"| C G -.->|"GPU Acceleration"| D G -.->|"GPU Acceleration"| E classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000 classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000 classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000 class A,B,C,D,E stage class F output class G infra ``` ## Pipeline Stages NeMo Curator's text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training: ### 1. Data Sources Multiple input sources provide the foundation for text curation: - **Cloud storage**: Amazon S3, Azure - **Local workstation**: JSONL, Parquet ### 2. Data Acquisition & Processing Raw data is downloaded, extracted, and converted into standardized formats: - **Download & Extraction**: Retrieve and process remote data sources - **Cleaning & Pre-processing**: Convert formats and normalize text - **DocumentBatch Creation**: Standardize data into NeMo Curator's core data structure ### 3. Quality Assessment & Filtering Multiple filtering stages ensure data quality: - **Heuristic Quality Filtering**: Rule-based filters for basic quality checks - **Model-based Quality Filtering**: Classification models trained to identify high vs. low quality text ### 4. Deduplication Remove duplicate and near-duplicate content: - **Exact Deduplication**: Remove identical documents using MD5 hashing - **Fuzzy Deduplication**: Remove near-duplicates using MinHash and LSH similarity - **Semantic Deduplication**: Remove semantically similar content using embeddings ### 5. Final Preparation Prepare the curated dataset for training: - **Format Standardization**: Ensure consistent output format ## Infrastructure Foundation The entire pipeline runs on a robust, scalable infrastructure: - **Ray**: Distributed computing framework for parallelization - **RAPIDS**: GPU-accelerated data processing (cuDF, cuGraph, cuML) - **Flexible Deployment**: CPU and GPU acceleration support ## Key Components The pipeline leverages several core component types: Core concepts for loading and managing text datasets from local files Components for downloading and extracting data from remote sources Concepts for filtering, deduplication, and classification ## Processing Modes The pipeline supports different processing approaches: **GPU Acceleration**: Leverage NVIDIA GPUs for: - High-throughput data processing - ML model inference for classification - Embedding generation for semantic operations **CPU Processing**: Scale across multiple CPU cores for: - Text parsing and cleaning - Rule-based filtering - Large-scale data transformations **Hybrid Workflows**: Combine CPU and GPU processing for optimal performance based on the specific operation. ## Scalability & Deployment The architecture scales from single machines to large clusters: - **Single Node**: Process datasets on laptops or workstations - **Multi-Node**: Distribute processing across cluster resources - **Cloud Native**: Deploy on cloud platforms - **HPC Integration**: Run on HPC supercomputing clusters --- For hands-on experience, refer to the [Text Curation Getting Started Guide ](/get-started/text).