--- name: nemo-curator description: GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora. version: 1.0.0 author: Orchestra Research license: MIT tags: [Data Processing, NeMo Curator, Data Curation, GPU Acceleration, Deduplication, Quality Filtering, NVIDIA, RAPIDS, PII Redaction, Multimodal, LLM Training Data] dependencies: [nemo-curator, cudf, dask, rapids] --- # NeMo Curator - GPU-Accelerated Data Curation NVIDIA's toolkit for preparing high-quality training data for LLMs. ## When to use NeMo Curator **Use NeMo Curator when:** - Preparing LLM training data from web scrapes (Common Crawl) - Need fast deduplication (16× faster than CPU) - Curating multi-modal datasets (text, images, video, audio) - Filtering low-quality or toxic content - Scaling data processing across GPU cluster **Performance**: - **16× faster** fuzzy deduplication (8TB RedPajama v2) - **40% lower TCO** vs CPU alternatives - **Near-linear scaling** across GPU nodes **Use alternatives instead**: - **datatrove**: CPU-based, open-source data processing - **dolma**: Allen AI's data toolkit - **Ray Data**: General ML data processing (no curation focus) ## Quick start ### Installation ```bash # Text curation (CUDA 12) uv pip install "nemo-curator[text_cuda12]" # All modalities uv pip install "nemo-curator[all_cuda12]" # CPU-only (slower) uv pip install "nemo-curator[cpu]" ``` ### Basic text curation pipeline ```python from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd # Load data df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df) # Quality filtering def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs filtered = ScoreFilter(quality_score)(dataset) # Deduplication from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered) # Save deduped.to_parquet("curated_data/") ``` ## Data curation pipeline ### Stage 1: Quality filtering ```python from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter ) # Apply 30+ heuristic filters from nemo_curator import ScoreFilter # Word count filter dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000)) # Remove repetitive content dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3)) # URL ratio filter dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2)) ``` ### Stage 2: Deduplication **Exact deduplication**: ```python from nemo_curator.modules import ExactDuplicates # Remove exact duplicates deduped = ExactDuplicates(id_field="id", text_field="text")(dataset) ``` **Fuzzy deduplication** (16× faster on GPU): ```python from nemo_curator.modules import FuzzyDuplicates # MinHash + LSH deduplication fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" ) deduped = fuzzy_dedup(dataset) ``` **Semantic deduplication**: ```python from nemo_curator.modules import SemanticDuplicates # Embedding-based deduplication semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold ) deduped = semantic_dedup(dataset) ``` ### Stage 3: PII redaction ```python from nemo_curator.modules import Modify from nemo_curator.modifiers import PIIRedactor # Redact personally identifiable information pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" ) redacted = Modify(pii_redactor)(dataset) ``` ### Stage 4: Classifier filtering ```python from nemo_curator.classifiers import QualityClassifier # Quality classification quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" ) # Filter low-quality documents high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5) ``` ## GPU acceleration ### GPU vs CPU performance | Operation | CPU (16 cores) | GPU (A100) | Speedup | |-----------|----------------|------------|---------| | Fuzzy dedup (8TB) | 120 hours | 7.5 hours | 16× | | Exact dedup (1TB) | 8 hours | 0.5 hours | 16× | | Quality filtering | 2 hours | 0.2 hours | 10× | ### Multi-GPU scaling ```python from nemo_curator import get_client import dask_cuda # Initialize GPU cluster client = get_client(cluster_type="gpu", n_workers=8) # Process with 8 GPUs deduped = FuzzyDuplicates(...)(dataset) ``` ## Multi-modal curation ### Image curation ```python from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder ) # Aesthetic scoring aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset) # NSFW detection nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images) # Generate CLIP embeddings clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images) ``` ### Video curation ```python from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder ) # Detect scenes scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset) # Extract clips clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes) # Generate embeddings video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips) ``` ### Audio curation ```python from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter ) # ASR transcription asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset) # Filter by WER (word error rate) wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed) # Duration filtering duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio) ``` ## Common patterns ### Web scrape curation (Common Crawl) ```python from nemo_curator import ScoreFilter, Modify from nemo_curator.filters import * from nemo_curator.modules import * from nemo_curator.datasets import DocumentDataset # Load Common Crawl data dataset = DocumentDataset.read_parquet("common_crawl/*.parquet") # Pipeline pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3), # 2. Language filtering LanguageIdentificationFilter(target_languages=["en"]), # 3. Deduplication ExactDuplicates(id_field="id", text_field="text"), FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260), # 4. PII redaction PIIRedactor(), # 5. NSFW filtering NSFWClassifier(threshold=0.8) ] # Execute for stage in pipeline: dataset = stage(dataset) # Save dataset.to_parquet("curated_common_crawl/") ``` ### Distributed processing ```python from nemo_curator import get_client from dask_cuda import LocalCUDACluster # Multi-GPU cluster cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster) # Process large dataset dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet") deduped = FuzzyDuplicates(...)(dataset) # Cleanup client.close() cluster.close() ``` ## Performance benchmarks ### Fuzzy deduplication (8TB RedPajama v2) - **CPU (256 cores)**: 120 hours - **GPU (8× A100)**: 7.5 hours - **Speedup**: 16× ### Exact deduplication (1TB) - **CPU (64 cores)**: 8 hours - **GPU (4× A100)**: 0.5 hours - **Speedup**: 16× ### Quality filtering (100GB) - **CPU (32 cores)**: 2 hours - **GPU (2× A100)**: 0.2 hours - **Speedup**: 10× ## Cost comparison **CPU-based curation** (AWS c5.18xlarge × 10): - Cost: $3.60/hour × 10 = $36/hour - Time for 8TB: 120 hours - **Total**: $4,320 **GPU-based curation** (AWS p4d.24xlarge × 2): - Cost: $32.77/hour × 2 = $65.54/hour - Time for 8TB: 7.5 hours - **Total**: $491.55 **Savings**: 89% reduction ($3,828 saved) ## Supported data formats - **Input**: Parquet, JSONL, CSV - **Output**: Parquet (recommended), JSONL - **WebDataset**: TAR archives for multi-modal ## Use cases **Production deployments**: - NVIDIA used NeMo Curator to prepare Nemotron-4 training data - Open-source datasets curated: RedPajama v2, The Pile ## References - **[Filtering Guide](references/filtering.md)** - 30+ quality filters, heuristics - **[Deduplication Guide](references/deduplication.md)** - Exact, fuzzy, semantic methods ## Resources - **GitHub**: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+ - **Docs**: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/ - **Version**: 0.4.0+ - **License**: Apache 2.0