--- description: "Process text data using comprehensive filtering, deduplication, content processing, and specialized tools for high-quality datasets" categories: ["workflows"] tags: ["data-processing", "filtering", "deduplication", "content-processing", "quality-assessment", "distributed"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "workflow" modality: "text-only" --- # Process Data for Text Curation Process text data you've loaded through NeMo Curator's [pipeline architecture ](/about/concepts/text/data/loading). NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training. ## How it Works NeMo Curator's text processing capabilities are organized into six main categories: 1. **Language Management**: Handle multilingual content and language-specific processing 2. **Content Processing & Cleaning**: Clean, normalize, and transform text content 3. **Deduplication**: Remove duplicate and near-duplicate documents efficiently 4. **Quality Assessment & Filtering**: Score and remove low-quality content using heuristics and ML classifiers 5. **Specialized Processing**: Domain-specific processing for code and advanced curation tasks 6. **Interleaved Datasets**: Read, write, and filter MINT-1T-style image-text datasets Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training. --- ## Language Management Handle multilingual content and language-specific processing requirements. Identify document languages and separate multilingual datasets fasttext 176-languages detection Manage high-frequency words to enhance text extraction and content detection preprocessing filtering language-specific ## Content Processing & Cleaning Clean, normalize, and transform text content for high-quality training data. Fix Unicode issues, standardize spacing, and remove URLs unicode normalization preprocessing ## Deduplication Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows. Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated Identify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated Identify and remove semantically similar documents using embeddings and clustering embeddings meaning-based gpu-accelerated ## Quality Assessment & Filtering Score and remove low-quality content using heuristics and ML classifiers. Filter text using configurable rules and metrics rules metrics fast Filter text using trained quality classifiers ml-models quality scoring GPU-accelerated classification with pre-trained models gpu distributed scalable ## Specialized Processing Domain-specific processing for code and advanced curation tasks. Specialized filters for programming content and source code programming syntax comments ## Interleaved Datasets Read, write, and filter MINT-1T-style image-text interleaved datasets across WebDataset and Parquet formats. Round-trip readers and writers between WebDataset tar shards and Parquet parquet webdataset schema-utilities Sample-level filters for image quality, QR-code detection, CLIP alignment, and image-to-text ratio blur clip qr-detection