--- description: "Essential concepts for text data curation including loading and processing." categories: ["concepts-architecture"] tags: ["concepts", "text-curation", "data-processing", "distributed"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "beginner" content_type: "concept" modality: "text-only" --- # Text Curation Concepts This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles. ## Core Concept Areas Text curation in NeMo Curator focuses on these key areas: Comprehensive overview of the end-to-end text curation architecture and workflow overview architecture Core concepts for loading and managing text datasets from local files local-files formats Components for downloading and extracting data from remote sources remote-sources download Concepts for filtering, deduplication, and classification filtering quality ## Infrastructure Components The text curation concepts build on NVIDIA NeMo Curator's core infrastructure components, which are shared across all modalities. These components include: Optimize memory usage when processing large datasets partitioning batching monitoring Leverage NVIDIA GPUs for faster data processing cuda rmm performance Continue interrupted operations across large datasets checkpoints recovery batching