---
description: "Essential concepts for text data curation including loading and processing."
categories: ["concepts-architecture"]
tags: ["concepts", "text-curation", "data-processing", "distributed"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "beginner"
content_type: "concept"
modality: "text-only"
---
# Text Curation Concepts
This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.
## Core Concept Areas
Text curation in NeMo Curator focuses on these key areas:
Comprehensive overview of the end-to-end text curation architecture and workflow
overview architecture
Core concepts for loading and managing text datasets from local files
local-files formats
Components for downloading and extracting data from remote sources
remote-sources download
Concepts for filtering, deduplication, and classification
filtering quality
## Infrastructure Components
The text curation concepts build on NVIDIA NeMo Curator's core infrastructure components, which are shared across all modalities. These components include:
Optimize memory usage when processing large datasets
partitioning
batching
monitoring
Leverage NVIDIA GPUs for faster data processing
cuda
rmm
performance
Continue interrupted operations across large datasets
checkpoints
recovery
batching