--- description: "Essential concepts for image data curation including loading, processing, and export with GPU acceleration" categories: ["concepts-architecture"] tags: ["concepts", "image-curation", "tar-archives", "gpu-accelerated", "embedding", "classification"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "beginner" content_type: "concept" modality: "image-only" --- # Image Curation Concepts This document covers the essential concepts for image data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles. ## Core Concept Areas Image curation in NVIDIA NeMo Curator focuses on these key areas: Core concepts for loading and managing image datasets Concepts for embedding generation, classification, filtering, and deduplication Concepts for saving, exporting, and resharding curated image datasets ## Infrastructure Components The image curation concepts build on NVIDIA NeMo Curator's core infrastructure components, which are shared across all modalities (text, image, video). These components include: Optimize memory usage when processing large datasets partitioning batching monitoring Leverage NVIDIA GPUs for faster data processing cuda dali performance Continue interrupted operations across large datasets checkpoints recovery batching