--- description: "Load and process JPEG images from tar archives using DALI-powered GPU acceleration with distributed processing" categories: ["how-to-guides"] tags: ["tar-archives", "data-loading", "dali", "gpu-acceleration", "distributed"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "image-only" --- # Loading Images from Tar Archives Load and process JPEG images from tar archives using NeMo Curator's DALI-powered `ImageReaderStage`. The `ImageReaderStage` uses NVIDIA DALI for high-performance image decoding with GPU acceleration and automatic CPU fallback, designed for processing large collections of images stored in tar files. ## How it Works The `ImageReaderStage` processes directories containing `.tar` files with JPEG images. While tar files may contain other file types (text, JSON, etc.), the stage extracts only JPEG images for processing. **Directory Structure Example** ```text dataset/ ├── 00000.tar │ ├── 000000000.jpg │ ├── 000000001.jpg │ ├── 000000002.jpg │ ├── ... ├── 00001.tar │ ├── 000001000.jpg │ ├── 000001001.jpg │ ├── ... ``` **What gets processed:** - **JPEG images**: All `.jpg` files within tar archives **What gets ignored:** - Text files (`.txt`), JSON files (`.json`), and other non-JPEG content within tar archives - Any files outside the tar archives (like standalone Parquet files) --- ## Usage ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.file_partitioning import FilePartitioningStage from nemo_curator.stages.image.io.image_reader import ImageReaderStage # Create pipeline pipeline = Pipeline(name="image_loading", description="Load images from tar archives") # Stage 1: Partition tar files for parallel processing pipeline.add_stage(FilePartitioningStage( file_paths="/path/to/tar_dataset", files_per_partition=1, file_extensions=[".tar"], )) # Stage 2: Read JPEG images from tar files using DALI pipeline.add_stage(ImageReaderStage( dali_batch_size=100, verbose=True, num_threads=8, num_gpus_per_worker=0.25, )) # Run the pipeline results = pipeline.run() ``` `ImageReaderStage` is compatible with both `XennaExecutor` and `RayDataExecutor`. When using `RayDataExecutor`, the stage automatically signals that it fans out (one tar file can produce multiple `ImageBatch` objects), which enables Ray Data to repartition batches across downstream workers for parallel processing. **Parameters:** - `file_paths`: Path to directory containing tar files - `files_per_partition`: Number of tar files to process per partition (controls parallelism) - `dali_batch_size`: Number of images per ImageBatch for processing --- ## ImageReaderStage Details The `ImageReaderStage` is the core component that handles tar archive loading with the following capabilities: ### DALI Integration - **Automatic Device Selection**: Uses GPU decoding when CUDA is available, CPU decoding otherwise - **Tar Archive Reader**: Leverages DALI's tar archive reader to process tar files - **Batch Processing**: Processes images in configurable batch sizes for memory efficiency - **JPEG-Only Processing**: Extracts only JPEG files (`ext=["jpg"]`) from tar archives ### Image Processing - **Format Support**: Reads only JPEG images (`.jpg`) from tar files - **Size Preservation**: Maintains original image dimensions (no automatic resizing) - **RGB Output**: Converts images to RGB format for consistent downstream processing - **Metadata Extraction**: Creates ImageObject instances with image paths and generated IDs ### Error Handling - **Missing Components**: Skips missing or corrupted images with `missing_component_behavior="skip"` - **Graceful Fallback**: Automatically falls back to CPU processing if GPU is unavailable - **Validation**: Validates tar file paths and provides clear error messages - **Non-JPEG Filtering**: Silently ignores non-JPEG files within tar archives --- ## Parameters ### ImageReaderStage Parameters | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `dali_batch_size` | int | 100 | Number of images per ImageBatch for processing | | `verbose` | bool | True | Enable verbose logging for debugging | | `num_threads` | int | 8 | Number of threads for DALI operations | | `num_gpus_per_worker` | float | 0.25 | GPU allocation per worker (0.25 = 1/4 GPU) | --- ## Output Format The pipeline produces `ImageBatch` objects containing `ImageObject` instances for downstream curation tasks. Each `ImageObject` contains: - `image_data`: Raw image pixel data as numpy array (H, W, C) in RGB format - `image_path`: Path to the original image file in the tar - `image_id`: Unique identifier extracted from the filename - `metadata`: Additional metadata dictionary **Example ImageObject structure:** ```python ImageObject( image_path="00000.tar/000000031.jpg", image_id="000000031", image_data=np.array(...), # Shape: (H, W, 3) metadata={} ) ``` **Note**: Only JPEG images are extracted from tar files. Other content (text files, JSON metadata, etc.) within the tar archives is ignored during processing.