---
description: "Core concepts for saving and exporting curated image datasets including metadata and resharding"
categories: ["concepts-architecture"]
tags: ["data-export", "tar-files", "parquet", "resharding", "metadata"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "concept"
modality: "image-only"
---

# Data Export Concepts (Image)

This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.

## Key Topics

- Saving curated images and metadata
- Understanding output format structure
- Configuring output sharding
- Preparing data for downstream training or analysis

## Saving Results

After processing through the pipeline, you can save the curated images and metadata using the `ImageWriterStage`.

**Example:**

```python
from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Add writer stage to pipeline
pipeline.add_stage(ImageWriterStage(
    output_dir="/output/curated_dataset",
    images_per_tar=1000,  # Images per tar file
    remove_image_data=True,
    verbose=True,
    deterministic_name=True,  # Use deterministic naming for reproducible output
))
```

**Key Parameters:**

- `output_dir`: Directory where tar archives and metadata files are written
- `images_per_tar`: Number of images per tar file for optimal sharding
- `remove_image_data`: Whether to remove image data from memory after writing
- `deterministic_name`: Ensures reproducible file naming based on input content

**Behavior:**

- The writer stage creates tar files with curated images
- Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
- Adjust `images_per_tar` to balance I/O, parallelism, and storage efficiency
- Smaller values create more files but enable better parallelism
- Larger values reduce file count but may impact loading performance

## Output Format

The `ImageWriterStage` creates tar archives containing curated images with accompanying metadata files:

**Output Structure:**

```bash
output/
├── images-{hash}-000000.tar    # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet
```

**Format Details:**

- **Tar contents**: JPEG images with sequential or ID-based filenames
- **Metadata storage**: Separate Parquet files containing image paths, IDs, and processing metadata
- **Naming**: Deterministic or random naming based on configuration
- **Sharding**: Configurable number of images per tar file for optimal performance

## Preparing for Downstream Use

- Ensure your exported dataset matches the requirements of your training or analysis pipeline.
- Use consistent naming and metadata fields for compatibility.
- Document any filtering or processing steps for reproducibility.
- Test loading the exported dataset before large-scale training.

{/* Detailed content to be added here. */}