--- description: "Save metadata, export filtered datasets, and configure output sharding for downstream use after image curation" categories: ["how-to-guides"] tags: ["data-export", "parquet", "tar-files", "filtering", "sharding", "metadata"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "image-only" --- # Saving and Exporting Image Datasets After processing and filtering your image datasets using NeMo Curator's pipeline stages, you can save results and export curated data for downstream use. The pipeline-based approach provides flexible options for saving and exporting your curated image data. ## Saving Results with ImageWriterStage The `ImageWriterStage` is the primary method for saving curated images and metadata to tar archives with accompanying Parquet files. This stage is typically the final step in your image curation pipeline. ```python from nemo_curator.stages.image.io.image_writer import ImageWriterStage # Add ImageWriterStage to your pipeline pipeline.add_stage(ImageWriterStage( output_dir="/output/curated_images", # Output directory for tar files and metadata images_per_tar=1000, # Number of images per tar file remove_image_data=True, # Remove image data from memory after writing verbose=True, # Enable progress logging )) ``` ### Parameters | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `output_dir` | str | Required | Output directory for tar files and metadata | | `images_per_tar` | int | 1000 | Number of images per tar file (controls sharding) | | `verbose` | bool | False | Enable verbose logging for debugging | | `deterministic_name` | bool | True | Use deterministic hash-based naming for output files | | `remove_image_data` | bool | False | Remove image data from memory after writing (saves memory) | **Set `images_per_tar` ≤ upstream `batch_size`.** Each upstream batch produces one tar shard. If `images_per_tar` is larger than the reader's `batch_size`, every tar contains only `batch_size` images instead of `images_per_tar` and your output is silently under-packed. The pipeline emits a warning when this happens — verify your reader's `batch_size` matches or exceeds `images_per_tar`. ## Output Format The `ImageWriterStage` creates: * **Tar Archives**: `.tar` files containing JPEG images * **Parquet Files**: `.parquet` files with metadata for each corresponding tar file * **Deterministic Naming**: Files named with content-based hashes for reproducibility * **Preserved Metadata**: All scores and metadata from processing stages stored in Parquet files **Output Structure:** ```bash output/ ├── images-{hash}-000000.tar # Contains JPEG images ├── images-{hash}-000000.parquet # Metadata for corresponding tar ├── images-{hash}-000001.tar ├── images-{hash}-000001.parquet ``` Each tar file contains JPEG images with sequential or ID-based filenames, while metadata (including `aesthetic_score`, `nsfw_score`, and other processing data) is stored in the accompanying Parquet files. --- For more details on stage parameters and customization options, see the [ImageWriterStage documentation](/curate-images/process-data) and the [Complete Tutorial](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py). {/* More details and examples will be added here. */}