---
description: "Generate CLIP embeddings for images using OpenAI's ViT-L/14 model for downstream classification and filtering tasks"
categories: ["how-to-guides"]
tags: ["embedding", "clip", "vit", "gpu-accelerated", "pipeline-stage"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "how-to"
modality: "image-only"
---

# CLIP ImageEmbeddingStage

The `ImageEmbeddingStage` generates CLIP embeddings for images using OpenAI's ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.

## Model Details

- **Architecture:** [OpenAI CLIP ViT-L/14 model](https://huggingface.co/openai/clip-vit-large-patch14)
- **Output Field:** `embedding` (stored in `ImageObject.embedding`)
- **Embedding Dimension:** Generated by ViT-L/14 model
- **Input Requirements:** RGB images loaded by `ImageReaderStage`

## How It Works

The stage processes `ImageBatch` objects containing `ImageObject` instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each `ImageObject.embedding` attribute.

## Prerequisites

Before using the `ImageEmbeddingStage`, ensure you have:

### Model Setup

The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:
1. Download the OpenAI CLIP ViT-L/14 model (~3.5GB) to the specified `model_dir`
2. Cache the model for subsequent runs
3. Load the model onto GPU (or CPU if GPU unavailable)

**First-time setup:** The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.

### System Requirements

- **GPU:** NVIDIA GPU with CUDA support (recommended for performance)
- **Memory:** At least 8GB GPU memory for batch processing
- **Disk Space:** ~4GB for model weights
- **Python Dependencies:** PyTorch, transformers (installed with NeMo Curator)

## Usage

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage

# Create pipeline
pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images")

# Stage 1: Partition tar files
pipeline.add_stage(FilePartitioningStage(
    file_paths="/path/to/tar_dataset",
    files_per_partition=1,
    file_extensions=[".tar"],
))

# Stage 2: Read images
pipeline.add_stage(ImageReaderStage(
    dali_batch_size=100,
    num_threads=8,
    num_gpus_per_worker=0.25,
))

# Stage 3: Generate CLIP embeddings
pipeline.add_stage(ImageEmbeddingStage(
    model_dir="/path/to/models",
    model_inference_batch_size=32,
    num_gpus_per_worker=0.25,
    remove_image_data=False,
    verbose=True,
))

# Run the pipeline (uses XennaExecutor by default)
results = pipeline.run()
```

## Parameters

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| `model_dir` | str | None | Path to directory containing CLIP model weights |
| `model_inference_batch_size` | int | 32 | Batch size for model inference |
| `num_gpus_per_worker` | float | 0.25 | GPU allocation per worker (0.25 = 1/4 GPU) |
| `remove_image_data` | bool | False | Whether to remove image data after embedding generation (saves memory) |
| `verbose` | bool | False | Enable verbose logging for debugging |

## Performance Notes

- The CLIP model requires GPU acceleration for reasonable performance.
- Increase `model_inference_batch_size` for better throughput if GPU memory allows.
- Set `remove_image_data=True` if you don't need the raw image data for downstream stages.
- The stage automatically handles different image sizes by preprocessing them to 224x224.

## Best Practices

- Use GPU-enabled environments for best performance.
- Adjust `model_inference_batch_size` based on available GPU memory (start with 32, increase if memory allows).
- Set `remove_image_data=True` for memory efficiency if downstream stages only need embeddings.
- Monitor GPU utilization and adjust `num_gpus_per_worker` accordingly.

## Output Format

After processing, each `ImageObject` will have:

```python
ImageObject(
    image_path="00000.tar/000000031.jpg",
    image_id="000000031",
    image_data=np.array(...),  # Raw image data (if remove_image_data=False)
    embedding=np.array(...),   # CLIP embedding vector
    metadata={}
)
```

## Additional Resources

- [Complete Pipeline Example](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py)
- [OpenAI CLIP Paper](https://arxiv.org/abs/2103.00020)