--- description: "Generate CLIP embeddings for images using OpenAI's ViT-L/14 model for downstream classification and filtering tasks" categories: ["how-to-guides"] tags: ["embedding", "clip", "vit", "gpu-accelerated", "pipeline-stage"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "image-only" --- # CLIP ImageEmbeddingStage The `ImageEmbeddingStage` generates CLIP embeddings for images using OpenAI's ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication. ## Model Details - **Architecture:** [OpenAI CLIP ViT-L/14 model](https://huggingface.co/openai/clip-vit-large-patch14) - **Output Field:** `embedding` (stored in `ImageObject.embedding`) - **Embedding Dimension:** Generated by ViT-L/14 model - **Input Requirements:** RGB images loaded by `ImageReaderStage` ## How It Works The stage processes `ImageBatch` objects containing `ImageObject` instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each `ImageObject.embedding` attribute. ## Prerequisites Before using the `ImageEmbeddingStage`, ensure you have: ### Model Setup The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will: 1. Download the OpenAI CLIP ViT-L/14 model (~3.5GB) to the specified `model_dir` 2. Cache the model for subsequent runs 3. Load the model onto GPU (or CPU if GPU unavailable) **First-time setup:** The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model. ### System Requirements - **GPU:** NVIDIA GPU with CUDA support (recommended for performance) - **Memory:** At least 8GB GPU memory for batch processing - **Disk Space:** ~4GB for model weights - **Python Dependencies:** PyTorch, transformers (installed with NeMo Curator) ## Usage ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.file_partitioning import FilePartitioningStage from nemo_curator.stages.image.io.image_reader import ImageReaderStage from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage # Create pipeline pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images") # Stage 1: Partition tar files pipeline.add_stage(FilePartitioningStage( file_paths="/path/to/tar_dataset", files_per_partition=1, file_extensions=[".tar"], )) # Stage 2: Read images pipeline.add_stage(ImageReaderStage( dali_batch_size=100, num_threads=8, num_gpus_per_worker=0.25, )) # Stage 3: Generate CLIP embeddings pipeline.add_stage(ImageEmbeddingStage( model_dir="/path/to/models", model_inference_batch_size=32, num_gpus_per_worker=0.25, remove_image_data=False, verbose=True, )) # Run the pipeline (uses XennaExecutor by default) results = pipeline.run() ``` ## Parameters | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `model_dir` | str | None | Path to directory containing CLIP model weights | | `model_inference_batch_size` | int | 32 | Batch size for model inference | | `num_gpus_per_worker` | float | 0.25 | GPU allocation per worker (0.25 = 1/4 GPU) | | `remove_image_data` | bool | False | Whether to remove image data after embedding generation (saves memory) | | `verbose` | bool | False | Enable verbose logging for debugging | ## Performance Notes - The CLIP model requires GPU acceleration for reasonable performance. - Increase `model_inference_batch_size` for better throughput if GPU memory allows. - Set `remove_image_data=True` if you don't need the raw image data for downstream stages. - The stage automatically handles different image sizes by preprocessing them to 224x224. ## Best Practices - Use GPU-enabled environments for best performance. - Adjust `model_inference_batch_size` based on available GPU memory (start with 32, increase if memory allows). - Set `remove_image_data=True` for memory efficiency if downstream stages only need embeddings. - Monitor GPU utilization and adjust `num_gpus_per_worker` accordingly. ## Output Format After processing, each `ImageObject` will have: ```python ImageObject( image_path="00000.tar/000000031.jpg", image_id="000000031", image_data=np.array(...), # Raw image data (if remove_image_data=False) embedding=np.array(...), # CLIP embedding vector metadata={} ) ``` ## Additional Resources - [Complete Pipeline Example](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py) - [OpenAI CLIP Paper](https://arxiv.org/abs/2103.00020)