--- description: "Generate text embeddings using vLLM, Sentence Transformers, or Hugging Face models for deduplication, similarity search, and downstream tasks" categories: ["how-to-guides"] tags: ["embeddings", "vllm", "sentence-transformers", "gpu-accelerated", "similarity-search"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "text-only" --- # Text Embedding Generate text embeddings for large-scale datasets using NeMo Curator's built-in embedding stages. Text embeddings enable downstream tasks such as semantic deduplication, similarity search, and clustering. ## How It Works NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements: 1. **`EmbeddingCreatorStage`** — A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers' `SentenceTransformer` and Hugging Face's `AutoModel` classes via the `use_sentence_transformer` flag. 2. **`VLLMEmbeddingModelStage`** — A standalone stage that uses vLLM for GPU-accelerated embedding generation with optional pretokenization. Best for large embedding models where vLLM's batching and GPU utilization provide significant throughput gains. 3. **`SentenceTransformerEmbeddingModelStage`** — A model stage that uses the `sentence-transformers` library directly. Used internally by `EmbeddingCreatorStage` when `use_sentence_transformer=True`. ## Choosing an Embedding Backend | Backend | Best For | GPU Utilization | Setup | | --- | --- | --- | --- | | `EmbeddingCreatorStage` (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in `text_cuda12` extra | | `VLLMEmbeddingModelStage` | Large models (e.g., `google/embeddinggemma-300m`) and semantic deduplication | Excellent | Included in `text_cuda12` extra | | `EmbeddingCreatorStage` (AutoModel) | Custom pooling strategies | Good | Set `use_sentence_transformer=False` | Benchmarks on 5 GB of Common Crawl data show that vLLM outperforms Sentence Transformers for larger embedding models, while Sentence Transformers is faster for smaller models. The vLLM `pretokenize` mode provides the best per-task throughput across both model sizes when amortized over many tasks. ## Quick Start ### EmbeddingCreatorStage ```python from nemo_curator.backends.xenna import XennaExecutor from nemo_curator.stages.text.embedders import EmbeddingCreatorStage from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import ParquetReader from nemo_curator.stages.text.io.writer import ParquetWriter pipeline = Pipeline( name="text_embeddings", stages=[ ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]), EmbeddingCreatorStage( model_identifier="sentence-transformers/all-MiniLM-L6-v2", text_field="text", embedding_field="embeddings", model_inference_batch_size=256, ), ParquetWriter(path="output/", fields=["text", "embeddings"]), ], ) executor = XennaExecutor() pipeline.run(executor) ``` ### VLLMEmbeddingModelStage (Recommended for Semantic Deduplication) `VLLMEmbeddingModelStage` is the default embedding backend for semantic deduplication, using `google/embeddinggemma-300m`. It provides better GPU utilization and throughput for large embedding models. See the [vLLM Embedder](/curate-text/process-data/embeddings/vllm-embedder) guide for setup, configuration, and code examples. --- ## Available Embedding Tools Generate embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models. --- ## Integration with Semantic Deduplication Text embeddings are a key input for [semantic deduplication](/curate-text/process-data/deduplication/semdedup). The `TextSemanticDeduplicationWorkflow` uses `VLLMEmbeddingModelStage` internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process.