--- description: "Remove duplicate and near-duplicate documents efficiently using GPU-accelerated and semantic deduplication modules" categories: ["workflows"] tags: ["deduplication", "fuzzy-dedup", "semantic-dedup", "exact-dedup", "gpu-accelerated", "minhash"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "explanation" modality: "text-only" --- # Deduplication Remove duplicate and near-duplicate documents from text datasets using NeMo Curator's GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training. NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the [data processing pipeline ](/about/concepts/text/data/processing). ## How It Works NeMo Curator provides three deduplication approaches, each optimized for different duplicate types: **Method**: MD5 hashing **Detects**: Character-for-character identical documents **Speed**: Fastest Computes MD5 hashes for each document's text content and groups documents with identical hashes. Best for removing exact copies. ```python from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow ray_client = RayClient() ray_client.start() exact_workflow = ExactDeduplicationWorkflow( input_path="/path/to/input/data", output_path="/path/to/output", text_field="text", perform_removal=False, # Identification only assign_id=True, input_filetype="parquet" ) result = exact_workflow.run() # result.metadata contains: total_time, num_duplicates, identification_time, id_generator_path ``` For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for details. **Method**: MinHash + Locality Sensitive Hashing (LSH) **Detects**: Near-duplicates with minor edits (~80% similarity) **Speed**: Fast Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos. ```python from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow ray_client = RayClient() ray_client.start() fuzzy_workflow = FuzzyDeduplicationWorkflow( input_path="/path/to/input/data", cache_path="/path/to/cache", output_path="/path/to/output", text_field="text", perform_removal=False, # Identification only input_blocksize="1GiB", seed=42, char_ngrams=24, num_bands=20, minhashes_per_band=13 ) result = fuzzy_workflow.run() # result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_path ``` For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for details. **Method**: Embeddings + clustering + pairwise similarity **Detects**: Semantically similar content (paraphrases, translations) **Speed**: Moderate Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication. ```python from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow text_workflow = TextSemanticDeduplicationWorkflow( input_path="/path/to/input/data", output_path="/path/to/output", cache_path="/path/to/cache", text_field="text", n_clusters=100, eps=0.01, # Similarity threshold perform_removal=True # Complete deduplication ) result = text_workflow.run() # result.metadata contains: total_time, num_duplicates, num_duplicates_removed ``` **Note**: Two workflows available: - `TextSemanticDeduplicationWorkflow`: For raw text with automatic embedding generation - `SemanticDeduplicationWorkflow`: For pre-computed embeddings See [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for details. For fine-grained control, break semantic deduplication into separate stages: ```python from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow # 1. Create ID generator create_id_generator_actor() # 2. Generate embeddings separately (using vLLM) embedding_pipeline = Pipeline( stages=[ ParquetReader(file_paths=input_path, _generate_ids=True), VLLMEmbeddingModelStage( model_identifier="google/embeddinggemma-300m", text_field="text", embedding_field="embeddings", ), ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"]) ] ) embedding_out = embedding_pipeline.run() # 3. Run clustering and pairwise similarity semantic_workflow = SemanticDeduplicationWorkflow( input_path=embedding_output_path, output_path=semantic_workflow_path, n_clusters=100, id_field="_curator_dedup_id", embedding_field="embeddings", eps=None # Skip duplicate identification for analysis ) result = semantic_workflow.run() # 4. Analyze results and choose eps parameter # 5. Identify and remove duplicates ``` This approach enables analysis of intermediate results and fine-grained control. --- ## Deduplication Methods Choose a deduplication method based on your needs: Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated Identify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated Remove semantically similar documents using embeddings embeddings gpu-accelerated meaning-based advanced ## Common Operations ### Document IDs Duplicate removal workflows require stable document identifiers. Choose one approach: - **Use `AddId`** to add IDs at the start of your pipeline - **Use reader-based ID generation** (`_generate_ids`, `_assign_ids`) backed by the ID Generator actor for stable integer IDs - **Use existing IDs** if your documents already have unique identifiers Some workflows write an ID generator state file (`*_id_generator.json`) for later removal when IDs are auto-assigned. ### Removing Duplicates Use `TextDuplicatesRemovalWorkflow` to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication. ```python from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow removal_workflow = TextDuplicatesRemovalWorkflow( input_path="/path/to/input", ids_to_remove_path="/path/to/duplicates", # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/ output_path="/path/to/clean", input_filetype="parquet", input_id_field="_curator_dedup_id", ids_to_remove_duplicate_id_field="_curator_dedup_id", id_generator_path="/path/to/id_generator.json" # Required when IDs were auto-assigned ) result = removal_workflow.run() # result.metadata contains: total_time, num_duplicates_removed ``` **When `assign_id=True`** (IDs auto-assigned): - Duplicate IDs file contains `_curator_dedup_id` column - Set `ids_to_remove_duplicate_id_field="_curator_dedup_id"` - `id_generator_path` is required **When `assign_id=False`** (using existing IDs): - Duplicate IDs file contains the column specified by `id_field` (e.g., `"id"`) - Set `ids_to_remove_duplicate_id_field` to match your `id_field` value - `id_generator_path` not required ### Workflow Results All deduplication workflows return a `WorkflowRunResult` object with timing and duplicate count metadata: ```python from nemo_curator.pipeline.workflow import WorkflowRunResult result = exact_workflow.run() print(result.metadata) # {"total_time": 42.1, "num_duplicates": 1500, ...} ``` Available metadata varies by workflow. Common keys include `total_time` and `num_duplicates`. ### Outputs and Artifacts Each deduplication method produces specific output files and directories: | Method | Duplicate IDs Location | ID Generator File | Deduplicated Output | | --- | --- | --- | --- | | Exact | `ExactDuplicateIds/` (parquet) | `exact_id_generator.json` (if `assign_id=True`) | Via `TextDuplicatesRemovalWorkflow` | | Fuzzy | `FuzzyDuplicateIds/` (parquet) | `fuzzy_id_generator.json` (if IDs auto-assigned) | Via `TextDuplicatesRemovalWorkflow` | | Semantic | `output_path/duplicates/` (parquet) | N/A | `output_path/deduplicated/` (if `perform_removal=True`) | **Column names**: - `_curator_dedup_id` when `assign_id=True` or IDs are auto-assigned - Matches `id_field` parameter when `assign_id=False` ## Choosing a Deduplication Method Compare deduplication methods to select the best approach for your dataset: | Method | Best For | Speed | Duplicate Types | GPU Required | | --- | --- | --- | --- | --- | | **Exact** | Identical copies | Very fast | Character-for-character matches | Required | | **Fuzzy** | Near-duplicates with small changes | Fast | Minor edits, reformatting (~80% similarity) | Required | | **Semantic** | Similar meaning, different words | Moderate | Paraphrases, translations, rewrites | Required | ### Quick Decision Guide Use this guide to quickly select the right method: - **Start with Exact** if you have numerous identical documents or need the fastest speed - **Use Fuzzy** if you need to catch near-duplicates with minor formatting differences - **Use Semantic** for meaning-based deduplication on large, diverse datasets **Exact Deduplication**: - Removing identical copies of documents - Fast initial deduplication pass - Datasets with numerous exact duplicates - When speed is more important than detecting near-duplicates **Fuzzy Deduplication**: - Removing near-duplicate documents with minor formatting differences - Detecting documents with small edits or typos - Fast deduplication when exact matching misses numerous duplicates - When speed is important but some near-duplicate detection is needed **Semantic Deduplication**: - Removing semantically similar content (paraphrases, translations) - Large, diverse web-scale datasets - When meaning-based deduplication is more important than speed - Advanced use cases requiring embedding-based similarity detection You can combine deduplication methods for comprehensive duplicate removal: 1. **Exact → Fuzzy → Semantic**: Start with fastest methods, then apply more sophisticated methods 2. **Exact → Semantic**: Use exact for quick wins, then semantic for meaning-based duplicates 3. **Fuzzy → Semantic**: Use fuzzy for near-duplicates, then semantic for paraphrases Run each method independently, then combine duplicate IDs before removal. For detailed implementation guides, see: - [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) - [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) - [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) ### GPU Acceleration All deduplication workflows require GPU acceleration: - **Exact**: Ray backend with GPU support for MD5 hashing operations - **Fuzzy**: Ray backend with GPU support for MinHash computation and LSH operations - **Semantic**: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation GPU acceleration provides significant speedup for large datasets through parallel processing. ### Hardware Requirements - **GPU**: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic) - **Memory**: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions - **Executors**: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support ### Backend Setup For optimal performance with large datasets, configure Ray backend: ```python from nemo_curator.core.client import RayClient client = RayClient( num_cpus=64, # Adjust based on available cores num_gpus=4 # Should be roughly 2x the memory of embeddings ) client.start() try: result = workflow.run() finally: client.stop() ``` For TB-scale datasets, consider distributed GPU clusters with Ray. ### ID Generator for Large-Scale Operations For large-scale duplicate removal, persist the ID Generator for consistent document tracking: ```python from nemo_curator.stages.deduplication.id_generator import ( create_id_generator_actor, write_id_generator_to_disk, kill_id_generator_actor ) create_id_generator_actor() id_generator_path = "semantic_id_generator.json" write_id_generator_to_disk(id_generator_path) kill_id_generator_actor() # Use saved ID generator in removal workflow removal_workflow = TextDuplicatesRemovalWorkflow( input_path=input_path, ids_to_remove_path=duplicates_path, output_path=output_path, id_generator_path=id_generator_path, # ... other parameters ) ``` The ID Generator ensures consistent IDs across workflow stages. ## Next Steps **Ready to use deduplication?** - **New to deduplication**: Start with [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for the fastest approach - **Need near-duplicate detection**: See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for MinHash-based matching - **Require semantic matching**: Explore [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for meaning-based deduplication **For hands-on guidance**: See [Text Curation Tutorials ](/curate-text/tutorials) for step-by-step examples.