---
description: "Remove duplicate and near-duplicate documents efficiently using GPU-accelerated and semantic deduplication modules"
categories: ["workflows"]
tags: ["deduplication", "fuzzy-dedup", "semantic-dedup", "exact-dedup", "gpu-accelerated", "minhash"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "explanation"
modality: "text-only"
---
# Deduplication
Remove duplicate and near-duplicate documents from text datasets using NeMo Curator's GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.
NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the [data processing pipeline ](/about/concepts/text/data/processing).
## How It Works
NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:
**Method**: MD5 hashing
**Detects**: Character-for-character identical documents
**Speed**: Fastest
Computes MD5 hashes for each document's text content and groups documents with identical hashes. Best for removing exact copies.
```python
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
ray_client = RayClient()
ray_client.start()
exact_workflow = ExactDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
assign_id=True,
input_filetype="parquet"
)
result = exact_workflow.run()
# result.metadata contains: total_time, num_duplicates, identification_time, id_generator_path
```
For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for details.
**Method**: MinHash + Locality Sensitive Hashing (LSH)
**Detects**: Near-duplicates with minor edits (~80% similarity)
**Speed**: Fast
Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
```python
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
ray_client = RayClient()
ray_client.start()
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="/path/to/input/data",
cache_path="/path/to/cache",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
input_blocksize="1GiB",
seed=42,
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
result = fuzzy_workflow.run()
# result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_path
```
For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for details.
**Method**: Embeddings + clustering + pairwise similarity
**Detects**: Semantically similar content (paraphrases, translations)
**Speed**: Moderate
Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
```python
from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
text_workflow = TextSemanticDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
cache_path="/path/to/cache",
text_field="text",
n_clusters=100,
eps=0.01, # Similarity threshold
perform_removal=True # Complete deduplication
)
result = text_workflow.run()
# result.metadata contains: total_time, num_duplicates, num_duplicates_removed
```
**Note**: Two workflows available:
- `TextSemanticDeduplicationWorkflow`: For raw text with automatic embedding generation
- `SemanticDeduplicationWorkflow`: For pre-computed embeddings
See [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for details.
For fine-grained control, break semantic deduplication into separate stages:
```python
from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
# 1. Create ID generator
create_id_generator_actor()
# 2. Generate embeddings separately (using vLLM)
embedding_pipeline = Pipeline(
stages=[
ParquetReader(file_paths=input_path, _generate_ids=True),
VLLMEmbeddingModelStage(
model_identifier="google/embeddinggemma-300m",
text_field="text",
embedding_field="embeddings",
),
ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"])
]
)
embedding_out = embedding_pipeline.run()
# 3. Run clustering and pairwise similarity
semantic_workflow = SemanticDeduplicationWorkflow(
input_path=embedding_output_path,
output_path=semantic_workflow_path,
n_clusters=100,
id_field="_curator_dedup_id",
embedding_field="embeddings",
eps=None # Skip duplicate identification for analysis
)
result = semantic_workflow.run()
# 4. Analyze results and choose eps parameter
# 5. Identify and remove duplicates
```
This approach enables analysis of intermediate results and fine-grained control.
---
## Deduplication Methods
Choose a deduplication method based on your needs:
Identify and remove character-for-character duplicates using MD5 hashing
hashing
fast
gpu-accelerated
Identify and remove near-duplicates using MinHash and LSH similarity
minhash
lsh
gpu-accelerated
Remove semantically similar documents using embeddings
embeddings
gpu-accelerated
meaning-based
advanced
## Common Operations
### Document IDs
Duplicate removal workflows require stable document identifiers. Choose one approach:
- **Use `AddId`** to add IDs at the start of your pipeline
- **Use reader-based ID generation** (`_generate_ids`, `_assign_ids`) backed by the ID Generator actor for stable integer IDs
- **Use existing IDs** if your documents already have unique identifiers
Some workflows write an ID generator state file (`*_id_generator.json`) for later removal when IDs are auto-assigned.
### Removing Duplicates
Use `TextDuplicatesRemovalWorkflow` to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.
```python
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input",
ids_to_remove_path="/path/to/duplicates", # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
output_path="/path/to/clean",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/id_generator.json" # Required when IDs were auto-assigned
)
result = removal_workflow.run()
# result.metadata contains: total_time, num_duplicates_removed
```
**When `assign_id=True`** (IDs auto-assigned):
- Duplicate IDs file contains `_curator_dedup_id` column
- Set `ids_to_remove_duplicate_id_field="_curator_dedup_id"`
- `id_generator_path` is required
**When `assign_id=False`** (using existing IDs):
- Duplicate IDs file contains the column specified by `id_field` (e.g., `"id"`)
- Set `ids_to_remove_duplicate_id_field` to match your `id_field` value
- `id_generator_path` not required
### Workflow Results
All deduplication workflows return a `WorkflowRunResult` object with timing and duplicate count metadata:
```python
from nemo_curator.pipeline.workflow import WorkflowRunResult
result = exact_workflow.run()
print(result.metadata) # {"total_time": 42.1, "num_duplicates": 1500, ...}
```
Available metadata varies by workflow. Common keys include `total_time` and `num_duplicates`.
### Outputs and Artifacts
Each deduplication method produces specific output files and directories:
| Method | Duplicate IDs Location | ID Generator File | Deduplicated Output |
| --- | --- | --- | --- |
| Exact | `ExactDuplicateIds/` (parquet) | `exact_id_generator.json` (if `assign_id=True`) | Via `TextDuplicatesRemovalWorkflow` |
| Fuzzy | `FuzzyDuplicateIds/` (parquet) | `fuzzy_id_generator.json` (if IDs auto-assigned) | Via `TextDuplicatesRemovalWorkflow` |
| Semantic | `output_path/duplicates/` (parquet) | N/A | `output_path/deduplicated/` (if `perform_removal=True`) |
**Column names**:
- `_curator_dedup_id` when `assign_id=True` or IDs are auto-assigned
- Matches `id_field` parameter when `assign_id=False`
## Choosing a Deduplication Method
Compare deduplication methods to select the best approach for your dataset:
| Method | Best For | Speed | Duplicate Types | GPU Required |
| --- | --- | --- | --- | --- |
| **Exact** | Identical copies | Very fast | Character-for-character matches | Required |
| **Fuzzy** | Near-duplicates with small changes | Fast | Minor edits, reformatting (~80% similarity) | Required |
| **Semantic** | Similar meaning, different words | Moderate | Paraphrases, translations, rewrites | Required |
### Quick Decision Guide
Use this guide to quickly select the right method:
- **Start with Exact** if you have numerous identical documents or need the fastest speed
- **Use Fuzzy** if you need to catch near-duplicates with minor formatting differences
- **Use Semantic** for meaning-based deduplication on large, diverse datasets
**Exact Deduplication**:
- Removing identical copies of documents
- Fast initial deduplication pass
- Datasets with numerous exact duplicates
- When speed is more important than detecting near-duplicates
**Fuzzy Deduplication**:
- Removing near-duplicate documents with minor formatting differences
- Detecting documents with small edits or typos
- Fast deduplication when exact matching misses numerous duplicates
- When speed is important but some near-duplicate detection is needed
**Semantic Deduplication**:
- Removing semantically similar content (paraphrases, translations)
- Large, diverse web-scale datasets
- When meaning-based deduplication is more important than speed
- Advanced use cases requiring embedding-based similarity detection
You can combine deduplication methods for comprehensive duplicate removal:
1. **Exact → Fuzzy → Semantic**: Start with fastest methods, then apply more sophisticated methods
2. **Exact → Semantic**: Use exact for quick wins, then semantic for meaning-based duplicates
3. **Fuzzy → Semantic**: Use fuzzy for near-duplicates, then semantic for paraphrases
Run each method independently, then combine duplicate IDs before removal.
For detailed implementation guides, see:
- [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact)
- [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy)
- [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup)
### GPU Acceleration
All deduplication workflows require GPU acceleration:
- **Exact**: Ray backend with GPU support for MD5 hashing operations
- **Fuzzy**: Ray backend with GPU support for MinHash computation and LSH operations
- **Semantic**: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation
GPU acceleration provides significant speedup for large datasets through parallel processing.
### Hardware Requirements
- **GPU**: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
- **Memory**: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
- **Executors**: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support
### Backend Setup
For optimal performance with large datasets, configure Ray backend:
```python
from nemo_curator.core.client import RayClient
client = RayClient(
num_cpus=64, # Adjust based on available cores
num_gpus=4 # Should be roughly 2x the memory of embeddings
)
client.start()
try:
result = workflow.run()
finally:
client.stop()
```
For TB-scale datasets, consider distributed GPU clusters with Ray.
### ID Generator for Large-Scale Operations
For large-scale duplicate removal, persist the ID Generator for consistent document tracking:
```python
from nemo_curator.stages.deduplication.id_generator import (
create_id_generator_actor,
write_id_generator_to_disk,
kill_id_generator_actor
)
create_id_generator_actor()
id_generator_path = "semantic_id_generator.json"
write_id_generator_to_disk(id_generator_path)
kill_id_generator_actor()
# Use saved ID generator in removal workflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path=input_path,
ids_to_remove_path=duplicates_path,
output_path=output_path,
id_generator_path=id_generator_path,
# ... other parameters
)
```
The ID Generator ensures consistent IDs across workflow stages.
## Next Steps
**Ready to use deduplication?**
- **New to deduplication**: Start with [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for the fastest approach
- **Need near-duplicate detection**: See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for MinHash-based matching
- **Require semantic matching**: Explore [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for meaning-based deduplication
**For hands-on guidance**: See [Text Curation Tutorials ](/curate-text/tutorials) for step-by-step examples.