--- description: "Identify and remove near-duplicate documents using MinHash and LSH with GPU acceleration" categories: ["how-to-guides"] tags: ["fuzzy-dedup", "minhash", "lsh", "gpu", "ray"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "text-only" --- # Fuzzy Duplicate Removal Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU. For other approaches, refer to [Deduplication ](/curate-text/process-data/deduplication). ## How It Works Fuzzy deduplication uses MinHash and LSH to find near-duplicate content: 1. Computes MinHash signatures over character n-grams 2. Uses Locality Sensitive Hashing (LSH) to find candidate matches 3. Builds a graph of duplicate relationships 4. Identifies groups of near-duplicate documents Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content. ## Before You Start **Prerequisites**: - Ray cluster with GPU support (required for distributed processing) - Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages) **Running in Docker**: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that Ray workers can access the GPU. Without GPU access, you may see `CUDARuntimeError` or `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container. ## Quick Start Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them: ```python from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow ray_client = RayClient() ray_client.start() # Step 1: Identify duplicates fuzzy_workflow = FuzzyDeduplicationWorkflow( input_path="input_data/", cache_path="./cache", output_path="./results", text_field="text", perform_removal=False, input_filetype="parquet", char_ngrams=24, num_bands=20, minhashes_per_band=13 ) result = fuzzy_workflow.run() # result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_path # Step 2: Remove duplicates removal_workflow = TextDuplicatesRemovalWorkflow( input_path="input_data/", ids_to_remove_path="./results/FuzzyDuplicateIds", output_path="./deduplicated", input_filetype="parquet", input_id_field="_curator_dedup_id", ids_to_remove_duplicate_id_field="_curator_dedup_id", id_generator_path="./results/fuzzy_id_generator.json" ) result = removal_workflow.run() # result.metadata contains: total_time, num_duplicates_removed ``` ## Configuration Configure fuzzy deduplication using these key parameters: | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `input_path` | str \| list[str] | None | Path(s) to input files or directories | | `cache_path` | str | Required | Directory to cache intermediate results | | `output_path` | str | Required | Directory to write duplicate IDs and ID generator | | `text_field` | str | "text" | Name of the text field in input data | | `char_ngrams` | int | 24 | Character n-gram size for MinHash (recommended: >= 20) | | `num_bands` | int | 20 | Number of LSH bands (affects similarity threshold) | | `minhashes_per_band` | int | 13 | Number of hashes per LSH band | | `bands_per_iteration` | int | 5 | Bands processed per iteration (memory tuning) | | `use_64_bit_hash` | bool | False | Use 64-bit hash (more memory, fewer collisions) | | `seed` | int | 42 | Random seed for MinHash permutations | | `input_filetype` | str | "parquet" | Input file format ("parquet" or "jsonl") | | `input_blocksize` | str \| int | "1GiB" | Size of input blocks for processing | | `lsh_num_output_partitions` | int \| None | None | Total number of partitions to write during the LSH shuffle. If `None`, the partition count is chosen automatically as the closest power of 2 <= the number of input tasks. | | `lsh_rmm_pool_size` | int \| "auto" \| None | "auto" | Size of the RMM GPU memory pool in bytes for the LSH stage. `"auto"` sets the pool to 90% of free GPU memory. `None` sets the pool to 50% of free GPU memory and allows expansion. | | `lsh_spill_memory_limit` | int \| "auto" \| None | "auto" | Device memory limit in bytes for spilling to host during the LSH stage. `"auto"` sets the limit to 80% of the RMM pool size. `None` disables spilling. | | `perform_removal` | bool | False | Reserved; must remain `False`. Fuzzy removal is performed with `TextDuplicatesRemovalWorkflow`. | ### Similarity Threshold Control matching strictness with `num_bands` and `minhashes_per_band`: - **More strict matching**: Increase `num_bands` or decrease `minhashes_per_band` - **Less strict matching**: Decrease `num_bands` or increase `minhashes_per_band` Default (`num_bands=20`, `minhashes_per_band=13`) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution. ```python # Example: stricter matching (fewer pairs detected, higher required similarity) fuzzy_workflow = FuzzyDeduplicationWorkflow( num_bands=25, # More bands = stricter matching minhashes_per_band=10 # Fewer hashes per band = stricter matching ) # Example: less strict matching (more pairs detected, lower required similarity) fuzzy_workflow = FuzzyDeduplicationWorkflow( num_bands=15, # Fewer bands = less strict matching minhashes_per_band=15 # More hashes per band = less strict matching ) ``` ## Removing Duplicates After identifying duplicates, use `TextDuplicatesRemovalWorkflow` to remove them: ```python from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow removal_workflow = TextDuplicatesRemovalWorkflow( input_path="/path/to/input/data", ids_to_remove_path="/path/to/output/FuzzyDuplicateIds", output_path="/path/to/deduplicated", input_filetype="parquet", input_id_field="_curator_dedup_id", ids_to_remove_duplicate_id_field="_curator_dedup_id", id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned ) result = removal_workflow.run() ``` **When IDs were auto-assigned**: - `id_generator_path` is required - Ensures consistent ID mapping between identification and removal stages ## Output Format The fuzzy deduplication process produces the following directory structure: ```s cache_path/ ├── MinHashStage/ # MinHash signatures │ └── *.parquet ├── LSHStage/ # LSH buckets │ └── *.parquet ├── BucketsToEdges/ # Graph edges │ └── *.parquet └── ConnectedComponents/ # Connected components └── *.parquet output_path/ ├── FuzzyDuplicateIds/ # Duplicate identification results │ └── *.parquet # Parquet files with document IDs to remove └── fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned) ``` ### File Formats The workflow produces these output files: 1. **Duplicate IDs** (`FuzzyDuplicateIds/*.parquet`): - Contains document IDs to remove - Format: Parquet files with column: `["_curator_dedup_id"]` - **Important**: Contains only the IDs of documents to remove, not the full document content 2. **ID Generator** (`fuzzy_id_generator.json`): - JSON file containing ID generator state - Required for removal workflow when IDs were auto-assigned - Ensures consistent ID mapping across workflow stages 3. **Cache Files** (`cache_path/`): - Intermediate results for debugging and analysis - Can be reused if re-running with different parameters - Clear cache between runs if parameters change significantly **Performance characteristics**: - GPU-accelerated MinHash and LSH operations - Scales across multiple GPUs and nodes using Ray - `bands_per_iteration` controls memory usage - Intermediate results are cached for efficiency **GPU requirements**: - NVIDIA GPU with CUDA support - Ray cluster with GPU workers **Performance tuning**: - **Memory**: Adjust `bands_per_iteration` (lower = less memory, more iterations) - **GPU memory (LSH)**: Use `lsh_rmm_pool_size` to control GPU memory allocation and `lsh_spill_memory_limit` to tune host-spilling behavior during the LSH stage. Reducing the pool size or lowering the spill threshold can prevent out-of-memory errors on smaller GPUs. - **Shuffle partitions**: Set `lsh_num_output_partitions` to control the number of output partitions during the LSH shuffle. More partitions reduce per-partition memory but increase I/O overhead. - **Accuracy**: Use `char_ngrams >= 20` to reduce false positives - **Best practices**: Clear cache between runs, use `input_blocksize="1GiB"` **Note**: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as `bands_per_iteration`, `char_ngrams`, and `input_blocksize`. For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the [Deduplication overview ](/curate-text/process-data/deduplication).