--- description: "Comprehensive overview of deduplication techniques across text, image, and video modalities including exact, fuzzy, and semantic approaches" categories: ["concepts-architecture"] tags: ["deduplication", "exact-dedup", "fuzzy-dedup", "semantic-dedup", "multimodal", "gpu-accelerated"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "concept" modality: "multimodal" --- # Deduplication Concepts This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings. ## Overview Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text and image modalities. Removing duplicates offers several benefits: - **Improved Training Efficiency**: Prevents overrepresentation of repeated content - **Reduced Dataset Size**: Significantly reduces storage and processing requirements - **Better Model Performance**: Eliminates redundant examples that can bias training ## Deduplication Approaches NeMo Curator implements three main deduplication strategies, each with different strengths and use cases: ### Exact Deduplication - **Method**: Hash-based matching (MD5) - **Best For**: Identical copies and character-for-character matches - **Speed**: Very fast - **Scale**: Unlimited size - **GPU Required**: Yes (for distributed processing) Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content. **Modalities Supported**: Text ### Fuzzy Deduplication - **Method**: MinHash and Locality-Sensitive Hashing (LSH) - **Best For**: Near-duplicates with minor changes (reformatting, small edits) - **Speed**: Fast - **Scale**: Up to petabyte scale - **GPU Required**: Yes Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits. **Modalities Supported**: Text ### Semantic Deduplication - **Method**: Embedding-based similarity using neural networks - **Best For**: Content with similar meaning but different expression - **Speed**: Moderate (due to embedding generation step) - **Scale**: Up to terabyte scale - **GPU Required**: Yes Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation. **Modalities Supported**: Text, Image ## Multimodal Applications ### Text Deduplication Text deduplication is the most mature implementation, offering all three approaches: - **Exact**: Remove identical documents using MD5 hashing - **Fuzzy**: Remove near-duplicates using MinHash and LSH similarity - **Semantic**: Remove semantically similar content using embeddings Text deduplication can handle web-scale datasets and is commonly used for: - Web crawl data (Common Crawl) - Academic papers (ArXiv) - Code repositories - General text corpora ### Video Deduplication Video deduplication uses the semantic deduplication workflow with video embeddings: - **Semantic Clustering**: Uses the general K-means clustering workflow on video embeddings - **Pairwise Similarity**: Computes within-cluster similarity using the semantic deduplication pipeline - **Representative Selection**: Leverages the semantic workflow to identify and remove redundant content Video deduplication is particularly effective for: - Educational content with similar presentations - News clips covering the same events - Entertainment content with repeated segments ### Image Deduplication Semantic duplicates are images that contain almost the same information content, but are perceptually different. Image deduplication is computed in Curator by: - **Generating Embeddings**: Generate CLIP embeddings for images - **Convert to Text**: Convert the `ImageBatch` embeddings to `DocumentBatch` objects - **Identify Semantic Duplicates**: Run the text-based semantic deduplication workflow and save the results - **Remove Duplicates**: Read back the data and remove the identified duplicates ## Architecture and Performance ### Distributed Processing All deduplication workflows leverage distributed computing frameworks: - **Ray Backend**: Provides scalable distributed processing - **GPU Acceleration**: Essential for embedding generation and similarity computation - **Memory Optimization**: Streaming processing for large datasets ### Scalability Characteristics | Method | Dataset Size | Memory Requirements | Processing Time | | --- | --- | --- | --- | | Exact | Unlimited | Low (hash storage) | Linear with data size | | Fuzzy | Petabyte-scale | Moderate (LSH tables) | Sub-linear with LSH | | Semantic | Terabyte-scale | High (embeddings) | Depends on model inference | ## Implementation Patterns ### Workflow-Based Processing NeMo Curator provides high-level workflows that encapsulate the complete deduplication process: ```python # Text-based workflow for identifying exact duplicates from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow # Text-based workflow for identifying fuzzy duplicates from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow # Text-based workflow for identifying (and optionally removing) semantic duplicates from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow # Text-based workflow for removing identified duplicates from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow ``` ### Stage-Based Processing For fine-grained control, individual stages can be composed into custom pipelines: ```python # Semantic deduplication stages from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage ``` ## Integration with Pipeline Architecture Deduplication workflows should be run separately from traditional pipelines in Curator. While other Curator modules are purely map-style operations, meaning that they run on chunks of the data at a time, deduplication workflows identify duplicates across the entire dataset. Thus, special logic is needed outside of Curator's map-style functions, and the deduplication modules are implemented as separate workflows. Each deduplication workflow expects input and output file path parameters, making them more self-contained than other Curator modules. The input and output of a deduplication workflow can be JSONL or Parquet files, meaning that they are compatible with Curator's traditional pipeline-based read and write stages. Thus, the user may write out intermediate results of a pipeline to be used as input to a deduplication workflow, then read the deduplicated output and resume curation. As a high level example: ```python # Define first pipeline pipeline_1 = Pipeline(name="text_curation_1", description="...") # Read input data pipeline_1.add_stage(JsonlReader(...)) # Add more stages like heuristic filters, etc. pipeline_1.add_stage(...) # Save intermediate results to JSONL pipeline_1.add_stage(JsonlWriter(...)) pipeline_1.run() # Create and run semantic deduplication workflow workflow = SemanticDeduplicationWorkflow(...) workflow.run() # Define second pipeline pipeline_2 = Pipeline(name="text_curation_2", description="...") # Read deduplicated data pipeline_2.add_stage(JsonlReader(...)) # Add more stages like classifiers, etc. pipeline_2.add_stage(...) # Save final results to JSONL pipeline_2.add_stage(JsonlWriter(...)) pipeline_2.run() ```