--- name: fiftyone-find-duplicates description: Finds duplicate or near-duplicate images in FiftyOne datasets using brain similarity computation. Use when deduplicating datasets, finding similar images, or removing redundant samples. --- # Find Duplicates in FiftyOne Datasets ## Key Directives **ALWAYS follow these rules:** ### 1. Set context first ```python set_context(dataset_name="my-dataset") ``` ### 2. Launch FiftyOne App Brain operators are delegated and require the app: ```python launch_app() ``` Wait 5-10 seconds for initialization. ### 3. Discover operators dynamically ```python # List all brain operators list_operators(builtin_only=False) # Get schema for specific operator get_operator_schema(operator_uri="@voxel51/brain/compute_similarity") ``` ### 4. Compute embeddings before finding duplicates ```python execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={"brain_key": "img_sim", "model": "mobilenet-v2-imagenet-torch"} ) ``` ### 5. Close app when done ```python close_app() ``` ## Complete Workflow ### Step 1: Setup ```python # Set context set_context(dataset_name="my-dataset") # Launch app (required for brain operators) launch_app() ``` ### Step 2: Verify Brain Plugin ```python # Check if brain plugin is available list_plugins(enabled=True) # If not installed: download_plugin( url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/brain"] ) enable_plugin(plugin_name="@voxel51/brain") ``` ### Step 3: Discover Brain Operators ```python # List all available operators list_operators(builtin_only=False) # Get schema for compute_similarity get_operator_schema(operator_uri="@voxel51/brain/compute_similarity") # Get schema for find_duplicates get_operator_schema(operator_uri="@voxel51/brain/find_duplicates") ``` ### Step 4: Compute Similarity ```python # Execute operator to compute embeddings execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={ "brain_key": "img_duplicates", "model": "mobilenet-v2-imagenet-torch" } ) ``` ### Step 5: Find Near Duplicates ```python execute_operator( operator_uri="@voxel51/brain/find_near_duplicates", params={ "similarity_index": "img_duplicates", "threshold": 0.3 } ) ``` **Threshold guidelines (distance-based, lower = more similar):** - `0.1` = Very similar (near-exact duplicates) - `0.3` = Near duplicates (recommended default) - `0.5` = Similar images - `0.7` = Loosely similar This operator creates two saved views automatically: - `near duplicates`: all samples that are near duplicates - `representatives of near duplicates`: one representative from each group ### Step 6: View Duplicates in App After finding duplicates, use `set_view` to display them in the FiftyOne App: **Option A: Filter by near_dup_id field** ```python # Show all samples that have a near_dup_id (all duplicates) set_view(exists=["near_dup_id"]) ``` **Option B: Show specific duplicate group** ```python # Show samples with a specific duplicate group ID set_view(filters={"near_dup_id": 1}) ``` **Option C: Load saved view (if available)** ```python # Load the automatically created saved view set_view(view_name="near duplicates") ``` **Option D: Clear filter to show all samples** ```python clear_view() ``` The `find_near_duplicates` operator adds a `near_dup_id` field to samples. Samples with the same ID are duplicates of each other. ### Step 7: Delete Duplicates **Option A: Use deduplicate operator (keeps one representative per group)** ```python execute_operator( operator_uri="@voxel51/brain/deduplicate_near_duplicates", params={} ) ``` **Option B: Manual deletion from App UI** 1. Use `set_view(exists=["near_dup_id"])` to show duplicates 2. Review samples in the App at http://localhost:5151/ 3. Select samples to delete 4. Use the delete action in the App ### Step 8: Clean Up ```python close_app() ``` ## Available Tools ### Session View Tools | Tool | Description | |------|-------------| | `set_view(exists=[...])` | Filter samples where field(s) have non-None values | | `set_view(filters={...})` | Filter samples by exact field values | | `set_view(tags=[...])` | Filter samples by tags | | `set_view(sample_ids=[...])` | Select specific sample IDs | | `set_view(view_name="...")` | Load a saved view by name | | `clear_view()` | Clear filters, show all samples | ### Brain Operators for Duplicates Use `list_operators()` to discover and `get_operator_schema()` to see parameters: | Operator | Description | |----------|-------------| | `@voxel51/brain/compute_similarity` | Compute embeddings and similarity index | | `@voxel51/brain/find_near_duplicates` | Find near-duplicate samples | | `@voxel51/brain/deduplicate_near_duplicates` | Delete duplicates, keep representatives | | `@voxel51/brain/find_exact_duplicates` | Find exact duplicate media files | | `@voxel51/brain/deduplicate_exact_duplicates` | Delete exact duplicates | | `@voxel51/brain/compute_uniqueness` | Compute uniqueness scores | ## Common Use Cases ### Use Case 1: Remove Exact Duplicates For accidentally duplicated files (identical bytes): ```python set_context(dataset_name="my-dataset") launch_app() execute_operator( operator_uri="@voxel51/brain/find_exact_duplicates", params={} ) execute_operator( operator_uri="@voxel51/brain/deduplicate_exact_duplicates", params={} ) close_app() ``` ### Use Case 2: Find and Review Near Duplicates For visually similar but not identical images: ```python set_context(dataset_name="my-dataset") launch_app() # Compute embeddings execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={"brain_key": "near_dups", "model": "mobilenet-v2-imagenet-torch"} ) # Find duplicates execute_operator( operator_uri="@voxel51/brain/find_near_duplicates", params={"similarity_index": "near_dups", "threshold": 0.3} ) # View duplicates in the App set_view(exists=["near_dup_id"]) # After review, deduplicate execute_operator( operator_uri="@voxel51/brain/deduplicate_near_duplicates", params={} ) # Clear view and close clear_view() close_app() ``` ### Use Case 3: Sort by Similarity Find images similar to a specific sample: ```python set_context(dataset_name="my-dataset") launch_app() execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={"brain_key": "search"} ) execute_operator( operator_uri="@voxel51/brain/sort_by_similarity", params={ "brain_key": "search", "query_id": "sample_id_here", "k": 20 } ) close_app() ``` ## Troubleshooting **Error: "No executor available"** - Cause: Delegated operators require the App executor for UI triggers - Solution: Direct user to App UI to view results and complete deletion manually - Affected operators: `find_near_duplicates`, `deduplicate_near_duplicates` **Error: "Brain key not found"** - Cause: Embeddings not computed - Solution: Run `compute_similarity` first with a `brain_key` **Error: "Operator not found"** - Cause: Brain plugin not installed - Solution: Install with `download_plugin()` and `enable_plugin()` **Error: "Missing dependency" (e.g., torch, tensorflow)** - The MCP server detects missing dependencies automatically - Response includes `missing_package` and `install_command` - Example response: ```json { "error_type": "missing_dependency", "missing_package": "torch", "install_command": "pip install torch" } ``` - Offer to run the install command for the user - After installation, restart MCP server and retry **Similarity computation is slow** - Use faster model: `mobilenet-v2-imagenet-torch` - Use GPU if available - Process large datasets in batches ## Best Practices 1. **Discover dynamically** - Use `list_operators()` and `get_operator_schema()` to get current operator names and parameters 2. **Start with default threshold** (0.3) and adjust as needed 3. **Review before deleting** - Direct user to App to inspect duplicates 4. **Store embeddings** - Reuse for multiple operations via `brain_key` 5. **Handle executor errors gracefully** - Guide user to App UI when needed ## Performance Notes **Embedding computation time:** - 1,000 images: ~1-2 minutes - 10,000 images: ~10-15 minutes - 100,000 images: ~1-2 hours **Memory requirements:** - ~2KB per image for embeddings - ~4-8KB per image for similarity index ## Resources - [FiftyOne Brain Documentation](https://docs.voxel51.com/user_guide/brain.html) - [Brain Plugin Source](https://github.com/voxel51/fiftyone-plugins/tree/main/plugins/brain)