--- name: fiftyone-embeddings-visualization description: Visualizes datasets in 2D using embeddings with UMAP or t-SNE dimensionality reduction. Use when exploring dataset structure, finding clusters, identifying outliers, or understanding data distribution. --- # Embeddings Visualization in FiftyOne ## Key Directives **ALWAYS follow these rules:** ### 1. Set context first ```python set_context(dataset_name="my-dataset") ``` ### 2. Launch FiftyOne App Brain operators are delegated and require the app: ```python launch_app() ``` Wait 5-10 seconds for initialization. ### 3. Discover operators dynamically ```python # List all brain operators list_operators(builtin_only=False) # Get schema for specific operator get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") ``` ### 4. Compute embeddings before visualization Embeddings are required for dimensionality reduction: ```python execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={ "brain_key": "img_sim", "model": "clip-vit-base32-torch", "embeddings": "clip_embeddings", "backend": "sklearn", "metric": "cosine" } ) ``` ### 5. Close app when done ```python close_app() ``` ## Complete Workflow ### Step 1: Setup ```python # Set context set_context(dataset_name="my-dataset") # Launch app (required for brain operators) launch_app() ``` ### Step 2: Verify Brain Plugin ```python # Check if brain plugin is available list_plugins(enabled=True) # If not installed: download_plugin( url_or_repo="voxel51/fiftyone-plugins", plugin_names=["@voxel51/brain"] ) enable_plugin(plugin_name="@voxel51/brain") ``` ### Step 3: Discover Brain Operators ```python # List all available operators list_operators(builtin_only=False) # Get schema for compute_visualization get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") ``` ### Step 4: Check for Existing Embeddings or Compute New Ones First, check if the dataset already has embeddings by looking at the operator schema: ```python get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") # Look for existing embeddings fields in the "embeddings" choices # (e.g., "clip_embeddings", "dinov2_embeddings") ``` **If embeddings exist:** Skip to Step 5 and use the existing embeddings field. **If no embeddings exist:** Compute them: ```python execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={ "brain_key": "img_viz", "model": "clip-vit-base32-torch", "embeddings": "clip_embeddings", # Field name to store embeddings "backend": "sklearn", "metric": "cosine" } ) ``` **Required parameters for compute_similarity:** - `brain_key` - Unique identifier for this brain run - `model` - Model from FiftyOne Model Zoo to generate embeddings - `embeddings` - Field name where embeddings will be stored - `backend` - Similarity backend (use `"sklearn"`) - `metric` - Distance metric (use `"cosine"` or `"euclidean"`) **Recommended embedding models:** - `clip-vit-base32-torch` - Best for general visual + semantic similarity - `dinov2-vits14-torch` - Best for visual similarity only - `resnet50-imagenet-torch` - Classic CNN features - `mobilenet-v2-imagenet-torch` - Fast, lightweight option ### Step 5: Compute 2D Visualization Use existing embeddings field OR the brain_key from Step 4: ```python # Option A: Use existing embeddings field (e.g., clip_embeddings) execute_operator( operator_uri="@voxel51/brain/compute_visualization", params={ "brain_key": "img_viz", "embeddings": "clip_embeddings", # Use existing field "method": "umap", "num_dims": 2 } ) # Option B: Use brain_key from compute_similarity execute_operator( operator_uri="@voxel51/brain/compute_visualization", params={ "brain_key": "img_viz", # Same key used in compute_similarity "method": "umap", "num_dims": 2 } ) ``` **Dimensionality reduction methods:** - `umap` - (Recommended) Preserves local and global structure, faster. Requires `umap-learn` package. - `tsne` - Better local structure, slower on large datasets. No extra dependencies. - `pca` - Linear reduction, fastest but less informative ### Step 6: Direct User to Embeddings Panel After computing visualization, direct the user to open the FiftyOne App at http://localhost:5151/ and: 1. Click the **Embeddings** panel icon (scatter plot icon, looks like a grid of dots) in the top toolbar 2. Select the brain key (e.g., `img_viz`) from the dropdown 3. Points represent samples in 2D embedding space 4. Use the **"Color by"** dropdown to color points by a field (e.g., `ground_truth`, `predictions`) 5. Click points to select samples, use lasso tool to select groups **IMPORTANT:** Do NOT use `set_view(exists=["brain_key"])` - this filters samples and is not needed for visualization. The Embeddings panel automatically shows all samples with computed coordinates. ### Step 7: Explore and Filter (Optional) To filter samples while viewing in the Embeddings panel: ```python # Filter to specific class set_view(filters={"ground_truth.label": "dog"}) # Filter by tag set_view(tags=["validated"]) # Clear filter to show all clear_view() ``` These filters will update the Embeddings panel to show only matching samples. ### Step 8: Find Outliers Outliers appear as isolated points far from clusters: ```python # Compute uniqueness scores (higher = more unique/outlier) execute_operator( operator_uri="@voxel51/brain/compute_uniqueness", params={ "brain_key": "img_viz" } ) # View most unique samples (potential outliers) set_view(sort_by="uniqueness", reverse=True, limit=50) ``` ### Step 9: Find Clusters Use the App's Embeddings panel to visually identify clusters, then: **Option A: Lasso selection in App** 1. Use lasso tool to select a cluster 2. Selected samples are highlighted 3. Tag or export selected samples **Option B: Use similarity to find cluster members** ```python # Sort by similarity to a representative sample execute_operator( operator_uri="@voxel51/brain/sort_by_similarity", params={ "brain_key": "img_viz", "query_id": "sample_id_from_cluster", "k": 100 } ) ``` ### Step 10: Clean Up ```python close_app() ``` ## Available Tools ### Session View Tools | Tool | Description | |------|-------------| | `set_view(filters={...})` | Filter samples by field values | | `set_view(tags=[...])` | Filter samples by tags | | `set_view(sort_by="...", reverse=True)` | Sort samples by field | | `set_view(limit=N)` | Limit to N samples | | `clear_view()` | Clear filters, show all samples | ### Brain Operators for Visualization Use `list_operators()` to discover and `get_operator_schema()` to see parameters: | Operator | Description | |----------|-------------| | `@voxel51/brain/compute_similarity` | Compute embeddings and similarity index | | `@voxel51/brain/compute_visualization` | Reduce embeddings to 2D/3D for visualization | | `@voxel51/brain/compute_uniqueness` | Score samples by uniqueness (outlier detection) | | `@voxel51/brain/sort_by_similarity` | Sort by similarity to a query sample | ## Common Use Cases ### Use Case 1: Basic Dataset Exploration Visualize dataset structure and explore clusters: ```python set_context(dataset_name="my-dataset") launch_app() # Check for existing embeddings in schema get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") # If embeddings exist (e.g., clip_embeddings), use them directly: execute_operator( operator_uri="@voxel51/brain/compute_visualization", params={ "brain_key": "exploration", "embeddings": "clip_embeddings", "method": "umap", # or "tsne" if umap-learn not installed "num_dims": 2 } ) # Direct user to App Embeddings panel at http://localhost:5151/ # 1. Click Embeddings panel icon # 2. Select "exploration" from dropdown # 3. Use "Color by" to color by ground_truth or predictions ``` ### Use Case 2: Find Outliers in Dataset Identify anomalous or mislabeled samples: ```python set_context(dataset_name="my-dataset") launch_app() # Check for existing embeddings in schema get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") # If no embeddings exist, compute them: execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={ "brain_key": "outliers", "model": "clip-vit-base32-torch", "embeddings": "clip_embeddings", "backend": "sklearn", "metric": "cosine" } ) # Compute uniqueness scores execute_operator( operator_uri="@voxel51/brain/compute_uniqueness", params={"brain_key": "outliers"} ) # Generate visualization (use existing embeddings field or brain_key) execute_operator( operator_uri="@voxel51/brain/compute_visualization", params={ "brain_key": "outliers", "embeddings": "clip_embeddings", # Use existing field if available "method": "umap", # or "tsne" if umap-learn not installed "num_dims": 2 } ) # Direct user to App at http://localhost:5151/ # 1. Click Embeddings panel icon # 2. Select "outliers" from dropdown # 3. Outliers appear as isolated points far from clusters # 4. Optionally sort by uniqueness field in the App sidebar ``` ### Use Case 3: Compare Classes in Embedding Space See how different classes cluster: ```python set_context(dataset_name="my-dataset") launch_app() # Check for existing embeddings in schema get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") # If no embeddings exist, compute them: execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={ "brain_key": "class_viz", "model": "clip-vit-base32-torch", "embeddings": "clip_embeddings", "backend": "sklearn", "metric": "cosine" } ) # Generate visualization (use existing embeddings field or brain_key) execute_operator( operator_uri="@voxel51/brain/compute_visualization", params={ "brain_key": "class_viz", "embeddings": "clip_embeddings", # Use existing field if available "method": "umap", # or "tsne" if umap-learn not installed "num_dims": 2 } ) # Direct user to App at http://localhost:5151/ # 1. Click Embeddings panel icon # 2. Select "class_viz" from dropdown # 3. Use "Color by" dropdown to color by ground_truth or predictions # Look for: # - Well-separated clusters = good class distinction # - Overlapping clusters = similar classes or confusion # - Scattered points = high variance within class ``` ### Use Case 4: Analyze Model Predictions Compare ground truth vs predictions in embedding space: ```python set_context(dataset_name="my-dataset") launch_app() # Check for existing embeddings in schema get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") # If no embeddings exist, compute them: execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={ "brain_key": "pred_analysis", "model": "clip-vit-base32-torch", "embeddings": "clip_embeddings", "backend": "sklearn", "metric": "cosine" } ) # Generate visualization (use existing embeddings field or brain_key) execute_operator( operator_uri="@voxel51/brain/compute_visualization", params={ "brain_key": "pred_analysis", "embeddings": "clip_embeddings", # Use existing field if available "method": "umap", # or "tsne" if umap-learn not installed "num_dims": 2 } ) # Direct user to App at http://localhost:5151/ # 1. Click Embeddings panel icon # 2. Select "pred_analysis" from dropdown # 3. Color by ground_truth - see true class distribution # 4. Color by predictions - see model's view # 5. Look for mismatches to find errors ``` ### Use Case 5: t-SNE for Publication-Quality Plots Use t-SNE for better local structure (no extra dependencies): ```python set_context(dataset_name="my-dataset") launch_app() # Check for existing embeddings in schema get_operator_schema(operator_uri="@voxel51/brain/compute_visualization") # If no embeddings exist, compute them (DINOv2 for visual similarity): execute_operator( operator_uri="@voxel51/brain/compute_similarity", params={ "brain_key": "tsne_viz", "model": "dinov2-vits14-torch", "embeddings": "dinov2_embeddings", "backend": "sklearn", "metric": "cosine" } ) # Generate t-SNE visualization (no umap-learn dependency needed) execute_operator( operator_uri="@voxel51/brain/compute_visualization", params={ "brain_key": "tsne_viz", "embeddings": "dinov2_embeddings", # Use existing field if available "method": "tsne", "num_dims": 2 } ) # Direct user to App at http://localhost:5151/ # 1. Click Embeddings panel icon # 2. Select "tsne_viz" from dropdown # 3. t-SNE provides better local cluster structure than UMAP ``` ## Troubleshooting **Error: "No executor available"** - Cause: Delegated operators require the App executor - Solution: Ensure `launch_app()` was called and wait 5-10 seconds **Error: "Brain key not found"** - Cause: Embeddings not computed - Solution: Run `compute_similarity` first with a `brain_key` **Error: "Operator not found"** - Cause: Brain plugin not installed - Solution: Install with `download_plugin()` and `enable_plugin()` **Error: "You must install the `umap-learn>=0.5` package"** - Cause: UMAP method requires the `umap-learn` package - Solutions: 1. **Install umap-learn**: Ask user if they want to run `pip install umap-learn` 2. **Use t-SNE instead**: Change `method` to `"tsne"` (no extra dependencies) 3. **Use PCA instead**: Change `method` to `"pca"` (fastest, no extra dependencies) - After installing umap-learn, restart Claude Code/MCP server and retry **Visualization is slow** - Use UMAP instead of t-SNE for large datasets - Use faster embedding model: `mobilenet-v2-imagenet-torch` - Process subset first: `set_view(limit=1000)` **Embeddings panel not showing** - Ensure visualization was computed (not just embeddings) - Check brain_key matches in both compute_similarity and compute_visualization - Refresh the App page **Points not colored correctly** - Verify the field exists on samples - Check field type is compatible (Classification, Detections, or string) ## Best Practices 1. **Discover dynamically** - Use `list_operators()` and `get_operator_schema()` to get current operator names and parameters 2. **Choose the right model** - CLIP for semantic similarity, DINOv2 for visual similarity 3. **Start with UMAP** - Faster and often better than t-SNE for exploration 4. **Use uniqueness for outliers** - More reliable than visual inspection alone 5. **Store embeddings** - Reuse for multiple visualizations via `brain_key` 6. **Subset large datasets** - Compute on subset first, then full dataset ## Performance Notes **Embedding computation time:** - 1,000 images: ~1-2 minutes - 10,000 images: ~10-15 minutes - 100,000 images: ~1-2 hours **Visualization computation time:** - UMAP: ~30 seconds for 10,000 samples - t-SNE: ~5-10 minutes for 10,000 samples - PCA: ~5 seconds for 10,000 samples **Memory requirements:** - ~2KB per image for embeddings - ~16 bytes per image for 2D coordinates ## Resources - [FiftyOne Brain Documentation](https://docs.voxel51.com/user_guide/brain.html) - [Visualizing Embeddings Guide](https://docs.voxel51.com/user_guide/embeddings.html) - [Brain Plugin Source](https://github.com/voxel51/fiftyone-plugins/tree/main/plugins/brain)