{ "cells": [ { "cell_type": "markdown", "id": "674e0a70", "metadata": {}, "source": [ "# Clustering and Labeling with Embeddings\n", "\n", "In this notebook, we will explore how to use **embeddings** to cluster and assign labels to dataset samples.\n", "This technique is valuable when dealing with unlabeled datasets or when refining existing labels.\n", "\n", "![clustering](https://cdn.voxel51.com/getting_started_manufacturing/notebook3/clustering.webp)\n", "\n", "## Learning Objectives:\n", "- Understand clustering techniques and their role in dataset labeling.\n", "- Compute embeddings and cluster dataset samples.\n", "- Use FiftyOne Plugin for clustering MVTec dataset\n", "- Use FiftyOne for visualization and dataset management.\n", "\n" ] }, { "cell_type": "markdown", "id": "0b1b43a1", "metadata": {}, "source": [ "\n", "## Why Use Clustering for Labeling?\n", "\n", "Labeling datasets manually is expensive and time-consuming. **Clustering** helps by automatically grouping similar samples based on embeddings.\n", "This method is useful for:\n", "- **Unsupervised Learning**: Automatically discovering patterns in unlabeled data.\n", "- **Dataset Cleanup**: Detecting mislabeled samples.\n", "- **Efficient Annotation**: Pre-labeling groups for human annotators.\n", "\n", "Common clustering techniques include:\n", "- **K-Means** (partition-based clustering)\n", "- **DBSCAN** (density-based clustering)\n", "- **Hierarchical Clustering** (tree-based grouping)\n", "\n", "**Relevant Documentation:** \n", "- [Clustering in Machine Learning](https://en.wikipedia.org/wiki/Cluster_analysis)\n", "- [Introduction to Unsupervised Learning](https://towardsdatascience.com/a-guide-to-clustering-algorithms-e28af85da0b7/#:~:text=Clustering%20is%20an%20unsupervised%20Machine,and%20even%20some%20use%20cases.)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requirement step: Remove Category Labels (Keep Only Defects)\n", "\n", "Since category labels in your dataset are stored as `category.label`, you can remove them like is showing in the following cell. This removes all `category.label` fields from the dataset while keeping `defect.label`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo\n", "import fiftyone.utils.huggingface as fouh # Hugging Face integration\n", "\n", "# Define the new dataset name\n", "dataset_name = \"MVTec_AD_nc\"\n", "#dataset = fo.load_dataset(dataset_name)\n", "\n", "# Check if the dataset exists\n", "if dataset_name in fo.list_datasets():\n", " print(f\"Dataset '{dataset_name}' exists. Loading...\")\n", " dataset = fo.load_dataset(dataset_name)\n", "else:\n", " print(f\"Dataset '{dataset_name}' does not exist. Creating a new one...\")\n", " # Clone the dataset with a new name and make it persistent\n", " dataset_ = fo.load_dataset(\"MVTec_AD\") #fouh.load_from_hub(\"Voxel51/mvtec-ad\", persistent=True, overwrite=True)\n", " dataset = dataset_.clone(dataset_name, persistent=True)\n", " \n", " # # Iterate over samples and remove category labels\n", " for sample in dataset:\n", " sample[\"category\"] = None # Remove category classification\n", " sample.save()\n", "\n", " print(\"✅ All category labels removed. Only defect labels remain.\")\n", "\n", "print(dataset)\n", "print(dataset.last())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute embeddings for MVTec AD using CLIP\n", "import fiftyone as fo\n", "import fiftyone.brain as fob\n", "import fiftyone.zoo.models as fozm\n", "import fiftyone.utils.huggingface as fouh # Hugging Face integration\n", "\n", "# List all datasets\n", "datasets = fo.list_datasets()\n", "\n", "# Print datasets\n", "print(datasets)\n" ] }, { "cell_type": "markdown", "id": "7784cd5a", "metadata": {}, "source": [ "\n", "## Generating Embeddings\n", "\n", "Before clustering, we must **generate embeddings** that capture meaningful feature representations.\n", "FiftyOne provides multiple ways to obtain embeddings:\n", "- **Using pre-trained models** from FiftyOne's Model Zoo (e.g., CLIP, ResNet).\n", "- **Extracting embeddings from custom models** (e.g., OpenAI, Anomalib, or self-trained networks)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fiftyone.zoo as foz\n", "\n", "model = foz.load_zoo_model(\"open-clip-vit-base32-torch\")\n", "\n", "# Compute embeddings\n", "fob.compute_visualization(dataset, model=model, embeddings=...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Relevant Documentation:** [Computing Embeddings in FiftyOne](https://voxel51.com/docs/fiftyone/user_guide/brain.html#computing-embeddings)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fiftyone.brain as fob\n", "import fiftyone.zoo.models as fozm\n", "\n", "# Load a pretrained model (e.g., CLIP)\n", "model = fozm.load_zoo_model(\"clip-vit-base32-torch\")\n", "\n", "fob.compute_visualization(\n", " dataset,\n", " model=model,\n", " embeddings=\"mvtec_emb_nc\",\n", " brain_key=\"mvtec_embeddings_nc\",\n", " method=\"umap\", # Change to \"tsne\" for t-SNE\n", " num_dims=2 # Reduce to 2D\n", ")\n", "\n", "print(\"✅ Embeddings computed.\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.reload()\n", "print(dataset)\n", "print(dataset.last())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = fo.launch_app(dataset, port=5153, auto=False)" ] }, { "cell_type": "markdown", "id": "02284e08", "metadata": {}, "source": [ "\n", "## Clustering Dataset Samples with a FiftyOne Plugin\n", "\n", "Once we have embeddings, we can **cluster dataset samples** based on similarity. the theory around that is not complex but it could take some time for running and testing what do you want to do. For Example for clustering samples by similarity you maybe need to use the following code: \n", "\n", "Example using **K-Means Clustering**:\n", "```python\n", "from sklearn.cluster import KMeans\n", "\n", "num_clusters = 5\n", "kmeans = KMeans(n_clusters=num_clusters, random_state=42)\n", "cluster_labels = kmeans.fit_predict(embeddings)\n", "```\n", "\n", "After clustering, we can assign labels to our dataset:\n", "\n", "```python\n", "import fiftyone as fo\n", "\n", "# Assign clustering results to FiftyOne dataset\n", "for sample, label in zip(dataset, cluster_labels):\n", " sample[\"cluster\"] = int(label) # Store as an integer field\n", " sample.save()\n", "```\n", "\n", "## But, What if I tell you you don't need to write a code for doing that\n", "\n", "FiftyOne provides an extensible plugin system that allows users to enhance their workflows with additional functionalities, such as dataset visualization, embeddings analysis, and automated clustering. A FiftyOne Plugin is a modular component that extends the capabilities of the FiftyOne UI or backend. These plugins can be used for various tasks, including visualization, dataset processing, and machine learning model integration.\n", "\n", "#### Clustering Plugin in FiftyOne\n", "One powerful plugin available in FiftyOne is the Clustering Plugin, which allows users to group similar samples together based on embeddings. This is useful for tasks like:\n", "- Understanding patterns in your dataset\n", "- Identifying redundant or mislabeled data\n", "- Grouping anomalies or outliers\n", "\n", "**Relevant Documentation:** \n", "- [Clustering Algorithms](https://en.wikipedia.org/wiki/Cluster_analysis) \n", "- [FiftyOne Dataset Fields](https://docs.voxel51.com/user_guide/using_datasets.html#fields)\n", "- [FiftyOne Clustering Plugin](https://github.com/jacobmarks/clustering-plugin)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install the Clustering Plugin\n", "\n", "You can see how FiftyOne App allows you to cluster your dataset using a variety of algorithms: [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans), [Birch](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html#sklearn.cluster.Birch), [Agglomerative](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering), [HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.HDBSCAN). It also serves as a proof of concept for adding new \"types\" of runs to FiftyOne!!!\n", "\n", "You will also need to have `scikit-learn` installed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!fiftyone plugins download https://github.com/jacobmarks/clustering-plugin\n", "!pip install -U scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = fo.launch_app(dataset, port=5153, auto=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![clustering_video](https://cdn.voxel51.com/getting_started_manufacturing/notebook3/clustering_video.webp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.reload()\n", "print(dataset)\n", "print(dataset.last())" ] }, { "cell_type": "markdown", "id": "c994105a", "metadata": {}, "source": [ "### Next Steps:\n", "Try experimenting with different clustering methods (e.g., HDBSCAN, Birch clustering) and evaluate their impact on labeling quality! 🚀\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "manu_env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.17" } }, "nbformat": 4, "nbformat_minor": 2 }