{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Zero-Shot Image Classification with Multimodal Models and FiftyOne" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Traditionally, computer vision models are trained to predict a fixed set of categories. For image classification, for instance, many standard models are trained on the ImageNet dataset, which contains 1,000 categories. All images *must* be assigned to one of these 1,000 categories, and the model is trained to predict the correct category for each image.\n", "\n", "For object detection, many popular models like YOLOv5, YOLOv8, and YOLO-NAS are trained on the MS COCO dataset, which contains 80 categories. In other words, the model is trained to detect objects in any of these categories, and ignore all other objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to the recent advances in multimodal models, it is now possible to perform zero-shot learning, which allows us to predict categories that were *not* seen during training. This can be especially useful when:\n", "\n", "- We want to roughly pre-label images with a new set of categories\n", "- Obtaining labeled data for all categories is impractical or impossible.\n", "- The categories are changing over time, and we want to predict new categories without retraining the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this walkthrough, you will learn how to apply and evaluate zero-shot image classification models to your data with FiftyOne, Hugging Face [Transformers](https://docs.voxel51.com/integrations/huggingface.html), and [OpenCLIP](https://docs.voxel51.com/integrations/openclip.html)!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It covers the following:\n", "\n", "- Loading zero-shot image classification models from Hugging Face and OpenCLIP with the [FiftyOne Model Zoo](https://docs.voxel51.com/user_guide/model_zoo/index.html)\n", "- Using the models to predict categories for images in your dataset\n", "- Evaluating the predictions with FiftyOne" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to illustrate how to work with many multimodal models:\n", "\n", "- [OpenAI CLIP](https://arxiv.org/abs/2103.00020)\n", "- [AltCLIP](https://arxiv.org/abs/2211.06679v2)\n", "- [ALIGN](https://arxiv.org/abs/2102.05918)\n", "- [CLIPA](https://arxiv.org/abs/2305.07017)\n", "- [SigLIP](https://arxiv.org/abs/2303.15343)\n", "- [MetaCLIP](https://arxiv.org/abs/2309.16671)\n", "- [EVA-CLIP](https://arxiv.org/abs/2303.15389v1)\n", "- [Data Filtering Network (DFN)](https://arxiv.org/abs/2309.17425)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a breakdown of what each model brings to the table, check out our [🕶️ comprehensive collection of Awesome CLIP Papers](https://github.com/jacobmarks/awesome-clip-papers?tab=readme-ov-file)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this walkthrough, we will use the [Caltech-256 dataset](https://docs.voxel51.com/user_guide/dataset_zoo/datasets.html#caltech-256), which contains 30,607 images across 257 categories. We will use 1000 randomly selected images from the dataset for demonstration purposes. The zero-shot models were not explicitly trained on the Caltech-256 dataset, so we will use this as a test of the models' zero-shot capabilities. Of course, you can use any dataset you like! \n", "\n", "💡 Your results may depend on how similar your dataset is to the training data of the zero-shot models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we start, let's install the required packages:\n", "\n", "```bash\n", "pip install -U torch torchvision fiftyone transformers timm open_clip_torch\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's import the relevant modules and load the dataset:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo\n", "import fiftyone.zoo as foz\n", "import fiftyone.brain as fob\n", "from fiftyone import ViewField as F" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "dataset = foz.load_zoo_dataset(\n", " \"caltech256\", \n", " max_samples=1000, \n", " shuffle=True, \n", " persistent=True\n", ")\n", "dataset.name = \"CLIP-Comparison\"\n", "\n", "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Initial Dataset](./images/zero_shot_classification_initial_dataset.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we are using the `shuffle=True` option to randomly select 1000 images from the dataset, and are persisting the dataset to disk so that we can use it in future sessions. We also change the name of the dataset to reflect the experiment we are running." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's use the dataset's `distinct()` method to get a list of the distinct categories in the dataset, which we will give to the zero-shot models to predict:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classes = dataset.distinct(\"ground_truth.label\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zero-Shot Image Classification with the FiftyOne Zero-Shot Prediction Plugin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a moment, we'll switch gears to a more explicit demonstration of how to load and apply zero-shot models in FiftyOne. This programmatic approach is useful for more advanced use cases, and illustrates how to use the models in a more flexible manner.\n", "\n", "For simpler scenarios, check out the [FiftyOne Zero-Shot Prediction Plugin](https://github.com/jacobmarks/zero-shot-prediction-plugin), which provides a convenient graphical interface for applying zero-shot models to your dataset. The plugin supports all of the models we are going to use in this walkthrough, and is a great way to quickly experiment with zero-shot models in FiftyOne. In addition to classificaion, the plugin also supports zero-shot object detection, instance segmentation, and semantic segmentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have the [FiftyOne Plugin Utils Plugin](https://github.com/voxel51/fiftyone-plugins) installed, you can install the Zero-Shot Prediction Plugin from the FiftyOne App:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Installing Zero-Shot Prediction Plugin](./images/zero_shot_classification_install_plugin.gif)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If not, you can install the plugin from the command line:\n", "\n", "```bash\n", "fiftyone plugins download https://github.com/jacobmarks/zero-shot-prediction-plugin\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the plugin is installed, you can run zero-shot models from the FiftyOne App by pressing the backtick key ('\\`') to open the command palette, selecting `zero-shot-predict` or `zero-shot-classify` from the dropdown, and following the prompts:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Running Zero-Shot Prediction Plugin](./images/zero_shot_classification_run_plugin.gif)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zero-Shot Image Classification with the FiftyOne Model Zoo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we will show how to explicitly load and apply a variety of zero-shot classification models to your dataset with FiftyOne. Our models will come from three places:\n", "\n", "1. OpenAI's [CLIP](https://github.com/openai/CLIP) model, which is natively supported by FiftyOne\n", "2. [OpenCLIP](https://github.com/mlfoundations/open_clip), which is a collection of open-source CLIP-style models\n", "3. Hugging Face's [Transformers library](https://huggingface.co/docs/transformers/index), which provides a wide variety of zero-shot models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of these models can be loaded from the FiftyOne Model Zoo via the `load_zoo_model()` function, although the arguments you pass to the function will depend on the model you are loading!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Recipe for Loading a Zero-Shot Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regardless of the model you are loading, the basic recipe for loading a zero-shot model is as follows:\n", "\n", "```python\n", "model = foz.load_zoo_model(\n", " \"\",\n", " classes=classes,\n", " **kwargs\n", ")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The zoo model name is the name under which the model is registered in the FiftyOne Model Zoo. \n", "\n", "- `\"clip-vit-base32-torch\"` specifies the natively supported CLIP model, CLIP-ViT-B/32\n", "- `\"open-clip-torch\"` specifies that you want to load a specific model (architecture and pretrained checkpoint) from the OpenCLIP library. You can then specify the architecture with `clip_model=\"\"` and the checkpoint with `pretrained=\"\"`. We will see examples of this in a moment. For a list of allowed architecture-checkpoint pairs, check out this [results table](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv) from the OpenCLIP documentation. The `name` column contains the value for `clip_model`.\n", "- `\"zero-shot-classification-transformer-torch\"` specifies that you want to a zero-shot image classification model from the Hugging Face Transformers library. You can then specify the model via the `name_or_path` argument, which should be the repository name or model identifier of the model you want to load. Again, we will see examples of this in a moment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "💡 While we won't be exploring this degree of freedom, all of these models accept a `text_prompt` keyword argument, which allows you to override the prompt template used to embed the class names. Zero-shot classification results can vary based on this text!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have our model loaded (and classes set), we can use it like any other image classification model in FiftyOne by calling the dataset's `apply_model()` method:\n", "\n", "\n", "```python\n", "\n", "dataset.apply_model(\n", " model,\n", " label_field=\"\",\n", ")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For efficiency, we will also set our default batch size to 32, which will speed up the predictions:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "fo.config.default_batch_size = 32" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zero-Shot Image Classification with OpenAI CLIP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting off with the natively supported CLIP model, we can load and apply it to our dataset as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clip = foz.load_zoo_model(\n", " \"clip-vit-base32-torch\",\n", " classes=classes,\n", ")\n", "\n", "dataset.apply_model(clip, label_field=\"clip\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we would like, after adding our predictions in the specified field, we can add some high-level information detailing what the field contains:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "field = dataset.get_field(\"clip\")\n", "field.description = \"OpenAI CLIP predictions\"\n", "field.info = {\"clip_model\": \"CLIP-ViT-B-32\"}\n", "field.save()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = fo.launch_app(dataset, auto=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see the FiftyOne App, open a tab in your browser and navigate to `http://localhost:5151`!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will then see this information when we hover over the \"clip\" field in the FiftyOne App. This can be useful if you want to use shorthand field names, or if you want to provide additional context to other users of the dataset.\n", "\n", "For the rest of the tutorial, we will omit this step for brevity, but you can add this information to any field in your dataset!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zero-Shot Image Classification with OpenCLIP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make life interesting, we will be running inference with 5 different OpenCLIP models:\n", "\n", "- CLIPA\n", "- Data Filtering Network (DFN)\n", "- EVA-CLIP\n", "- MetaCLIP\n", "- SigLIP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To reduce the repetition, we're just going to create a dictionary for the `clip_model` and `pretrained` arguments, and then loop through the dictionary to load and apply the models to our dataset:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "open_clip_args = {\n", " \"clipa\": {\n", " \"clip_model\": 'hf-hub:UCSC-VLAA/ViT-L-14-CLIPA-datacomp1B',\n", " \"pretrained\": '',\n", " },\n", " \"dfn\": {\n", " \"clip_model\": 'ViT-B-16',\n", " \"pretrained\": 'dfn2b',\n", " },\n", " \"eva02_clip\": {\n", " \"clip_model\": 'EVA02-B-16',\n", " \"pretrained\": 'merged2b_s8b_b131k',\n", " },\n", " \"metaclip\": {\n", " \"clip_model\": 'ViT-B-32-quickgelu',\n", " \"pretrained\": 'metaclip_400m',\n", " },\n", " \"siglip\": {\n", " \"clip_model\": 'hf-hub:timm/ViT-B-16-SigLIP',\n", " \"pretrained\": '',\n", " },\n", " }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for name, args in open_clip_args.items():\n", " clip_model = args[\"clip_model\"]\n", " pretrained = args[\"pretrained\"]\n", " model = foz.load_zoo_model(\n", " \"open-clip-torch\",\n", " clip_model=clip_model,\n", " pretrained=pretrained,\n", " classes=classes,\n", " )\n", "\n", " dataset.apply_model(model, label_field=name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zero-Shot Image Classification with Hugging Face Transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we will load and apply zero-shot image classification model sfrom the Hugging Face Transformers library. Once again, we will loop through a dictionary of model names and apply the models to our dataset:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "transformer_model_repo_ids = {\n", " \"altclip\": \"BAAI/AltCLIP\",\n", " \"align\": \"kakaobrain/align-base\"\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for name, repo_id in transformer_model_repo_ids.items():\n", " model = foz.load_zoo_model(\n", " \"zero-shot-classification-transformer-torch\",\n", " name_or_path=repo_id,\n", " classes=classes,\n", " )\n", "\n", " dataset.apply_model(model, label_field=name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating Zero-Shot Image Classification Predictions with FiftyOne" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using FiftyOne's Evaluation API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have applied all of our zero-shot models to our dataset, we can evaluate the predictions with FiftyOne! As a first step, let's use FiftyOne's [Evaluation API](https://docs.voxel51.com/user_guide/evaluation.html) to assign True/False labels to the predictions based on whether they match the ground truth labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we will use the dataset's schema to get a list of all of the fields that contain predictions:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "classification_fields = sorted(list(\n", " dataset.get_field_schema(\n", " ftype=fo.EmbeddedDocumentField, embedded_doc_type=fo.Classification\n", " ).keys())\n", ")\n", "\n", "prediction_fields = [f for f in classification_fields if f != \"ground_truth\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we will loop through these prediction fields and apply the dataset's `evaluate_classifications()` method to each one, evaluating against the `ground_truth` field:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for pf in prediction_fields:\n", " eval_key = f\"{pf}_eval\"\n", " dataset.evaluate_classifications(\n", " pf,\n", " gt_field=\"ground_truth\",\n", " eval_key=eval_key,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then easily filter the dataset based on which models predicted the ground truth labels correctly, either programmatically in Python, or in the FiftyOne App. For example, here is how we could specify the view into the dataset containing all samples where SigLIP predicted the ground truth label correctly and CLIP did not:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "dataset = fo.load_dataset(\"CLIP-Comparison\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "siglip_not_clip_view = dataset.match((F(\"siglip_eval\") == True) & (F(\"clip_eval\") == False))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There were 57 samples where the SigLIP model predicted correctly and the CLIP model did not.\n" ] } ], "source": [ "num_siglip_not_clip = len(siglip_not_clip_view)\n", "print(f\"There were {num_siglip_not_clip} samples where the SigLIP model predicted correctly and the CLIP model did not.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is how we would accomplish the same thing in the FiftyOne App:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Samples where SigLIP is Right and CLIP isn't](./images/zero_shot_classification_siglip_not_clip_view.gif)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### High-Level Insights using Aggregations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the predictions evaluated, we can use FiftyOne's [aggregation](https://docs.voxel51.com/user_guide/using_aggregations.html?highlight=aggregation) capabilities to get high-level insights into the performance of the zero-shot models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will allow us to answer questions like:\n", "\n", "- Which model was \"correct\" most often?\n", "- What models were most or least confident in their predictions?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the first question, we can use the `count_values()` aggregation on the evaluation fields for our predictions, which will give us a count of the number of times each model was correct or incorrect. As an example:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{False: 197, True: 803}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.count_values(f\"clip_eval\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looping over our prediction fields and turning these raw counts into percentages, we can get a high-level view of the performance of our models:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "align: 83.7% correct\n", "altclip: 87.6% correct\n", "clip: 80.3% correct\n", "clipa: 88.9% correct\n", "dfn: 91.0% correct\n", "eva02_clip: 85.6% correct\n", "metaclip: 84.3% correct\n", "siglip: 64.9% correct\n" ] } ], "source": [ "for pf in prediction_fields:\n", " eval_results = dataset.count_values(f\"{pf}_eval\")\n", " percent_correct = eval_results.get(True, 0) / sum(eval_results.values())\n", " print(f\"{pf}: {percent_correct:.1%} correct\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At least on this dataset, it looks like the DFN model was the clear winner, with the highest percentage of correct predictions. The other strong performers were CLIPA and AltCLIP." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To answer the second question, we can use the `mean()` aggregation to get the average confidence of each model's predictions. This will give us a sense of how confident each model was in its predictions:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean confidence for align: 0.774\n", "Mean confidence for altclip: 0.883\n", "Mean confidence for clip: 0.770\n", "Mean confidence for clipa: 0.912\n", "Mean confidence for dfn: 0.926\n", "Mean confidence for eva02_clip: 0.843\n", "Mean confidence for metaclip: 0.824\n", "Mean confidence for siglip: 0.673\n" ] } ], "source": [ "for pf in prediction_fields:\n", " mean_conf = dataset.mean(F(f\"{pf}.confidence\"))\n", " print(f\"Mean confidence for {pf}: {mean_conf:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the most part, mean model confidence seems pretty strongly correlated with model accuracy. The DFN model, which was the most accurate, also had the highest mean confidence!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Advanced Insights using ViewExpressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These high-level insights are useful, but as always, they only tell part of the story. To get a more nuanced understanding of the performance of our zero-shot models — and how the models interface with our data — we can use FiftyOne's [ViewExpressions](https://docs.voxel51.com/user_guide/using_views.html#filtering) to construct rich views of our data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One thing we might want to see is where all of the models were correct or incorrect. To probe these questions, we can construct a list with one `ViewExpression` for each model, and then use the `any()` and `all()` methods:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "exprs = [F(f\"{pf}_eval\") == True for pf in prediction_fields]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's see how many samples every model got correct:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "498 samples were right for all models\n", "Session launched. Run `session.show()` to open the App in a cell output.\n" ] } ], "source": [ "all_right_view = dataset.match(F().all(exprs))\n", "print(f\"{len(all_right_view)} samples were right for all models\")\n", "\n", "session = fo.launch_app(all_right_view, auto=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fact that about half of the time, all of the models are \"correct\" and in agreement is good validation of both our data quality and the capabilities of our zero-shot models!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about when all of the models are incorrect?" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "45 samples were wrong for all models\n", "Session launched. Run `session.show()` to open the App in a cell output.\n" ] } ], "source": [ "all_wrong_view = dataset.match(~F().any(exprs))\n", "print(f\"{len(all_wrong_view)} samples were wrong for all models\")\n", "\n", "session = fo.launch_app(all_wrong_view, auto=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![All Wrong View](./images/zero_shot_classification_all_wrong_view.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The samples where all of the models are supposedly incorrect are interesting and merit further investigation. It could be that the ground truth labels are incorrect, or that the images are ambiguous and difficult to classify. It could also be that the zero-shot models are not well-suited to the dataset, or that the models are not well-suited to the task. In any case, these samples are worth a closer look!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at some of these samples in the FiftyOne App, we can see that some of the ground truth labels are indeed ambiguous or incorrect. Take the second image, for example. It is labeled as `\"treadmill\"`, while all but one of the zero-shot models predict `\"horse\"`. To a human, the image does indeed look like a horse, and the ground truth label is likely incorrect.\n", "\n", "The seventh image is a prime example of ambiguity. The ground truth label is `\"sneaker\"`, but almost all of the zero-shot models predict `\"tennis-shoes\"`. It is difficult to say which label is correct, and it is likely that the ground truth label is not specific enough to be useful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a more precise view into the relative quality of our zero-shot models, we would need to handle these edge cases and re-evaluate on the improved dataset. \n", "\n", "💡 This is a great example of how the combination of zero-shot models and FiftyOne can be used to iteratively improve the quality of your data and your models!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we wrap up, let's construct one even more nuanced view of our data: the samples where just *one* of the models was correct. This will really help us understand the strengths and weaknesses of each model.\n", "\n", "To construct this view, we will copy the array of expressions, remove one model from the array, and see where that model was correct and the others were not. We will then loop through the models, and find the samples where each *any* of these conditions is met:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = len(prediction_fields)\n", "sub_exprs = []\n", "for i in range(n):\n", " tmp_exprs = exprs.copy()\n", " expr = tmp_exprs.pop(i)\n", " sub_exprs.append((expr & ~F().any(tmp_exprs)))\n", "\n", "one_right_view = dataset.match(F().any(sub_exprs))\n", "\n", "session = fo.launch_app(one_right_view, auto=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![One Right View](./images/zero_shot_classification_one_right_view.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at these samples in the FiftyOne App, a few things stand out:\n", "\n", "- First, the vast majority of these samples are primarily images of people's faces. A lot of the \"wrong\" predictions are related to people, faces, or facial features, like `\"eye-glasses\"`, `\"iris\"`, `\"yarmulke\"`, and `\"human-skeleton\"`. This is a good reminder that zero-shot models are not perfect, and that they are not well-suited to all types of images.\n", "\n", "- Second, of all 22 samples where only one model was correct, 11 of them were correctly predicted by the DFN model. This is more validation of the DFN model's strong performance on this dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zero-shot image classification is a powerful tool for predicting categories that were not seen during training. But it is not a panacea, and it is important to understand the strengths and weaknesses of zero-shot models, and how they interface with your data.\n", "\n", "In this walkthrough, we showed how to not only apply a variety of zero-shot image classification models to your data, but also how to evaluate them and choose the best model for your use case.\n", "\n", "The same principles can be applied to other types of zero-shot models, like zero-shot object detection, instance segmentation, and semantic segmentation. If you're interested in these use cases, check out the [FiftyOne Zero-Shot Prediction Plugin](https://github.com/jacobmarks/zero-shot-prediction-plugin).\n", "\n", "For zero-shot object detection, here are some resources to get you started:\n", "\n", "- [YOLO-World](https://docs.voxel51.com/integrations/ultralytics.html#open-vocabulary-detection) from Ultralytics\n", "- [Zero-Shot Detection Transformers](https://docs.voxel51.com/integrations/huggingface.html#zero-shot-object-detection) from Hugging Face\n", "- [Evaluating Object Detections](https://docs.voxel51.com/tutorials/evaluate_detections.html) tutorial\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "fdev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 2 }