{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring Image Uniqueness with FiftyOne\n", "\n", "During model training, the best results will be seen when training on *unique data samples*. For example, finding and removing similar samples in your dataset can avoid accidental concept imbalance that can bias the learning of your model. Or, if duplicate or near-duplicate data is present in both training and validation/test splits, evaluation results may not be reliable. Just to name a few.\n", "\n", "In this tutorial, we explore how FiftyOne's image uniqueness tool can be used to analyze and extract insights from raw (unlabeled) datasets.\n", "\n", "We'll cover the following concepts:\n", "\n", "- Loading a dataset from the [FiftyOne Dataset Zoo](https://voxel51.com/docs/fiftyone/user_guide/dataset_zoo/index.html)\n", "- Applying FiftyOne's [uniqueness method](https://voxel51.com/docs/fiftyone/user_guide/brain.html#image-uniqueness) to your dataset\n", "- Launching the [FiftyOne App](https://voxel51.com/docs/fiftyone/user_guide/app.html) and visualizing/exploring your data\n", "- Identifying duplicate and near-duplicate images in your dataset\n", "- Identifying the most unique/representative images in your dataset\n", "\n", "**So, what's the takeaway?**\n", "\n", "This tutorial shows how FiftyOne can automatically find and remove near-duplicate images in your datasets and recommend the most unique samples in your data, enabling you to start your model training off right with a high-quality bootstrapped training set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "If you haven't already, install FiftyOne:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install fiftyone" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial requires either [Torchvision Datasets](https://pytorch.org/docs/stable/torchvision/datasets.html) or [TensorFlow Datasets](https://www.tensorflow.org/datasets) to download the CIFAR-10 dataset used below.\n", "\n", "You can, for example, install PyTorch as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install torch torchvision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Finding duplicate and near-duplicate images\n", "\n", "A common problem in dataset creation is duplicated data. Although this could be\n", "found using file hashing---as in the [image_deduplication](https://colab.research.google.com/github/voxel51/fiftyone-examples/blob/master/examples/image_deduplication.ipynb) walkthrough---it is\n", "less possible when small manipulations have occurred in the data. Even more\n", "critical for workflows involving model training is the need to get as much\n", "power out of each data samples as possible; near-duplicates, which are samples\n", "that are exceptionally similar to one another, are intrinsically less valuable\n", "for the training scenario. Let's see if we can find such duplicates and\n", "near-duplicates in a common dataset: CIFAR-10." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the dataset\n", "\n", "Open a Python shell to begin. We will use the CIFAR-10 dataset, which is\n", "available in the [FiftyOne Dataset Zoo](https://voxel51.com/docs/fiftyone/user_guide/dataset_zoo/datasets.html#dataset-zoo-cifar10):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Split 'test' already downloaded\n", "Loading 'cifar10' split 'test'\n", " 100% |█████████████| 10000/10000 [9.6s elapsed, 0s remaining, 1.0K samples/s] \n", "Dataset 'cifar10-test' created\n" ] } ], "source": [ "import fiftyone as fo\n", "import fiftyone.zoo as foz\n", "\n", "# Load the CIFAR-10 test split\n", "# Downloads the dataset from the web if necessary\n", "dataset = foz.load_zoo_dataset(\"cifar10\", split=\"test\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains the ground truth labels in a `ground_truth` field:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: cifar10-test\n", "Media type: image\n", "Num samples: 10000\n", "Persistent: False\n", "Tags: ['test']\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n" ] } ], "source": [ "print(dataset)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",\n", "}>\n" ] } ], "source": [ "print(dataset.first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's launch the [FiftyOne App](https://voxel51.com/docs/fiftyone/user_guide/app.html) and use the GUI to explore the dataset visually before we go any further:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute uniqueness\n", "\n", "Now we can process the entire dataset for uniqueness. This is a fairly expensive operation, but should finish in a few minutes at most. We are processing through all samples in the dataset, then building a representation that relates the samples to each other. Finally, we analyze this representation to output uniqueness scores for each sample." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating embeddings...\n", " 0% ||------------| 16/10000 [95.0ms elapsed, 59.3s remaining, 168.5 samples/s] 100% |█████████████| 10000/10000 [1.2m elapsed, 0s remaining, 166.0 samples/s] \n", "Computing uniqueness...\n", "Uniqueness computation complete\n" ] } ], "source": [ "import fiftyone.brain as fob\n", "\n", "fob.compute_uniqueness(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above method populates a `uniqueness` field on each sample that contains\n", "the sample's uniqueness score. Let's confirm this by printing some information\n", "about the dataset:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: cifar10-test\n", "Media type: image\n", "Num samples: 10000\n", "Persistent: False\n", "Tags: ['test']\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n", " uniqueness: fiftyone.core.fields.FloatField\n" ] } ], "source": [ "# Now the samples have a \"uniqueness\" field on them\n", "print(dataset)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",\n", " 'uniqueness': 0.4978482190810026,\n", "}>\n" ] } ], "source": [ "print(dataset.first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize to find duplicate and near-duplicate images\n", "\n", "Now, let's visually inspect the least unique images in the dataset to see if\n", "our dataset has any issues:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Sort in increasing order of uniqueness (least unique first)\n", "dups_view = dataset.sort_by(\"uniqueness\")\n", "\n", "# Open view in the App\n", "session.view = dups_view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will easily see some near-duplicates in the App. It surprised us that\n", "there are duplicates in CIFAR-10, too!\n", "\n", "Of course, in this scenario, near duplicates are identified from visual\n", "inspection. So, how do we get the information out of FiftyOne and back into\n", "your working environment. Easy! The `session` variable provides a bidirectional\n", "bridge between the App and your Python environment. In this case, we will\n", "use the `session.selected` bridge. So, in the App, select some of the\n", "duplicates and near-duplicates. Then, execute the following code in the Python\n", "shell." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get currently selected images from App\n", "dup_ids = session.selected\n", "\n", "# Mark as duplicates\n", "dups_view = dataset.select(dup_ids)\n", "dups_view.tag_samples(\"dups\")\n", "\n", "# Visualize duplicates-only in App\n", "session.view = dups_view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the App will only show these samples now. We can, of course access\n", "the filepaths and other information about these samples programmatically so you\n", "can act on the findings. But, let's do that at the end of Part 2 below!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Bootstrapping a dataset of unique samples\n", "\n", "When building a dataset, it is important to create a diverse dataset with\n", "unique and representative samples. Here, we explore FiftyOne's ability to help\n", "identify the most unique samples in a raw dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download some images\n", "\n", "This walkthrough will process a directory of images and compute their\n", "uniqueness. The first thing we need to do is get some images. Let's get some\n", "images from Flickr, to keep this interesting!\n", "\n", "You need a Flickr API key to do this. If you already have a Flickr API key,\n", "then skip the next steps.\n", "\n", "1. Go to \n", "2. Click on Request API Key.\n", " () You will need to\n", " login (create account if needed, free).\n", "3. Click on \"Non-Commercial API Key\" (this is just for a test usage) and fill\n", " in the information on the next page. You do not need to be very descriptive;\n", " your API will automatically appear on the following page.\n", "4. Install the Flickr API:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install flickrapi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will also need to enable ETA's storage support to run this script, if you\n", "haven't yet:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install voxel51-eta[storage]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's download three sets of images to process together. I suggest using\n", "three distinct object-nouns like \"badger\", \"wolverine\", and \"kitten\". For the\n", "actual downloading, we will use the provided [query_flickr.py](https://raw.githubusercontent.com/voxel51/fiftyone/develop/docs/source/tutorials/query_flickr.py) script:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from query_flickr import query_flickr\n", "\n", "# Your credentials here\n", "KEY = \"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\"\n", "SECRET = \"XXXXXXXXXXXXXXXX\"\n", "\n", "query_flickr(KEY, SECRET, \"badger\")\n", "query_flickr(KEY, SECRET, \"wolverine\")\n", "query_flickr(KEY, SECRET, \"kitten\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rest of this walkthrough assumes you've downloaded some images to your\n", "local `.data/` directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the data into FiftyOne\n", "\n", "Let's now work through getting this data into FiftyOne and\n", "working with it." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 100% |█████████████████| 167/167 [160.7ms elapsed, 0s remaining, 1.0K samples/s] \n", "Name: flickr-images\n", "Media type: image\n", "Num samples: 167\n", "Persistent: False\n", "Tags: []\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", "\n" ] } ], "source": [ "import fiftyone as fo\n", "\n", "dataset = fo.Dataset.from_images_dir(\"data\", recursive=True, name=\"flickr-images\")\n", "\n", "print(dataset)\n", "print(dataset.first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above command uses a [factory method](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html?highlight=from_images_dir#fiftyone.core.dataset.Dataset.from_images_dir) on the `Dataset` class to traverse a directory of images (including subdirectories) and generate a dataset instance in FiftyOne containing those images.\n", "\n", "Note that the images are not loaded from disk, so this operation is fast. The first argument is the path to the directory of images on disk, and the third is a name for the dataset.\n", "\n", "With the dataset loaded into FiftyOne, we can easily launch the App and visualize it:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Refer to the [User Guide](https://voxel51.com/docs/fiftyone/user_guide/index.html) for more\n", "useful things you can do with the dataset and App." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute uniqueness and analyze\n", "\n", "Now, let's analyze the data. For example, we may want to understand what are\n", "the most unique images among the data as they may inform or harm model\n", "training; we may want to discover duplicates or redundant samples.\n", "\n", "Continuing in the same Python shell, let's compute and visualize uniqueness." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating embeddings...\n", " 100% |█████████████████| 167/167 [1.8s elapsed, 0s remaining, 94.6 samples/s] \n", "Computing uniqueness...\n", "Uniqueness computation complete\n", "Name: flickr-images\n", "Media type: image\n", "Num samples: 167\n", "Persistent: False\n", "Tags: []\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " uniqueness: fiftyone.core.fields.FloatField\n" ] } ], "source": [ "import fiftyone.brain as fob\n", "\n", "fob.compute_uniqueness(dataset)\n", "\n", "# Now the samples have a \"uniqueness\" field on them\n", "print(dataset)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(dataset.first())" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Sort by uniqueness (most unique first)\n", "rank_view = dataset.sort_by(\"uniqueness\", reverse=True)\n", "\n", "# Visualize in the App\n", "session.view = rank_view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, just visualizing the samples is interesting, but we want more. We want to\n", "get the most unique samples from our dataset so that we can use them in our\n", "work. Let's do just that. In the same Python session, execute the following\n", "code." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "2428280852_6c77fe2877_c.jpg\n", "49733688496_b6fc5cde41_c.jpg\n", "2843545851_6e1dc16dfc_c.jpg\n", "7466201514_0a3c7d615a_c.jpg\n", "6176873587_d0744926cb_c.jpg\n", "33891021626_4cfe3bf1d2_c.jpg\n", "8303699893_a7c14c04d3_c.jpg\n", "388994554_34d60d1b18_c.jpg\n", "5880167199_906172bc50_c.jpg\n", "8538740443_a587bfe75c_c.jpg\n" ] } ], "source": [ "# Verify that the most unique sample has the maximal uniqueness of 1.0\n", "print(rank_view.first())\n", "\n", "# Extract paths to 10 most unique samples\n", "ten_best = [x.filepath for x in rank_view.limit(10)]\n", "\n", "for filepath in ten_best:\n", " print(filepath.split('/')[-1])\n", "\n", "# Then you can do what you want with these.\n", "# Output to csv or json, send images to your annotation team, seek additional\n", "# similar data, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can simply tag the most unique samples and persist the dataset so you can return to it later in FiftyOne." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "rank_view.limit(10).tag_samples(\"unique\")\n", "\n", "dataset.persistent = True" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "session.freeze() # screenshot the active App for sharing" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "nbsphinx": { "execute": "never" } }, "nbformat": 4, "nbformat_minor": 4 }