{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Image Deduplication with FiftyOne\n", "\n", "This recipe demonstrates a simple use case of using FiftyOne to detect and\n", "remove duplicate images from your dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "If you haven't already, install FiftyOne:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install fiftyone" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook also requires the `tensorflow` package:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install tensorflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the data\n", "\n", "First we download the dataset to disk. The dataset is a 1000 sample subset of\n", "CIFAR-100, a dataset of 32x32 pixel images with one of 100 different\n", "classification labels such as `apple`, `bicycle`, `porcupine`, etc. You can use this [helper script](https://raw.githubusercontent.com/voxel51/fiftyone/develop/docs/source/recipes/image_deduplication_helpers.py)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading dataset of 1000 samples to:\n", "\t/tmp/fiftyone/cifar100_with_duplicates\n", "and corrupting the data (5% duplicates)\n", "Download successful\n" ] } ], "source": [ "from image_deduplication_helpers import download_dataset\n", "\n", "download_dataset()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above script uses `tensorflow.keras.datasets` to download the dataset, so\n", "you must have [TensorFlow installed](https://www.tensorflow.org/install).\n", "\n", "The dataset is organized on disk as follows:\n", "\n", "```\n", "/tmp/fiftyone/\n", "└── cifar100_with_duplicates/\n", " ├── /\n", " │ ├── .jpg\n", " │ ├── .jpg\n", " │ └── ...\n", " ├── /\n", " │ ├── .jpg\n", " │ ├── .jpg\n", " │ └── ...\n", " └── ...\n", "```\n", "\n", "As we will soon come to discover, some of these samples are duplicates and we\n", "have no clue which they are!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a dataset\n", "\n", "Let's start by importing the FiftyOne library:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use a utililty method provided by FiftyOne to load the image\n", "classification dataset from disk:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 100% |████████████| 1000/1000 [1.2s elapsed, 0s remaining, 718.5 samples/s] \n" ] } ], "source": [ "import os\n", "\n", "dataset_name = \"cifar100_with_duplicates\"\n", "dataset_dir = os.path.join(\"/tmp/fiftyone\", dataset_name)\n", "\n", "dataset = fo.Dataset.from_dir(\n", " dataset_dir,\n", " fo.types.ImageClassificationDirectoryTree,\n", " name=dataset_name\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore the dataset\n", "\n", "We can poke around in the dataset:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: cifar100_with_duplicates\n", "Media type: image\n", "Num samples: 1000\n", "Persistent: False\n", "Info: {'classes': ['apple', 'aquarium_fish', 'baby', ...]}\n", "Tags: []\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n" ] } ], "source": [ "# Print summary information about the dataset\n", "print(dataset)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",\n", "}>\n" ] } ], "source": [ "# Print a sample\n", "print(dataset.first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a view that contains only samples whose ground truth label is\n", "`mountain`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset: cifar100_with_duplicates\n", "Media type: image\n", "Num samples: 8\n", "Tags: []\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n", "View stages:\n", " 1. Match(filter={'$expr': {'$eq': [...]}})\n" ] } ], "source": [ "# Used to write view expressions that involve sample fields\n", "from fiftyone import ViewField as F\n", "\n", "view = dataset.match(F(\"ground_truth.label\") == \"mountain\")\n", "\n", "# Print summary information about the view\n", "print(view)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",\n", "}>\n" ] } ], "source": [ "# Print the first sample in the view\n", "print(view.first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a view with samples sorted by their ground truth labels in reverse\n", "alphabetical order:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset: cifar100_with_duplicates\n", "Media type: image\n", "Num samples: 1000\n", "Tags: []\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n", "View stages:\n", " 1. SortBy(field_or_expr='ground_truth', reverse=True)\n" ] } ], "source": [ "view = dataset.sort_by(\"ground_truth\", reverse=True)\n", "\n", "# Print summary information about the view\n", "print(view)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",\n", "}>\n" ] } ], "source": [ "# Print the first sample in the view\n", "print(view.first())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the dataset\n", "\n", "Start browsing the dataset:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session = fo.launch_app(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Narrow your scope to 10 random samples:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session.view = dataset.take(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click on some some samples in the App to select them and access their IDs from\n", "code!" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Get the IDs of the currently selected samples in the App\n", "sample_ids = session.selected" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a view that contains your currently selected samples:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "selected_view = dataset.select(session.selected)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Update the App to only show your selected samples:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session.view = selected_view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute file hashes\n", "\n", "Iterate over the samples and compute their file hashes:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: cifar100_with_duplicates\n", "Media type: image\n", "Num samples: 1000\n", "Persistent: False\n", "Info: {'classes': ['apple', 'aquarium_fish', 'baby', ...]}\n", "Tags: []\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n", " file_hash: fiftyone.core.fields.IntField\n" ] } ], "source": [ "import fiftyone.core.utils as fou\n", "\n", "for sample in dataset:\n", " sample[\"file_hash\"] = fou.compute_filehash(sample.filepath)\n", " sample.save()\n", "\n", "print(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have two ways to visualize this new information.\n", "\n", "First, you can view the sample from your Terminal:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",\n", " 'file_hash': -6346202338154836458,\n", "}>\n" ] } ], "source": [ "sample = dataset.first()\n", "print(sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can refresh the App and toggle on the new `file_hash` field:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session.dataset = dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check for duplicates\n", "\n", "Now let's use a simple Python statement to locate the duplicate files in the\n", "dataset, i.e., those with the same file hashses:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of duplicate file hashes: 36\n" ] } ], "source": [ "from collections import Counter\n", "\n", "filehash_counts = Counter(sample.file_hash for sample in dataset)\n", "dup_filehashes = [k for k, v in filehash_counts.items() if v > 1]\n", "\n", "print(\"Number of duplicate file hashes: %d\" % len(dup_filehashes))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's create a view that contains only the samples with these duplicate\n", "file hashes:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of images that have a duplicate: 72\n", "Number of duplicates: 36\n" ] } ], "source": [ "dup_view = (dataset\n", " # Extract samples with duplicate file hashes\n", " .match(F(\"file_hash\").is_in(dup_filehashes))\n", " # Sort by file hash so duplicates will be adjacent\n", " .sort_by(\"file_hash\")\n", ")\n", "\n", "print(\"Number of images that have a duplicate: %d\" % len(dup_view))\n", "print(\"Number of duplicates: %d\" % (len(dup_view) - len(dup_filehashes)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, we can always use the App to visualize our work!" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", "
\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "session.view = dup_view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete duplicates\n", "\n", "Now let's delete the duplicate samples from the dataset using our `dup_view` to\n", "restrict our attention to known duplicates:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Length of dataset before: 1000\n", "Length of dataset after: 964\n", "Number of unique file hashes: 964\n" ] } ], "source": [ "print(\"Length of dataset before: %d\" % len(dataset))\n", "\n", "_dup_filehashes = set()\n", "for sample in dup_view:\n", " if sample.file_hash not in _dup_filehashes:\n", " _dup_filehashes.add(sample.file_hash)\n", " continue\n", "\n", " del dataset[sample.id]\n", "\n", "print(\"Length of dataset after: %d\" % len(dataset))\n", "\n", "# Verify that the dataset no longer contains any duplicates\n", "print(\"Number of unique file hashes: %d\" % len({s.file_hash for s in dataset}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export the deduplicated dataset\n", "\n", "Finally, let's export a fresh copy of our now-duplicate-free dataset:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 100% |████████████| 964/964 [408.8ms elapsed, 0s remaining, 2.4K samples/s] \n" ] } ], "source": [ "EXPORT_DIR = \"/tmp/fiftyone/image-deduplication\"\n", "dataset.export\n", "\n", "dataset.export(label_field=\"ground_truth\", export_dir=EXPORT_DIR, dataset_type=fo.types.FiftyOneImageClassificationDataset,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check out the contents of `/tmp/fiftyone/image-deduplication` on disk to see how the data is\n", "organized.\n", "\n", "You can load the deduplicated dataset that you exported back into FiftyOne at\n", "any time as follows:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 100% |████████████| 964/964 [1.1s elapsed, 0s remaining, 838.4 samples/s] \n", "Name: no_duplicates\n", "Media type: image\n", "Num samples: 964\n", "Persistent: False\n", "Info: {'classes': ['apple', 'aquarium_fish', 'baby', ...]}\n", "Tags: []\n", "Sample fields:\n", " filepath: fiftyone.core.fields.StringField\n", " tags: fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n", " metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n", " ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n" ] } ], "source": [ "no_dups_dataset = fo.Dataset.from_dir(\n", " EXPORT_DIR,\n", " fo.types.FiftyOneImageClassificationDataset,\n", " name=\"no_duplicates\",\n", ")\n", "\n", "print(no_dups_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup\n", "\n", "You can cleanup the files generated by this recipe by running:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "!rm -rf /tmp/fiftyone" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "session.freeze() # screenshot the active App for sharing" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }