{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- Autogenerated by `scripts/make_examples.py` -->\n",
    "<table align=\"left\">\n",
    "    <td>\n",
    "        <a target=\"_blank\" href=\"https://colab.research.google.com/github/voxel51/fiftyone-examples/blob/master/examples/image_uniqueness.ipynb\">\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104791629-6e618700-5769-11eb-857f-d176b37d2496.png\" height=\"32\" width=\"32\">\n",
    "            Try in Google Colab\n",
    "        </a>\n",
    "    </td>\n",
    "    <td>\n",
    "        <a target=\"_blank\" href=\"https://nbviewer.jupyter.org/github/voxel51/fiftyone-examples/blob/master/examples/image_uniqueness.ipynb\">\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104791634-6efa1d80-5769-11eb-8a4c-71d6cb53ccf0.png\" height=\"32\" width=\"32\">\n",
    "            Share via nbviewer\n",
    "        </a>\n",
    "    </td>\n",
    "    <td>\n",
    "        <a target=\"_blank\" href=\"https://github.com/voxel51/fiftyone-examples/blob/master/examples/image_uniqueness.ipynb\">\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104791633-6efa1d80-5769-11eb-8ee3-4b2123fe4b66.png\" height=\"32\" width=\"32\">\n",
    "            View on GitHub\n",
    "        </a>\n",
    "    </td>\n",
    "    <td>\n",
    "        <a href=\"https://github.com/voxel51/fiftyone-examples/raw/master/examples/image_uniqueness.ipynb\" download>\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104792428-60f9cc00-576c-11eb-95a4-5709d803023a.png\" height=\"32\" width=\"32\">\n",
    "            Download notebook\n",
    "        </a>\n",
    "    </td>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring Image Uniqueness\n",
    "\n",
    "This example provides a brief overivew of using FiftyOne's [image uniqueness method](https://voxel51.com/docs/fiftyone/user_guide/brain.html#image-uniqueness) to analyze and extract insights from unlabeled datasets.\n",
    "\n",
    "For more details, check out the in-depth [image uniqueness tutorial](https://voxel51.com/docs/fiftyone/tutorials/uniqueness.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "If you haven't already, install FiftyOne:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install fiftyone"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load dataset\n",
    "\n",
    "We'll work with the test split of the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), which is\n",
    "conveniently available in the [FiftyOne Dataset Zoo](https://voxel51.com/docs/fiftyone/user_guide/dataset_creation/zoo.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split 'test' already downloaded\n",
      "Loading existing dataset 'cifar10-test'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use\n",
      "Name:           image-uniqueness-example\n",
      "Media type:     None\n",
      "Num samples:    10000\n",
      "Persistent:     True\n",
      "Info:           {'classes': ['airplane', 'automobile', 'bird', ...]}\n",
      "Tags:           ['test']\n",
      "Sample fields:\n",
      "    media_type:   fiftyone.core.fields.StringField\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n",
      "    uniqueness:   fiftyone.core.fields.FloatField\n"
     ]
    }
   ],
   "source": [
    "import fiftyone as fo\n",
    "import fiftyone.zoo as foz\n",
    "\n",
    "# Load the CIFAR-10 test split\n",
    "# This will download the dataset from the web, if necessary\n",
    "dataset = foz.load_zoo_dataset(\"cifar10\", split=\"test\")\n",
    "dataset.name = \"image-uniqueness-example\"\n",
    "\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Index by visual uniqueness\n",
    "\n",
    "Next we'll index the dataset by visual uniqueness using a\n",
    "[builtin method](https://voxel51.com/docs/fiftyone/user_guide/brain.html#image-uniqueness)\n",
    "from the FiftyOne Brain:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading uniqueness model...\n",
      "Loaded default deployment config for model 'simple_resnet_cifar10'\n",
      "Applied 0 setting(s) from default deployment config\n",
      "Preparing data...\n",
      "Generating embeddings...\n",
      " 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [2.5m elapsed, 0s remaining, 56.4 samples/s]      \n",
      "Computing uniqueness...\n",
      "Saving results...\n",
      " 100% |████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [18.3s elapsed, 0s remaining, 559.9 samples/s]      \n",
      "Uniqueness computation complete\n",
      "Name:           image-uniqueness-example\n",
      "Media type:     None\n",
      "Num samples:    10000\n",
      "Persistent:     True\n",
      "Info:           {'classes': ['airplane', 'automobile', 'bird', ...]}\n",
      "Tags:           ['test']\n",
      "Sample fields:\n",
      "    media_type:   fiftyone.core.fields.StringField\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n",
      "    uniqueness:   fiftyone.core.fields.FloatField\n"
     ]
    }
   ],
   "source": [
    "import fiftyone.brain as fob\n",
    "\n",
    "fob.compute_uniqueness(dataset)\n",
    "\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the dataset now has a `uniqueness` field that contains a numeric measure of the visual uniqueness of each sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'id': '5f89c1e54937ecdaa3ffa1a4',\n",
      "    'media_type': 'image',\n",
      "    'filepath': '/Users/Brian/fiftyone/cifar10/test/data/000001.jpg',\n",
      "    'tags': BaseList(['test']),\n",
      "    'metadata': None,\n",
      "    'ground_truth': <Classification: {\n",
      "        'id': '5f89c1e54937ecdaa3ffa1a3',\n",
      "        'label': 'cat',\n",
      "        'confidence': None,\n",
      "        'logits': None,\n",
      "    }>,\n",
      "    'uniqueness': 0.4978481475892659,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "# View a sample from the dataset\n",
    "print(dataset.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize near-duplicate samples in the App\n",
    "\n",
    "Let's open the dataset in the App:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "App launched\n"
     ]
    }
   ],
   "source": [
    "# View dataset in the App\n",
    "session = fo.launch_app(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![uniqueness-01](https://user-images.githubusercontent.com/25985824/97113820-1adc2180-16c3-11eb-97b8-474878099522.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the App, we can show the most visually similar images in the dataset by creating a `SortBy(\"uniqueness\", reverse=False)` stage in the [view bar](https://voxel51.com/docs/fiftyone/user_guide/app.html#using-the-view-bar).\n",
    "\n",
    "Alternatively, this same operation can be performed programmatically via Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show least unique images first\n",
    "least_unique_view = dataset.sort_by(\"uniqueness\", reverse=False)\n",
    "\n",
    "# Open view in App\n",
    "session.view = least_unique_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![uniqueness-02](https://user-images.githubusercontent.com/25985824/97113818-1a438b00-16c3-11eb-96a7-4307d65ddc1f.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Omit near-duplicate samples from the dataset\n",
    "\n",
    "Next, we'll show how to omit visually similar samples from a dataset.\n",
    "\n",
    "First, use the App to select visually similar samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![uniqueness-03](https://user-images.githubusercontent.com/25985824/97113816-19125e00-16c3-11eb-856d-8720d4bf50df.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming the visually similar samples are currently selected in the App, we can easily add a `duplicate` tag to these samples via Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['5f89c1f54937ecdaa3fffb11', '5f89c1f04937ecdaa3ffde28', '5f89c1eb4937ecdaa3ffc52f', '5f89c1f84937ecdaa30010ec', '5f89c1f94937ecdaa3001458', '5f89c1f24937ecdaa3ffe959', '5f89c1ec4937ecdaa3ffcd45', '5f89c1ec4937ecdaa3ffce32', '5f89c1f04937ecdaa3ffe0da']\n"
     ]
    }
   ],
   "source": [
    "# Get currently selected images from App\n",
    "dup_ids = session.selected\n",
    "print(dup_ids)\n",
    "\n",
    "# Get view containing selected samples\n",
    "dups_view = dataset.select(dup_ids)\n",
    "\n",
    "# Mark as duplicates\n",
    "for sample in dups_view:\n",
    "    sample.tags.append(\"duplicate\")\n",
    "    sample.save()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can, for example, then use the `MatchTag(\"duplicate\")` stage in the [view bar](https://voxel51.com/docs/fiftyone/user_guide/app.html#using-the-view-bar) to re-isolate the duplicate samples.\n",
    "\n",
    "Alternatively, this same operation can be performed programmatically via Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select samples with `duplicate` tag\n",
    "dups_tag_view = dataset.match_tags(\"duplicate\")\n",
    "\n",
    "# Open view in App\n",
    "session.view = dups_tag_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![uniqueness-04](https://user-images.githubusercontent.com/25985824/97113813-16b00400-16c3-11eb-9031-a097e24ecd5a.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export de-duplicated dataset\n",
    "\n",
    "Now let's [create a view](https://voxel51.com/docs/fiftyone/user_guide/using_views.html#filtering)\n",
    "that omits samples with the `duplicate` tag, and then export them to disk as an [image classification directory tree](https://voxel51.com/docs/fiftyone/user_guide/export_datasets.html#imageclassificationdirectorytree):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 100% |██████████████████████████████████████████████████████████████████████████████████████████████████████████| 9991/9991 [13.1s elapsed, 0s remaining, 779.2 samples/s]       \n"
     ]
    }
   ],
   "source": [
    "from fiftyone import ViewField as F\n",
    "\n",
    "# Get samples that do not have the `duplicate` tag\n",
    "no_dups_view = dataset.match(~F(\"tags\").contains(\"duplicate\"))\n",
    "\n",
    "# Export dataset to disk as a classification directory tree\n",
    "no_dups_view.export(\n",
    "    \"/tmp/fiftyone-examples/cifar10-no-dups\",\n",
    "    fo.types.ImageClassificationDirectoryTree\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's list the contents of the exported dataset on disk to verify the export:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 0\r\n",
      "drwxr-xr-x    12 Brian  wheel   384B Oct 25 13:03 \u001b[34m.\u001b[m\u001b[m\r\n",
      "drwxr-xr-x     3 Brian  wheel    96B Oct 25 13:03 \u001b[34m..\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1001 Brian  wheel    31K Oct 25 13:03 \u001b[34mairplane\u001b[m\u001b[m\r\n",
      "drwxr-xr-x   995 Brian  wheel    31K Oct 25 13:03 \u001b[34mautomobile\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 Brian  wheel    31K Oct 25 13:03 \u001b[34mbird\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 Brian  wheel    31K Oct 25 13:03 \u001b[34mcat\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 Brian  wheel    31K Oct 25 13:03 \u001b[34mdeer\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 Brian  wheel    31K Oct 25 13:03 \u001b[34mdog\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 Brian  wheel    31K Oct 25 13:03 \u001b[34mfrog\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1001 Brian  wheel    31K Oct 25 13:03 \u001b[34mhorse\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 Brian  wheel    31K Oct 25 13:03 \u001b[34mship\u001b[m\u001b[m\r\n",
      "drwxr-xr-x  1002 Brian  wheel    31K Oct 25 13:03 \u001b[34mtruck\u001b[m\u001b[m\r\n"
     ]
    }
   ],
   "source": [
    "# Check the top-level directory structure\n",
    "!ls -lah /tmp/fiftyone-examples/cifar10-no-dups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 7992\r\n",
      "drwxr-xr-x  1001 Brian  wheel    31K Oct 25 13:03 .\r\n",
      "drwxr-xr-x    12 Brian  wheel   384B Oct 25 13:03 ..\r\n",
      "-rw-r--r--     1 Brian  wheel   1.2K Oct 25 13:03 000004.jpg\r\n",
      "-rw-r--r--     1 Brian  wheel   1.1K Oct 25 13:03 000011.jpg\r\n",
      "-rw-r--r--     1 Brian  wheel   1.1K Oct 25 13:03 000022.jpg\r\n",
      "-rw-r--r--     1 Brian  wheel   1.3K Oct 25 13:03 000028.jpg\r\n",
      "-rw-r--r--     1 Brian  wheel   1.2K Oct 25 13:03 000045.jpg\r\n",
      "-rw-r--r--     1 Brian  wheel   1.2K Oct 25 13:03 000053.jpg\r\n",
      "-rw-r--r--     1 Brian  wheel   1.3K Oct 25 13:03 000075.jpg\r\n"
     ]
    }
   ],
   "source": [
    "# View the contents of a class directory\n",
    "!ls -lah /tmp/fiftyone-examples/cifar10-no-dups/airplane | head"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}