{
"cells": [
{
"cell_type": "markdown",
"id": "db662146",
"metadata": {},
"source": [
"[![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)"
]
},
{
"cell_type": "markdown",
"id": "b74da379",
"metadata": {},
"source": [
"# Clean Image Folder\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb)\n",
"[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb)\n",
"\n",
"This notebook shows how you can use fastdup analyze an image folder from potential issues and export a list of problematic files for further action.\n",
"\n",
"By the end of this notebook you will learn how to:\n",
"\n",
"+ Find various dataset issues with fastdup.\n",
"+ Export a list of problematic images for further action."
]
},
{
"cell_type": "markdown",
"id": "45bae965",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"If you're new, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/cleaning-image-dataset.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb). \n",
"\n",
"Let's start with the installation:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1e6a3577",
"metadata": {},
"outputs": [],
"source": [
"!pip install fastdup -Uq"
]
},
{
"cell_type": "markdown",
"id": "fc9b6b89",
"metadata": {},
"source": [
"Now, test the installation by printing out the version. If there's no error message, we are ready to go!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6a49b5eb",
"metadata": {
"executionInfo": {
"elapsed": 2034,
"status": "ok",
"timestamp": 1677668109538,
"user": {
"displayName": "Tom Shani",
"userId": "00667426488827942961"
},
"user_tz": -120
},
"id": "6a49b5eb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/usr/bin/dpkg\n"
]
},
{
"data": {
"text/plain": [
"'1.23'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import fastdup\n",
"fastdup.__version__"
]
},
{
"cell_type": "markdown",
"id": "ff4dfa80-d1e4-46d1-ae10-e8715c16bb07",
"metadata": {
"id": "ff4dfa80-d1e4-46d1-ae10-e8715c16bb07"
},
"source": [
"## Download Dataset\n",
"\n",
"In this notebook let's use a widely available and relatively well curated [Food-101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) dataset.\n",
"\n",
"The Food-101 dataset consists of 101 food classes with 1,000 images per class. That is a total of 101,000 images.\n",
"\n",
"Let's download only from the dataset and extract them into our local directory:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a45d227",
"metadata": {},
"outputs": [],
"source": [
"!wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz\n",
"!tar -xf food-101.tar.gz"
]
},
{
"cell_type": "markdown",
"id": "7e2e70a3",
"metadata": {
"id": "7e2e70a3",
"tags": []
},
"source": [
"## Run fastdup"
]
},
{
"cell_type": "markdown",
"id": "b04063f6",
"metadata": {},
"source": [
"Once the extraction completes, we can run fastdup on the images.\n",
"\n",
"For that let's create a `fastdup` object and specify the input directory which points to the folder of images."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "29858edc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning: fastdup create() without work_dir argument, output is stored in a folder named work_dir in your current working path.\n"
]
}
],
"source": [
"fd = fastdup.create(input_dir=\"food-101/images/\")"
]
},
{
"cell_type": "markdown",
"id": "13777969",
"metadata": {},
"source": [
"> **NOTE**: If you're running this example on Google Colab, we recommend running with `num_images=40000` in the following cell. This limits fastdup to run on 40000 images instead of the entire dataset which takes shorter time to complete on Google Colab."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "637c1650",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.\n",
"2023-07-11 15:23:39 [INFO] Going to loop over dir food-101/images\n",
"2023-07-11 15:23:40 [INFO] Found total 101000 images to run on, 101000 train, 0 test, name list 101000, counter 101000 \n",
"2023-07-11 15:27:12 [INFO] Found total 101000 images to run onmated: 0 Minutes\n",
"Finished histogram 29.784\n",
"Finished bucket sort 29.980\n",
"2023-07-11 15:27:37 [INFO] 24498) Finished write_index() NN model\n",
"2023-07-11 15:27:37 [INFO] Stored nn model index file work_dir/nnf.index\n",
"2023-07-11 15:27:49 [INFO] Total time took 249055 ms\n",
"2023-07-11 15:27:49 [INFO] Found a total of 230 fully identical images (d>0.990), which are 0.11 %\n",
"2023-07-11 15:27:49 [INFO] Found a total of 88 nearly identical images(d>0.980), which are 0.04 %\n",
"2023-07-11 15:27:49 [INFO] Found a total of 5296 above threshold images (d>0.900), which are 2.62 %\n",
"2023-07-11 15:27:49 [INFO] Found a total of 10103 outlier images (d<0.050), which are 5.00 %\n",
"2023-07-11 15:27:49 [INFO] Min distance found 0.379 max distance 1.000\n",
"2023-07-11 15:27:49 [INFO] Running connected components for ccthreshold 0.900000 \n",
".0\n",
" ########################################################################################\n",
"\n",
"Dataset Analysis Summary: \n",
"\n",
" Dataset contains 101000 images\n",
" Valid images are 100.00% (101,000) of the data, invalid are 0.00% (0) of the data\n",
" Similarity: 1.93% (1,946) belong to 23 similarity clusters (components).\n",
" 98.07% (99,054) images do not belong to any similarity cluster.\n",
" Largest cluster has 2,572 (2.55%) images.\n",
" For a detailed analysis, use `.connected_components()`\n",
"(similarity threshold used is 0.9, connected component threshold used is 0.9).\n",
"\n",
" Outliers: 5.97% (6,031) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n",
" For a detailed list of outliers, use `.outliers()`.\n"
]
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fd.run(num_images=40000, ccthreshold=0.9) # runs fastdup on a subset of 40000 images from the dataset\n",
"fd.run(ccthreshold=0.9) # runs fastdup on the entire dataset"
]
},
{
"cell_type": "markdown",
"id": "1a4ec356",
"metadata": {},
"source": [
"> **Note**: `ccthreshold` is a parameter to in the connected components algorithm. Read more [here](https://visual-layer.readme.io/docs/dataset-cleanup#threshold-for-similarity-clusters) on how to set an appropriate value for your dataset."
]
},
{
"cell_type": "markdown",
"id": "fa3b8707",
"metadata": {},
"source": [
"Get a summary of the run showing potentially problematic files."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8ec0a195",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" ########################################################################################\n",
"\n",
"Dataset Analysis Summary: \n",
"\n",
" Dataset contains 101000 images\n",
" Valid images are 100.00% (101,000) of the data, invalid are 0.00% (0) of the data\n",
" Similarity: 1.93% (1,946) belong to 23 similarity clusters (components).\n",
" 98.07% (99,054) images do not belong to any similarity cluster.\n",
" Largest cluster has 2,572 (2.55%) images.\n",
" For a detailed analysis, use `.connected_components()`\n",
"(similarity threshold used is 0.9, connected component threshold used is 0.9).\n",
"\n",
" Outliers: 5.97% (6,031) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n",
" For a detailed list of outliers, use `.outliers()`.\n"
]
},
{
"data": {
"text/plain": [
"['Dataset contains 101000 images',\n",
" 'Valid images are 100.00% (101,000) of the data, invalid are 0.00% (0) of the data',\n",
" 'Similarity: 1.93% (1,946) belong to 23 similarity clusters (components).',\n",
" '98.07% (99,054) images do not belong to any similarity cluster.',\n",
" 'Largest cluster has 2,572 (2.55%) images.',\n",
" 'For a detailed analysis, use `.connected_components()`\\n(similarity threshold used is 0.9, connected component threshold used is 0.9).\\n',\n",
" 'Outliers: 5.97% (6,031) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',\n",
" 'For a detailed list of outliers, use `.outliers()`.']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fd.summary()"
]
},
{
"cell_type": "markdown",
"id": "be16bf24",
"metadata": {
"tags": []
},
"source": [
"## Broken Images\n",
"\n",
"The lowest hanging fruit is to find a list of broken images and remove them from your dataset. These images are most probably corrupted file and could not be loaded."
]
},
{
"cell_type": "markdown",
"id": "9754a904",
"metadata": {},
"source": [
"To get the broken images simply run"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "71f062ba",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" filename | \n",
" index | \n",
" error_code | \n",
" is_valid | \n",
" fd_index | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [filename, index, error_code, is_valid, fd_index]\n",
"Index: []"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"broken_images = fd.invalid_instances()\n",
"broken_images"
]
},
{
"cell_type": "markdown",
"id": "1625710f",
"metadata": {},
"source": [
"This dataset is a carefully curated, so we did not find any broken images. Which is great!"
]
},
{
"cell_type": "markdown",
"id": "396e9f68",
"metadata": {},
"source": [
"## List of Broken Images\n",
"If there are broken images however, you can easily get a list of the images."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2827cbcb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list_of_broken_images = broken_images['filename'].to_list()\n",
"list_of_broken_images"
]
},
{
"cell_type": "markdown",
"id": "ba78da12",
"metadata": {
"tags": []
},
"source": [
"## Duplicate Image Pairs\n",
"\n",
"Show a gallery of duplicate image pairs. Distance of `1.0` indicate that the image pairs are exact copies."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "516ee9fc",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 137.27it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Stored similarity visual view in work_dir/galleries/duplicates.html\n"
]
},
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" Duplicates Report\n",
" \n",
" \n",
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
Duplicates Report
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" Info | \n",
"
\n",
"\n",
" Distance | \n",
" 1.0 | \n",
"
\n",
"\n",
" From | \n",
" /fried_rice/2820757.jpg | \n",
"
\n",
"\n",
" To | \n",
" /fried_rice/2899815.jpg | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" Info | \n",
"
\n",
"\n",
" Distance | \n",
" 1.0 | \n",
"
\n",
"\n",
" From | \n",
" /chocolate_cake/55122.jpg | \n",
"
\n",
"\n",
" To | \n",
" /chocolate_cake/51717.jpg | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" Info | \n",
"
\n",
"\n",
" Distance | \n",
" 1.0 | \n",
"
\n",
"\n",
" From | \n",
" /grilled_salmon/606368.jpg | \n",
"
\n",
"\n",
" To | \n",
" /grilled_salmon/599021.jpg | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" Info | \n",
"
\n",
"\n",
" Distance | \n",
" 1.0 | \n",
"
\n",
"\n",
" From | \n",
" /paella/2199941.jpg | \n",
"
\n",
"\n",
" To | \n",
" /paella/2199939.jpg | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fd.vis.duplicates_gallery(num_images=5)"
]
},
{
"cell_type": "markdown",
"id": "85cc68f7",
"metadata": {
"tags": []
},
"source": [
"## Image Clusters\n",
"\n",
"Visualize image clusters from the dataset.\n",
"\n",
"> **Note**: Setting `num_images=5` shows a gallery of with 5 rows. Change this value to view more/less."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c08827bf",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 4.66it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Finished OK. Components are stored as image files work_dir/galleries/components_[index].jpg\n",
"Stored components visual view in work_dir/galleries/components.html\n",
"Execution time in seconds 7.7\n"
]
},
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" Components Report\n",
" \n",
" \n",
"\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
Components Report
Showing groups of similar images
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" Info | \n",
"
\n",
"\n",
" component | \n",
" 18214 | \n",
"
\n",
"\n",
" num_images | \n",
" 754 | \n",
"
\n",
"\n",
" mean_distance | \n",
" 0.9 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" Info | \n",
"
\n",
"\n",
" component | \n",
" 24712 | \n",
"
\n",
"\n",
" num_images | \n",
" 464 | \n",
"
\n",
"\n",
" mean_distance | \n",
" 0.9 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" Info | \n",
"
\n",
"\n",
" component | \n",
" 31543 | \n",
"
\n",
"\n",
" num_images | \n",
" 139 | \n",
"
\n",
"\n",
" mean_distance | \n",
" 0.9001 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"