{ "cells": [ { "cell_type": "markdown", "id": "201bf295-3c31-4348-9429-893dcab6be94", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", " \n", " \"vl\n", " \n", "
\n", " GitHub •\n", " Join Discord Community •\n", " Discussion Forum \n", "
\n", "\n", "
\n", " Blog •\n", " Documentation •\n", " About Us \n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "
\n", " \n", " \"site\"\n", " \n", " \"blog\"\n", " \n", " \"github\"\n", " \n", " \"slack\"\n", " \n", " \"linkedin\"\n", " \n", " \"youtube\"\n", " \n", " \"twitter\"\n", "
\n", "
" ] }, { "cell_type": "markdown", "id": "pN6wiKBax7Pa", "metadata": { "id": "pN6wiKBax7Pa", "tags": [] }, "source": [ "# Finding and Removing Duplicates\n", "\n", "[![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=google-colab&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb)\n", "[![Open in Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=kaggle&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-duplicates.ipynb)\n", "[![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/finding-removing-duplicates)\n", "\n", "This notebook shows how to analyze an image dataset for duplicates and near-duplicates [fastdup](https://github.com/visual-layer/fastdup)." ] }, { "cell_type": "markdown", "id": "c0727302-dbe5-46b3-a5ff-b039811a7e7e", "metadata": { "tags": [] }, "source": [ "## Installation\n", "First, let's start with the installation:\n", "\n", "> ✅ **Tip** - If you're new to fastdup, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/quick-dataset-analysis.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb). \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8e6dd3e6-0f72-456b-9b16-2e53d5d5c099", "metadata": {}, "outputs": [], "source": [ "!pip install fastdup -Uq" ] }, { "cell_type": "markdown", "id": "488abfbf", "metadata": {}, "source": [ "Now, test the installation by printing out the version. If there's no error message, we are ready to go!" ] }, { "cell_type": "code", "execution_count": 1, "id": "e301485f", "metadata": { "id": "e301485f", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'2.0.21'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import fastdup\n", "fastdup.__version__" ] }, { "cell_type": "markdown", "id": "2d30a901-4ba8-48cf-9a2f-37e0f70fa1ae", "metadata": { "tags": [] }, "source": [ "## Download Dataset\n", "\n", "For demonstration, we will use a generally curated [Oxford IIIT Pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/). Feel free to swap this dataset with your own.\n", "\n", "The dataset consists of images and annotations for 37 category pets with roughly 200 images for each class. \n", "\n", "> 🗒 **Note** - fastdup works on both unlabeled and labeled images. But for now, we are only interested in finding issues in the images and not the annotations. \n", "> If you're interested in finding annotation issues, head to:\n", "> + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)\n", "> + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb).\n", "\n", "\n", "Let's download only from the dataset and extract them into the local directory:" ] }, { "cell_type": "code", "execution_count": null, "id": "d91abfc1", "metadata": {}, "outputs": [], "source": [ "!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz -O images.tar.gz\n", "!tar xf images.tar.gz" ] }, { "cell_type": "markdown", "id": "8cd8a7da-2e05-4c38-aa37-33fd466a61e2", "metadata": { "tags": [] }, "source": [ "## Run fastdup\n", "\n", "Once the extraction completes, we can run fastdup on the images.\n", "\n", "For that let's initialize fastdup and specify the input directory which points to the folder of images." ] }, { "cell_type": "code", "execution_count": 2, "id": "fe4d8211-89b2-4a2f-91f4-8074d2314aef", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning: fastdup create() without work_dir argument, output is stored in a folder named work_dir in your current working path.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "fastdup By Visual Layer, Inc. 2024. All rights reserved.\n", "\n", "A fastdup dataset object was created!\n", "\n", "Input directory is set to \u001b[0;35m\"images\"\u001b[0m\n", "Work directory is set to \u001b[0;35m\"work_dir\"\u001b[0m\n", "\n", "The next steps are:\n", " 1. Analyze your dataset with the \u001b[0;35m.run()\u001b[0m function of the dataset object\n", " 2. Interactively explore your data on your local machine with the \u001b[0;35m.explore()\u001b[0m function of the dataset object\n", "\n", "For more information, use \u001b[0;35mhelp(fastdup)\u001b[0m or check our documentation https://docs.visual-layer.com/docs/getting-started-with-fastdup.\n", "\n" ] } ], "source": [ "fd = fastdup.create(input_dir=\"images/\")" ] }, { "cell_type": "markdown", "id": "4acb64a1-ab06-4fa2-8111-65b5d4f2a335", "metadata": {}, "source": [ "> 🗒 **Note** - The `.create` method also has an optional `work_dir` parameter which specifies the directory to store artifacts from the run.\n", "\n", "In other words you can run `fastdup.create(input_dir=\"images/\", work_dir=\"my_work_dir/\")` if you'd like to store the artifacts in a `my_work_dir`.\n", "\n", "Now, let's run fastdup." ] }, { "cell_type": "code", "execution_count": null, "id": "2ef4b508-62ac-4eae-a9f7-5c8370a9c623", "metadata": {}, "outputs": [], "source": [ "fd.run()" ] }, { "cell_type": "markdown", "id": "24b9d94d-7458-42f0-bf77-1b33491279f2", "metadata": {}, "source": [ "## View Run Summary\n", "\n", "After the run is completed, you can optionally view the summary with:" ] }, { "cell_type": "code", "execution_count": 4, "id": "b546398f-e555-42b7-83ad-fd9ba9286d41", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " ########################################################################################\n", "\n", "Dataset Analysis Summary: \n", "\n", " Dataset contains 7390 images\n", " Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data\n", " For a detailed analysis, use `.invalid_instances()`.\n", "\n", " Components: failed to find images clustered into components, try to run with lower cc_threshold.\n", " Outliers: 6.14% (454) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n", " For a detailed list of outliers, use `.outliers()`.\n", "\n" ] }, { "data": { "text/plain": [ "['Dataset contains 7390 images',\n", " 'Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data',\n", " 'For a detailed analysis, use `.invalid_instances()`.\\n',\n", " 'Components: failed to find images clustered into components, try to run with lower cc_threshold.',\n", " 'Outliers: 6.14% (454) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',\n", " 'For a detailed list of outliers, use `.outliers()`.\\n']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd.summary()" ] }, { "cell_type": "markdown", "id": "8149c870", "metadata": {}, "source": [ "## Removing Duplicates\n", "\n", "Let's first get the information about which cluster each image belongs to." ] }, { "cell_type": "code", "execution_count": 5, "id": "56a64f4c-f873-4838-91dc-31b0ecf9f051", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexcomponent_idcountmean_distancemin_distancemax_distancefilenameerror_codeis_validfd_index
0212140.9680950.9680950.968095images/Abyssinian_11.jpgVALIDTrue21
1802140.9680950.9680950.968095images/Abyssinian_177.jpgVALIDTrue80
216216140.9775540.9775540.977554images/Abyssinian_66.jpgVALIDTrue162
318016140.9775540.9775540.977554images/Abyssinian_82.jpgVALIDTrue180
41307130540.9949640.9949640.994964images/Birman_199.jpgVALIDTrue1307
.................................
1436493641440.9999070.9999070.999907images/Siamese_203.jpgVALIDTrue6493
1447272719940.9608150.9608150.960815images/yorkshire_terrier_175.jpgVALIDTrue7272
1457274720140.9629090.9629090.962909images/yorkshire_terrier_177.jpgVALIDTrue7274
1467278720140.9629090.9629090.962909images/yorkshire_terrier_180.jpgVALIDTrue7278
1477280719940.9608150.9608150.960815images/yorkshire_terrier_182.jpgVALIDTrue7280
\n", "

148 rows × 10 columns

\n", "
" ], "text/plain": [ " index component_id count mean_distance min_distance max_distance filename error_code is_valid fd_index\n", "0 21 21 4 0.968095 0.968095 0.968095 images/Abyssinian_11.jpg VALID True 21\n", "1 80 21 4 0.968095 0.968095 0.968095 images/Abyssinian_177.jpg VALID True 80\n", "2 162 161 4 0.977554 0.977554 0.977554 images/Abyssinian_66.jpg VALID True 162\n", "3 180 161 4 0.977554 0.977554 0.977554 images/Abyssinian_82.jpg VALID True 180\n", "4 1307 1305 4 0.994964 0.994964 0.994964 images/Birman_199.jpg VALID True 1307\n", ".. ... ... ... ... ... ... ... ... ... ...\n", "143 6493 6414 4 0.999907 0.999907 0.999907 images/Siamese_203.jpg VALID True 6493\n", "144 7272 7199 4 0.960815 0.960815 0.960815 images/yorkshire_terrier_175.jpg VALID True 7272\n", "145 7274 7201 4 0.962909 0.962909 0.962909 images/yorkshire_terrier_177.jpg VALID True 7274\n", "146 7278 7201 4 0.962909 0.962909 0.962909 images/yorkshire_terrier_180.jpg VALID True 7278\n", "147 7280 7199 4 0.960815 0.960815 0.960815 images/yorkshire_terrier_182.jpg VALID True 7280\n", "\n", "[148 rows x 10 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "connected_components_df , _ = fd.connected_components()\n", "connected_components_df" ] }, { "cell_type": "markdown", "id": "bfab311d-2e83-4b1a-bfb2-a5e610e5f449", "metadata": {}, "source": [ "Duplicates are stored in a cluster (`component_id`). Let's group the images based on the `component_id`." ] }, { "cell_type": "code", "execution_count": 6, "id": "b11df26d-ab20-4b8f-adf9-f3b5ee2f428a", "metadata": {}, "outputs": [], "source": [ "duplicates_df = (\n", " connected_components_df\n", " .groupby('component_id')\n", " .agg(\n", " filenames=('filename', list),\n", " count=('filename', 'size'),\n", " mean_distance=('mean_distance', 'mean')\n", " )\n", " .sort_values('mean_distance', ascending=False)\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "id": "fd72435e-22ee-4465-9a91-7f51e8ce519d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filenamescountmean_distance
component_id
2420[images/english_cocker_spaniel_152.jpg, images/english_cocker_spaniel_163.jpg]21.000000
1484[images/Bombay_194.jpg, images/Bombay_32.jpg]21.000000
1487[images/Bombay_200.jpg, images/Bombay_85.jpg]21.000000
1488[images/Bombay_201.jpg, images/Bombay_92.jpg]21.000000
1489[images/Bombay_202.jpg, images/Bombay_99.jpg]21.000000
............
1459[images/Bombay_166.jpg, images/Bombay_177.jpg]20.962989
7201[images/yorkshire_terrier_177.jpg, images/yorkshire_terrier_180.jpg]20.962909
7199[images/yorkshire_terrier_175.jpg, images/yorkshire_terrier_182.jpg]20.960815
2959[images/great_pyrenees_103.jpg, images/great_pyrenees_99.jpg]20.960410
5266[images/Ragdoll_33.jpg, images/Ragdoll_34.jpg]20.960083
\n", "

73 rows × 3 columns

\n", "
" ], "text/plain": [ " filenames count mean_distance\n", "component_id \n", "2420 [images/english_cocker_spaniel_152.jpg, images/english_cocker_spaniel_163.jpg] 2 1.000000\n", "1484 [images/Bombay_194.jpg, images/Bombay_32.jpg] 2 1.000000\n", "1487 [images/Bombay_200.jpg, images/Bombay_85.jpg] 2 1.000000\n", "1488 [images/Bombay_201.jpg, images/Bombay_92.jpg] 2 1.000000\n", "1489 [images/Bombay_202.jpg, images/Bombay_99.jpg] 2 1.000000\n", "... ... ... ...\n", "1459 [images/Bombay_166.jpg, images/Bombay_177.jpg] 2 0.962989\n", "7201 [images/yorkshire_terrier_177.jpg, images/yorkshire_terrier_180.jpg] 2 0.962909\n", "7199 [images/yorkshire_terrier_175.jpg, images/yorkshire_terrier_182.jpg] 2 0.960815\n", "2959 [images/great_pyrenees_103.jpg, images/great_pyrenees_99.jpg] 2 0.960410\n", "5266 [images/Ragdoll_33.jpg, images/Ragdoll_34.jpg] 2 0.960083\n", "\n", "[73 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "duplicates_df" ] }, { "cell_type": "markdown", "id": "ff28b176", "metadata": {}, "source": [ "Above, we see that there are 73 clusters. Each cluster represents a set of images that are duplicates or near-duplicates of each other." ] }, { "cell_type": "markdown", "id": "95f8be06", "metadata": {}, "source": [ "Now, let's simplify the above dataframe by keeping only the first image from each cluster and treat the rest as duplicates." ] }, { "cell_type": "code", "execution_count": 8, "id": "fb9607ce-9100-452c-9e99-62f2800051ba", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imageduplicates
component_id
2420images/english_cocker_spaniel_152.jpg[images/english_cocker_spaniel_163.jpg]
1484images/Bombay_194.jpg[images/Bombay_32.jpg]
1487images/Bombay_200.jpg[images/Bombay_85.jpg]
1488images/Bombay_201.jpg[images/Bombay_92.jpg]
1489images/Bombay_202.jpg[images/Bombay_99.jpg]
.........
1459images/Bombay_166.jpg[images/Bombay_177.jpg]
7201images/yorkshire_terrier_177.jpg[images/yorkshire_terrier_180.jpg]
7199images/yorkshire_terrier_175.jpg[images/yorkshire_terrier_182.jpg]
2959images/great_pyrenees_103.jpg[images/great_pyrenees_99.jpg]
5266images/Ragdoll_33.jpg[images/Ragdoll_34.jpg]
\n", "

73 rows × 2 columns

\n", "
" ], "text/plain": [ " image duplicates\n", "component_id \n", "2420 images/english_cocker_spaniel_152.jpg [images/english_cocker_spaniel_163.jpg]\n", "1484 images/Bombay_194.jpg [images/Bombay_32.jpg]\n", "1487 images/Bombay_200.jpg [images/Bombay_85.jpg]\n", "1488 images/Bombay_201.jpg [images/Bombay_92.jpg]\n", "1489 images/Bombay_202.jpg [images/Bombay_99.jpg]\n", "... ... ...\n", "1459 images/Bombay_166.jpg [images/Bombay_177.jpg]\n", "7201 images/yorkshire_terrier_177.jpg [images/yorkshire_terrier_180.jpg]\n", "7199 images/yorkshire_terrier_175.jpg [images/yorkshire_terrier_182.jpg]\n", "2959 images/great_pyrenees_103.jpg [images/great_pyrenees_99.jpg]\n", "5266 images/Ragdoll_33.jpg [images/Ragdoll_34.jpg]\n", "\n", "[73 rows x 2 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "def extract_image_duplicates(row):\n", " filenames = row['filenames']\n", " image = filenames[0]\n", " duplicates = filenames[1:] if len(filenames) > 1 else []\n", " return pd.Series({'image': image, 'duplicates': duplicates})\n", "\n", "df = duplicates_df.apply(extract_image_duplicates, axis=1)\n", "df" ] }, { "cell_type": "markdown", "id": "ee1f189d", "metadata": {}, "source": [ "## Visualizing Duplicates\n", "\n", "The following steps are optional and are used to visualize the duplicates in the dataset to get a better understanding of the duplicates." ] }, { "cell_type": "code", "execution_count": 9, "id": "6b49dbce", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 imageduplicatesimage_previewduplicates_preview
component_id    
2420images/english_cocker_spaniel_152.jpg['images/english_cocker_spaniel_163.jpg']
1484images/Bombay_194.jpg['images/Bombay_32.jpg']
1487images/Bombay_200.jpg['images/Bombay_85.jpg']
1488images/Bombay_201.jpg['images/Bombay_92.jpg']
1489images/Bombay_202.jpg['images/Bombay_99.jpg']
1587images/boxer_114.jpg['images/boxer_82.jpg']
2179images/Egyptian_Mau_10.jpg['images/Egyptian_Mau_183.jpg']
2203images/Egyptian_Mau_131.jpg['images/Egyptian_Mau_202.jpg']
2288images/Egyptian_Mau_224.jpg['images/Egyptian_Mau_71.jpg']
2419images/english_cocker_spaniel_151.jpg['images/english_cocker_spaniel_162.jpg']
2422images/english_cocker_spaniel_154.jpg['images/english_cocker_spaniel_164.jpg']
2443images/english_cocker_spaniel_176.jpg['images/english_cocker_spaniel_179.jpg']
3691images/keeshond_54.jpg['images/keeshond_59.jpg']
4377images/newfoundland_137.jpg['images/newfoundland_153.jpg']
4378images/newfoundland_138.jpg['images/newfoundland_154.jpg']
4379images/newfoundland_139.jpg['images/newfoundland_155.jpg']
4388images/newfoundland_147.jpg['images/newfoundland_152.jpg']
1485images/Bombay_198.jpg['images/Bombay_69.jpg']
2277images/Egyptian_Mau_210.jpg['images/Egyptian_Mau_41.jpg']
1483images/Bombay_193.jpg['images/Bombay_22.jpg']
1429images/Bombay_131.jpg['images/Bombay_217.jpg']
1397images/Bombay_100.jpg['images/Bombay_11.jpg', 'images/Bombay_192.jpg']
1458images/Bombay_164.jpg['images/Bombay_189.jpg']
1399images/Bombay_102.jpg['images/Bombay_203.jpg']
1478images/Bombay_185.jpg['images/Bombay_190.jpg']
1406images/Bombay_109.jpg['images/Bombay_206.jpg']
1416images/Bombay_118.jpg['images/Bombay_209.jpg']
1419images/Bombay_121.jpg['images/Bombay_210.jpg']
1423images/Bombay_126.jpg['images/Bombay_220.jpg']
3550images/keeshond_103.jpg['images/keeshond_167.jpg']
3626images/keeshond_175.jpg['images/keeshond_27.jpg']
3548images/keeshond_101.jpg['images/keeshond_162.jpg']
3592images/keeshond_141.jpg['images/keeshond_47.jpg']
3394images/japanese_chin_137.jpg['images/japanese_chin_85.jpg']
2814images/german_shorthaired_150.jpg['images/german_shorthaired_3.jpg']
3604images/keeshond_152.jpg['images/keeshond_99.jpg']
3398images/japanese_chin_140.jpg['images/japanese_chin_88.jpg']
2845images/german_shorthaired_179.jpg['images/german_shorthaired_19.jpg']
3450images/japanese_chin_188.jpg['images/japanese_chin_78.jpg']
3449images/japanese_chin_187.jpg['images/japanese_chin_200.jpg']
3417images/japanese_chin_158.jpg['images/japanese_chin_81.jpg']
3834images/leonberger_187.jpg['images/leonberger_1.jpg']
2847images/german_shorthaired_180.jpg['images/german_shorthaired_20.jpg']
3437images/japanese_chin_176.jpg['images/japanese_chin_20.jpg']
4390images/newfoundland_149.jpg['images/newfoundland_2.jpg']
3036images/great_pyrenees_173.jpg['images/great_pyrenees_89.jpg']
6414images/Siamese_196.jpg['images/Siamese_203.jpg']
3457images/japanese_chin_194.jpg['images/japanese_chin_79.jpg']
3551images/keeshond_104.jpg['images/keeshond_170.jpg']
3591images/keeshond_140.jpg['images/keeshond_97.jpg']
3836images/leonberger_189.jpg['images/leonberger_2.jpg']
2208images/Egyptian_Mau_138.jpg['images/Egyptian_Mau_219.jpg']
1824images/British_Shorthair_160.jpg['images/British_Shorthair_278.jpg']
1433images/Bombay_136.jpg['images/Bombay_150.jpg']
1553images/Bombay_79.jpg['images/Bombay_97.jpg']
1305images/Birman_199.jpg['images/Birman_25.jpg']
1512images/Bombay_38.jpg['images/Bombay_57.jpg']
6103images/scottish_terrier_78.jpg['images/scottish_terrier_94.jpg']
1851images/British_Shorthair_186.jpg['images/British_Shorthair_271.jpg']
2255images/Egyptian_Mau_186.jpg['images/Egyptian_Mau_6.jpg']
1442images/Bombay_146.jpg['images/Bombay_82.jpg']
1411images/Bombay_113.jpg['images/Bombay_157.jpg']
161images/Abyssinian_66.jpg['images/Abyssinian_82.jpg']
1404images/Bombay_107.jpg['images/Bombay_132.jpg', 'images/Bombay_19.jpg']
1412images/Bombay_114.jpg['images/Bombay_139.jpg']
5188images/Ragdoll_160.jpg['images/Ragdoll_161.jpg']
21images/Abyssinian_11.jpg['images/Abyssinian_177.jpg']
5592images/saint_bernard_157.jpg['images/saint_bernard_158.jpg']
1459images/Bombay_166.jpg['images/Bombay_177.jpg']
7201images/yorkshire_terrier_177.jpg['images/yorkshire_terrier_180.jpg']
7199images/yorkshire_terrier_175.jpg['images/yorkshire_terrier_182.jpg']
2959images/great_pyrenees_103.jpg['images/great_pyrenees_99.jpg']
5266images/Ragdoll_33.jpg['images/Ragdoll_34.jpg']
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import base64\n", "from io import BytesIO\n", "from PIL import Image\n", "\n", "def resize_and_encode_image(image_path, width=100):\n", " with Image.open(image_path) as img:\n", " wpercent = (width / float(img.size[0]))\n", " height = int((float(img.size[1]) * float(wpercent)))\n", " resized_img = img.resize((width, height))\n", " buffered = BytesIO()\n", " resized_img.save(buffered, format=\"PNG\")\n", " encoded_string = base64.b64encode(buffered.getvalue()).decode('utf-8')\n", " return f''\n", "\n", "def display_image_list(image_list, width=100):\n", " if isinstance(image_list, list):\n", " return ''.join([resize_and_encode_image(image, width) for image in image_list])\n", " else:\n", " return ''\n", "\n", "# Apply the resize_and_encode_image function to the 'image' column\n", "df['image_preview'] = df['image'].apply(lambda x: resize_and_encode_image(x, width=100))\n", "\n", "# Apply the display_image_list function to the 'duplicates' column\n", "df['duplicates_preview'] = df['duplicates'].apply(lambda x: display_image_list(x, width=100))\n", "\n", "display(df.style)" ] }, { "cell_type": "markdown", "id": "edcb94d6", "metadata": {}, "source": [ "## Get Duplicates List" ] }, { "cell_type": "code", "execution_count": 10, "id": "089ceddc-632a-4fdf-8535-3843ff120726", "metadata": {}, "outputs": [], "source": [ "duplicates_to_remove = df['duplicates'].tolist()" ] }, { "cell_type": "code", "execution_count": 11, "id": "a3ceaac2-b9e5-4a56-8bcd-c745e394f856", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['images/english_cocker_spaniel_163.jpg'],\n", " ['images/Bombay_32.jpg'],\n", " ['images/Bombay_85.jpg'],\n", " ['images/Bombay_92.jpg'],\n", " ['images/Bombay_99.jpg'],\n", " ['images/boxer_82.jpg'],\n", " ['images/Egyptian_Mau_183.jpg'],\n", " ['images/Egyptian_Mau_202.jpg'],\n", " ['images/Egyptian_Mau_71.jpg'],\n", " ['images/english_cocker_spaniel_162.jpg'],\n", " ['images/english_cocker_spaniel_164.jpg'],\n", " ['images/english_cocker_spaniel_179.jpg'],\n", " ['images/keeshond_59.jpg'],\n", " ['images/newfoundland_153.jpg'],\n", " ['images/newfoundland_154.jpg'],\n", " ['images/newfoundland_155.jpg'],\n", " ['images/newfoundland_152.jpg'],\n", " ['images/Bombay_69.jpg'],\n", " ['images/Egyptian_Mau_41.jpg'],\n", " ['images/Bombay_22.jpg'],\n", " ['images/Bombay_217.jpg'],\n", " ['images/Bombay_11.jpg', 'images/Bombay_192.jpg'],\n", " ['images/Bombay_189.jpg'],\n", " ['images/Bombay_203.jpg'],\n", " ['images/Bombay_190.jpg'],\n", " ['images/Bombay_206.jpg'],\n", " ['images/Bombay_209.jpg'],\n", " ['images/Bombay_210.jpg'],\n", " ['images/Bombay_220.jpg'],\n", " ['images/keeshond_167.jpg'],\n", " ['images/keeshond_27.jpg'],\n", " ['images/keeshond_162.jpg'],\n", " ['images/keeshond_47.jpg'],\n", " ['images/japanese_chin_85.jpg'],\n", " ['images/german_shorthaired_3.jpg'],\n", " ['images/keeshond_99.jpg'],\n", " ['images/japanese_chin_88.jpg'],\n", " ['images/german_shorthaired_19.jpg'],\n", " ['images/japanese_chin_78.jpg'],\n", " ['images/japanese_chin_200.jpg'],\n", " ['images/japanese_chin_81.jpg'],\n", " ['images/leonberger_1.jpg'],\n", " ['images/german_shorthaired_20.jpg'],\n", " ['images/japanese_chin_20.jpg'],\n", " ['images/newfoundland_2.jpg'],\n", " ['images/great_pyrenees_89.jpg'],\n", " ['images/Siamese_203.jpg'],\n", " ['images/japanese_chin_79.jpg'],\n", " ['images/keeshond_170.jpg'],\n", " ['images/keeshond_97.jpg'],\n", " ['images/leonberger_2.jpg'],\n", " ['images/Egyptian_Mau_219.jpg'],\n", " ['images/British_Shorthair_278.jpg'],\n", " ['images/Bombay_150.jpg'],\n", " ['images/Bombay_97.jpg'],\n", " ['images/Birman_25.jpg'],\n", " ['images/Bombay_57.jpg'],\n", " ['images/scottish_terrier_94.jpg'],\n", " ['images/British_Shorthair_271.jpg'],\n", " ['images/Egyptian_Mau_6.jpg'],\n", " ['images/Bombay_82.jpg'],\n", " ['images/Bombay_157.jpg'],\n", " ['images/Abyssinian_82.jpg'],\n", " ['images/Bombay_132.jpg', 'images/Bombay_19.jpg'],\n", " ['images/Bombay_139.jpg'],\n", " ['images/Ragdoll_161.jpg'],\n", " ['images/Abyssinian_177.jpg'],\n", " ['images/saint_bernard_158.jpg'],\n", " ['images/Bombay_177.jpg'],\n", " ['images/yorkshire_terrier_180.jpg'],\n", " ['images/yorkshire_terrier_182.jpg'],\n", " ['images/great_pyrenees_99.jpg'],\n", " ['images/Ragdoll_34.jpg']]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "duplicates_to_remove" ] }, { "cell_type": "markdown", "id": "98a0333c", "metadata": {}, "source": [ "## Interactive Exploration\n", "In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.\n", "\n", "To explore the dataset and issues interactively in a browser, run:" ] }, { "cell_type": "code", "execution_count": null, "id": "1f1c8b89-cf96-4130-b09e-b257904445d1", "metadata": {}, "outputs": [], "source": [ "fd.explore()" ] }, { "cell_type": "markdown", "id": "609b7114-9bae-46f5-be4d-0b86c920770e", "metadata": {}, "source": [ "> 🗒 **Note** - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.\n", "\n", "You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.\n", "\n", "\n", "![image.png](https://vl-blog.s3.us-east-2.amazonaws.com/fastdup_assets/cloud_preview.gif)" ] }, { "cell_type": "markdown", "id": "6c3135e1", "metadata": {}, "source": [ "## Wrap Up\n", "\n", "That's a wrap! In this notebook we showed how you can run fastdup on a dataset or any folder of images. \n", "\n", "We've seen how to use fastdup to find:\n", "\n", "+ Broken images.\n", "+ Duplicate/near-duplicates.\n", "+ Outliers.\n", "+ Dark, bright and blurry images.\n", "+ Image clusters.\n", "\n", "Next, feel free to check out other tutorials -\n", "\n", "+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!\n", "+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.\n", "+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!\n", "+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. \n", "\n", "As usual, feedback is welcome! Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).\n" ] }, { "cell_type": "markdown", "id": "6034a6ad-2aa2-454e-ad2d-bd320e7fe6bb", "metadata": {}, "source": [ "
\n", "
\n", " \n", " \"site\"\n", " \n", " \"blog\"\n", " \n", " \"github\"\n", " \n", " \"slack\"\n", " \n", " \"linkedin\"\n", " \n", " \"youtube\"\n", " \n", " \"twitter\"\n", "
\n", "
\n", "
\n", " \"logo\"\n", "
Copyright © 2024 Visual Layer. All rights reserved.
\n", "
\n", "\n", "
" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }