{ "cells": [ { "cell_type": "markdown", "id": "201bf295-3c31-4348-9429-893dcab6be94", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", " \n", " \"vl\n", " \n", "
\n", " GitHub •\n", " Join Discord Community •\n", " Discussion Forum \n", "
\n", "\n", "
\n", " Blog •\n", " Documentation •\n", " About Us \n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "
\n", " \n", " \"site\"\n", " \n", " \"blog\"\n", " \n", " \"github\"\n", " \n", " \"slack\"\n", " \n", " \"linkedin\"\n", " \n", " \"youtube\"\n", " \n", " \"twitter\"\n", "
\n", "
" ] }, { "cell_type": "markdown", "id": "pN6wiKBax7Pa", "metadata": { "id": "pN6wiKBax7Pa", "tags": [] }, "source": [ "# Finding and Removing Mislabels\n", "\n", "[![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=google-colab&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb)\n", "[![Open in Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=kaggle&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb)\n", "[![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/finding-removing-mislabels)\n", "\n", "This notebook shows how to quickly analyze an image dataset for potential image mislabels and export the list of mislabeled images for further inspection." ] }, { "cell_type": "markdown", "id": "c0727302-dbe5-46b3-a5ff-b039811a7e7e", "metadata": { "tags": [] }, "source": [ "## Installation\n", "First, let's start with the installation:\n", "\n", "> ✅ **Tip** - If you're new to fastdup, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/quick-dataset-analysis.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb). \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8e6dd3e6-0f72-456b-9b16-2e53d5d5c099", "metadata": {}, "outputs": [], "source": [ "!pip install fastdup -Uq" ] }, { "cell_type": "markdown", "id": "488abfbf", "metadata": {}, "source": [ "Now, test the installation by printing out the version. If there's no error message, we are ready to go!" ] }, { "cell_type": "code", "execution_count": 1, "id": "e301485f", "metadata": { "id": "e301485f", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'2.0.21'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import fastdup\n", "fastdup.__version__" ] }, { "cell_type": "markdown", "id": "2d30a901-4ba8-48cf-9a2f-37e0f70fa1ae", "metadata": { "tags": [] }, "source": [ "## Download Dataset\n", "\n", "\n", "In this notebook let's use a widely available and relatively well curated [Food-101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) dataset.\n", "\n", "The Food-101 dataset consists of 101 food classes with 1,000 images per class. That is a total of 101,000 images.\n", "\n", "Let's download only from the dataset and extract them into our local directory:\n", "\n", "> 🗒 **Note** - fastdup works on both unlabeled and labeled images. But for now, we are only interested in finding issues in the images and not the annotations. \n", "> If you're interested in finding annotation issues, head to:\n", "> + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)\n", "> + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb).\n", "\n", "\n", "Let's download only from the dataset and extract them into the local directory:" ] }, { "cell_type": "code", "execution_count": null, "id": "d91abfc1", "metadata": {}, "outputs": [], "source": [ "!wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz\n", "!tar -xf food-101.tar.gz" ] }, { "cell_type": "markdown", "id": "1af55d41", "metadata": {}, "source": [ "## Create Annotations DataFrame\n", "\n", "food-101 dataset has a specific structure where the images are stored in folders named after the class name. Let's create a DataFrame with the annotations." ] }, { "cell_type": "code", "execution_count": 2, "id": "19206aac", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filenamelabel
0food-101/images/gnocchi/1642469.jpggnocchi
1food-101/images/gnocchi/1598303.jpggnocchi
2food-101/images/gnocchi/79585.jpggnocchi
3food-101/images/gnocchi/2397771.jpggnocchi
4food-101/images/gnocchi/2388954.jpggnocchi
.........
100995food-101/images/bread_pudding/2415610.jpgbread_pudding
100996food-101/images/bread_pudding/723067.jpgbread_pudding
100997food-101/images/bread_pudding/1051348.jpgbread_pudding
100998food-101/images/bread_pudding/3607583.jpgbread_pudding
100999food-101/images/bread_pudding/1907181.jpgbread_pudding
\n", "

101000 rows × 2 columns

\n", "
" ], "text/plain": [ " filename label\n", "0 food-101/images/gnocchi/1642469.jpg gnocchi\n", "1 food-101/images/gnocchi/1598303.jpg gnocchi\n", "2 food-101/images/gnocchi/79585.jpg gnocchi\n", "3 food-101/images/gnocchi/2397771.jpg gnocchi\n", "4 food-101/images/gnocchi/2388954.jpg gnocchi\n", "... ... ...\n", "100995 food-101/images/bread_pudding/2415610.jpg bread_pudding\n", "100996 food-101/images/bread_pudding/723067.jpg bread_pudding\n", "100997 food-101/images/bread_pudding/1051348.jpg bread_pudding\n", "100998 food-101/images/bread_pudding/3607583.jpg bread_pudding\n", "100999 food-101/images/bread_pudding/1907181.jpg bread_pudding\n", "\n", "[101000 rows x 2 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "import pandas as pd\n", "\n", "dataset_dir = 'food-101/images/'\n", "\n", "filenames = []\n", "labels = []\n", "\n", "# Iterate over the directory and subdirectories\n", "for root, dirs, files in os.walk(dataset_dir):\n", " # Skip the root directory\n", " if root == dataset_dir:\n", " continue\n", "\n", " label = os.path.basename(root)\n", "\n", " for filename in files:\n", " filenames.append(os.path.join(root, filename))\n", " labels.append(label)\n", "\n", "data = {'filename': filenames, 'label': labels}\n", "df = pd.DataFrame(data)\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "8cd8a7da-2e05-4c38-aa37-33fd466a61e2", "metadata": { "tags": [] }, "source": [ "## Run fastdup\n", "\n", "Once the extraction completes, we can run fastdup on the images.\n", "\n", "For that let's initialize fastdup and specify the input directory which points to the folder of images.\n", "\n", "> 🗒 **Note** - The `.create` method also has an optional `work_dir` parameter which specifies the directory to store artifacts from the run.\n", "\n", "In other words you can run `fastdup.create(input_dir=\"images/\", work_dir=\"my_work_dir/\")` if you'd like to store the artifacts in a `my_work_dir`.\n", "\n", "Now, let's run fastdup." ] }, { "cell_type": "code", "execution_count": null, "id": "ddfcf3d5", "metadata": {}, "outputs": [], "source": [ "fd = fastdup.create(input_dir=\"food-101/images/\")\n", "fd.run(annotations=df)" ] }, { "cell_type": "code", "execution_count": 44, "id": "bb83a238-9618-4900-a907-49ddbb60eb2b", "metadata": {}, "outputs": [], "source": [ "outliers_df = fd.outliers()" ] }, { "cell_type": "code", "execution_count": 45, "id": "5aa3f292-58cf-4311-808f-8ca5076765fa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outliernearestdistancefilename_outlierlabel_outlierindex_xerror_code_outlieris_valid_outlierfd_index_outlierfilename_nearestlabel_nearestindex_yerror_code_nearestis_valid_nearestfd_index_nearest
075368134900.379365food-101/images/breakfast_burrito/462294.jpgbreakfast_burrito75368VALIDTrue75368food-101/images/tacos/1505262.jpgtacos13490VALIDTrue13490
141508167640.429240food-101/images/macarons/2117640.jpgmacarons41508VALIDTrue41508food-101/images/fish_and_chips/2079080.jpgfish_and_chips16764VALIDTrue16764
213490193570.515785food-101/images/tacos/1505262.jpgtacos13490VALIDTrue13490food-101/images/red_velvet_cake/3143813.jpgred_velvet_cake19357VALIDTrue19357
33049986860.528563food-101/images/shrimp_and_grits/1047420.jpgshrimp_and_grits3049VALIDTrue3049food-101/images/club_sandwich/2465517.jpgclub_sandwich98686VALIDTrue98686
430949653100.547157food-101/images/sushi/3100962.jpgsushi30949VALIDTrue30949food-101/images/deviled_eggs/3145324.jpgdeviled_eggs65310VALIDTrue65310
................................................
604540611407580.772242food-101/images/chocolate_cake/2533462.jpgchocolate_cake40611VALIDTrue40611food-101/images/chocolate_cake/652245.jpgchocolate_cake40758VALIDTrue40758
604622826665820.772263food-101/images/dumplings/1325469.jpgdumplings22826VALIDTrue22826food-101/images/escargots/1488896.jpgescargots66582VALIDTrue66582
604796748966680.772266food-101/images/steak/513129.jpgsteak96748VALIDTrue96748food-101/images/steak/3113772.jpgsteak96668VALIDTrue96668
604861641786430.772278food-101/images/chocolate_mousse/1463326.jpgchocolate_mousse61641VALIDTrue61641food-101/images/tiramisu/849295.jpgtiramisu78643VALIDTrue78643
604984483841390.772282food-101/images/baby_back_ribs/645544.jpgbaby_back_ribs84483VALIDTrue84483food-101/images/baby_back_ribs/1571645.jpgbaby_back_ribs84139VALIDTrue84139
\n", "

6050 rows × 15 columns

\n", "
" ], "text/plain": [ " outlier nearest distance filename_outlier label_outlier index_x error_code_outlier is_valid_outlier fd_index_outlier filename_nearest label_nearest index_y error_code_nearest is_valid_nearest fd_index_nearest\n", "0 75368 13490 0.379365 food-101/images/breakfast_burrito/462294.jpg breakfast_burrito 75368 VALID True 75368 food-101/images/tacos/1505262.jpg tacos 13490 VALID True 13490\n", "1 41508 16764 0.429240 food-101/images/macarons/2117640.jpg macarons 41508 VALID True 41508 food-101/images/fish_and_chips/2079080.jpg fish_and_chips 16764 VALID True 16764\n", "2 13490 19357 0.515785 food-101/images/tacos/1505262.jpg tacos 13490 VALID True 13490 food-101/images/red_velvet_cake/3143813.jpg red_velvet_cake 19357 VALID True 19357\n", "3 3049 98686 0.528563 food-101/images/shrimp_and_grits/1047420.jpg shrimp_and_grits 3049 VALID True 3049 food-101/images/club_sandwich/2465517.jpg club_sandwich 98686 VALID True 98686\n", "4 30949 65310 0.547157 food-101/images/sushi/3100962.jpg sushi 30949 VALID True 30949 food-101/images/deviled_eggs/3145324.jpg deviled_eggs 65310 VALID True 65310\n", "... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", "6045 40611 40758 0.772242 food-101/images/chocolate_cake/2533462.jpg chocolate_cake 40611 VALID True 40611 food-101/images/chocolate_cake/652245.jpg chocolate_cake 40758 VALID True 40758\n", "6046 22826 66582 0.772263 food-101/images/dumplings/1325469.jpg dumplings 22826 VALID True 22826 food-101/images/escargots/1488896.jpg escargots 66582 VALID True 66582\n", "6047 96748 96668 0.772266 food-101/images/steak/513129.jpg steak 96748 VALID True 96748 food-101/images/steak/3113772.jpg steak 96668 VALID True 96668\n", "6048 61641 78643 0.772278 food-101/images/chocolate_mousse/1463326.jpg chocolate_mousse 61641 VALID True 61641 food-101/images/tiramisu/849295.jpg tiramisu 78643 VALID True 78643\n", "6049 84483 84139 0.772282 food-101/images/baby_back_ribs/645544.jpg baby_back_ribs 84483 VALID True 84483 food-101/images/baby_back_ribs/1571645.jpg baby_back_ribs 84139 VALID True 84139\n", "\n", "[6050 rows x 15 columns]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outliers_df" ] }, { "cell_type": "code", "execution_count": 46, "id": "e86067b1-9c89-41ba-9a65-b9eec57f4de2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filename_outlierfilename_nearestdistancelabel_outlierlabel_nearest
0food-101/images/breakfast_burrito/462294.jpgfood-101/images/tacos/1505262.jpg0.379365breakfast_burritotacos
1food-101/images/macarons/2117640.jpgfood-101/images/fish_and_chips/2079080.jpg0.429240macaronsfish_and_chips
2food-101/images/tacos/1505262.jpgfood-101/images/red_velvet_cake/3143813.jpg0.515785tacosred_velvet_cake
3food-101/images/shrimp_and_grits/1047420.jpgfood-101/images/club_sandwich/2465517.jpg0.528563shrimp_and_gritsclub_sandwich
4food-101/images/sushi/3100962.jpgfood-101/images/deviled_eggs/3145324.jpg0.547157sushideviled_eggs
..................
6045food-101/images/chocolate_cake/2533462.jpgfood-101/images/chocolate_cake/652245.jpg0.772242chocolate_cakechocolate_cake
6046food-101/images/dumplings/1325469.jpgfood-101/images/escargots/1488896.jpg0.772263dumplingsescargots
6047food-101/images/steak/513129.jpgfood-101/images/steak/3113772.jpg0.772266steaksteak
6048food-101/images/chocolate_mousse/1463326.jpgfood-101/images/tiramisu/849295.jpg0.772278chocolate_moussetiramisu
6049food-101/images/baby_back_ribs/645544.jpgfood-101/images/baby_back_ribs/1571645.jpg0.772282baby_back_ribsbaby_back_ribs
\n", "

6050 rows × 5 columns

\n", "
" ], "text/plain": [ " filename_outlier filename_nearest distance label_outlier label_nearest\n", "0 food-101/images/breakfast_burrito/462294.jpg food-101/images/tacos/1505262.jpg 0.379365 breakfast_burrito tacos\n", "1 food-101/images/macarons/2117640.jpg food-101/images/fish_and_chips/2079080.jpg 0.429240 macarons fish_and_chips\n", "2 food-101/images/tacos/1505262.jpg food-101/images/red_velvet_cake/3143813.jpg 0.515785 tacos red_velvet_cake\n", "3 food-101/images/shrimp_and_grits/1047420.jpg food-101/images/club_sandwich/2465517.jpg 0.528563 shrimp_and_grits club_sandwich\n", "4 food-101/images/sushi/3100962.jpg food-101/images/deviled_eggs/3145324.jpg 0.547157 sushi deviled_eggs\n", "... ... ... ... ... ...\n", "6045 food-101/images/chocolate_cake/2533462.jpg food-101/images/chocolate_cake/652245.jpg 0.772242 chocolate_cake chocolate_cake\n", "6046 food-101/images/dumplings/1325469.jpg food-101/images/escargots/1488896.jpg 0.772263 dumplings escargots\n", "6047 food-101/images/steak/513129.jpg food-101/images/steak/3113772.jpg 0.772266 steak steak\n", "6048 food-101/images/chocolate_mousse/1463326.jpg food-101/images/tiramisu/849295.jpg 0.772278 chocolate_mousse tiramisu\n", "6049 food-101/images/baby_back_ribs/645544.jpg food-101/images/baby_back_ribs/1571645.jpg 0.772282 baby_back_ribs baby_back_ribs\n", "\n", "[6050 rows x 5 columns]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outliers_df = outliers_df[['filename_outlier', 'filename_nearest', 'distance', 'label_outlier', 'label_nearest']]\n", "outliers_df" ] }, { "cell_type": "markdown", "id": "71f66a14", "metadata": {}, "source": [ "Let's select the top 30 outliers and display them." ] }, { "cell_type": "code", "execution_count": 47, "id": "51462e5c", "metadata": {}, "outputs": [], "source": [ "outliers_df = outliers_df.head(30)" ] }, { "cell_type": "code", "execution_count": 48, "id": "784ee26d-265c-4383-8d95-543651e617f0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_54998/3709838206.py:21: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " outliers_df['filename_outlier_preview'] = outliers_df['filename_outlier'].apply(lambda x: display_image(x, width=100))\n", "/tmp/ipykernel_54998/3709838206.py:22: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " outliers_df['filename_nearest_preview'] = outliers_df['filename_nearest'].apply(lambda x: display_image(x, width=100))\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 filename_outlierfilename_nearestdistancelabel_outlierlabel_nearestfilename_outlier_previewfilename_nearest_preview
0food-101/images/breakfast_burrito/462294.jpgfood-101/images/tacos/1505262.jpg0.379365breakfast_burritotacos
1food-101/images/macarons/2117640.jpgfood-101/images/fish_and_chips/2079080.jpg0.429240macaronsfish_and_chips
2food-101/images/tacos/1505262.jpgfood-101/images/red_velvet_cake/3143813.jpg0.515785tacosred_velvet_cake
3food-101/images/shrimp_and_grits/1047420.jpgfood-101/images/club_sandwich/2465517.jpg0.528563shrimp_and_gritsclub_sandwich
4food-101/images/sushi/3100962.jpgfood-101/images/deviled_eggs/3145324.jpg0.547157sushideviled_eggs
5food-101/images/pho/2399877.jpgfood-101/images/hot_dog/1823010.jpg0.573438phohot_dog
6food-101/images/pho/1840846.jpgfood-101/images/chocolate_mousse/456162.jpg0.574433phochocolate_mousse
7food-101/images/chocolate_cake/2518457.jpgfood-101/images/paella/3838854.jpg0.576987chocolate_cakepaella
8food-101/images/tacos/1091159.jpgfood-101/images/ice_cream/618711.jpg0.583393tacosice_cream
9food-101/images/red_velvet_cake/2894652.jpgfood-101/images/red_velvet_cake/2750594.jpg0.589379red_velvet_cakered_velvet_cake
10food-101/images/waffles/720603.jpgfood-101/images/lasagna/1142842.jpg0.591061waffleslasagna
11food-101/images/pad_thai/2614597.jpgfood-101/images/apple_pie/2008772.jpg0.592497pad_thaiapple_pie
12food-101/images/prime_rib/587532.jpgfood-101/images/poutine/529562.jpg0.594438prime_ribpoutine
13food-101/images/macarons/2591602.jpgfood-101/images/cup_cakes/3299930.jpg0.594465macaronscup_cakes
14food-101/images/hamburger/1608876.jpgfood-101/images/cheese_plate/2206573.jpg0.596191hamburgercheese_plate
15food-101/images/macaroni_and_cheese/912672.jpgfood-101/images/falafel/2666983.jpg0.596902macaroni_and_cheesefalafel
16food-101/images/peking_duck/388951.jpgfood-101/images/macarons/2710408.jpg0.601192peking_duckmacarons
17food-101/images/steak/2788759.jpgfood-101/images/pulled_pork_sandwich/2098588.jpg0.605568steakpulled_pork_sandwich
18food-101/images/ice_cream/1837798.jpgfood-101/images/chocolate_cake/662729.jpg0.610101ice_creamchocolate_cake
19food-101/images/grilled_salmon/795787.jpgfood-101/images/prime_rib/3286982.jpg0.611880grilled_salmonprime_rib
20food-101/images/miso_soup/881247.jpgfood-101/images/fried_calamari/440673.jpg0.615745miso_soupfried_calamari
21food-101/images/creme_brulee/1661605.jpgfood-101/images/pork_chop/1569230.jpg0.616932creme_bruleepork_chop
22food-101/images/ice_cream/1793992.jpgfood-101/images/hot_dog/502977.jpg0.619381ice_creamhot_dog
23food-101/images/cup_cakes/1005580.jpgfood-101/images/chocolate_cake/2480326.jpg0.622404cup_cakeschocolate_cake
24food-101/images/onion_rings/2447676.jpgfood-101/images/donuts/921183.jpg0.622859onion_ringsdonuts
25food-101/images/bread_pudding/1375816.jpgfood-101/images/chocolate_mousse/2177988.jpg0.624186bread_puddingchocolate_mousse
26food-101/images/chicken_curry/2523126.jpgfood-101/images/pulled_pork_sandwich/1782028.jpg0.625223chicken_currypulled_pork_sandwich
27food-101/images/pho/3642399.jpgfood-101/images/grilled_cheese_sandwich/1709486.jpg0.628450phogrilled_cheese_sandwich
28food-101/images/cheesecake/2160930.jpgfood-101/images/mussels/2039320.jpg0.632786cheesecakemussels
29food-101/images/takoyaki/914304.jpgfood-101/images/grilled_salmon/2429320.jpg0.633307takoyakigrilled_salmon
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import base64\n", "from io import BytesIO\n", "from PIL import Image\n", "\n", "def resize_and_encode_image(image_path, width=100):\n", " with Image.open(image_path) as img:\n", " wpercent = (width / float(img.size[0]))\n", " height = int((float(img.size[1]) * float(wpercent)))\n", " resized_img = img.resize((width, height))\n", " buffered = BytesIO()\n", " resized_img.save(buffered, format=\"PNG\")\n", " encoded_string = base64.b64encode(buffered.getvalue()).decode('utf-8')\n", " return f''\n", "\n", "def display_image(image_path, width=100):\n", " if isinstance(image_path, str):\n", " return resize_and_encode_image(image_path, width)\n", " else:\n", " return ''\n", "\n", "outliers_df['filename_outlier_preview'] = outliers_df['filename_outlier'].apply(lambda x: display_image(x, width=100))\n", "outliers_df['filename_nearest_preview'] = outliers_df['filename_nearest'].apply(lambda x: display_image(x, width=100))\n", "\n", "display(outliers_df.style)" ] }, { "cell_type": "markdown", "id": "d573ceac", "metadata": {}, "source": [ "Now we can export the results to a CSV file for further analysis and correction of labels." ] }, { "cell_type": "code", "execution_count": 49, "id": "a263073e", "metadata": {}, "outputs": [], "source": [ "outliers_df.drop(columns=['filename_outlier_preview', 'filename_nearest_preview']).to_csv('outliers.csv', index=False)" ] }, { "cell_type": "markdown", "id": "98a0333c", "metadata": {}, "source": [ "## Interactive Exploration\n", "In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.\n", "\n", "To explore the dataset and issues interactively in a browser, run:" ] }, { "cell_type": "code", "execution_count": null, "id": "1f1c8b89-cf96-4130-b09e-b257904445d1", "metadata": {}, "outputs": [], "source": [ "fd.explore()" ] }, { "cell_type": "markdown", "id": "609b7114-9bae-46f5-be4d-0b86c920770e", "metadata": {}, "source": [ "> 🗒 **Note** - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.\n", "\n", "You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.\n", "\n", "\n", "![image.png](https://vl-blog.s3.us-east-2.amazonaws.com/fastdup_assets/cloud_preview.gif)" ] }, { "cell_type": "markdown", "id": "6c3135e1", "metadata": {}, "source": [ "## Wrap Up\n", "\n", "That's a wrap! In this notebook, we showed how to get mislabels from a labeled dataset.\n", "\n", "\n", "Next, feel free to check out other tutorials -\n", "\n", "+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!\n", "+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.\n", "+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!\n", "+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. \n", "\n", "As usual, feedback is welcome! Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).\n" ] }, { "cell_type": "markdown", "id": "6034a6ad-2aa2-454e-ad2d-bd320e7fe6bb", "metadata": {}, "source": [ "
\n", "
\n", " \n", " \"site\"\n", " \n", " \"blog\"\n", " \n", " \"github\"\n", " \n", " \"slack\"\n", " \n", " \"linkedin\"\n", " \n", " \"youtube\"\n", " \n", " \"twitter\"\n", "
\n", "
\n", "
\n", " \"logo\"\n", "
Copyright © 2024 Visual Layer. All rights reserved.
\n", "
\n", "\n", "
" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }