\n",
" "
],
"text/plain": [
" filename col_x row_y width height label ext split\n",
"0 coco_minitrain_25k/images/train2017/000000131075.jpg 20.23 55.98 313.49 326.50 tv 0 train\n",
"1 coco_minitrain_25k/images/train2017/000000131075.jpg 176.90 381.12 286.20 136.63 laptop 0 train\n",
"2 coco_minitrain_25k/images/train2017/000000131075.jpg 369.96 361.35 72.76 73.91 laptop 0 train"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coco_annotations.head(3)"
]
},
{
"cell_type": "markdown",
"id": "1149696e",
"metadata": {
"id": "1149696e"
},
"source": [
"## Run fastdup"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "604e19f2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "604e19f2",
"outputId": "2c5cbb8e-3310-402a-82b9-497dd1897388",
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.\n",
"fastdup C++ info received: 2023-05-20 04:46:25 [INFO] Going to loop over dir /tmp/tmpaeboyuub.csv\n",
"2023-05-20 04:46:26 [INFO] Found total 10000 images to run on, 10000 train, 0 test, name list 10000, counter 10000 \n",
"2023-05-20 04:48:59 [ERROR] Error: found invalid bounding box for image coco_minitrain_25k/images/train2017/000000528201.jpg. Please check bounding box file 264 341 0 5\n",
"Error: found invalid bounding box for image coco_minitrain_25k/images/train2017/000000528201.jpg. Please check bounding box file 264 341 0 5\n",
" \n",
"\n",
"FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.\n",
"fastdup C++ info received: 2023-05-20 04:50:46 [INFO] Going to loop over dir /tmp/crops_input.csv\n",
"2023-05-20 04:50:46 [INFO] Found total 9999 images to run on, 9999 train, 0 test, name list 9999, counter 9999 \n",
"2023-05-20 04:50:46 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:47 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - file does not existMissing file missing_file - file does not exist2023-05-20 04:50:48 [ERROR] Missing file missing_file - fil \n",
"\n",
"\n",
" ########################################################################################\n",
"\n",
"Dataset Analysis Summary: \n",
"\n",
" Dataset contains 183544 images\n",
" Valid images are 4.94% (9,067) of the data, invalid are 95.06% (174,477) of the data\n",
" For a detailed analysis, use `.invalid_instances()`.\n",
"\n",
" Similarity: 0.26% (476) belong to 5 similarity clusters (components).\n",
" 99.74% (183,068) images do not belong to any similarity cluster.\n",
" Largest cluster has 1,940 (1.06%) images.\n",
" For a detailed analysis, use `.connected_components()`\n",
"(similarity threshold used is 0.9, connected component threshold used is 0.96).\n",
"\n",
" Outliers: 0.67% (1,228) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n",
" For a detailed list of outliers, use `.outliers()`.\n"
]
}
],
"source": [
"# Run fastdup with annotations\n",
"# This may take a while on a colab node with 2 cores..\n",
"input_dir = '.'\n",
"work_dir = 'fastdup_minicoco'\n",
"\n",
"fd = fastdup.create(work_dir=work_dir, input_dir=input_dir)\n",
"fd.run(annotations=coco_annotations, overwrite=True, num_images=10000)"
]
},
{
"cell_type": "markdown",
"id": "3b4f5823",
"metadata": {
"id": "3b4f5823"
},
"source": [
"## Class distribution\n",
"The dataset contains 25k images and 183k objects, an average of 7.3 objects per image. \n",
"\n",
"Interestingly, we see a highly unbalanced class distribution, where all 80 coco classes are present here, but there is a strong balance towards the person class, that accounts for over 56k instances (30.6%). Car and Chair classes also contain over 8k instances each, while at the bottom of the list the toaster and hair drier classes contain as few as 40 instances. \n",
"\n",
"Using `Plotly` we get a useful interactive histogram. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f87b7057",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 542
},
"id": "f87b7057",
"outputId": "fd417b92-da68-4e00-982a-2f44f780b9e9"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# visualize outliers\n",
"fd.vis.outliers_gallery()"
]
},
{
"cell_type": "markdown",
"id": "c0f1fade",
"metadata": {
"id": "c0f1fade"
},
"source": [
"## Size and shape issues\n",
"Objects come in various shapes and sizes, and sometimes objects might be incorrectly labeled or too small to be useful. We will now find the smallest, narrowest and widest objects, and asses their usefulness. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "a2d00424",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "a2d00424",
"outputId": "ba812fe8-0f9f-4e14-ad38-96b4bb751ac9"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" col_x_x row_y_x width_x height_x label ext split index filename crop_filename col_x_y row_y_y width_y height_y error_code is_valid fd_index\n",
"0 20.23 55.98 313.49 326.50 tv 0 train 0 coco_minitrain_25k/images/train2017/000000131075.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000131075.jpg_20_55_313_326.jpg NaN NaN NaN NaN VALID True 0\n",
"1 176.90 381.12 286.20 136.63 laptop 0 train 1 coco_minitrain_25k/images/train2017/000000131075.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000131075.jpg_176_381_286_136.jpg NaN NaN NaN NaN VALID True 1\n",
"2 369.96 361.35 72.76 73.91 laptop 0 train 2 coco_minitrain_25k/images/train2017/000000131075.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000131075.jpg_369_361_72_73.jpg NaN NaN NaN NaN VALID True 2\n",
"3 411.68 417.87 66.32 129.44 chair 0 train 3 coco_minitrain_25k/images/train2017/000000131075.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000131075.jpg_411_417_66_129.jpg NaN NaN NaN NaN VALID True 3\n",
"4 367.31 363.25 72.27 67.01 tv 0 train 4 coco_minitrain_25k/images/train2017/000000131075.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000131075.jpg_367_363_72_67.jpg NaN NaN NaN NaN VALID True 4\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
":3: SettingWithCopyWarning:\n",
"\n",
"\n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
"\n",
":4: SettingWithCopyWarning:\n",
"\n",
"\n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
"\n"
]
}
],
"source": [
"annot = fd.annotations()\n",
"print(annot.head())\n",
"annot['area'] = annot['width_x'] * annot['height_x']\n",
"annot['aspect'] = annot['width_x'] / annot['height_x']"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "3298e003",
"metadata": {
"id": "3298e003"
},
"outputs": [],
"source": [
"# Smallest 5% of objects:\n",
"smallest_objects = annot[annot['area'] < annot['area'].quantile(0.05)].sort_values(by=['area'])\n",
"\n",
"# 5% of extreme aspect ratios\n",
"aspect_ratio_objects = annot[(annot['aspect'] < annot['aspect'].quantile(0.05))\n",
" | (annot['aspect'] > annot['aspect'].quantile(0.95))].sort_values(by=['aspect'])\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "a4470f45",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 207
},
"id": "a4470f45",
"outputId": "51e772c2-9306-4510-defb-71f21a98757a"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" "
],
"text/plain": [
" col_x_x row_y_x width_x height_x label ext split index filename crop_filename col_x_y row_y_y width_y height_y error_code is_valid fd_index area aspect\n",
"6006 89.05 212.44 486.91 24.63 train 0 train 6006 coco_minitrain_25k/images/train2017/000000397173.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000397173.jpg_89_212_486_24.jpg NaN NaN NaN NaN VALID True 6006 11992.5933 19.768981\n",
"2021 221.00 180.00 305.00 15.00 car 0 train 2021 coco_minitrain_25k/images/train2017/000000001408.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000001408.jpg_221_180_305_15.jpg NaN NaN NaN NaN VALID True 2021 4575.0000 20.333333\n",
"4261 33.00 216.00 602.00 18.00 boat 0 train 4261 coco_minitrain_25k/images/train2017/000000527098.jpg fastdup_minicoco/crops/coco_minitrain_25kimagestrain2017000000527098.jpg_33_216_602_18.jpg NaN NaN NaN NaN VALID True 4261 10836.0000 33.444444"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"aspect_ratio_objects.tail(3)"
]
},
{
"cell_type": "markdown",
"id": "9af6979b",
"metadata": {
"id": "9af6979b"
},
"source": [
"Look at that! The slices reveal many items that are either tiny (10x10 pixels) or have extreme aspect ratios - as extreme at 1:45 - an object 601 pixels wide by only 13 pixels high. "
]
},
{
"cell_type": "markdown",
"id": "5f4d7cc1",
"metadata": {
"id": "5f4d7cc1"
},
"source": [
"## Objects that didn't make the cut:\n",
"Let's look at objects deemed invalid by fastdup. These are either objects that are too small to be useful in our analysis (smaller than 10px), have bouding boxes with illeagal values (negative or beyond image boundaries), or are part of images that are missing. We can tell which is which by the `error_code` column in our dataframe."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "6b030732",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"id": "6b030732",
"outputId": "04f80f66-aaeb-4563-e428-87d5dbcb818c"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
col_x_x
\n",
"
row_y_x
\n",
"
width_x
\n",
"
height_x
\n",
"
label
\n",
"
ext
\n",
"
split
\n",
"
index
\n",
"
filename
\n",
"
crop_filename
\n",
"
col_x_y
\n",
"
row_y_y
\n",
"
width_y
\n",
"
height_y
\n",
"
error_code
\n",
"
is_valid
\n",
"
fd_index
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
437.17
\n",
"
244.79
\n",
"
19.52
\n",
"
9.93
\n",
"
mouse
\n",
"
0
\n",
"
train
\n",
"
16
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
ERROR_BAD_BOUNDING_BOX
\n",
"
False
\n",
"
16
\n",
"
\n",
"
\n",
"
1
\n",
"
137.84
\n",
"
332.22
\n",
"
8.92
\n",
"
11.50
\n",
"
person
\n",
"
0
\n",
"
train
\n",
"
60
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
ERROR_BAD_BOUNDING_BOX
\n",
"
False
\n",
"
60
\n",
"
\n",
"
\n",
"
2
\n",
"
177.35
\n",
"
294.13
\n",
"
5.32
\n",
"
11.92
\n",
"
person
\n",
"
0
\n",
"
train
\n",
"
65
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
ERROR_BAD_BOUNDING_BOX
\n",
"
False
\n",
"
65
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" col_x_x row_y_x width_x height_x label ext split index filename crop_filename col_x_y row_y_y width_y height_y error_code is_valid fd_index\n",
"0 437.17 244.79 19.52 9.93 mouse 0 train 16 NaN NaN NaN NaN NaN NaN ERROR_BAD_BOUNDING_BOX False 16\n",
"1 137.84 332.22 8.92 11.50 person 0 train 60 NaN NaN NaN NaN NaN NaN ERROR_BAD_BOUNDING_BOX False 60\n",
"2 177.35 294.13 5.32 11.92 person 0 train 65 NaN NaN NaN NaN NaN NaN ERROR_BAD_BOUNDING_BOX False 65"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fd.invalid_instances().head(3)"
]
},
{
"cell_type": "markdown",
"id": "6d1196e3",
"metadata": {
"id": "6d1196e3"
},
"source": [
"## Distribution of error codes:\n",
"A simple `value_counts` will tell us the distribution of the errors. We have found 18,592 (!) bounding boxes that are either too small or go beyond image boundaries. This is 10% of the data! Filtering them would both save us grusome debugging of training errors and failures and help up provide the model with useful size objects. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "3d5350cf",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3d5350cf",
"outputId": "5b8b41f4-3227-4ed5-aaba-5624f4c3f433"
},
"outputs": [
{
"data": {
"text/plain": [
"ERROR_MISSING_FILE 173544\n",
"ERROR_BAD_BOUNDING_BOX 933\n",
"Name: error_code, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fd.invalid_instances()['error_code'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "39e4ee9b",
"metadata": {
"id": "39e4ee9b"
},
"source": [
"## Find possible mislabels\n",
"The fastdup similarity search and gallery is a strong tool for finding objects that are possibly mislabeled. By finding each object's nearest neighbors and their classes, we can find objects with classes contradicting their neighbors' - a strong sign for mislabels."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "f5dea401",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "f5dea401",
"outputId": "50d8eb14-1cbd-4ae4-a0aa-fcbd360936dc"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"laptop\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 25/25 [00:00<00:00, 77.16it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Finished OK. Components are stored as image files fastdup_minicoco/galleries/components_[index].jpg\n",
"Stored components visual view in fastdup_minicoco/galleries/components.html\n",
"Execution time in seconds 1.9\n"
]
},
{
"data": {
"text/html": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" Components Report\n",
" \n",
" \n",
"\n",
"\n",
"\n",
" \n",
" \n",
"
Showing groups of similar images, from different classes
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
4244
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
dining table
\n",
"
1
\n",
"
\n",
"
\n",
"
pizza
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
7021
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bowl
\n",
"
1
\n",
"
\n",
"
\n",
"
dining table
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
5016
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
couch
\n",
"
1
\n",
"
\n",
"
\n",
"
dog
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
8428
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
dining table
\n",
"
1
\n",
"
\n",
"
\n",
"
knife
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
2495
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bus
\n",
"
1
\n",
"
\n",
"
\n",
"
truck
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
6210
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
chair
\n",
"
1
\n",
"
\n",
"
\n",
"
couch
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
6395
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
person
\n",
"
1
\n",
"
\n",
"
\n",
"
sandwich
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
6632
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
person
\n",
"
1
\n",
"
\n",
"
\n",
"
skis
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
7228
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bed
\n",
"
1
\n",
"
\n",
"
\n",
"
person
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
3191
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
remote
\n",
"
1
\n",
"
\n",
"
\n",
"
wine glass
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
7112
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
1.0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
dining table
\n",
"
1
\n",
"
\n",
"
\n",
"
pizza
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
4647
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9908
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
fork
\n",
"
1
\n",
"
\n",
"
\n",
"
spoon
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
1245
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9906
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
cat
\n",
"
1
\n",
"
\n",
"
\n",
"
sink
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
9578
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9889
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bed
\n",
"
1
\n",
"
\n",
"
\n",
"
dog
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
7706
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.987
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
cow
\n",
"
1
\n",
"
\n",
"
\n",
"
horse
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
7129
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9863
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
hot dog
\n",
"
1
\n",
"
\n",
"
\n",
"
sandwich
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
7399
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.986
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
cat
\n",
"
1
\n",
"
\n",
"
\n",
"
tv
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
891
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9853
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
hot dog
\n",
"
1
\n",
"
\n",
"
\n",
"
sandwich
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
8309
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9846
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
hot dog
\n",
"
1
\n",
"
\n",
"
\n",
"
sandwich
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
526
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9831
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bottle
\n",
"
1
\n",
"
\n",
"
\n",
"
refrigerator
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
348
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9829
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bench
\n",
"
1
\n",
"
\n",
"
\n",
"
dining table
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
5727
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9827
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
cake
\n",
"
1
\n",
"
\n",
"
\n",
"
dining table
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
6040
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9817
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bowl
\n",
"
1
\n",
"
\n",
"
\n",
"
cup
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
4834
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9794
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
apple
\n",
"
1
\n",
"
\n",
"
\n",
"
orange
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Info
\n",
"
\n",
"
\n",
"
component
\n",
"
1041
\n",
"
\n",
"
\n",
"
num_images
\n",
"
2
\n",
"
\n",
"
\n",
"
mean_distance
\n",
"
0.9794
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
"
Label
\n",
"
\n",
"
\n",
"
bed
\n",
"
1
\n",
"
\n",
"
\n",
"
couch
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fd.vis.component_gallery(num_images=25, slice='diff')"
]
},
{
"cell_type": "markdown",
"id": "0bd821f1",
"metadata": {},
"source": [
"## Wrap Up\n",
"\n",
"Next, feel free to check out other tutorials -\n",
"\n",
"+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!\n",
"+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.\n",
"+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!\n",
"+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. "
]
},
{
"cell_type": "markdown",
"id": "HERGxWSMSDh0",
"metadata": {
"id": "HERGxWSMSDh0"
},
"source": [
"\n",
"## VL Profiler\n",
"If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. \n",
"\n",
"[Sign up](https://app.visual-layer.com) now, it's free.\n",
"\n",
"[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)\n",
"\n",
"As usual, feedback is welcome! \n",
"\n",
"Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues)."
]
},
{
"cell_type": "markdown",
"id": "25d5e96f",
"metadata": {},
"source": []
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"vscode": {
"interpreter": {
"hash": "5b6e8fba36db23bc4d54e0302cd75fdd75c29d9edcbab68d6cfc74e7e4b30305"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}