[![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)

# Analyzing Image Classification Dataset

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)
[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)

This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze an image classification dataset for:

+ Duplicates
+ Outliers
+ Wrong labels
+ Image clusters


> **Note** - No GPU needed! You can run this notebook on a CPU-only instance.



## Installation

First let's install [fastdup](https://github.com/visual-layer/fastdup) from PyPI with:

In [1]:
!pip install -Uq fastdup

Now, test the installation. If there's no error message, we are ready to go.

In [2]:
import fastdup
fastdup.__version__

/usr/bin/dpkg


'1.26'

## Download Dataset

We will analyze the [Imagenette](https://github.com/fastai/imagenette) dataset - a subset of 10 easily classified classes from Imagenet (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).

In [None]:
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
!tar -xf imagenette2-160.tgz

## Load and Format Annotations

In [3]:
import pandas as pd

In [4]:
data_dir = 'imagenette2-160/'
csv_path = 'imagenette2-160/noisy_imagenette.csv'

In [5]:
label_map = {
    'n02979186': 'cassette_player', 
    'n03417042': 'garbage_truck', 
    'n01440764': 'tench', 
    'n02102040': 'English_springer', 
    'n03028079': 'church',
    'n03888257': 'parachute', 
    'n03394916': 'French_horn', 
    'n03000684': 'chain_saw', 
    'n03445777': 'golf_ball', 
    'n03425413': 'gas_pump'
}

Load the annotations provided with the dataset.

In [6]:
df_annot = pd.read_csv(csv_path)
df_annot.head(3)

Unnamed: 0,path,noisy_labels_0,noisy_labels_1,noisy_labels_5,noisy_labels_25,noisy_labels_50,is_valid
0,train/n02979186/n02979186_9036.JPEG,n02979186,n02979186,n02979186,n02979186,n02979186,False
1,train/n02979186/n02979186_11957.JPEG,n02979186,n02979186,n02979186,n02979186,n03000684,False
2,train/n02979186/n02979186_9715.JPEG,n02979186,n02979186,n02979186,n03417042,n03000684,False


Transform the annotations to fastdup supported format.

fastdup expects an annotation `DataFrame` that contains the following column:

+ filename - contains the path to the image file
+ label - contains a label of the image
+ split - whether the image is subset of the training, validation or test dataset

In [7]:
# take relevant columns
df_annot = df_annot[['path', 'noisy_labels_0']]

# rename columns to fastdup's column names
df_annot = df_annot.rename({'noisy_labels_0': 'label', 'path': 'filename'}, axis='columns')

# append datadir
df_annot['filename'] = df_annot['filename'].apply(lambda x: data_dir + x)

# create split column
df_annot['split'] = df_annot['filename'].apply(lambda x: x.split("/")[1])

# map label ids to regular labels
df_annot['label'] = df_annot['label'].map(label_map)

# show formated annotations
df_annot

Unnamed: 0,filename,label,split
0,imagenette2-160/train/n02979186/n02979186_9036.JPEG,cassette_player,train
1,imagenette2-160/train/n02979186/n02979186_11957.JPEG,cassette_player,train
2,imagenette2-160/train/n02979186/n02979186_9715.JPEG,cassette_player,train
3,imagenette2-160/train/n02979186/n02979186_21736.JPEG,cassette_player,train
4,imagenette2-160/train/n02979186/ILSVRC2012_val_00046953.JPEG,cassette_player,train
...,...,...,...
13389,imagenette2-160/val/n03425413/n03425413_17521.JPEG,gas_pump,val
13390,imagenette2-160/val/n03425413/n03425413_20711.JPEG,gas_pump,val
13391,imagenette2-160/val/n03425413/n03425413_19050.JPEG,gas_pump,val
13392,imagenette2-160/val/n03425413/n03425413_13831.JPEG,gas_pump,val


## Run fastdup

With the images and annotations ready, we can proceed with running an analysis on the data.

+ `input_dir` is the path to the downloaded images
+ `work_dir` is the path to store the artifacts from the analysis (optional)

In [8]:
fd = fastdup.create(input_dir=data_dir) 
fd.run(annotations=df_annot, ccthreshold=0.9, threshold=0.8)

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-07-13 19:22:31 [INFO] Going to loop over dir /tmp/tmpqm6imqyr.csv
2023-07-13 19:22:31 [INFO] Found total 13394 images to run on, 13394 train, 0 test, name list 13394, counter 13394 
2023-07-13 19:23:04 [INFO] Found total 13394 images to run onimated: 0 Minutes
Finished histogram 3.121
Finished bucket sort 3.151
2023-07-13 19:23:04 [INFO] 544) Finished write_index() NN model
2023-07-13 19:23:04 [INFO] Stored nn model index file work_dir/nnf.index
2023-07-13 19:23:05 [INFO] Total time took 34024 ms
2023-07-13 19:23:05 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 %
2023-07-13 19:23:05 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 %
2023-07-13 19:23:05 [INFO] Found a total of 16764 above threshold images (d>0.800), which are 62.58 %
2023-07-13 19:23:05 [INFO] Found a total of 1339 outlier images         (d<0.050), which are 5.00 %
2023-07-13 19:23:05 [I

0

## Outliers

Visualize outliers from the dataset.

In [9]:
fd.vis.outliers_gallery()

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 26723.82it/s]

Stored outliers visual view in  work_dir/galleries/outliers.html





Info,Unnamed: 1
Distance,0.523752
Path,/train/n03445777/n03445777_5218.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.57066
Path,/train/n03888257/n03888257_34639.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.578252
Path,/train/n03445777/n03445777_3254.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.58389
Path,/val/n03445777/n03445777_5932.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.599957
Path,/train/n03888257/n03888257_79145.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.605961
Path,/train/n01440764/n01440764_5638.JPEG
label,tench

Info,Unnamed: 1
Distance,0.608525
Path,/train/n03394916/n03394916_33663.JPEG
label,French_horn

Info,Unnamed: 1
Distance,0.609527
Path,/train/n03888257/n03888257_7793.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.611143
Path,/val/n01440764/n01440764_4962.JPEG
label,tench

Info,Unnamed: 1
Distance,0.61373
Path,/train/n03445777/n03445777_6033.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.61618
Path,/train/n03394916/n03394916_37544.JPEG
label,French_horn

Info,Unnamed: 1
Distance,0.616704
Path,/val/n03888257/n03888257_11450.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.616785
Path,/val/n03445777/n03445777_9292.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.617952
Path,/train/n03888257/n03888257_16223.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.619739
Path,/train/n03028079/n03028079_24708.JPEG
label,church

Info,Unnamed: 1
Distance,0.619787
Path,/train/n01440764/ILSVRC2012_val_00037834.JPEG
label,tench

Info,Unnamed: 1
Distance,0.620815
Path,/train/n03888257/n03888257_5703.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.626412
Path,/train/n03445777/n03445777_9199.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.628011
Path,/train/n03888257/n03888257_32518.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.630812
Path,/train/n02979186/n02979186_10289.JPEG
label,cassette_player


0

Show outliers image data.

In [10]:
fd.outliers().head(5)

Unnamed: 0,outlier,nearest,distance,filename_outlier,label_outlier,split_outlier,index_x,error_code_outlier,is_valid_outlier,fd_index_outlier,filename_nearest,label_nearest,split_nearest,index_y,error_code_nearest,is_valid_nearest,fd_index_nearest
0,8293,13217,0.51903,imagenette2-160/train/n03445777/n03445777_5218.JPEG,golf_ball,train,8293,VALID,True,8293,imagenette2-160/val/n03425413/n03425413_11460.JPEG,gas_pump,val,13217,VALID,True,13217
1,5457,5500,0.544795,imagenette2-160/train/n03888257/n03888257_34639.JPEG,parachute,train,5457,VALID,True,5457,imagenette2-160/train/n03888257/n03888257_12053.JPEG,parachute,train,5500,VALID,True,5500
2,8076,3016,0.555266,imagenette2-160/train/n03445777/n03445777_3254.JPEG,golf_ball,train,8076,VALID,True,8076,imagenette2-160/train/n02102040/n02102040_585.JPEG,English_springer,train,3016,VALID,True,3016
3,2790,4510,0.568702,imagenette2-160/train/n01440764/n01440764_5638.JPEG,tench,train,2790,VALID,True,2790,imagenette2-160/train/n03028079/n03028079_6607.JPEG,church,train,4510,VALID,True,4510
4,5478,11775,0.582118,imagenette2-160/train/n03888257/n03888257_79145.JPEG,parachute,train,5478,VALID,True,5478,imagenette2-160/val/n03888257/n03888257_8080.JPEG,parachute,val,11775,VALID,True,11775


## Comparing Labels of Similar Images
Find possible mislabels by comparing a query image to other images in the dataset.

In [11]:
fd.vis.similarity_gallery() 

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 237.91it/s]


Stored similar images visual view in  work_dir/galleries/similarity.html


Info From,Unnamed: 1
label,church
from,/val/n03028079/n03028079_13002.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.800002,/train/n03028079/n03028079_3839.JPEG,church

0
Query Image

0
Similar

Info From,Unnamed: 1
label,French_horn
from,/train/n03394916/n03394916_32478.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.800012,/train/n03394916/n03394916_35573.JPEG,French_horn

0
Query Image

0
Similar

Info From,Unnamed: 1
label,cassette_player
from,/train/n02979186/n02979186_14524.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.806502,/train/n02979186/n02979186_213.JPEG,cassette_player
0.800015,/val/n02979186/n02979186_11000.JPEG,cassette_player

0
Query Image

0
Similar

Info From,Unnamed: 1
label,cassette_player
from,/val/n02979186/n02979186_11000.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.820827,/train/n02979186/n02979186_10095.JPEG,cassette_player
0.800015,/train/n02979186/n02979186_14524.JPEG,cassette_player

0
Query Image

0
Similar

Info From,Unnamed: 1
label,tench
from,/train/n01440764/n01440764_44.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.803563,/train/n01440764/n01440764_14249.JPEG,tench
0.800023,/val/n01440764/n01440764_5490.JPEG,tench

0
Query Image

0
Similar

Info From,Unnamed: 1
label,garbage_truck
from,/train/n03417042/n03417042_3236.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.800025,/train/n03417042/n03417042_12297.JPEG,garbage_truck

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/train/n03888257/n03888257_20704.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.804987,/train/n03888257/n03888257_20473.JPEG,parachute
0.800034,/train/n03888257/n03888257_8614.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,gas_pump
from,/train/n03425413/n03425413_14249.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.810811,/val/n03425413/n03425413_20360.JPEG,gas_pump
0.800035,/train/n03425413/n03425413_719.JPEG,gas_pump

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/val/n03888257/n03888257_31790.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.810816,/train/n03888257/n03888257_17326.JPEG,parachute
0.800036,/train/n03888257/n03888257_8199.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/train/n03888257/n03888257_8199.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.834109,/train/n03888257/n03888257_17326.JPEG,parachute
0.800036,/val/n03888257/n03888257_31790.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,chain_saw
from,/val/n03000684/n03000684_24542.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.803641,/val/n03000684/n03000684_2610.JPEG,chain_saw
0.80004,/train/n03000684/n03000684_26357.JPEG,chain_saw

0
Query Image

0
Similar

Info From,Unnamed: 1
label,chain_saw
from,/val/n03000684/n03000684_17431.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.807598,/train/n03000684/n03000684_1034.JPEG,chain_saw
0.800068,/train/n03000684/n03000684_807.JPEG,chain_saw

0
Query Image

0
Similar

Info From,Unnamed: 1
label,chain_saw
from,/train/n03000684/n03000684_807.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.811944,/val/n03000684/n03000684_18140.JPEG,chain_saw
0.800068,/val/n03000684/n03000684_17431.JPEG,chain_saw

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/n02102040_139.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.841122,/train/n02102040/n02102040_2528.JPEG,English_springer
0.800071,/val/n02102040/n02102040_1121.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/train/n03888257/n03888257_38633.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.800073,/train/n03888257/n03888257_12816.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/train/n03888257/n03888257_12816.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.800073,/train/n03888257/n03888257_38633.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/val/n03888257/n03888257_66961.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.805559,/val/n03888257/n03888257_13410.JPEG,parachute
0.800073,/val/n03888257/n03888257_3142.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,church
from,/train/n03028079/n03028079_17175.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.806021,/train/n03028079/n03028079_12685.JPEG,church
0.800076,/train/n03028079/n03028079_23514.JPEG,church

0
Query Image

0
Similar

Info From,Unnamed: 1
label,golf_ball
from,/val/n03445777/n03445777_6350.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.806152,/train/n03445777/n03445777_2468.JPEG,golf_ball
0.800086,/val/n03445777/n03445777_7480.JPEG,golf_ball

0
Query Image

0
Similar

Info From,Unnamed: 1
label,cassette_player
from,/train/n02979186/n02979186_10666.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.800088,/train/n02979186/n02979186_2383.JPEG,cassette_player

0
Query Image

0
Similar


Unnamed: 0,from,to,label,label2,distance
7505,imagenette2-160/val/n03028079/n03028079_13002.JPEG,[imagenette2-160/train/n03028079/n03028079_3839.JPEG],[church],[church],[0.800002]
3429,imagenette2-160/train/n03394916/n03394916_32478.JPEG,[imagenette2-160/train/n03394916/n03394916_35573.JPEG],[French_horn],[French_horn],[0.800012]
1700,imagenette2-160/train/n02979186/n02979186_14524.JPEG,"[imagenette2-160/val/n02979186/n02979186_11000.JPEG, imagenette2-160/train/n02979186/n02979186_213.JPEG]","[cassette_player, cassette_player]","[cassette_player, cassette_player]","[0.800015, 0.806502]"
7055,imagenette2-160/val/n02979186/n02979186_11000.JPEG,"[imagenette2-160/train/n02979186/n02979186_14524.JPEG, imagenette2-160/train/n02979186/n02979186_10095.JPEG]","[cassette_player, cassette_player]","[cassette_player, cassette_player]","[0.800015, 0.820827]"
471,imagenette2-160/train/n01440764/n01440764_44.JPEG,"[imagenette2-160/val/n01440764/n01440764_5490.JPEG, imagenette2-160/train/n01440764/n01440764_14249.JPEG]","[tench, tench]","[tench, tench]","[0.800023, 0.803563]"
...,...,...,...,...,...
870,imagenette2-160/train/n02102040/n02102040_1306.JPEG,"[imagenette2-160/train/n02102040/n02102040_876.JPEG, imagenette2-160/train/n02102040/n02102040_3114.JPEG]","[English_springer, English_springer]","[English_springer, English_springer]","[0.936799, 0.949252]"
1050,imagenette2-160/train/n02102040/n02102040_3114.JPEG,"[imagenette2-160/train/n02102040/n02102040_1055.JPEG, imagenette2-160/train/n02102040/n02102040_1306.JPEG]","[English_springer, English_springer]","[English_springer, English_springer]","[0.941953, 0.949252]"
231,imagenette2-160/train/n01440764/n01440764_13978.JPEG,"[imagenette2-160/val/n01440764/n01440764_6341.JPEG, imagenette2-160/val/n01440764/n01440764_8210.JPEG]","[tench, tench]","[tench, tench]","[0.943767, 0.945909]"
6846,imagenette2-160/val/n02102040/n02102040_350.JPEG,"[imagenette2-160/val/n02102040/n02102040_312.JPEG, imagenette2-160/train/n02102040/n02102040_6313.JPEG]","[English_springer, English_springer]","[English_springer, English_springer]","[0.945413, 0.947323]"


## Similar Image Pairs

Find similar image pairs within and across the train and validation subfolders. Pairs may include train-train, train-val, val-train, and val-val.

In [12]:
fd.vis.duplicates_gallery()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Stored similarity visual view in  work_dir/galleries/duplicates.html


Info,Unnamed: 1
Distance,0.968786
From,/val/n03394916/n03394916_30631.JPEG
To,/train/n03394916/n03394916_44127.JPEG
From_Label,French_horn
To_Label,French_horn

Info,Unnamed: 1
Distance,0.962458
From,/train/n03445777/n03445777_13918.JPEG
To,/val/n03445777/n03445777_6882.JPEG
From_Label,golf_ball
To_Label,golf_ball

Info,Unnamed: 1
Distance,0.953837
From,/train/n02102040/n02102040_1564.JPEG
To,/train/n02102040/n02102040_3837.JPEG
From_Label,English_springer
To_Label,English_springer

Info,Unnamed: 1
Distance,0.953413
From,/train/n01440764/n01440764_7457.JPEG
To,/train/n01440764/n01440764_11339.JPEG
From_Label,tench
To_Label,tench

Info,Unnamed: 1
Distance,0.952239
From,/train/n03417042/n03417042_12906.JPEG
To,/train/n03417042/n03417042_1578.JPEG
From_Label,garbage_truck
To_Label,garbage_truck

Info,Unnamed: 1
Distance,0.951679
From,/val/n03394916/n03394916_6830.JPEG
To,/val/n03394916/n03394916_21092.JPEG
From_Label,French_horn
To_Label,French_horn

Info,Unnamed: 1
Distance,0.950477
From,/train/n03888257/n03888257_21027.JPEG
To,/val/n03888257/n03888257_11210.JPEG
From_Label,parachute
To_Label,parachute

Info,Unnamed: 1
Distance,0.950174
From,/train/n02102040/n02102040_3767.JPEG
To,/train/n02102040/n02102040_6313.JPEG
From_Label,English_springer
To_Label,English_springer

Info,Unnamed: 1
Distance,0.949877
From,/train/n02102040/ILSVRC2012_val_00032959.JPEG
To,/val/n02102040/n02102040_662.JPEG
From_Label,English_springer
To_Label,English_springer

Info,Unnamed: 1
Distance,0.949252
From,/train/n02102040/n02102040_3114.JPEG
To,/train/n02102040/n02102040_1306.JPEG
From_Label,English_springer
To_Label,English_springer


0

Show similar image pairs.

In [13]:
fd.similarity().head(5)

Unnamed: 0,from,to,distance,filename_from,label_from,split_from,index_x,error_code_from,is_valid_from,fd_index_from,filename_to,label_to,split_to,index_y,error_code_to,is_valid_to,fd_index_to
0,11960,5925,0.968786,imagenette2-160/val/n03394916/n03394916_30631.JPEG,French_horn,val,11960,VALID,True,11960,imagenette2-160/train/n03394916/n03394916_44127.JPEG,French_horn,train,5925,VALID,True,5925
1,5925,11960,0.968786,imagenette2-160/train/n03394916/n03394916_44127.JPEG,French_horn,train,5925,VALID,True,5925,imagenette2-160/val/n03394916/n03394916_30631.JPEG,French_horn,val,11960,VALID,True,11960
2,12613,7916,0.962458,imagenette2-160/val/n03445777/n03445777_6882.JPEG,golf_ball,val,12613,VALID,True,12613,imagenette2-160/train/n03445777/n03445777_13918.JPEG,golf_ball,train,7916,VALID,True,7916
3,7916,12613,0.962458,imagenette2-160/train/n03445777/n03445777_13918.JPEG,golf_ball,train,7916,VALID,True,7916,imagenette2-160/val/n03445777/n03445777_6882.JPEG,golf_ball,val,12613,VALID,True,12613
4,3464,3486,0.953837,imagenette2-160/train/n02102040/n02102040_3837.JPEG,English_springer,train,3464,VALID,True,3464,imagenette2-160/train/n02102040/n02102040_1564.JPEG,English_springer,train,3486,VALID,True,3486


## Image Clusters

In [14]:
fd.vis.component_gallery()

cassette_player


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 68.44it/s]


Finished OK. Components are stored as image files work_dir/galleries/components_[index].jpg
Stored components visual view in  work_dir/galleries/components.html
Execution time in seconds 1.5


Info,Unnamed: 1
component,1894.0
num_images,161.0
mean_distance,0.9001

Label,Unnamed: 1
tench,54

Info,Unnamed: 1
component,2812.0
num_images,70.0
mean_distance,0.9004

Label,Unnamed: 1
English_springer,54

Info,Unnamed: 1
component,7313.0
num_images,69.0
mean_distance,0.9001

Label,Unnamed: 1
golf_ball,54

Info,Unnamed: 1
component,1072.0
num_images,21.0
mean_distance,0.9001

Label,Unnamed: 1
garbage_truck,21

Info,Unnamed: 1
component,5498.0
num_images,13.0
mean_distance,0.9004

Label,Unnamed: 1
French_horn,13

Info,Unnamed: 1
component,994.0
num_images,12.0
mean_distance,0.9025

Label,Unnamed: 1
garbage_truck,12

Info,Unnamed: 1
component,1391.0
num_images,10.0
mean_distance,0.9

Label,Unnamed: 1
garbage_truck,10

Info,Unnamed: 1
component,5644.0
num_images,8.0
mean_distance,0.902

Label,Unnamed: 1
French_horn,8

Info,Unnamed: 1
component,1315.0
num_images,8.0
mean_distance,0.9041

Label,Unnamed: 1
garbage_truck,8

Info,Unnamed: 1
component,2781.0
num_images,8.0
mean_distance,0.9062

Label,Unnamed: 1
English_springer,8

Info,Unnamed: 1
component,984.0
num_images,7.0
mean_distance,0.9064

Label,Unnamed: 1
garbage_truck,7

Info,Unnamed: 1
component,3034.0
num_images,6.0
mean_distance,0.903

Label,Unnamed: 1
English_springer,6

Info,Unnamed: 1
component,5639.0
num_images,6.0
mean_distance,0.902

Label,Unnamed: 1
French_horn,6

Info,Unnamed: 1
component,1951.0
num_images,5.0
mean_distance,0.9

Label,Unnamed: 1
tench,5

Info,Unnamed: 1
component,7294.0
num_images,5.0
mean_distance,0.9019

Label,Unnamed: 1
golf_ball,5

Info,Unnamed: 1
component,4921.0
num_images,5.0
mean_distance,0.9004

Label,Unnamed: 1
parachute,5

Info,Unnamed: 1
component,5548.0
num_images,5.0
mean_distance,0.9043

Label,Unnamed: 1
French_horn,5

Info,Unnamed: 1
component,100.0
num_images,5.0
mean_distance,0.9011

Label,Unnamed: 1
cassette_player,5

Info,Unnamed: 1
component,7292.0
num_images,4.0
mean_distance,0.9021

Label,Unnamed: 1
golf_ball,4

Info,Unnamed: 1
component,2143.0
num_images,4.0
mean_distance,0.9001

Label,Unnamed: 1
tench,4


0

You can also visualize clusters with specific labels using the `slice` parameter. For example let's visualize clusters with the `chain_saw` label

In [15]:
fd.vis.component_gallery(slice='chain_saw')

chain_saw


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 449.14it/s]

Finished OK. Components are stored as image files work_dir/galleries/components_[index].jpg
Stored components visual view in  work_dir/galleries/components.html
Execution time in seconds 0.2





Info,Unnamed: 1
component,6981.0
num_images,3.0
mean_distance,0.9064

Label,Unnamed: 1
chain_saw,3

Info,Unnamed: 1
component,6421.0
num_images,2.0
mean_distance,0.9222

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,6478.0
num_images,2.0
mean_distance,0.9355

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,6621.0
num_images,2.0
mean_distance,0.9029

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,6766.0
num_images,2.0
mean_distance,0.9208

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,6831.0
num_images,2.0
mean_distance,0.9198

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,6862.0
num_images,2.0
mean_distance,0.9139

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,6901.0
num_images,2.0
mean_distance,0.9073

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,7033.0
num_images,2.0
mean_distance,0.9345

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,7067.0
num_images,2.0
mean_distance,0.9192

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,11637.0
num_images,2.0
mean_distance,0.9039

Label,Unnamed: 1
chain_saw,2


0

## Connected Components

In [16]:
cc_df, _ = fd.connected_components()
cc_df.sort_values('count', ascending=False).head(5)

Unnamed: 0,index,component_id,sum,count,mean_distance,min_distance,max_distance,filename,label,split,error_code,is_valid,fd_index
179,2355,1894,513.6729,562.0,0.914,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_8673.JPEG,tench,train,VALID,True,2355
143,2147,1894,513.6729,562.0,0.914,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_5658.JPEG,tench,train,VALID,True,2147
145,2150,1894,513.6729,562.0,0.914,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_10726.JPEG,tench,train,VALID,True,2150
146,2174,1894,513.6729,562.0,0.914,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_6974.JPEG,tench,train,VALID,True,2174
147,2177,1894,513.6729,562.0,0.914,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_14294.JPEG,tench,train,VALID,True,2177


We can also get metadata for individual images using their `fastdup_id` available in `fd.annotations()`

In [17]:
fd[349]

{'filename': 'imagenette2-160/train/n02979186/n02979186_2819.JPEG',
 'label': 'cassette_player',
 'split': 'train',
 'index': 349,
 'error_code': 'VALID',
 'is_valid': True,
 'fd_index': 349}

## Wrap Up

Next, feel free to check out other tutorials -

+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 


## VL Profiler
If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. 

[Sign up](https://app.visual-layer.com) now, it's free.

[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)

As usual, feedback is welcome! 

Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).