# Analyzing Image Classification Dataset

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)
[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)

This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to analyze an image classification dataset for:

+ Duplicates.
+ Outliers.
+ Wrong labels.
+ Image clusters.

If you're new, run the notebook in Google Colab or Kaggle for free.

> **Note** - No GPU needed! You can run on an instance with only CPU.



## Installation

First let's install [fastdup](https://github.com/visual-layer/fastdup) from PyPI with:

In [1]:
!pip install -Uqq fastdup

Now, test the installation. If there's no error message, we are ready to go.

In [2]:
import fastdup
fastdup.__version__

'0.930'

## Download Dataset

We will analyze the [Imagenette](https://github.com/fastai/imagenette) dataset - a subset of 10 easily classified classes from Imagenet (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).

In [3]:
!wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
!tar -xf imagenette2-160.tgz

--2023-05-16 08:53:02--  https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.251.158, 52.217.83.238, 52.217.96.150, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.251.158|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 99003388 (94M) [application/x-tar]
Saving to: ‘imagenette2-160.tgz.1’


2023-05-16 08:53:04 (46.0 MB/s) - ‘imagenette2-160.tgz.1’ saved [99003388/99003388]



## Load and Format Annotations

In [4]:
import pandas as pd

In [5]:
data_dir = 'imagenette2-160/'
csv_path = 'imagenette2-160/noisy_imagenette.csv'

In [6]:
label_map = {
    'n02979186': 'cassette_player', 
    'n03417042': 'garbage_truck', 
    'n01440764': 'tench', 
    'n02102040': 'English_springer', 
    'n03028079': 'church',
    'n03888257': 'parachute', 
    'n03394916': 'French_horn', 
    'n03000684': 'chain_saw', 
    'n03445777': 'golf_ball', 
    'n03425413': 'gas_pump'
}

Load the annotation provided with the dataset.

In [7]:
df_annot = pd.read_csv(csv_path)
df_annot.head(3)

Unnamed: 0,path,noisy_labels_0,noisy_labels_1,noisy_labels_5,noisy_labels_25,noisy_labels_50,is_valid
0,train/n02979186/n02979186_9036.JPEG,n02979186,n02979186,n02979186,n02979186,n02979186,False
1,train/n02979186/n02979186_11957.JPEG,n02979186,n02979186,n02979186,n02979186,n03000684,False
2,train/n02979186/n02979186_9715.JPEG,n02979186,n02979186,n02979186,n03417042,n03000684,False


Transform the annotation to fastdup supported format.

fastdup expects an annotation `DataFrame` that contains the following column:

+ filename - contains the path to the image file.
+ label - contains a label of the image.
+ split - whether the image is subset of the training, validation or test dataset.

In [8]:
# take relevant columns
df_annot = df_annot[['path', 'noisy_labels_0']]

# rename columns to fastdup's column names
df_annot = df_annot.rename({'noisy_labels_0': 'label', 'path': 'filename'}, axis='columns')

# append datadir
df_annot['filename'] = df_annot['filename'].apply(lambda x: data_dir + x)

# create split column
df_annot['split'] = df_annot['filename'].apply(lambda x: x.split("/")[1])

# map label ids to regular labels
df_annot['label'] = df_annot['label'].map(label_map)

# show formated annotations
df_annot

Unnamed: 0,filename,label,split
0,imagenette2-160/train/n02979186/n02979186_9036.JPEG,cassette_player,train
1,imagenette2-160/train/n02979186/n02979186_11957.JPEG,cassette_player,train
2,imagenette2-160/train/n02979186/n02979186_9715.JPEG,cassette_player,train
3,imagenette2-160/train/n02979186/n02979186_21736.JPEG,cassette_player,train
4,imagenette2-160/train/n02979186/ILSVRC2012_val_00046953.JPEG,cassette_player,train
...,...,...,...
13389,imagenette2-160/val/n03425413/n03425413_17521.JPEG,gas_pump,val
13390,imagenette2-160/val/n03425413/n03425413_20711.JPEG,gas_pump,val
13391,imagenette2-160/val/n03425413/n03425413_19050.JPEG,gas_pump,val
13392,imagenette2-160/val/n03425413/n03425413_13831.JPEG,gas_pump,val


## Run fastdup

With the images and annotations, we are now ready to run an analysis.

+ `work_dir` is the path to store the artifacts from the analysis.

+ `input_dir` is the path to the downloaded images.

In [9]:
work_dir = 'fastdup_imagenette'

fd = fastdup.create(work_dir=work_dir, input_dir=data_dir) 
fd.run(annotations=df_annot, ccthreshold=0.9, threshold=0.8)

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-05-16 08:53:06 [INFO] Going to loop over dir imagenette2-160
2023-05-16 08:53:06 [INFO] Found total 13394 images to run on, 13394 train, 0 test, name list 13394, counter 13394 
2023-05-16 08:53:20 [INFO] Found total 13394 images to run onimated: 0 Minutes
Finished histogram 7.122
Finished bucket sort 7.177
2023-05-16 08:53:20 [INFO] 309) Finished write_index() NN model
2023-05-16 08:53:20 [INFO] Stored nn model index file fastdup_imagenette/nnf.index
2023-05-16 08:53:21 [INFO] Total time took 14601 ms
2023-05-16 08:53:21 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 %
2023-05-16 08:53:21 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 %
2023-05-16 08:53:21 [INFO] Found a total of 16757 above threshold images (d>0.800), which are 62.55 %
2023-05-16 08:53:21 [INFO] Found a total of 1339 outlier images         (d<0.050), which are 5.00 %
2023-05-16 08:53:

## Outliers

Visualize outliers from the dataset.

In [10]:
fd.vis.outliers_gallery()

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 9642.08it/s]

Stored outliers visual view in  fastdup_imagenette/galleries/outliers.html





Info,Unnamed: 1
Distance,0.489022
Path,/train/n02979186/n02979186_3967.JPEG
label,cassette_player

Info,Unnamed: 1
Distance,0.51468
Path,/train/n03445777/n03445777_5218.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.541967
Path,/val/n03417042/n03417042_5301.JPEG
label,garbage_truck

Info,Unnamed: 1
Distance,0.57066
Path,/train/n03888257/n03888257_34639.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.578252
Path,/train/n03445777/n03445777_3254.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.58389
Path,/val/n03445777/n03445777_5932.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.590838
Path,/val/n02102040/n02102040_7670.JPEG
label,English_springer

Info,Unnamed: 1
Distance,0.609527
Path,/train/n03888257/n03888257_7793.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.611143
Path,/val/n01440764/n01440764_4962.JPEG
label,tench

Info,Unnamed: 1
Distance,0.61373
Path,/train/n03445777/n03445777_6033.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.61618
Path,/train/n03394916/n03394916_37544.JPEG
label,French_horn

Info,Unnamed: 1
Distance,0.616785
Path,/val/n03445777/n03445777_9292.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.617952
Path,/train/n03888257/n03888257_16223.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.619739
Path,/train/n03028079/n03028079_24708.JPEG
label,church

Info,Unnamed: 1
Distance,0.619768
Path,/train/n03888257/n03888257_79145.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.620815
Path,/train/n03888257/n03888257_5703.JPEG
label,parachute

Info,Unnamed: 1
Distance,0.625504
Path,/train/n03394916/n03394916_33663.JPEG
label,French_horn

Info,Unnamed: 1
Distance,0.626412
Path,/train/n03445777/n03445777_9199.JPEG
label,golf_ball

Info,Unnamed: 1
Distance,0.630812
Path,/train/n02979186/n02979186_10289.JPEG
label,cassette_player

Info,Unnamed: 1
Distance,0.631131
Path,/train/n03888257/n03888257_75495.JPEG
label,parachute


Show outliers image data.

In [11]:
fd.outliers().head(5)

Unnamed: 0,outlier,nearest,distance,filename_outlier,label_outlier,split_outlier,index_x,error_code_outlier,is_valid_outlier,fd_index_outlier,filename_nearest,label_nearest,split_nearest,index_y,error_code_nearest,is_valid_nearest,fd_index_nearest
0,2664,9763,0.476124,imagenette2-160/train/n02979186/n02979186_3967.JPEG,cassette_player,train,2664,VALID,True,2664,imagenette2-160/val/n01440764/n01440764_710.JPEG,tench,val,9763,VALID,True,9763
1,8150,7831,0.51468,imagenette2-160/train/n03445777/n03445777_5218.JPEG,golf_ball,train,8150,VALID,True,8150,imagenette2-160/train/n03445777/n03445777_18756.JPEG,golf_ball,train,7831,VALID,True,7831
2,12076,956,0.539276,imagenette2-160/val/n03417042/n03417042_5301.JPEG,garbage_truck,val,12076,VALID,True,12076,imagenette2-160/train/n01440764/n01440764_9898.JPEG,tench,train,956,VALID,True,956
3,9087,8628,0.544795,imagenette2-160/train/n03888257/n03888257_34639.JPEG,parachute,train,9087,VALID,True,9087,imagenette2-160/train/n03888257/n03888257_12053.JPEG,parachute,train,8628,VALID,True,8628
4,7966,1630,0.555266,imagenette2-160/train/n03445777/n03445777_3254.JPEG,golf_ball,train,7966,VALID,True,7966,imagenette2-160/train/n02102040/n02102040_585.JPEG,English_springer,train,1630,VALID,True,1630


## Comparing Labels of Similar Images
Find possible mislabels by comparing a query image to other images in the dataset.

In [12]:
fd.vis.similarity_gallery() 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 106.60it/s]


Stored similar images visual view in  fastdup_imagenette/galleries/similarity.html


Info From,Unnamed: 1
label,French_horn
from,/train/n03394916/n03394916_44127.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.968786,/val/n03394916/n03394916_30631.JPEG,French_horn
0.918324,/train/n03394916/n03394916_36016.JPEG,French_horn

0
Query Image

0
Similar

Info From,Unnamed: 1
label,French_horn
from,/val/n03394916/n03394916_30631.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.968786,/train/n03394916/n03394916_44127.JPEG,French_horn
0.903753,/train/n03394916/n03394916_29969.JPEG,French_horn

0
Query Image

0
Similar

Info From,Unnamed: 1
label,golf_ball
from,/val/n03445777/n03445777_6882.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.962458,/train/n03445777/n03445777_13918.JPEG,golf_ball
0.918005,/val/n03445777/n03445777_5912.JPEG,golf_ball

0
Query Image

0
Similar

Info From,Unnamed: 1
label,golf_ball
from,/train/n03445777/n03445777_13918.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.962458,/val/n03445777/n03445777_6882.JPEG,golf_ball
0.917039,/val/n03445777/n03445777_8820.JPEG,golf_ball

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/n02102040_1564.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.953837,/train/n02102040/n02102040_3837.JPEG,English_springer
0.908732,/train/n02102040/n02102040_3586.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/n02102040_3837.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.953837,/train/n02102040/n02102040_1564.JPEG,English_springer
0.893944,/train/n02102040/n02102040_3027.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,tench
from,/train/n01440764/n01440764_7457.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.953413,/train/n01440764/n01440764_11339.JPEG,tench
0.918778,/train/n01440764/n01440764_9315.JPEG,tench

0
Query Image

0
Similar

Info From,Unnamed: 1
label,tench
from,/train/n01440764/n01440764_11339.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.953413,/train/n01440764/n01440764_7457.JPEG,tench
0.889166,/train/n01440764/n01440764_12279.JPEG,tench

0
Query Image

0
Similar

Info From,Unnamed: 1
label,garbage_truck
from,/train/n03417042/n03417042_1578.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.952239,/train/n03417042/n03417042_12906.JPEG,garbage_truck
0.837864,/val/n03417042/n03417042_9610.JPEG,garbage_truck

0
Query Image

0
Similar

Info From,Unnamed: 1
label,garbage_truck
from,/train/n03417042/n03417042_12906.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.952239,/train/n03417042/n03417042_1578.JPEG,garbage_truck
0.828749,/train/n03417042/n03417042_27686.JPEG,garbage_truck

0
Query Image

0
Similar

Info From,Unnamed: 1
label,French_horn
from,/val/n03394916/n03394916_6830.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.951679,/val/n03394916/n03394916_21092.JPEG,French_horn
0.89308,/train/n03394916/n03394916_35469.JPEG,French_horn

0
Query Image

0
Similar

Info From,Unnamed: 1
label,French_horn
from,/val/n03394916/n03394916_21092.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.951679,/val/n03394916/n03394916_6830.JPEG,French_horn
0.865771,/train/n03394916/n03394916_35469.JPEG,French_horn

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/train/n03888257/n03888257_21027.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.950477,/val/n03888257/n03888257_11210.JPEG,parachute
0.92043,/val/n03888257/n03888257_12491.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,parachute
from,/val/n03888257/n03888257_11210.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.950477,/train/n03888257/n03888257_21027.JPEG,parachute
0.865155,/val/n03888257/n03888257_12491.JPEG,parachute

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/n02102040_6313.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.950174,/train/n02102040/n02102040_3767.JPEG,English_springer
0.947323,/val/n02102040/n02102040_350.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/n02102040_3767.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.950174,/train/n02102040/n02102040_6313.JPEG,English_springer
0.914057,/val/n02102040/n02102040_350.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/ILSVRC2012_val_00032959.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.949877,/val/n02102040/n02102040_662.JPEG,English_springer
0.933114,/train/n02102040/n02102040_3114.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/val/n02102040/n02102040_662.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.949877,/train/n02102040/ILSVRC2012_val_00032959.JPEG,English_springer
0.927345,/val/n02102040/n02102040_3502.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/n02102040_3114.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.949252,/train/n02102040/n02102040_1306.JPEG,English_springer
0.941953,/train/n02102040/n02102040_1055.JPEG,English_springer

0
Query Image

0
Similar

Info From,Unnamed: 1
label,English_springer
from,/train/n02102040/n02102040_1306.JPEG

Info To,Unnamed: 1,Unnamed: 2
0.949252,/train/n02102040/n02102040_3114.JPEG,English_springer
0.936799,/train/n02102040/n02102040_876.JPEG,English_springer

0
Query Image

0
Similar


Unnamed: 0,from,to,label,label2,distance
3630,imagenette2-160/train/n03394916/n03394916_44127.JPEG,"[imagenette2-160/val/n03394916/n03394916_30631.JPEG, imagenette2-160/train/n03394916/n03394916_36016.JPEG]","[French_horn, French_horn]","[French_horn, French_horn]","[0.968786, 0.918324]"
7823,imagenette2-160/val/n03394916/n03394916_30631.JPEG,"[imagenette2-160/train/n03394916/n03394916_44127.JPEG, imagenette2-160/train/n03394916/n03394916_29969.JPEG]","[French_horn, French_horn]","[French_horn, French_horn]","[0.968786, 0.903753]"
8758,imagenette2-160/val/n03445777/n03445777_6882.JPEG,"[imagenette2-160/train/n03445777/n03445777_13918.JPEG, imagenette2-160/val/n03445777/n03445777_5912.JPEG]","[golf_ball, golf_ball]","[golf_ball, golf_ball]","[0.962458, 0.918005]"
5363,imagenette2-160/train/n03445777/n03445777_13918.JPEG,"[imagenette2-160/val/n03445777/n03445777_6882.JPEG, imagenette2-160/val/n03445777/n03445777_8820.JPEG]","[golf_ball, golf_ball]","[golf_ball, golf_ball]","[0.962458, 0.917039]"
896,imagenette2-160/train/n02102040/n02102040_1564.JPEG,"[imagenette2-160/train/n02102040/n02102040_3837.JPEG, imagenette2-160/train/n02102040/n02102040_3586.JPEG]","[English_springer, English_springer]","[English_springer, English_springer]","[0.953837, 0.908732]"
...,...,...,...,...,...
6224,imagenette2-160/train/n03888257/n03888257_38633.JPEG,[imagenette2-160/train/n03888257/n03888257_12816.JPEG],[parachute],[parachute],[0.800073]
5917,imagenette2-160/train/n03888257/n03888257_12816.JPEG,[imagenette2-160/train/n03888257/n03888257_38633.JPEG],[parachute],[parachute],[0.800073]
4324,imagenette2-160/train/n03417042/n03417042_3236.JPEG,[imagenette2-160/train/n03417042/n03417042_12297.JPEG],[garbage_truck],[garbage_truck],[0.800025]
3429,imagenette2-160/train/n03394916/n03394916_32478.JPEG,[imagenette2-160/train/n03394916/n03394916_35573.JPEG],[French_horn],[French_horn],[0.800012]


## Similar Image Pairs

Find similar image pairs within and across the train and validation subfolders. Pairs may include train-train, train-val, val-train, and val-val.

In [13]:
fd.vis.duplicates_gallery()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[out_col] = df[in_col].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 188.62it/s]


Stored similarity visual view in  fastdup_imagenette/galleries/duplicates.html


Info,Unnamed: 1
Distance,0.968786
From,/val/n03394916/n03394916_30631.JPEG
To,/train/n03394916/n03394916_44127.JPEG
From_Label,French_horn
To_Label,French_horn

Info,Unnamed: 1
Distance,0.962458
From,/train/n03445777/n03445777_13918.JPEG
To,/val/n03445777/n03445777_6882.JPEG
From_Label,golf_ball
To_Label,golf_ball

Info,Unnamed: 1
Distance,0.953837
From,/train/n02102040/n02102040_3837.JPEG
To,/train/n02102040/n02102040_1564.JPEG
From_Label,English_springer
To_Label,English_springer

Info,Unnamed: 1
Distance,0.953413
From,/train/n01440764/n01440764_7457.JPEG
To,/train/n01440764/n01440764_11339.JPEG
From_Label,tench
To_Label,tench

Info,Unnamed: 1
Distance,0.952239
From,/train/n03417042/n03417042_1578.JPEG
To,/train/n03417042/n03417042_12906.JPEG
From_Label,garbage_truck
To_Label,garbage_truck

Info,Unnamed: 1
Distance,0.951679
From,/val/n03394916/n03394916_6830.JPEG
To,/val/n03394916/n03394916_21092.JPEG
From_Label,French_horn
To_Label,French_horn

Info,Unnamed: 1
Distance,0.950477
From,/val/n03888257/n03888257_11210.JPEG
To,/train/n03888257/n03888257_21027.JPEG
From_Label,parachute
To_Label,parachute

Info,Unnamed: 1
Distance,0.950174
From,/train/n02102040/n02102040_6313.JPEG
To,/train/n02102040/n02102040_3767.JPEG
From_Label,English_springer
To_Label,English_springer

Info,Unnamed: 1
Distance,0.949877
From,/train/n02102040/ILSVRC2012_val_00032959.JPEG
To,/val/n02102040/n02102040_662.JPEG
From_Label,English_springer
To_Label,English_springer

Info,Unnamed: 1
Distance,0.949252
From,/train/n02102040/n02102040_1306.JPEG
To,/train/n02102040/n02102040_3114.JPEG
From_Label,English_springer
To_Label,English_springer


Show similar image pairs.

In [14]:
fd.similarity().head(5)

Unnamed: 0,from,to,distance,filename_from,label_from,split_from,index_x,error_code_from,is_valid_from,fd_index_from,filename_to,label_to,split_to,index_y,error_code_to,is_valid_to,fd_index_to
0,11521,5390,0.968786,imagenette2-160/val/n03394916/n03394916_30631.JPEG,French_horn,val,11521,VALID,True,11521,imagenette2-160/train/n03394916/n03394916_44127.JPEG,French_horn,train,5390,VALID,True,5390
1,5390,11521,0.968786,imagenette2-160/train/n03394916/n03394916_44127.JPEG,French_horn,train,5390,VALID,True,5390,imagenette2-160/val/n03394916/n03394916_30631.JPEG,French_horn,val,11521,VALID,True,11521
2,12914,7715,0.962458,imagenette2-160/val/n03445777/n03445777_6882.JPEG,golf_ball,val,12914,VALID,True,12914,imagenette2-160/train/n03445777/n03445777_13918.JPEG,golf_ball,train,7715,VALID,True,7715
3,7715,12914,0.962458,imagenette2-160/train/n03445777/n03445777_13918.JPEG,golf_ball,train,7715,VALID,True,7715,imagenette2-160/val/n03445777/n03445777_6882.JPEG,golf_ball,val,12914,VALID,True,12914
4,1404,1117,0.953837,imagenette2-160/train/n02102040/n02102040_3837.JPEG,English_springer,train,1404,VALID,True,1404,imagenette2-160/train/n02102040/n02102040_1564.JPEG,English_springer,train,1117,VALID,True,1117


## Image Clusters

In [15]:
fd.vis.component_gallery()

tench


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 36.72it/s]


Finished OK. Components are stored as image files fastdup_imagenette/galleries/components_[index].jpg
Stored components visual view in  fastdup_imagenette/galleries/components.html
Execution time in seconds 3.0


Info,Unnamed: 1
component,6.0
num_images,162.0
mean_distance,0.9001

Label,Unnamed: 1
tench,54

Info,Unnamed: 1
component,850.0
num_images,70.0
mean_distance,0.9004

Label,Unnamed: 1
English_springer,54

Info,Unnamed: 1
component,7240.0
num_images,69.0
mean_distance,0.9001

Label,Unnamed: 1
golf_ball,54

Info,Unnamed: 1
component,5410.0
num_images,21.0
mean_distance,0.9001

Label,Unnamed: 1
garbage_truck,21

Info,Unnamed: 1
component,4512.0
num_images,13.0
mean_distance,0.9004

Label,Unnamed: 1
French_horn,13

Info,Unnamed: 1
component,5397.0
num_images,12.0
mean_distance,0.9025

Label,Unnamed: 1
garbage_truck,12

Info,Unnamed: 1
component,5539.0
num_images,10.0
mean_distance,0.9

Label,Unnamed: 1
garbage_truck,10

Info,Unnamed: 1
component,1139.0
num_images,8.0
mean_distance,0.9062

Label,Unnamed: 1
English_springer,8

Info,Unnamed: 1
component,5632.0
num_images,8.0
mean_distance,0.9041

Label,Unnamed: 1
garbage_truck,8

Info,Unnamed: 1
component,4494.0
num_images,8.0
mean_distance,0.902

Label,Unnamed: 1
French_horn,8

Info,Unnamed: 1
component,1239.0
num_images,6.0
mean_distance,0.903

Label,Unnamed: 1
English_springer,6

Info,Unnamed: 1
component,4531.0
num_images,6.0
mean_distance,0.902

Label,Unnamed: 1
French_horn,6

Info,Unnamed: 1
component,5678.0
num_images,6.0
mean_distance,0.9064

Label,Unnamed: 1
garbage_truck,6

Info,Unnamed: 1
component,8335.0
num_images,5.0
mean_distance,0.9004

Label,Unnamed: 1
parachute,5

Info,Unnamed: 1
component,199.0
num_images,5.0
mean_distance,0.9

Label,Unnamed: 1
tench,5

Info,Unnamed: 1
component,2174.0
num_images,5.0
mean_distance,0.9011

Label,Unnamed: 1
cassette_player,5

Info,Unnamed: 1
component,7386.0
num_images,5.0
mean_distance,0.9019

Label,Unnamed: 1
golf_ball,5

Info,Unnamed: 1
component,4616.0
num_images,5.0
mean_distance,0.9043

Label,Unnamed: 1
French_horn,5

Info,Unnamed: 1
component,8979.0
num_images,4.0
mean_distance,0.9013

Label,Unnamed: 1
parachute,4

Info,Unnamed: 1
component,4764.0
num_images,4.0
mean_distance,0.9032

Label,Unnamed: 1
French_horn,4


You can also visualize clusters with specific labels using the `slice` parameter. For example let's visualize clusters with the `chain_saw` label

In [16]:
fd.vis.component_gallery(slice='chain_saw')

chain_saw


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 250.94it/s]

Finished OK. Components are stored as image files fastdup_imagenette/galleries/components_[index].jpg
Stored components visual view in  fastdup_imagenette/galleries/components.html
Execution time in seconds 0.3





Info,Unnamed: 1
component,2876.0
num_images,3.0
mean_distance,0.9064

Label,Unnamed: 1
chain_saw,3

Info,Unnamed: 1
component,2798.0
num_images,2.0
mean_distance,0.9029

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,2815.0
num_images,2.0
mean_distance,0.9208

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,2862.0
num_images,2.0
mean_distance,0.9222

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,2989.0
num_images,2.0
mean_distance,0.9139

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,2992.0
num_images,2.0
mean_distance,0.9198

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,3001.0
num_images,2.0
mean_distance,0.9073

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,3002.0
num_images,2.0
mean_distance,0.9192

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,3077.0
num_images,2.0
mean_distance,0.9355

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,3305.0
num_images,2.0
mean_distance,0.9345

Label,Unnamed: 1
chain_saw,2

Info,Unnamed: 1
component,10204.0
num_images,2.0
mean_distance,0.9039

Label,Unnamed: 1
chain_saw,2


## Connected Components

In [17]:
cc_df, _ = fd.connected_components()
cc_df.sort_values('count', ascending=False).head(5)

Unnamed: 0,index,component_id,sum,count,mean_distance,min_distance,max_distance,filename,label,split,error_code,is_valid,fd_index
235,235,6,517.2897,566.0,0.9139,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_13304.JPEG,tench,train,VALID,True,235
121,121,6,517.2897,566.0,0.9139,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_11486.JPEG,tench,train,VALID,True,121
685,685,6,517.2897,566.0,0.9139,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_6174.JPEG,tench,train,VALID,True,685
689,689,6,517.2897,566.0,0.9139,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_6249.JPEG,tench,train,VALID,True,689
706,706,6,517.2897,566.0,0.9139,0.9001,0.9534,imagenette2-160/train/n01440764/n01440764_6494.JPEG,tench,train,VALID,True,706


We can also get metadata for individual images using their `fastdup_id` available in `fd.annotations()`

In [18]:
fd[349]

{'filename': 'imagenette2-160/train/n01440764/n01440764_1778.JPEG',
 'label': 'tench',
 'split': 'train',
 'index': 349,
 'error_code': 'VALID',
 'is_valid': True,
 'fd_index': 349}