# Quick Dataset Analysis
This notebook shows how to quickly analyze an image dataset for potential issues using fastdup. We'll take you on a high level tour showcasing the core functions of fastdup in the shortest time.

## Installation & Setting Up

This notebook is written to be run on [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb). If you're running fastdup locally, view the installation instructions for your operating system [here](https://visual-layer.readme.io/docs/installation).

In [None]:
!pip install pip -U
!pip install fastdup matplotlib

## Download Oxford Pets Dataset

For demonstration, we will use a widely available and well curated dataset. For that reason we might not find a lot of issues here. Feel free to swap this dataset with your own.

In [None]:
!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz -O images.tar.gz
!tar xf images.tar.gz

## Import and Run fastdup

In [1]:
import fastdup
fastdup.__version__

'0.903'

Let's start by creating a `Fastdup` object.

+ `work_dir` - path to store artifacts from the run. 

+ `input_dir` - path to your images folder.

In [None]:
fd = fastdup.create(work_dir="fastdup_work_dir/", input_dir="images/")

In [2]:
fd.run()

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-03-15 18:49:07 [INFO] Going to loop over dir images
2023-03-15 18:49:07 [INFO] Found total 7390 images to run on
2023-03-15 18:49:07 [ERROR] Failed to read image images/Abyssinian_34.jpgtes 0 Features
2023-03-15 18:49:13 [ERROR] Failed to read image images/Egyptian_Mau_139.jpgs 0 Features
2023-03-15 18:49:13 [ERROR] Failed to read image images/Egyptian_Mau_145.jpg
2023-03-15 18:49:13 [ERROR] Failed to read image images/Egyptian_Mau_167.jpgs 0 Features
2023-03-15 18:49:13 [ERROR] Failed to read image images/Egyptian_Mau_177.jpg
2023-03-15 18:49:13 [ERROR] Failed to read image images/Egyptian_Mau_191.jpgs 0 Features
2023-03-15 18:49:27 [INFO] Found total 7390 images to run ontimated: 0 Minutes 0 Features
2023-03-15 18:49:28 [INFO] 1039) Finished write_index() NN model
2023-03-15 18:49:28 [INFO] Stored nn model index file fastdup_work_dir/nnf.index
2023-03-15 18:49:29 [INFO] Total time took 21607 ms
2023-03-15

## View Run Summary

In [3]:
fd.summary()


 ########################################################################################

Dataset Analysis Summary: 

    Dataset contains 7390 images
    Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data
    For a detailed analysis, use `.invalid_instances()`.

    Similarity:  1.00% (74) belong to 3 similarity clusters (components).
    99.00% (7,316) images do not belong to any similarity cluster.
    Largest cluster has 6 (0.08%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.96).

    Outliers: 6.13% (453) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.


['Dataset contains 7390 images',
 'Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data',
 'For a detailed analysis, use `.invalid_instances()`.\n',
 'Similarity:  1.00% (74) belong to 3 similarity clusters (components).',
 '99.00% (7,316) images do not belong to any similarity cluster.',
 'Largest cluster has 6 (0.08%) images.',
 'For a detailed analysis, use `.connected_components()`\n(similarity threshold used is 0.9, connected component threshold used is 0.96).\n',
 'Outliers: 6.13% (453) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',
 'For a detailed list of outliers, use `.outliers()`.']

## Invalid Images

Get a list of broken images.

In [4]:
fd.invalid_instances()

Unnamed: 0,filename,fastdup_id,error_code,is_valid
0,Abyssinian_34.jpg,135,ERROR_ZERO_SIZE_FILE,False
1,Egyptian_Mau_139.jpg,2240,ERROR_ZERO_SIZE_FILE,False
2,Egyptian_Mau_145.jpg,2247,ERROR_ZERO_SIZE_FILE,False
3,Egyptian_Mau_167.jpg,2268,ERROR_ZERO_SIZE_FILE,False
4,Egyptian_Mau_177.jpg,2278,ERROR_ZERO_SIZE_FILE,False
5,Egyptian_Mau_191.jpg,2293,ERROR_ZERO_SIZE_FILE,False


## Duplicate Image Pairs

Duplicate image pairs are computed based on the cosine distance of an image pair. View the docs [here](https://visual-layer.readme.io/docs/v1-api#duplicates_gallery).

In [5]:
fd.vis.duplicates_gallery()

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 112.30it/s]


Stored similarity visual view in  fastdup_work_dir/galleries/duplicates.html


Info,Unnamed: 1
Distance,1.0
From,Bombay_109.jpg
To,Bombay_206.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_11.jpg
To,Bombay_192.jpg

Info,Unnamed: 1
Distance,1.0
From,Egyptian_Mau_131.jpg
To,Egyptian_Mau_202.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_126.jpg
To,Bombay_220.jpg

Info,Unnamed: 1
Distance,1.0
From,boxer_114.jpg
To,boxer_82.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_201.jpg
To,Bombay_92.jpg

Info,Unnamed: 1
Distance,1.0
From,newfoundland_147.jpg
To,newfoundland_152.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_151.jpg
To,english_cocker_spaniel_162.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_193.jpg
To,Bombay_22.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_154.jpg
To,english_cocker_spaniel_164.jpg

Info,Unnamed: 1
Distance,1.0
From,Egyptian_Mau_10.jpg
To,Egyptian_Mau_183.jpg

Info,Unnamed: 1
Distance,1.0
From,keeshond_54.jpg
To,keeshond_59.jpg

Info,Unnamed: 1
Distance,1.0
From,Egyptian_Mau_210.jpg
To,Egyptian_Mau_41.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_100.jpg
To,Bombay_192.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_164.jpg
To,Bombay_189.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_152.jpg
To,english_cocker_spaniel_163.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_200.jpg
To,Bombay_85.jpg

Info,Unnamed: 1
Distance,1.0
From,Bombay_102.jpg
To,Bombay_203.jpg

Info,Unnamed: 1
Distance,1.0
From,english_cocker_spaniel_176.jpg
To,english_cocker_spaniel_179.jpg


## Outliers

Outliers are computed based on the distance of the image compared to other images in the dataset. View the docs [here](https://visual-layer.readme.io/docs/v1-api#outliers_gallery).

In [6]:
fd.vis.outliers_gallery() 

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 24679.64it/s]


Stored outliers visual view in  fastdup_work_dir/galleries/outliers.html


Info,Unnamed: 1
Distance,0.59692
Path,Bengal_105.jpg

Info,Unnamed: 1
Distance,0.611524
Path,Bengal_131.jpg

Info,Unnamed: 1
Distance,0.617132
Path,staffordshire_bull_terrier_51.jpg

Info,Unnamed: 1
Distance,0.621796
Path,miniature_pinscher_76.jpg

Info,Unnamed: 1
Distance,0.622757
Path,Sphynx_128.jpg

Info,Unnamed: 1
Distance,0.62428
Path,beagle_142.jpg

Info,Unnamed: 1
Distance,0.627605
Path,american_pit_bull_terrier_72.jpg

Info,Unnamed: 1
Distance,0.630928
Path,german_shorthaired_173.jpg

Info,Unnamed: 1
Distance,0.635179
Path,Bombay_36.jpg

Info,Unnamed: 1
Distance,0.636152
Path,chihuahua_6.jpg

Info,Unnamed: 1
Distance,0.636669
Path,staffordshire_bull_terrier_76.jpg

Info,Unnamed: 1
Distance,0.641191
Path,basset_hound_197.jpg

Info,Unnamed: 1
Distance,0.642425
Path,Bombay_204.jpg

Info,Unnamed: 1
Distance,0.642967
Path,Bengal_30.jpg

Info,Unnamed: 1
Distance,0.643354
Path,boxer_149.jpg

Info,Unnamed: 1
Distance,0.643533
Path,beagle_147.jpg

Info,Unnamed: 1
Distance,0.644183
Path,german_shorthaired_121.jpg

Info,Unnamed: 1
Distance,0.64548
Path,Bombay_188.jpg

Info,Unnamed: 1
Distance,0.646996
Path,chihuahua_164.jpg

Info,Unnamed: 1
Distance,0.653168
Path,Abyssinian_226.jpg


## Dark, Bright and Blurry Images

You can also visualize the dataset sorted by a specific metric. View the docs [here](https://visual-layer.readme.io/docs/v1-api#duplicates_gallery).

In [7]:
fd.vis.stats_gallery(metric='dark')

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 262.41it/s]


Stored mean visual view in  fastdup_work_dir/galleries/mean.html


Info,Unnamed: 1
mean,19.567
filename,images/Abyssinian_4.jpg

Info,Unnamed: 1
mean,22.0709
filename,images/Bombay_33.jpg

Info,Unnamed: 1
mean,25.2039
filename,images/Bombay_108.jpg

Info,Unnamed: 1
mean,25.5381
filename,images/Bombay_191.jpg

Info,Unnamed: 1
mean,26.5806
filename,images/Abyssinian_114.jpg

Info,Unnamed: 1
mean,28.0547
filename,images/Abyssinian_18.jpg

Info,Unnamed: 1
mean,28.2537
filename,images/Maine_Coon_8.jpg

Info,Unnamed: 1
mean,28.6222
filename,images/scottish_terrier_171.jpg

Info,Unnamed: 1
mean,30.6038
filename,images/boxer_189.jpg

Info,Unnamed: 1
mean,31.0021
filename,images/Egyptian_Mau_119.jpg

Info,Unnamed: 1
mean,31.8424
filename,images/shiba_inu_33.jpg

Info,Unnamed: 1
mean,32.1091
filename,images/Egyptian_Mau_46.jpg

Info,Unnamed: 1
mean,32.1753
filename,images/Russian_Blue_13.jpg

Info,Unnamed: 1
mean,33.3259
filename,images/Sphynx_93.jpg

Info,Unnamed: 1
mean,33.7525
filename,images/japanese_chin_175.jpg

Info,Unnamed: 1
mean,33.889
filename,images/Egyptian_Mau_186.jpg

Info,Unnamed: 1
mean,34.379
filename,images/shiba_inu_137.jpg

Info,Unnamed: 1
mean,34.5139
filename,images/Egyptian_Mau_6.jpg

Info,Unnamed: 1
mean,35.7243
filename,images/chihuahua_78.jpg

Info,Unnamed: 1
mean,36.3198
filename,images/Egyptian_Mau_59.jpg

Info,Unnamed: 1
mean,36.6248
filename,images/Sphynx_46.jpg

Info,Unnamed: 1
mean,36.9849
filename,images/american_bulldog_150.jpg

Info,Unnamed: 1
mean,37.3306
filename,images/japanese_chin_40.jpg

Info,Unnamed: 1
mean,37.5096
filename,images/Abyssinian_62.jpg

Info,Unnamed: 1
mean,37.5354
filename,images/Sphynx_119.jpg


In [8]:
fd.vis.stats_gallery(metric='bright')

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 315.85it/s]


Stored mean visual view in  fastdup_work_dir/galleries/mean.html


Info,Unnamed: 1
mean,242.6047
filename,images/saint_bernard_183.jpg

Info,Unnamed: 1
mean,239.4395
filename,images/saint_bernard_188.jpg

Info,Unnamed: 1
mean,238.5204
filename,images/saint_bernard_186.jpg

Info,Unnamed: 1
mean,237.767
filename,images/boxer_162.jpg

Info,Unnamed: 1
mean,235.5402
filename,images/Egyptian_Mau_99.jpg

Info,Unnamed: 1
mean,234.968
filename,images/Abyssinian_127.jpg

Info,Unnamed: 1
mean,232.9795
filename,images/saint_bernard_187.jpg

Info,Unnamed: 1
mean,231.1052
filename,images/British_Shorthair_274.jpg

Info,Unnamed: 1
mean,230.8341
filename,images/staffordshire_bull_terrier_25.jpg

Info,Unnamed: 1
mean,228.7601
filename,images/great_pyrenees_88.jpg

Info,Unnamed: 1
mean,228.4892
filename,images/Egyptian_Mau_110.jpg

Info,Unnamed: 1
mean,225.4876
filename,images/Egyptian_Mau_1.jpg

Info,Unnamed: 1
mean,224.8601
filename,images/Maine_Coon_267.jpg

Info,Unnamed: 1
mean,224.1675
filename,images/Bombay_182.jpg

Info,Unnamed: 1
mean,220.5146
filename,images/Egyptian_Mau_39.jpg

Info,Unnamed: 1
mean,219.7123
filename,images/pug_76.jpg

Info,Unnamed: 1
mean,219.2608
filename,images/Abyssinian_66.jpg

Info,Unnamed: 1
mean,218.1195
filename,images/Birman_136.jpg

Info,Unnamed: 1
mean,217.4757
filename,images/chihuahua_97.jpg

Info,Unnamed: 1
mean,217.39
filename,images/Maine_Coon_239.jpg

Info,Unnamed: 1
mean,217.1701
filename,images/Birman_61.jpg

Info,Unnamed: 1
mean,217.0271
filename,images/pug_96.jpg

Info,Unnamed: 1
mean,216.7833
filename,images/saint_bernard_14.jpg

Info,Unnamed: 1
mean,216.6674
filename,images/saint_bernard_189.jpg

Info,Unnamed: 1
mean,216.2317
filename,images/basset_hound_24.jpg


In [9]:
fd.vis.stats_gallery(metric='blur')

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 660.86it/s]

Stored blur visual view in  fastdup_work_dir/galleries/blur.html





Info,Unnamed: 1
blur,63.8531
filename,images/Ragdoll_255.jpg

Info,Unnamed: 1
blur,64.6984
filename,images/Ragdoll_254.jpg

Info,Unnamed: 1
blur,69.4447
filename,images/pomeranian_170.jpg

Info,Unnamed: 1
blur,72.8116
filename,images/pomeranian_183.jpg

Info,Unnamed: 1
blur,73.0642
filename,images/pug_166.jpg

Info,Unnamed: 1
blur,74.5024
filename,images/pomeranian_166.jpg

Info,Unnamed: 1
blur,78.083
filename,images/yorkshire_terrier_123.jpg

Info,Unnamed: 1
blur,83.1843
filename,images/Persian_228.jpg

Info,Unnamed: 1
blur,85.962
filename,images/chihuahua_124.jpg

Info,Unnamed: 1
blur,89.3777
filename,images/pomeranian_123.jpg

Info,Unnamed: 1
blur,92.4174
filename,images/chihuahua_161.jpg

Info,Unnamed: 1
blur,96.3646
filename,images/pomeranian_117.jpg

Info,Unnamed: 1
blur,99.7468
filename,images/pomeranian_176.jpg

Info,Unnamed: 1
blur,105.4029
filename,images/chihuahua_187.jpg

Info,Unnamed: 1
blur,106.3722
filename,images/Siamese_250.jpg

Info,Unnamed: 1
blur,106.7894
filename,images/pomeranian_173.jpg

Info,Unnamed: 1
blur,107.9866
filename,images/Persian_260.jpg

Info,Unnamed: 1
blur,114.2109
filename,images/pomeranian_172.jpg

Info,Unnamed: 1
blur,117.353
filename,images/samoyed_159.jpg

Info,Unnamed: 1
blur,118.0553
filename,images/Bombay_200.jpg

Info,Unnamed: 1
blur,118.0553
filename,images/Bombay_85.jpg

Info,Unnamed: 1
blur,122.016
filename,images/chihuahua_180.jpg

Info,Unnamed: 1
blur,122.1545
filename,images/Russian_Blue_222.jpg

Info,Unnamed: 1
blur,128.0107
filename,images/basset_hound_151.jpg

Info,Unnamed: 1
blur,128.2121
filename,images/miniature_pinscher_155.jpg


## Image Clusters

Visualize similar looking images as clusters. View the docs [here](https://visual-layer.readme.io/docs/v1-api#component_gallery).

In [10]:
fd.vis.component_gallery()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 88.66it/s]


Finished OK. Components are stored as image files fastdup_work_dir/galleries/components_[index].jpg
Stored components visual view in  fastdup_work_dir/galleries/components.html
Execution time in seconds 0.8


Info,Unnamed: 1
component,1397.0
num_images,3.0
mean_distance,1.0

Info,Unnamed: 1
component,1404.0
num_images,3.0
mean_distance,0.9658

Info,Unnamed: 1
component,21.0
num_images,2.0
mean_distance,0.9681

Info,Unnamed: 1
component,3437.0
num_images,2.0
mean_distance,0.9999

Info,Unnamed: 1
component,3551.0
num_images,2.0
mean_distance,0.9998

Info,Unnamed: 1
component,3550.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3548.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3457.0
num_images,2.0
mean_distance,0.9999

Info,Unnamed: 1
component,3450.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3449.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3398.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3417.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3592.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3394.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3036.0
num_images,2.0
mean_distance,0.9999

Info,Unnamed: 1
component,2959.0
num_images,2.0
mean_distance,0.9604

Info,Unnamed: 1
component,2847.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,2845.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3591.0
num_images,2.0
mean_distance,0.9997

Info,Unnamed: 1
component,3626.0
num_images,2.0
mean_distance,1.0
