[![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)

# Quick Dataset Analysis

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb)
[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb)

This notebook shows how to quickly analyze an image dataset for potential issues using fastdup. We'll take you on a high level tour showcasing the core functions of fastdup in the shortest time.

By the end of this notebook you will learn how to find out if your dataset has issues such as:

+ Broken images.
+ Duplicates/near-duplicates.
+ Outliers.
+ Dark/bright/blurry images.

We'll also visualize clusters of visually similar looking images to let you have a birds eye view on your dataset.

## Installation

If you're new, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/quick-dataset-analysis.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb).  

Let's start with the installation:

In [1]:
!pip install fastdup -Uq

Now, test the installation by printing out the version. If there's no error message, we are ready to go!

In [2]:
import fastdup
fastdup.__version__

/usr/bin/dpkg


'1.23'

## Download Dataset

For demonstration, we will use a widely available and well curated dataset [Oxford IIIT Pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/). For that reason we might not find a lot of issues here but feel free to swap this dataset with your own.

The dataset consists of images and annotations for 37 category pet with roughly 200 images for each class. For now, we are only interested in finding issues in the images and not the annotations. More on annotations in the upcoming [notebook](./analyzing-image-classification-dataset.ipynb).

Let's download only from the dataset and extract them into the local directory:

In [None]:
!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz -O images.tar.gz
!tar xf images.tar.gz

## Run fastdup

Once the extraction completes, we can run fastdup on the images.

For that let's create a `fastdup` object and specify the input directory which points to the folder of images.

In [3]:
fd = fastdup.create(input_dir="images/")



The `.create` method also has an optional `work_dir` parameter which specifies the directory to store artifacts from the run.

You can optionally run `fastdup.create(work_dir="my_work_dir/", input_dir="images/")` if you'd like to store the artifacts in a specific working directory.

In [4]:
fd.run()

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-07-11 13:16:29 [INFO] Going to loop over dir images
2023-07-11 13:16:29 [INFO] Found total 7390 images to run on, 7390 train, 0 test, name list 7390, counter 7390 
2023-07-11 13:16:29 [ERROR] Failed to read image images/Abyssinian_34.jpgtes
2023-07-11 13:16:34 [ERROR] Failed to read image images/Egyptian_Mau_139.jpgs
2023-07-11 13:16:34 [ERROR] Failed to read image images/Egyptian_Mau_145.jpg
2023-07-11 13:16:34 [ERROR] Failed to read image images/Egyptian_Mau_167.jpg
2023-07-11 13:16:34 [ERROR] Failed to read image images/Egyptian_Mau_177.jpgs
2023-07-11 13:16:34 [ERROR] Failed to read image images/Egyptian_Mau_191.jpg
2023-07-11 13:16:45 [INFO] Found total 7390 images to run ontimated: 0 Minutes
Finished histogram 1.707
Finished bucket sort 1.726
2023-07-11 13:16:45 [INFO] 138) Finished write_index() NN model
2023-07-11 13:16:45 [INFO] Stored nn model index file work_dir/nnf.index
2023-07-11 13:16:45 [INF

0

## View Run Summary

In [5]:
fd.summary()


 ########################################################################################

Dataset Analysis Summary: 

    Dataset contains 7390 images
    Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data
    For a detailed analysis, use `.invalid_instances()`.

    Similarity:  1.00% (74) belong to 3 similarity clusters (components).
    99.00% (7,316) images do not belong to any similarity cluster.
    Largest cluster has 12 (0.16%) images.
    For a detailed analysis, use `.connected_components()`
(similarity threshold used is 0.9, connected component threshold used is 0.96).

    Outliers: 6.14% (454) of images are possible outliers, and fall in the bottom 5.00% of similarity values.
    For a detailed list of outliers, use `.outliers()`.


['Dataset contains 7390 images',
 'Valid images are 99.92% (7,384) of the data, invalid are 0.08% (6) of the data',
 'For a detailed analysis, use `.invalid_instances()`.\n',
 'Similarity:  1.00% (74) belong to 3 similarity clusters (components).',
 '99.00% (7,316) images do not belong to any similarity cluster.',
 'Largest cluster has 12 (0.16%) images.',
 'For a detailed analysis, use `.connected_components()`\n(similarity threshold used is 0.9, connected component threshold used is 0.96).\n',
 'Outliers: 6.14% (454) of images are possible outliers, and fall in the bottom 5.00% of similarity values.',
 'For a detailed list of outliers, use `.outliers()`.']

## Invalid Images
From the logs printed above, we see there are a few invalid images. These are broken images that cannot be read.

You can get a list of broken images with:

In [6]:
fd.invalid_instances()

Unnamed: 0,filename,index,error_code,is_valid,fd_index
0,images/Abyssinian_34.jpg,135,ERROR_CORRUPT_IMAGE,False,135
1,images/Egyptian_Mau_139.jpg,2240,ERROR_CORRUPT_IMAGE,False,2240
2,images/Egyptian_Mau_145.jpg,2247,ERROR_CORRUPT_IMAGE,False,2247
3,images/Egyptian_Mau_167.jpg,2268,ERROR_CORRUPT_IMAGE,False,2268
4,images/Egyptian_Mau_177.jpg,2278,ERROR_CORRUPT_IMAGE,False,2278
5,images/Egyptian_Mau_191.jpg,2293,ERROR_CORRUPT_IMAGE,False,2293


## Duplicate/Near-duplicates

One of the lowest hanging fruits in cleaning a dataset is finding and eliminating duplicates.

fastdup provides a handy way of visualizing duplicates/near-duplicates using the `duplicates_gallery` method. The `Distance` value indicates how visually similar are the image pairs in the gallery. A `Distance` of `1.0` indicates an exact copy and vice-versa.

In [7]:
fd.vis.duplicates_gallery()

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 197.91it/s]


Stored similarity visual view in  work_dir/galleries/duplicates.html


Info,Unnamed: 1
Distance,1.0
From,/Bombay_190.jpg
To,/Bombay_185.jpg

Info,Unnamed: 1
Distance,1.0
From,/newfoundland_154.jpg
To,/newfoundland_138.jpg

Info,Unnamed: 1
Distance,1.0
From,/Bombay_201.jpg
To,/Bombay_92.jpg

Info,Unnamed: 1
Distance,1.0
From,/Bombay_32.jpg
To,/Bombay_194.jpg

Info,Unnamed: 1
Distance,1.0
From,/Bombay_202.jpg
To,/Bombay_99.jpg

Info,Unnamed: 1
Distance,1.0
From,/Egyptian_Mau_210.jpg
To,/Egyptian_Mau_41.jpg

Info,Unnamed: 1
Distance,1.0
From,/Bombay_189.jpg
To,/Bombay_164.jpg

Info,Unnamed: 1
Distance,1.0
From,/english_cocker_spaniel_176.jpg
To,/english_cocker_spaniel_179.jpg

Info,Unnamed: 1
Distance,1.0
From,/Bombay_198.jpg
To,/Bombay_69.jpg

Info,Unnamed: 1
Distance,1.0
From,/english_cocker_spaniel_163.jpg
To,/english_cocker_spaniel_152.jpg

Info,Unnamed: 1
Distance,1.0
From,/english_cocker_spaniel_151.jpg
To,/english_cocker_spaniel_162.jpg

Info,Unnamed: 1
Distance,1.0
From,/english_cocker_spaniel_164.jpg
To,/english_cocker_spaniel_154.jpg

Info,Unnamed: 1
Distance,1.0
From,/Bombay_109.jpg
To,/Bombay_206.jpg

Info,Unnamed: 1
Distance,1.0
From,/Bombay_126.jpg
To,/Bombay_220.jpg


0

## Outliers

Similar to duplicate pairs, you can visualize potential outliers in your dataset with:

In [8]:
fd.vis.outliers_gallery() 

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 24556.81it/s]


Stored outliers visual view in  work_dir/galleries/outliers.html


Info,Unnamed: 1
Distance,0.597075
Path,/Bengal_105.jpg

Info,Unnamed: 1
Distance,0.624279
Path,/beagle_142.jpg

Info,Unnamed: 1
Distance,0.629087
Path,/staffordshire_bull_terrier_51.jpg

Info,Unnamed: 1
Distance,0.629917
Path,/american_pit_bull_terrier_72.jpg

Info,Unnamed: 1
Distance,0.633318
Path,/german_shorthaired_173.jpg

Info,Unnamed: 1
Distance,0.633533
Path,/miniature_pinscher_76.jpg

Info,Unnamed: 1
Distance,0.634925
Path,/Bengal_131.jpg

Info,Unnamed: 1
Distance,0.636669
Path,/staffordshire_bull_terrier_76.jpg

Info,Unnamed: 1
Distance,0.639585
Path,/chihuahua_6.jpg

Info,Unnamed: 1
Distance,0.642
Path,/basset_hound_197.jpg

Info,Unnamed: 1
Distance,0.643355
Path,/boxer_149.jpg

Info,Unnamed: 1
Distance,0.643534
Path,/beagle_147.jpg

Info,Unnamed: 1
Distance,0.64548
Path,/Bombay_188.jpg

Info,Unnamed: 1
Distance,0.645831
Path,/Bombay_204.jpg

Info,Unnamed: 1
Distance,0.653168
Path,/Bombay_36.jpg

Info,Unnamed: 1
Distance,0.6535
Path,/Abyssinian_226.jpg

Info,Unnamed: 1
Distance,0.654307
Path,/miniature_pinscher_191.jpg

Info,Unnamed: 1
Distance,0.660908
Path,/chihuahua_164.jpg

Info,Unnamed: 1
Distance,0.661223
Path,/german_shorthaired_121.jpg

Info,Unnamed: 1
Distance,0.668792
Path,/Maine_Coon_194.jpg


0

## Dark, Bright and Blurry Images

fastdup also lets you visualize images from your dataset using statistical metrics.

For example, with `metric='dark'` we can visualize the darkest images from the dataset.

In [9]:
fd.vis.stats_gallery(metric='dark')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 288.94it/s]


Stored mean visual view in  work_dir/galleries/mean.html


Info,Unnamed: 1
mean,15.7118
filename,images/Abyssinian_4.jpg

Info,Unnamed: 1
mean,18.7883
filename,images/Abyssinian_114.jpg

Info,Unnamed: 1
mean,19.5741
filename,images/Abyssinian_18.jpg

Info,Unnamed: 1
mean,19.8396
filename,images/Bombay_191.jpg

Info,Unnamed: 1
mean,26.7209
filename,images/Bombay_108.jpg

Info,Unnamed: 1
mean,27.4072
filename,images/Abyssinian_62.jpg

Info,Unnamed: 1
mean,28.5051
filename,images/scottish_terrier_171.jpg

Info,Unnamed: 1
mean,29.4029
filename,images/Sphynx_119.jpg

Info,Unnamed: 1
mean,29.9286
filename,images/Maine_Coon_134.jpg

Info,Unnamed: 1
mean,31.4749
filename,images/shiba_inu_137.jpg

Info,Unnamed: 1
mean,31.599
filename,images/chihuahua_78.jpg

Info,Unnamed: 1
mean,32.7848
filename,images/shiba_inu_27.jpg

Info,Unnamed: 1
mean,33.2283
filename,images/Egyptian_Mau_59.jpg

Info,Unnamed: 1
mean,33.7525
filename,images/japanese_chin_175.jpg

Info,Unnamed: 1
mean,33.7692
filename,images/beagle_180.jpg

Info,Unnamed: 1
mean,33.9768
filename,images/Abyssinian_30.jpg

Info,Unnamed: 1
mean,34.0113
filename,images/american_bulldog_150.jpg

Info,Unnamed: 1
mean,34.3895
filename,images/Abyssinian_46.jpg

Info,Unnamed: 1
mean,34.8092
filename,images/Sphynx_46.jpg

Info,Unnamed: 1
mean,35.634
filename,images/japanese_chin_40.jpg


0

In [10]:
fd.vis.stats_gallery(metric='bright')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 333.70it/s]


Stored mean visual view in  work_dir/galleries/mean.html


Info,Unnamed: 1
mean,235.6992
filename,images/saint_bernard_183.jpg

Info,Unnamed: 1
mean,234.3785
filename,images/saint_bernard_188.jpg

Info,Unnamed: 1
mean,233.4722
filename,images/Egyptian_Mau_99.jpg

Info,Unnamed: 1
mean,232.2554
filename,images/saint_bernard_186.jpg

Info,Unnamed: 1
mean,230.1848
filename,images/Abyssinian_127.jpg

Info,Unnamed: 1
mean,226.9057
filename,images/saint_bernard_187.jpg

Info,Unnamed: 1
mean,226.3688
filename,images/British_Shorthair_274.jpg

Info,Unnamed: 1
mean,223.6878
filename,images/Egyptian_Mau_1.jpg

Info,Unnamed: 1
mean,223.2687
filename,images/great_pyrenees_88.jpg

Info,Unnamed: 1
mean,220.246
filename,images/Bengal_20.jpg

Info,Unnamed: 1
mean,218.5597
filename,images/pug_76.jpg

Info,Unnamed: 1
mean,217.9169
filename,images/Egyptian_Mau_39.jpg

Info,Unnamed: 1
mean,216.7688
filename,images/Maine_Coon_267.jpg

Info,Unnamed: 1
mean,214.4495
filename,images/staffordshire_bull_terrier_25.jpg

Info,Unnamed: 1
mean,213.1254
filename,images/Birman_136.jpg

Info,Unnamed: 1
mean,212.3259
filename,images/basset_hound_24.jpg

Info,Unnamed: 1
mean,211.3064
filename,images/boxer_172.jpg

Info,Unnamed: 1
mean,211.2815
filename,images/saint_bernard_14.jpg

Info,Unnamed: 1
mean,211.1101
filename,images/pug_96.jpg

Info,Unnamed: 1
mean,210.7337
filename,images/Egyptian_Mau_45.jpg


0

In [11]:
fd.vis.stats_gallery(metric='blur')

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 805.61it/s]


Stored blur visual view in  work_dir/galleries/blur.html


Info,Unnamed: 1
blur,65.1586
filename,images/Persian_228.jpg

Info,Unnamed: 1
blur,68.6347
filename,images/Ragdoll_254.jpg

Info,Unnamed: 1
blur,71.8926
filename,images/pomeranian_170.jpg

Info,Unnamed: 1
blur,76.9661
filename,images/pomeranian_183.jpg

Info,Unnamed: 1
blur,77.3129
filename,images/pug_166.jpg

Info,Unnamed: 1
blur,77.8375
filename,images/Ragdoll_255.jpg

Info,Unnamed: 1
blur,79.21
filename,images/yorkshire_terrier_123.jpg

Info,Unnamed: 1
blur,83.2725
filename,images/pomeranian_166.jpg

Info,Unnamed: 1
blur,88.556
filename,images/pomeranian_123.jpg

Info,Unnamed: 1
blur,91.0464
filename,images/chihuahua_124.jpg

Info,Unnamed: 1
blur,93.68
filename,images/chihuahua_161.jpg

Info,Unnamed: 1
blur,96.0024
filename,images/pomeranian_117.jpg

Info,Unnamed: 1
blur,99.3509
filename,images/pomeranian_176.jpg

Info,Unnamed: 1
blur,104.3721
filename,images/chihuahua_187.jpg

Info,Unnamed: 1
blur,105.5227
filename,images/Siamese_250.jpg

Info,Unnamed: 1
blur,108.3876
filename,images/Persian_260.jpg

Info,Unnamed: 1
blur,111.6988
filename,images/pomeranian_173.jpg

Info,Unnamed: 1
blur,115.5061
filename,images/Bombay_85.jpg

Info,Unnamed: 1
blur,115.5061
filename,images/Bombay_200.jpg

Info,Unnamed: 1
blur,115.8939
filename,images/pomeranian_172.jpg


0

## Image Clusters

One of fastdup's coolest feature is visualizing image clusters. In the previous section we saw how to visualize similar image pairs. In this section we group similar looking image (or even duplicates) as a cluster and visualize them in gallery.

To do so, simply run:



> **Note**: fastdup uses default parameter values when creating image clusters. Depending on your data and use case, the best value may vary. Read more [here](https://visual-layer.readme.io/docs/dataset-cleanup) on how to change parameter values to cluster images.

In [12]:
fd.vis.component_gallery()

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 113.13it/s]


Finished OK. Components are stored as image files work_dir/galleries/components_[index].jpg
Stored components visual view in  work_dir/galleries/components.html
Execution time in seconds 0.7


Info,Unnamed: 1
component,1397.0
num_images,3.0
mean_distance,1.0

Info,Unnamed: 1
component,1404.0
num_images,3.0
mean_distance,0.9658

Info,Unnamed: 1
component,21.0
num_images,2.0
mean_distance,0.9681

Info,Unnamed: 1
component,3437.0
num_images,2.0
mean_distance,0.9999

Info,Unnamed: 1
component,3551.0
num_images,2.0
mean_distance,0.9998

Info,Unnamed: 1
component,3550.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3548.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3457.0
num_images,2.0
mean_distance,0.9999

Info,Unnamed: 1
component,3450.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3449.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3398.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3417.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3592.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3394.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3036.0
num_images,2.0
mean_distance,0.9999

Info,Unnamed: 1
component,2959.0
num_images,2.0
mean_distance,0.9604

Info,Unnamed: 1
component,2847.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,2845.0
num_images,2.0
mean_distance,1.0

Info,Unnamed: 1
component,3591.0
num_images,2.0
mean_distance,0.9997

Info,Unnamed: 1
component,3626.0
num_images,2.0
mean_distance,1.0


0

## Wrap Up

That's a wrap! In this notebook we showed how you can run fastdup on a dataset or any folder of images. 

We've seen how to use fastdup to find:

+ Broken images.
+ Duplicate/near-duplicates.
+ Outliers.
+ Dark, bright and blurry images.
+ Image clusters.

Next, feel free to check out other tutorials -

+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 



## VL Profiler
If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. 

[Sign up](https://app.visual-layer.com) now, it's free.

[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)

As usual, feedback is welcome! 

Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).