[![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)

# Image Captioning & Visual Question Answering (VQA) With fastdup

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/caption_generation.ipynb)
[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/caption_generation.ipynb)


This notebook shows how you can use [fastdup](https://github.com/visual-layer/fastdup) to generate image captions. Caption generation has many useful use cases, including zero-shot classification, and accessibility features.
Additional examples in this notebook include visual question answering (VQA), which can be used for a number of applications such as image retrieval.

The captioning and VQA models employed in this example can generally be run on a CPU, with no GPU needed. The smallest model in this example requires about 0.5s per image caption, allowing 100,000 images to be captioned in half a day.

# Install fastdup

First, install fastdup and verify the installation.

In [1]:
!pip install fastdup -Uq

Now, test the installation. If there's no error message, we are ready to go.

In [1]:
import fastdup
fastdup.__version__

'1.39'

# Load Dataset

In this example we will be using the [COCO Minitrain Dataset](https://github.com/giddyyupp/coco-minitrain), which is a curated mini training set of about 25,000 images (20% of the original COCO dataset).
We will download the dataset into our local drive.

In [None]:
!pip install gdown

In [3]:
# Download coco minitrain dataset.
!gdown --fuzzy https://drive.google.com/file/d/1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK/view
!unzip -qq coco_minitrain_25k.zip

# Download csv annotations
!cd coco_minitrain_25k/annotations && gdown --fuzzy https://drive.google.com/file/d/1i12p23cXlqp1QrXjAD_vu467r4q67Mq9/view

Downloading...
From (uriginal): https://drive.google.com/uc?id=1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK
From (redirected): https://drive.google.com/uc?id=1iSXVTlkV1_DhdYpVDqsjlT4NJFQ7OkyK&confirm=t&uuid=8ace7c5d-ec8e-4bba-a8d3-2cb89d555188
To: /Users/guysinger/Desktop/fastdup/examples/coco_minitrain_25k.zip
100%|██████████████████████████████████████| 4.90G/4.90G [03:26<00:00, 23.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1i12p23cXlqp1QrXjAD_vu467r4q67Mq9
To: /Users/guysinger/Desktop/fastdup/examples/coco_minitrain_25k/annotations/coco_minitrain2017.csv
100%|██████████████████████████████████████| 9.43M/9.43M [00:00<00:00, 12.1MB/s]


# Run fastdup

Run fastdup with annotations on the dataset. Here, we set `num_images` to limit the run to 1000 images.

In [8]:
fd = fastdup.create(input_dir='./coco_minitrain_25k')
fd.run(ccthreshold=0.9, num_images=1000, overwrite=True)

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
2023-09-14 16:39:55 [INFO] Going to loop over dir coco_minitrain_25k
2023-09-14 16:39:55 [INFO] Found total 1000 images to run on, 1000 train, 0 test, name list 1000, counter 1000 
2023-09-14 16:39:57 [INFO] Found total 1000 images to run ontimated: 0 Minutes
2023-09-14 16:39:57 [INFO] 97) Finished write_index() NN model
2023-09-14 16:39:57 [INFO] Stored nn model index file work_dir/nnf.index
2023-09-14 16:39:57 [INFO] Total time took 2157 ms
2023-09-14 16:39:57 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 % of total graph edges
2023-09-14 16:39:57 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 % of total graph edges
2023-09-14 16:39:57 [INFO] Found a total of 0 above threshold images (d>0.900), which are 0.00 % of total graph edges
2023-09-14 16:39:57 [INFO] Found a total of 100 outlier images         (d<0.050), which are 5.00 % of total graph edges
2023



0

# Generate Captions

fastdup currently supports a number of different captioning and VQA models, each with their own set of advantages and disadvantages. Some of these models are larger and slower, but may produce better results for datasets that are far outside their training distribution. Other models are smaller and faster, but their results may be less useful for images falling far outside their training distribution. Currently, the available models for captioning are:
- ViT-GPT2 : `'vitgpt2'` : a lightweight and fast model trained on COCO images. This model takes about 0.5s per image caption (on a CPU), but may provide less useful results for images that are very different from COCO-like images.
- BLIP-2 : `'blip2'` : a more heavyweight model. This model may provide more robust answers for images different than COCO images, but can take upwards of 10s per image caption.
- BLIP : `'blip'` : a middleweight model that provides a middle-way approach between ViT-GPT2 and BLIP-2.

Available models for VQA are:
- Vilt-b32: `'vqa'` : a fairly lightweight model used for question answering.
- ViT-Age: `'age'` : a lightweight model used to classify the age of humans in a photo.
---> used for person age VQA

By default, the captioning model used will be ViT-GPT2, if not specified otherwise.

**Selecting GPU/CPU and batch sizes:**
<br> The captioning method in fastdup enables you to select either a GPU or CPU for computation, and decide your preferred batch size. By default, CPU computation is selected, and batch sizes are set to 8. For GPU's with high-RAM (40GB), a batch size of 256 will enable captioning in under 0.05 seconds per image.

In [None]:
captions_df = fd.caption(model_name='automatic', device = 'cpu', batch_size=8)

In [10]:
captions_sample = captions_df.sample(n=5)
captions_sample.loc[:,['filename', 'caption']].head(5)

Unnamed: 0,filename,caption
230,coco_minitrain_25k/images/train2017/000000005811.jpg,a red bus is driving down the street
398,coco_minitrain_25k/images/train2017/000000009845.jpg,a man is standing next to a bus
803,coco_minitrain_25k/images/train2017/000000019167.jpg,a bowl of oranges in a metal bowl
493,coco_minitrain_25k/images/train2017/000000012315.jpg,a cat is sitting on a toilet seat
162,coco_minitrain_25k/images/train2017/000000004404.jpg,a woman walking down a street holding an umbrella


# Visualizing the Dataset's Outlier Images With Their Captions

Use fastdup's built-in galleries methods to visualize the captioned images.
Additionally, captions can always be generated for a gallery by setting the `label_col` argument to one of the available model names listed above.

In [14]:
import pandas as pd
captions_to_show = captions_df.sample(20)
visualization_df = pd.DataFrame({'from':captions_to_show['filename'],'to':captions_to_show['filename'], 'label':captions_to_show['caption'], 'distance':0*len(captions_to_show),})
fastdup.create_outliers_gallery(visualization_df, save_path='.', num_images=10)

100%|██████████| 10/10 [00:00<00:00, 23643.20it/s]

Stored outliers visual view in  ./outliers.html





0

In [15]:
from IPython.display import HTML
HTML('outliers.html')

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000001737.jpg
label,two polar bears are sitting on rocks in the wilderness

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000003538.jpg
label,a person on skis is going down a hill

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000011305.jpg
label,"a living room with a couch, a table and a window"

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000004820.jpg
label,a cat laying on a bed with a blanket

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000020438.jpg
label,a dog laying on a bed with a blanket

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000010489.jpg
label,"a bathroom with a toilet, sink, and tub"

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000016716.jpg
label,two zebras standing in a field of grass

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000023194.jpg
label,a horse drawn carriage on a street

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000010130.jpg
label,a woman is looking at her cell phone

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000018633.jpg
label,two dogs are standing outside of a door


# Visual Question Answering

Visual question answering in fastdup allows you to use open-ended questions to learn more about the images in your dataset. These can be questions such as "is this photo taken indoors or outdoors?", "is there a dog in the photo?", "is this a photo of an animal or an object?", or any other questions that come to mind. The output from these queries can, in turn, be used for image retrieval, to aid the visually impaired, and many other interesting use cases.

In [None]:
vqa_df = fd.caption(model_name='vqa', vqa_prompt='is this photo taken indoors or outdoors?')

In [17]:
vqa_sample = vqa_df.sample(n=5)
vqa_sample.loc[:,['filename', 'caption']].head(5)

Unnamed: 0,filename,caption
702,coco_minitrain_25k/images/train2017/000000017004.jpg,indoors
768,coco_minitrain_25k/images/train2017/000000018464.jpg,outdoors
338,coco_minitrain_25k/images/train2017/000000008458.jpg,outdoors
164,coco_minitrain_25k/images/train2017/000000004502.jpg,outdoors
587,coco_minitrain_25k/images/train2017/000000014271.jpg,outdoors


# Visualize the Results of VQA on The Dataset's Outliers

Once again, we will use fastdup's built-in galleries methods to visualize the results of our VQA prompts.

In [18]:
vqa_to_show = vqa_df.sample(20)
vis_vqa_df = pd.DataFrame({'from':vqa_to_show['filename'],'to':vqa_to_show['filename'], 'label':vqa_to_show['caption'], 'distance':0*len(vqa_to_show),})
fastdup.create_outliers_gallery(vis_vqa_df, save_path='.', num_images=10)

100%|██████████| 10/10 [00:00<00:00, 21597.86it/s]

Stored outliers visual view in  ./outliers.html





0

In [19]:
from IPython.display import HTML
HTML('outliers.html')

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000012475.jpg
label,outdoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000005344.jpg
label,outdoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000010324.jpg
label,indoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000005139.jpg
label,outdoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000002614.jpg
label,outdoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000005540.jpg
label,indoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000005107.jpg
label,outdoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000023811.jpg
label,outdoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000001625.jpg
label,outdoors

Info,Unnamed: 1
Distance,0
Path,coco_minitrain_25k/images/train2017/000000012754.jpg
label,indoors


## Wrap Up

That's a wrap! In this notebook we showed how you load dataset from Kaggle and analyze it using fastdup. You can use similar methods to run on other similar datasets on [Roboflow Universe](https://universe.roboflow.com/).

Try it out and let us know what issues you find.

Next, feel free to check out other tutorials -

+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 


## VL Profiler - A faster and easier way to diagnose and visualize dataset issues

If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. 

VL Profiler is free to get started. Upload up to 1,000,000 images for analysis at zero cost!

[Sign up](https://app.visual-layer.com) now.

[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/github_banner_profiler.gif)](https://app.visual-layer.com)

As usual, feedback is welcome! Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).

<center> 
    <a href="https://visual-layer.com/" target="_blank" style="text-decoration: none;"> <img style="width:200px" alt="logo" src="https://d2iycffepdu1yp.cloudfront.net/design-assets/VL_horizontal_logo.png"> </a><br>
    <a href="https://github.com/visual-layer/fastdup" target="_blank" style="text-decoration: none;"> GitHub </a> •
    <a href="https://visual-layer.slack.com/" target="_blank" style="text-decoration: none;"> Join Slack Community </a> •
    <a href="https://visual-layer.readme.io/discuss" target="_blank" style="text-decoration: none;"> Discussion Forum </a>
</center>

<center> 
    <a href="https://medium.com/visual-layer" target="_blank" style="text-decoration: none;"> Blog </a> •
    <a href="https://visual-layer.readme.io/" target="_blank" style="text-decoration: none;"> Documentation </a> •
    <a href="https://visual-layer.com/about" target="_blank" style="text-decoration: none;"> About Us </a> 
</center>

<center> 
    <a href="https://www.linkedin.com/company/visual-layer/" target="_blank" style="text-decoration: none;"> LinkedIn </a> •
    <a href="https://twitter.com/visual_layer" target="_blank" style="text-decoration: none;"> Twitter </a>
</center>