{ "cells": [ { "cell_type": "markdown", "id": "221f5c62-51ac-447e-91b0-c42c7b602af9", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", " \n", " \"vl\n", " \n", " \n", "
\n", "
\n", " \n", " \"Logo\"\n", " \n", " \n", " \"Logo\"\n", " \n", " \n", " \"Logo\"\n", " \n", " \n", " \"Logo\"\n", " \n", " \n", " \"Logo\"\n", " \n", "
" ] }, { "cell_type": "markdown", "id": "7aad46c3-7e0a-463f-9064-0b5751501039", "metadata": {}, "source": [ "# Analyze Datasets from LabelBox\n", "\n", "[![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-labelbox-datasets.ipynb)\n", "[![Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-labelbox-datasets.ipynb)\n", "[![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/analyzing-labelbox-datasets)\n", "\n", "\n", "If you have datasets from LabelBox this notebook shows how you can download datasets from your Labelbox account and analyze them for issues with fastdup." ] }, { "cell_type": "markdown", "id": "55b99f27-269c-49d6-8f51-b2af6d2019bb", "metadata": {}, "source": [ "## Installation\n", "\n", "First, let's install the necessary packages." ] }, { "cell_type": "code", "execution_count": 1, "id": "9b81d6ca-a91f-46c5-bc91-7bc3b36b01b5", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "accelerate 0.21.0 requires torch>=1.10.0, which is not installed.\n", "auto-gptq 0.2.2 requires datasets, which is not installed.\n", "auto-gptq 0.2.2 requires torch>=1.13.0, which is not installed.\n", "peft 0.4.0 requires torch>=1.13.0, which is not installed.\n", "posthog 3.0.1 requires monotonic>=1.5, which is not installed.\n", "scikit-learn 1.3.0 requires scipy>=1.5.0, which is not installed.\n", "sentence-transformers 2.2.2 requires scipy, which is not installed.\n", "sentence-transformers 2.2.2 requires torch>=1.6.0, which is not installed.\n", "sentence-transformers 2.2.2 requires torchvision, which is not installed.\u001b[0m\u001b[31m\n", "\u001b[0m" ] } ], "source": [ "!pip install -Uq fastdup labelbox" ] }, { "cell_type": "markdown", "id": "e6722adf-0f74-4aae-8e67-76107456a91b", "metadata": {}, "source": [ "Now, test the installation. If there's no error message, we are ready to go." ] }, { "cell_type": "code", "execution_count": 2, "id": "efc6af00-4688-454d-b84b-05e15c95fb86", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'1.43'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import fastdup\n", "fastdup.__version__" ] }, { "cell_type": "markdown", "id": "ce198f1f-00b3-4836-968e-cc266b7eee24", "metadata": {}, "source": [ "## Labelbox Python SDK\n", "The [Labelbox Python API](https://github.com/Labelbox/labelbox-python) offers a simple, user-friendly way to interact with the Labelbox back-end." ] }, { "cell_type": "code", "execution_count": 3, "id": "29d1e0f2-aeba-4b1f-8646-3e191ac16846", "metadata": { "tags": [] }, "outputs": [], "source": [ "import labelbox\n", "from labelbox.schema.bulk_import_request import BulkImportRequest\n", "\n", "API_KEY=\"YOUR_API_KEY\"\n", "labelbox_client = labelbox.Client(API_KEY)" ] }, { "cell_type": "markdown", "id": "532e6aac-2677-4e50-a8d6-a7a607955dca", "metadata": {}, "source": [ "In this example, we uploaded the Oxford Pets Dataset into our Labelbox account. To download the dataset locally, specify the dataset ID." ] }, { "cell_type": "code", "execution_count": 4, "id": "877e8514-c085-4b49-bbf2-bc01114090ab", "metadata": { "tags": [] }, "outputs": [], "source": [ "dataset = labelbox_client.get_dataset(\"DATASET_ID\")" ] }, { "cell_type": "markdown", "id": "390f9cd9-08cf-42bb-947b-546f7a520e5b", "metadata": {}, "source": [ "## Download Dataset\n", "The Labelbox SDK does not provide a way to bulk download the dataset. We will download the dataset by iterating over each image in the dataset." ] }, { "cell_type": "code", "execution_count": 5, "id": "958fba0b-9be7-40e5-8d3f-ad5829fc2883", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8658e64f420942ffa04552cb39b54613", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading images: 0%| | 0/7378 [00:000.990), which are 0.81 % of total graph edges\n", "2023-09-21 14:55:28 [INFO] Found a total of 8 nearly identical images(d>0.980), which are 0.05 % of total graph edges\n", "2023-09-21 14:55:28 [INFO] Found a total of 1006 above threshold images (d>0.900), which are 6.82 % of total graph edges\n", "2023-09-21 14:55:28 [INFO] Found a total of 739 outlier images (d<0.050), which are 5.01 % of total graph edges\n", "2023-09-21 14:55:28 [INFO] Min distance found 0.597 max distance 1.000\n", "2023-09-21 14:55:28 [INFO] Running connected components for ccthreshold 0.960000 \n", ".0\n", " ########################################################################################\n", "\n", "Dataset Analysis Summary: \n", "\n", " Dataset contains 7378 images\n", " Valid images are 100.00% (7,378) of the data, invalid are 0.00% (0) of the data\n", " Similarity: 2.01% (148) belong to 3 similarity clusters (components).\n", " 97.99% (7,230) images do not belong to any similarity cluster.\n", " Largest cluster has 12 (0.16%) images.\n", " For a detailed analysis, use `.connected_components()`\n", "(similarity threshold used is 0.9, connected component threshold used is 0.96).\n", "\n", " Outliers: 6.18% (456) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n", " For a detailed list of outliers, use `.outliers()`.\n", "\n", "########################################################################################\n", "Would you like to see awesome visualizations for some of the most popular academic datasets?\n", "Click here to see and learn more: https://app.visual-layer.com/vl-datasets?utm_source=fastdup\n", "########################################################################################\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd = fastdup.create(input_dir=folder_path)\n", "fd.run()" ] }, { "cell_type": "markdown", "id": "3caa9e1a-5cb5-47d3-baa0-948d879b78b3", "metadata": {}, "source": [ "## View Galleries\n", "\n", "You can use all of fastdup gallery methods to view duplicates, clusters, etc." ] }, { "cell_type": "code", "execution_count": 7, "id": "0dbea899-8560-4e0d-9c25-7ba882dd06e0", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0e1e78c31b0e43aabdfe0eca49e34550", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating gallery: 0%| | 0/20 [00:00\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Components Report\n", " \n", " \n", "\n", "\n", "\n", "
\n", "
\n", "
\n", " \n", " \"logo\"\n", " \n", "
\n", " \n", "
\n", "
\n", "
\n", "

Components Report

Showing groups of similar images

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1394
num_images3
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1401
num_images3
mean_distance0.9658
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component21
num_images2
mean_distance0.9681
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3428
num_images2
mean_distance0.9999
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3542
num_images2
mean_distance0.9998
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3541
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3539
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3448
num_images2
mean_distance0.9999
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3441
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3440
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3389
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3408
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3583
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3385
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3028
num_images2
mean_distance0.9999
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component2951
num_images2
mean_distance0.9604
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component2839
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component2837
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3582
num_images2
mean_distance0.9997
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3617
num_images2
mean_distance1.0
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", "
\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd.vis.component_gallery()" ] }, { "cell_type": "markdown", "id": "bc8a3ce2", "metadata": {}, "source": [ "## Wrap Up\n", "In this tutorial, we showed how you can download datasets from your Labelbox and analyze it with fastdup.\n", "\n", "Next, feel free to check out other tutorials -\n", "\n", "+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!\n", "+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.\n", "+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!\n", "+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try." ] }, { "cell_type": "markdown", "id": "44acb813-730b-4513-9266-17e0348f8584", "metadata": {}, "source": [ "\n", "## VL Profiler - A faster and easier way to diagnose and visualize dataset issues\n", "\n", "If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. \n", "\n", "VL Profiler is free to get started. Upload up to 1,000,000 images for analysis at zero cost!\n", "\n", "[Sign up](https://app.visual-layer.com) now.\n", "\n", "[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/github_banner_profiler.gif)](https://app.visual-layer.com)\n", "\n", "As usual, feedback is welcome! Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues)." ] }, { "cell_type": "markdown", "id": "75e95b2c-5354-46b6-8f5a-23d3c20e1864", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", " \n", " \"vl\n", " \n", "
\n", " GitHub •\n", " Join Slack Community •\n", " Discussion Forum \n", "
\n", "\n", "
\n", " Blog •\n", " Documentation •\n", " About Us \n", "
\n", "\n", "
\n", " LinkedIn •\n", " Twitter \n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }