{ "cells": [ { "cell_type": "markdown", "id": "221f5c62-51ac-447e-91b0-c42c7b602af9", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", " \n", " \"vl\n", " \n", "
\n", " GitHub •\n", " Join Discord Community •\n", " Discussion Forum \n", "
\n", "\n", "
\n", " Blog •\n", " Documentation •\n", " About Us \n", "
\n", "\n", "\n", "
\n", "\n", "
\n", "
\n", " \n", " \"site\"\n", " \n", " \"blog\"\n", " \n", " \"github\"\n", " \n", " \"slack\"\n", " \n", " \"linkedin\"\n", " \n", " \"youtube\"\n", " \n", " \"twitter\"\n", "
\n", "
" ] }, { "cell_type": "markdown", "id": "7aad46c3-7e0a-463f-9064-0b5751501039", "metadata": {}, "source": [ "# Run fastdup with TIMM Embeddings\n", "\n", "[![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-blue?style=for-the-badge&logo=google-colab&labelColor=gray)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb)\n", "[![Open in Kaggle](https://img.shields.io/badge/Open%20in%20Kaggle-blue?style=for-the-badge&logo=kaggle&labelColor=gray)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/embeddings-timm.ipynb)\n", "[![Explore the Docs](https://img.shields.io/badge/Explore%20the%20Docs-blue?style=for-the-badge&labelColor=gray&logo=read-the-docs)](https://visual-layer.readme.io/docs/embeddings-timm)" ] }, { "cell_type": "markdown", "id": "bae6d61b-3beb-46ad-b53a-895e78d3cf5f", "metadata": {}, "source": [ "In this notebook we show an end-to-end example on how you can pre-compute embeddings using any models from TIMM run fastdup on top of the embeddings to surface dataset issues." ] }, { "cell_type": "markdown", "id": "55b99f27-269c-49d6-8f51-b2af6d2019bb", "metadata": {}, "source": [ "## Installation\n", "\n", "First, let's install the neccessary packages:\n", "\n", "- [fastdup](https://github.com/visual-layer/fastdup) - To analyze issues in the dataset.\n", "- [TIMM (PyTorch Image Models)](https://github.com/huggingface/pytorch-image-models) - To acquire pre-trained models." ] }, { "cell_type": "code", "execution_count": 1, "id": "fc42cae3-4659-4060-b781-48e2983411fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n", "Requirement already satisfied: fastdup in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (2.0.21)\n", "Collecting timm\n", " Downloading timm-1.0.3-py3-none-any.whl.metadata (43 kB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m43.6/43.6 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: pyOpenSSL>=24.0.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (24.1.0)\n", "Requirement already satisfied: cryptography==42.0.5 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (42.0.5)\n", "Requirement already satisfied: fastapi==0.99.1 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.99.1)\n", "Requirement already satisfied: google-auth==2.29.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (2.29.0)\n", "Requirement already satisfied: httpx==0.26.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.26.0)\n", "Requirement already satisfied: Jinja2==3.1.3 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (3.1.3)\n", "Requirement already satisfied: joblib~=1.2.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (1.2.0)\n", "Requirement already satisfied: numpy~=1.23.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (1.23.5)\n", "Requirement already satisfied: pandas==2.0.3 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (2.0.3)\n", "Requirement already satisfied: Pillow==9.2.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (9.2.0)\n", "Requirement already satisfied: polars==0.20.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.20.0)\n", "Requirement already satisfied: sqlalchemy~=2.0.29 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from sqlalchemy[asyncio]~=2.0.29->fastdup) (2.0.30)\n", "Requirement already satisfied: duckdb~=0.10.1 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.10.3)\n", "Requirement already satisfied: duckdb-engine~=0.11.5 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.11.5)\n", "Requirement already satisfied: pyarrow==12.0.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (12.0.0)\n", "Requirement already satisfied: pydantic==1.10.14 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (1.10.14)\n", "Requirement already satisfied: pyjwt==2.8.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (2.8.0)\n", "Requirement already satisfied: python-multipart==0.0.9 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.0.9)\n", "Requirement already satisfied: PyYAML~=6.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (6.0.1)\n", "Requirement already satisfied: requests==2.28.1 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (2.28.1)\n", "Requirement already satisfied: scikit-learn==1.3.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (1.3.0)\n", "Requirement already satisfied: sentry-sdk==1.43.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (1.43.0)\n", "Requirement already satisfied: setproctitle==1.3.3 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (1.3.3)\n", "Requirement already satisfied: setuptools~=69.2.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (69.2.0)\n", "Requirement already satisfied: starlette==0.27.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.27.0)\n", "Requirement already satisfied: starlette-prometheus==0.9.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.9.0)\n", "Requirement already satisfied: tqdm==4.66.2 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (4.66.2)\n", "Requirement already satisfied: uvicorn==0.29.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.29.0)\n", "Requirement already satisfied: nest-asyncio in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (1.6.0)\n", "Requirement already satisfied: pillow-heif in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (0.15.0)\n", "Requirement already satisfied: packaging in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (24.0)\n", "Requirement already satisfied: opencv-python-headless in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from fastdup) (4.10.0.82)\n", "Requirement already satisfied: cffi>=1.12 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from cryptography==42.0.5->fastdup) (1.16.0)\n", "Requirement already satisfied: typing-extensions>=4.5.0 in /home/dnth/.local/lib/python3.10/site-packages (from fastapi==0.99.1->fastdup) (4.9.0)\n", "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from google-auth==2.29.0->fastdup) (5.3.3)\n", "Requirement already satisfied: pyasn1-modules>=0.2.1 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from google-auth==2.29.0->fastdup) (0.4.0)\n", "Requirement already satisfied: rsa<5,>=3.1.4 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from google-auth==2.29.0->fastdup) (4.9)\n", "Requirement already satisfied: anyio in /home/dnth/.local/lib/python3.10/site-packages (from httpx==0.26.0->fastdup) (3.7.1)\n", "Requirement already satisfied: certifi in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from httpx==0.26.0->fastdup) (2024.6.2)\n", "Requirement already satisfied: httpcore==1.* in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from httpx==0.26.0->fastdup) (1.0.5)\n", "Requirement already satisfied: idna in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from httpx==0.26.0->fastdup) (3.7)\n", "Requirement already satisfied: sniffio in /home/dnth/.local/lib/python3.10/site-packages (from httpx==0.26.0->fastdup) (1.3.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from Jinja2==3.1.3->fastdup) (2.1.3)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /home/dnth/.local/lib/python3.10/site-packages (from pandas==2.0.3->fastdup) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from pandas==2.0.3->fastdup) (2024.1)\n", "Requirement already satisfied: tzdata>=2022.1 in /home/dnth/.local/lib/python3.10/site-packages (from pandas==2.0.3->fastdup) (2023.4)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from requests==2.28.1->fastdup) (2.0.4)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from requests==2.28.1->fastdup) (1.26.18)\n", "Requirement already satisfied: scipy>=1.5.0 in /home/dnth/.local/lib/python3.10/site-packages (from scikit-learn==1.3.0->fastdup) (1.12.0)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/dnth/.local/lib/python3.10/site-packages (from scikit-learn==1.3.0->fastdup) (3.2.0)\n", "Requirement already satisfied: prometheus_client<0.13,>=0.12 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from starlette-prometheus==0.9.0->fastdup) (0.12.0)\n", "Requirement already satisfied: click>=7.0 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from uvicorn==0.29.0->fastdup) (8.1.7)\n", "Requirement already satisfied: h11>=0.8 in /home/dnth/.local/lib/python3.10/site-packages (from uvicorn==0.29.0->fastdup) (0.14.0)\n", "Collecting torch (from timm)\n", " Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)\n", "Collecting torchvision (from timm)\n", " Downloading torchvision-0.18.1-cp310-cp310-manylinux1_x86_64.whl.metadata (6.6 kB)\n", "Requirement already satisfied: huggingface_hub in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from timm) (0.23.3)\n", "Collecting safetensors (from timm)\n", " Downloading safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\n", "Requirement already satisfied: greenlet!=0.4.17 in /home/dnth/.local/lib/python3.10/site-packages (from sqlalchemy~=2.0.29->sqlalchemy[asyncio]~=2.0.29->fastdup) (2.0.2)\n", "Requirement already satisfied: filelock in /home/dnth/.local/lib/python3.10/site-packages (from huggingface_hub->timm) (3.12.2)\n", "Requirement already satisfied: fsspec>=2023.5.0 in /home/dnth/.local/lib/python3.10/site-packages (from huggingface_hub->timm) (2023.6.0)\n", "Requirement already satisfied: sympy in /home/dnth/.local/lib/python3.10/site-packages (from torch->timm) (1.12)\n", "Requirement already satisfied: networkx in /home/dnth/.local/lib/python3.10/site-packages (from torch->timm) (3.2.1)\n", "Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->timm)\n", " Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", "Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->timm)\n", " Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", "Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->timm)\n", " Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", "Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->timm)\n", " Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", "Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->timm)\n", " Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", "Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->timm)\n", " Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", "Collecting nvidia-curand-cu12==10.3.2.106 (from torch->timm)\n", " Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", "Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch->timm)\n", " Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", "Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch->timm)\n", " Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", "Collecting nvidia-nccl-cu12==2.20.5 (from torch->timm)\n", " Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)\n", "Collecting nvidia-nvtx-cu12==12.1.105 (from torch->timm)\n", " Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)\n", "Collecting triton==2.3.1 (from torch->timm)\n", " Downloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)\n", "Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch->timm)\n", " Downloading nvidia_nvjitlink_cu12-12.5.40-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n", "Requirement already satisfied: exceptiongroup in /home/dnth/.local/lib/python3.10/site-packages (from anyio->httpx==0.26.0->fastdup) (1.1.2)\n", "Requirement already satisfied: pycparser in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from cffi>=1.12->cryptography==42.0.5->fastdup) (2.21)\n", "Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth==2.29.0->fastdup) (0.6.0)\n", "Requirement already satisfied: six>=1.5 in /home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3->fastdup) (1.16.0)\n", "Requirement already satisfied: mpmath>=0.19 in /home/dnth/.local/lib/python3.10/site-packages (from sympy->torch->timm) (1.3.0)\n", "Downloading timm-1.0.3-py3-none-any.whl (2.3 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m5.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m:01\u001b[0m\n", "\u001b[?25hDownloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m779.1/779.1 MB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:03\u001b[0m\n", "\u001b[?25hDownloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m410.6/410.6 MB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:02\u001b[0m\n", "\u001b[?25hDownloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m0:01\u001b[0m:01\u001b[0m\n", "\u001b[?25hDownloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.7/23.7 MB\u001b[0m \u001b[31m8.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m823.6/823.6 kB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0mm\n", "\u001b[?25hDownloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m731.7/731.7 MB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:03\u001b[0mm\n", "\u001b[?25hDownloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m121.6/121.6 MB\u001b[0m \u001b[31m9.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.5/56.5 MB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.2/124.2 MB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m196.0/196.0 MB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m176.2/176.2 MB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.1/99.1 kB\u001b[0m \u001b[31m10.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hDownloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.1/168.1 MB\u001b[0m \u001b[31m8.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hDownloading torchvision-0.18.1-cp310-cp310-manylinux1_x86_64.whl (7.0 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.0/7.0 MB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m[36m0:00:01\u001b[0mm eta \u001b[36m0:00:01\u001b[0mm\n", "\u001b[?25hDownloading nvidia_nvjitlink_cu12-12.5.40-py3-none-manylinux2014_x86_64.whl (21.3 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.3/21.3 MB\u001b[0m \u001b[31m9.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m0:01\u001b[0m:01\u001b[0m\n", "\u001b[?25hInstalling collected packages: triton, safetensors, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, torch, torchvision, timm\n", "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "auto-gptq 0.3.0 requires accelerate>=0.19.0, which is not installed.\n", "auto-gptq 0.3.0 requires peft, which is not installed.\n", "auto-gptq 0.3.0 requires transformers>=4.29.0, which is not installed.\u001b[0m\u001b[31m\n", "\u001b[0mSuccessfully installed nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.5.40 nvidia-nvtx-cu12-12.1.105 safetensors-0.4.3 timm-1.0.3 torch-2.3.1 torchvision-0.18.1 triton-2.3.1\n" ] } ], "source": [ "!pip install -Uq fastdup timm" ] }, { "cell_type": "markdown", "id": "e6722adf-0f74-4aae-8e67-76107456a91b", "metadata": {}, "source": [ "Now, test the installation. If there's no error message, we are ready to go." ] }, { "cell_type": "code", "execution_count": 2, "id": "efc6af00-4688-454d-b84b-05e15c95fb86", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/dnth/anaconda3/envs/fastdup2021/lib/python3.10/site-packages/requests/__init__.py:109: RequestsDependencyWarning: urllib3 (2.2.1) or chardet (5.2.0)/charset_normalizer (2.0.4) doesn't match a supported version!\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "'2.0.21'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import fastdup\n", "fastdup.__version__" ] }, { "cell_type": "markdown", "id": "c5ba0892-e860-4f3c-84a9-8c0fc189b77d", "metadata": {}, "source": [ "## Download Dataset\n", "\n", "In this notebook, we will the [Price Match Guarantee Dataset](https://www.kaggle.com/competitions/shopee-product-matching/) from Shopee from Kaggle. \n", "The dataset consists of images from users who sell products on the Shopee online platform.\n", "\n", "Download the dataset [here](https://www.kaggle.com/competitions/shopee-product-matching/data), unzip, and place it in the current directory.\n", "\n", "Here's a snapshot showing some of the images from the dataset.\n", "![img](https://files.readme.io/09f6849-download.png)" ] }, { "cell_type": "markdown", "id": "91910747-6be2-4283-959e-c931e45f1f2c", "metadata": {}, "source": [ "## List TIMM Models\n", "There are currently 1212 computer vision models on TIMM. Pick a model of your choice to compute the embedding with.\n", "\n", "Now, pick a model of your choice. For demonstration, we will go with a relatively new model `vit_small_patch14_dinov2.lvd142m` from MetaAI. \n", "\n", "Let's list down models that match the keyword `dino`." ] }, { "cell_type": "code", "execution_count": 3, "id": "dd166b91-38a1-4dfe-9e9e-590e3f550242", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['resmlp_12_224.fb_dino',\n", " 'resmlp_24_224.fb_dino',\n", " 'vit_base_patch8_224.dino',\n", " 'vit_base_patch14_dinov2.lvd142m',\n", " 'vit_base_patch14_reg4_dinov2.lvd142m',\n", " 'vit_base_patch16_224.dino',\n", " 'vit_giant_patch14_dinov2.lvd142m',\n", " 'vit_giant_patch14_reg4_dinov2.lvd142m',\n", " 'vit_large_patch14_dinov2.lvd142m',\n", " 'vit_large_patch14_reg4_dinov2.lvd142m',\n", " 'vit_small_patch8_224.dino',\n", " 'vit_small_patch14_dinov2.lvd142m',\n", " 'vit_small_patch14_reg4_dinov2.lvd142m',\n", " 'vit_small_patch16_224.dino']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import timm\n", "timm.list_models(\"*dino*\", pretrained=True)" ] }, { "cell_type": "markdown", "id": "633dce0c-47eb-4039-8cd4-a36874c49b8a", "metadata": {}, "source": [ "DINOv2 models produce high-performance visual features that can be directly employed with classifiers as simple as linear layers on a variety of computer vision tasks; these visual features are robust and perform well across domains without any requirement for fine-tuning. Read more about DINOv2 [here](https://github.com/facebookresearch/dinov2).\n", "\n", "It makes sense for us to use DINOv2 as a model to create an embedding of the dataset." ] }, { "cell_type": "markdown", "id": "9a1e56b1-d8cd-4457-9b2b-83c1aa3ccaaf", "metadata": {}, "source": [ "## Compute Embeddings using TIMM\n", "\n", "Loading TIMM models in fastdup is seamless with the `TimmEncoder` wrapper class. This ensures all TIMM models can be used in fastdup to compute the embeddings of your dataset. \n", "Under the hood, the wrapper class loads the model from TIMM excluding the final classification layer.\n", "\n", "Next, let's load the DINOv2 model using the `TimmEncoder` wrapper." ] }, { "cell_type": "code", "execution_count": 4, "id": "e3a9e5f2-a92e-4536-be12-19ccc47a7ca4", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Initializing model - vit_small_patch14_dinov2.lvd142m.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "78353a430a854fd8b26ddc11acab33b7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "model.safetensors: 0%| | 0.00/88.2M [00:00\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Components Report\n", " \n", " \n", "\n", "\n", "\n", "
\n", "
\n", "
\n", " \n", " \"logo\"\n", " \n", "
\n", " \n", "\n", "
\n", "
\n", "
\n", " For the new and interactive data exploration\n", " \n", " Read more \n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", " fastdup.explore()\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "

Components Report

Showing groups of similar images

\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1086
num_images22
mean_distance0.962451
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component285
num_images20
mean_distance0.962573
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component563
num_images20
mean_distance0.960016
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component274
num_images18
mean_distance0.96256
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1383
num_images18
mean_distance0.967508
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3564
num_images16
mean_distance0.96001
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1306
num_images16
mean_distance0.980212
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component2931
num_images16
mean_distance0.97116
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1475
num_images15
mean_distance0.973674
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component4939
num_images15
mean_distance0.960707
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1324
num_images14
mean_distance0.9687
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component554
num_images13
mean_distance0.987988
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component328
num_images13
mean_distance0.983095
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1958
num_images13
mean_distance0.970053
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1912
num_images13
mean_distance0.975593
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component3678
num_images12
mean_distance0.986679
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1490
num_images12
mean_distance0.975063
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component339
num_images12
mean_distance0.994324
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component1114
num_images12
mean_distance0.960803
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
component2775
num_images11
mean_distance0.9902
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", "
\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd.vis.component_gallery()" ] }, { "cell_type": "markdown", "id": "c9c58233-e12a-4608-bcac-b311a98eedd4", "metadata": {}, "source": [ "And duplicates gallery." ] }, { "cell_type": "code", "execution_count": 8, "id": "06f822db-959a-4d37-8fdc-508233179ddf", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a0534359e3ff4a8698568e2a521132f7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating gallery: 0%| | 0/20 [00:00\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Duplicates Report\n", " \n", " \n", "\n", "\n", "\n", "
\n", "
\n", "
\n", " \n", " \"logo\"\n", " \n", "
\n", " \n", "\n", "
\n", "
\n", "
\n", " For the new and interactive data exploration\n", " \n", " Read more \n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", " fastdup.explore()\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "

Duplicates Report

\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/home/dnth/Desktop/fastdup/examples//f4da907524bf9cb6b4f2588ba2134af3.jpg
To/home/dnth/Desktop/fastdup/examples//e76626814516c662e78fcf71858379f3.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/home/dnth/Desktop/fastdup/examples//eb66d39daaf7ef264d427e1d0e670eff.jpg
To/home/dnth/Desktop/fastdup/examples//b78206278f2ca82501b83ced2c77c844.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/home/dnth/Desktop/fastdup/examples//d95f57c80178d86166a25c7b8679f9ba.jpg
To/home/dnth/Desktop/fastdup/examples//4812e9df89c8ac537a50b818a7236033.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/home/dnth/Desktop/fastdup/examples//664d02d07a08338ad0d3eb07456d13d9.jpg
To/home/dnth/Desktop/fastdup/examples//bd59cad80ec6c0339c822fbb45b6a1d1.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/home/dnth/Desktop/fastdup/examples//84e8eedb8f7a783bbf78d547bea1fcdb.jpg
To/home/dnth/Desktop/fastdup/examples//14d84ed1a6c78da17f560e16073cff24.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/home/dnth/Desktop/fastdup/examples//6a44bfb5f0d61dd6ac57b495c54f0dc3.jpg
To/home/dnth/Desktop/fastdup/examples//8f8f3bb971e994cd5cd525b3a2571081.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//5d64bbe94eec36f5d148c979d9515216.jpg
To/home/dnth/Desktop/fastdup/examples//52c5565e067a1558fea331b4563b989d.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//65af7ae970f86e283d8b8e45337d5f17.jpg
To/home/dnth/Desktop/fastdup/examples//38c75058bcbe6777bbf4b0e057f36954.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//11244cae08d34a160374ed7c3dc19b37.jpg
To/home/dnth/Desktop/fastdup/examples//ff13d1b54edbe8cc25ae51f4d4ecd936.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//8ce6b836a84ccaba663dbe0882d74b57.jpg
To/home/dnth/Desktop/fastdup/examples//893bc2e791b83d6dc4e31a02b5caa3d6.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//28cd408859e1a811c6cae6fb56fcc434.jpg
To/home/dnth/Desktop/fastdup/examples//7d109a5c2d9caa09e6bc71c35f9ac830.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//ead0a1c2b629822c5bce8a42ffc0572d.jpg
To/home/dnth/Desktop/fastdup/examples//8f4e5aa11720e61c64d01d6641eca9f6.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//f6cb8da672134a9ddb6defa2896f6d81.jpg
To/home/dnth/Desktop/fastdup/examples//9656ee5a0d11af425ec5ef32281ff5c2.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.999999
From/home/dnth/Desktop/fastdup/examples//484e5855a1e43221bf52b14d7f49c763.jpg
To/home/dnth/Desktop/fastdup/examples//c6ff1773f3d071162290814c44f27962.jpg
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", "
\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd.vis.duplicates_gallery()" ] }, { "cell_type": "markdown", "id": "cd666306-f2ca-46e9-89b6-ca9c1a5f5d5d", "metadata": {}, "source": [ "## Interactive Exploration\n", "In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.\n", "\n", "To explore the dataset and issues interactively in a browser, run:" ] }, { "cell_type": "code", "execution_count": null, "id": "762b6511-2038-4e5a-b9a5-95b22c03eec3", "metadata": {}, "outputs": [], "source": [ "fd.explore()" ] }, { "cell_type": "markdown", "id": "37a65097-72eb-46ec-8c02-c328554818d8", "metadata": {}, "source": [ "> 🗒 **Note** - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.\n", "\n", "You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.\n", "\n", "\n", "![image.png](https://vl-blog.s3.us-east-2.amazonaws.com/fastdup_assets/cloud_preview.gif)" ] }, { "cell_type": "markdown", "id": "bc8a3ce2", "metadata": {}, "source": [ "## Wrap Up\n", "In this tutorial, we showed how you can compute embeddings on your dataset using TIMM and run fastdup on top of it to surface dataset issues.\n", "\n", "Questions about this tutorial? Reach out to us on our [Slack channel](https://visuallayer.slack.com/)!\n", "\n", "\n", "\n", "Next, feel free to check out other tutorials -\n", "\n", "+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!\n", "+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.\n", "+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!\n", "+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try." ] }, { "cell_type": "markdown", "id": "44acb813-730b-4513-9266-17e0348f8584", "metadata": {}, "source": [ "
\n", "
\n", " \n", " \"site\"\n", " \n", " \"blog\"\n", " \n", " \"github\"\n", " \n", " \"slack\"\n", " \n", " \"linkedin\"\n", " \n", " \"youtube\"\n", " \n", " \"twitter\"\n", "
\n", "
\n", "
\n", " \"logo\"\n", "
Copyright © 2024 Visual Layer. All rights reserved.
\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "75e95b2c-5354-46b6-8f5a-23d3c20e1864", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", " \n", " \"vl\n", " \n", "
\n", " GitHub •\n", " Join Slack Community •\n", " Discussion Forum \n", "
\n", "\n", "
\n", " Blog •\n", " Documentation •\n", " About Us \n", "
\n", "\n", "
\n", " LinkedIn •\n", " Twitter \n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }