{ "cells": [ { "cell_type": "markdown", "id": "87da8fac", "metadata": {}, "source": [ "[![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)" ] }, { "cell_type": "markdown", "id": "a6445d60-47e5-4f91-93cb-3da1a37bc205", "metadata": {}, "source": [ "# Tutorial for working directly with feature vectors\n", "\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/feature_vectors.ipynb)\n", "[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/feature_vectors.ipynb)" ] }, { "cell_type": "markdown", "id": "9ece0f6e-518a-49b6-b06a-959d50bef991", "metadata": {}, "source": [ "## Use case 1: compute feature vectors with fastdup and load them with numpy for further processing" ] }, { "cell_type": "code", "execution_count": 4, "id": "fd4a0e45-55d9-4544-877f-9dfcce74a0ad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621\u001b[0m\u001b[33m\n", "\u001b[0mRequirement already satisfied: fastdup in /Users/dannybickson/homebrew/lib/python3.8/site-packages (0.926)\n", "Collecting fastdup\n", " Downloading fastdup-1.0-cp38-cp38-macosx_11_0_arm64.whl (32.8 MB)\n", "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m32.8/32.8 MB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: numpy in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (1.24.3)\n", "Requirement already satisfied: tqdm in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (4.65.0)\n", "Requirement already satisfied: pillow in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (9.5.0)\n", "Requirement already satisfied: pyyaml in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (6.0)\n", "Requirement already satisfied: pandas in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (2.0.1)\n", "Requirement already satisfied: requests==2.28.1 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (2.28.1)\n", "Requirement already satisfied: sentry-sdk in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (1.21.1)\n", "Requirement already satisfied: packaging in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (23.1)\n", "Requirement already satisfied: certifi in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (2022.12.7)\n", "Requirement already satisfied: opencv-python-headless<=4.5.5.64 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from fastdup) (4.5.5.64)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from requests==2.28.1->fastdup) (2.1.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from requests==2.28.1->fastdup) (3.4)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from requests==2.28.1->fastdup) (1.26.15)\n", "Requirement already satisfied: tzdata>=2022.1 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from pandas->fastdup) (2023.3)\n", "Requirement already satisfied: pytz>=2020.1 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from pandas->fastdup) (2023.3)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from pandas->fastdup) (2.8.2)\n", "Requirement already satisfied: six>=1.5 in /Users/dannybickson/homebrew/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas->fastdup) (1.16.0)\n", "Installing collected packages: fastdup\n", " Attempting uninstall: fastdup\n", " Found existing installation: fastdup 0.926\n", " Uninstalling fastdup-0.926:\n", " Successfully uninstalled fastdup-0.926\n", "\u001b[33m DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621\u001b[0m\u001b[33m\n", "\u001b[0mSuccessfully installed fastdup-1.0\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -U fastdup" ] }, { "cell_type": "code", "execution_count": 1, "id": "d6f3966d-e596-494f-91c6-282a3222919e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.\n", "2023-05-17 15:11:02 [INFO] Going to loop over dir /Users/dannybickson/visual_database/cxx/unittests/two_images\n", "2023-05-17 15:11:02 [INFO] Found total 2 images to run on, 2 train, 0 test, name list 2, counter 2 \n", "2023-05-17 15:11:03 [INFO] Found total 2 images to run onEstimated: 0 Minutess\n", "2023-05-17 15:11:03 [INFO] 19) Finished write_index() NN model\n", "2023-05-17 15:11:03 [INFO] Stored nn model index file out/nnf.index\n", "2023-05-17 15:11:03 [INFO] Total time took 1030 ms\n", "2023-05-17 15:11:03 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 %\n", "2023-05-17 15:11:03 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 %\n", "2023-05-17 15:11:03 [INFO] Found a total of 0 above threshold images (d>0.900), which are 0.00 %\n", "2023-05-17 15:11:03 [INFO] Found a total of 1 outlier images (d<0.050), which are 50.00 %\n", "2023-05-17 15:11:03 [INFO] Min distance found 0.805 max distance 0.805\n", "2023-05-17 15:11:03 [INFO] Running connected components for ccthreshold 0.960000 \n", ".0Read a total of 2 images\n", "Read embedding matrix of shape (2, 576)\n", "Image filenames are\n", "['/Users/dannybickson/visual_database/cxx/unittests/two_images/test_1234.jpg', '/Users/dannybickson/visual_database/cxx/unittests/two_images/train_1274.jpg']\n" ] } ], "source": [ "import fastdup\n", "import numpy as np\n", "\n", "#chnage to your image folder\n", "input_dir = '/Users/dannybickson/visual_database/cxx/unittests/two_images/'\n", "\n", "# Run fastup on an input image folder to create embeddings\n", "fd = fastdup.create(input_dir=input_dir, work_dir='out')\n", "fd.run(overwrite=True, print_summary=False)\n", "\n", "# Read the embeddings to use them in python\n", "# There are two images in the input_dir, so the embedding matrix is 2x 576. \n", "# Each row in the embedding matrix is an image.\n", "flist, embedding_matrix = fastdup.load_binary_feature(filename='./out/atrain_features.dat')\n", "print('Read embedding matrix of shape', embedding_matrix.shape)\n", "print('Image filenames are')\n", "print(flist)\n" ] }, { "cell_type": "markdown", "id": "4c22f533-a30b-404b-abff-9812a9dedba6", "metadata": {}, "source": [ "# Use case 2: Save your own binary features to work with fastdup" ] }, { "cell_type": "markdown", "id": "dea40128-eac2-4e3c-8002-f89d3528fd1b", "metadata": {}, "source": [ "## Version 0.2" ] }, { "cell_type": "code", "execution_count": 2, "id": "d2617508-a261-41d7-919f-41f7e533891b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.\n", "2023-05-17 15:11:12 [INFO] Found total 2 images to run on\n", "2023-05-17 15:11:12 [INFO] 0) Finished write_index() NN model\n", "2023-05-17 15:11:12 [INFO] Stored nn model index file embedding_input/nnf.index\n", "2023-05-17 15:11:12 [INFO] Total time took 64 ms\n", "2023-05-17 15:11:12 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 %\n", "2023-05-17 15:11:12 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 %\n", "2023-05-17 15:11:12 [INFO] Found a total of 0 above threshold images (d>0.900), which are 0.00 %\n", "2023-05-17 15:11:12 [INFO] Found a total of 1 outlier images (d<0.050), which are 50.00 %\n", "2023-05-17 15:11:12 [INFO] Min distance found 0.733 max distance 0.733\n", "2023-05-17 15:11:12 [INFO] Running connected components for ccthreshold 0.960000 \n", ".0" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import fastdup\n", "import numpy as np\n", "import os\n", "input_dir = '/Users/dannybickson/visual_database/cxx/unittests/two_images/'\n", "flist = os.listdir(input_dir)\n", "flist = [os.path.join(input_dir, f) for f in flist]\n", "\n", "# replace the below code with computation of your own features\n", "matrix = np.random.rand(2, 576).astype('float32')\n", "\n", "# save the embedding along the filenames into a working folder\n", "!mkdir -p embedding_input\n", "fastdup.save_binary_feature('embedding_input', flist, matrix)\n", "fastdup.run('~/visual_database/cxx/unittests/two_images/', run_mode=2, work_dir='embedding_input')" ] }, { "cell_type": "markdown", "id": "b04faace-8fdd-4d4b-8c6b-24f55ac89f89", "metadata": {}, "source": [ "## Version 1.0" ] }, { "cell_type": "code", "execution_count": 7, "id": "2ce1e4aa-c57f-4a26-bef2-0b398da19d1e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.\n", "2023-05-17 15:13:40 [INFO] Found total 2 images to run on\n", "2023-05-17 15:13:40 [INFO] 0) Finished write_index() NN model\n", "2023-05-17 15:13:40 [INFO] Stored nn model index file out3/nnf.index\n", "2023-05-17 15:13:40 [INFO] Total time took 65 ms\n", "2023-05-17 15:13:40 [INFO] Found a total of 0 fully identical images (d>0.990), which are 0.00 %\n", "2023-05-17 15:13:40 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 %\n", "2023-05-17 15:13:40 [INFO] Found a total of 0 above threshold images (d>0.900), which are 0.00 %\n", "2023-05-17 15:13:40 [INFO] Found a total of 1 outlier images (d<0.050), which are 50.00 %\n", "2023-05-17 15:13:40 [INFO] Min distance found 0.733 max distance 0.733\n", "2023-05-17 15:13:40 [INFO] Running connected components for ccthreshold 0.960000 \n", ".0" ] } ], "source": [ "# Note: files should contain absolute path and not relative path\n", "import fastdup\n", "import numpy as np\n", "import os\n", "input_dir = '/Users/dannybickson/visual_database/cxx/unittests/two_images/'\n", "flist = os.listdir(input_dir)\n", "flist = [os.path.join(input_dir, f) for f in flist]\n", "\n", "# replace the below code with computation of your own features\n", "matrix = np.random.rand(2, 576).astype('float32')\n", "\n", "fd2 = fastdup.create(input_dir=input_dir, work_dir='output2')\n", "fd2.run(annotations=flist, embeddings=matrix, print_summary=False, overwrite=True)" ] }, { "cell_type": "markdown", "id": "bc8a3ce2", "metadata": {}, "source": [ "## Wrap Up\n", "\n", "Next, feel free to check out other tutorials -\n", "\n", "+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!\n", "+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.\n", "+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!\n", "+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try." ] }, { "cell_type": "markdown", "id": "44acb813-730b-4513-9266-17e0348f8584", "metadata": {}, "source": [ "\n", "## VL Profiler\n", "If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. \n", "\n", "[Sign up](https://app.visual-layer.com) now, it's free.\n", "\n", "[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)\n", "\n", "As usual, feedback is welcome! \n", "\n", "Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues)." ] }, { "cell_type": "markdown", "id": "1c1dbab7", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 5 }