{ "cells": [ { "cell_type": "markdown", "id": "c26264f7", "metadata": {}, "source": [ "# Nearest Neighbor Embeddings Classification with Qdrant\n", "\n", "FiftyOne provides [powerful workflows](https://voxel51.com/docs/fiftyone/user_guide/brain.html) centered around embeddings, including pre-annotation, finding annotation mistakes, finding hard samples, and visual similarity searches. However, performing nearest neighbor searches on large datasets requires the right infrastructure.\n", "\n", "Vector search engines have been developed for the purpose of efficiently storing, searching, and managing embedding vectors. [Qdrant](https://qdrant.tech/) is a vector database designed to perform an approximate nearest neighbor search (ANN) on dense neural embeddings, which is a key part of any production-ready system that is expected to scale to large amounts of data. And best of all, it's open-source!\n", "\n", "In this tutorial, we'll load the MNIST dataset into FiftyOne and then use Qdrant to perform ANN-based classification, where the data points will be classified by selecting the most common ground truth label among the **K** nearest points from our training dataset. In other words, for each test example, we'll select the K nearest neighbors in embedding space and assign the best label by voting. We'll then evaluate the results of this classification strategy using FiftyOne.\n", "\n", "**So, what's the takeaway?**\n", "\n", "FiftyOne and Qdrant can be used together to easily perform an approximate nearest neighbors search on the embeddings of your datasets and kickstart pre-annotation workflows." ] }, { "cell_type": "markdown", "id": "274f46cd", "metadata": {}, "source": [ "## Setup\n", "\n", "If you haven’t already, install FiftyOne:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5e542006", "metadata": {}, "outputs": [], "source": [ "!pip install fiftyone" ] }, { "cell_type": "markdown", "id": "5c4f77c0", "metadata": {}, "source": [ "We'll also need the Qdrant Python client:" ] }, { "cell_type": "code", "execution_count": null, "id": "2cea7b23", "metadata": {}, "outputs": [], "source": [ "!pip install qdrant_client" ] }, { "cell_type": "markdown", "id": "2c9abce5", "metadata": {}, "source": [ "In this example, we will also be making use of torchvision models from the [FiftyOne Model Zoo](https://voxel51.com/docs/fiftyone/user_guide/model_zoo/index.html#)." ] }, { "cell_type": "code", "execution_count": null, "id": "c3cc1dd3", "metadata": {}, "outputs": [], "source": [ "!pip install torchvision" ] }, { "cell_type": "markdown", "id": "d8075bb7", "metadata": {}, "source": [ "### Qdrant installation\n", "If you want to start using the semantic search with Qdrant, you need to run an instance of it, as this tool works in a client-server manner. The easiest way to do this is to use an official Docker image and start Qdrant with just a single command:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5fc204a5", "metadata": {}, "outputs": [], "source": [ "!docker run -p \"6333:6333\" -p \"6334:6334\" -d qdrant/qdrant" ] }, { "cell_type": "markdown", "id": "2d959395", "metadata": {}, "source": [ "After running the command we’ll have the Qdrant server running, with HTTP API exposed at port 6333 and gRPC interface at 6334." ] }, { "cell_type": "markdown", "id": "89e0b5e0", "metadata": {}, "source": [ "## Loading the dataset\n", "\n", "There are several steps we need to take to get things running smoothly. First of all, we need to load the MNIST dataset and extract the train examples from it, as we're going to use them in our search operations. To make everything even faster, we're not going to use all the examples, but just 2500 samples. We can use the [FiftyOne Dataset Zoo](https://voxel51.com/docs/fiftyone/user_guide/dataset_zoo/index.html) to load the subset of MNIST we want in just one line of code." ] }, { "cell_type": "code", "execution_count": 1, "id": "57c18ce6", "metadata": {}, "outputs": [], "source": [ "import fiftyone as fo\n", "import fiftyone.zoo as foz\n", "import fiftyone.brain as fob" ] }, { "cell_type": "code", "execution_count": 3, "id": "00a6c2ad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Split 'train' already downloaded\n", "Split 'test' already downloaded\n", "Loading 'mnist' split 'train'\n", " 100% |███████████████| 2500/2500 [4.1s elapsed, 0s remaining, 685.3 samples/s] \n", "Loading 'mnist' split 'test'\n", " 100% |███████████████| 2500/2500 [4.0s elapsed, 0s remaining, 644.8 samples/s] \n", "Dataset 'mnist-2500' created\n" ] } ], "source": [ "# Load the data\n", "dataset = foz.load_zoo_dataset(\"mnist\", max_samples=2500)\n", "train_view = dataset.match_tags(tags=[\"train\"])" ] }, { "cell_type": "markdown", "id": "8a7195b0", "metadata": {}, "source": [ "Let's start by taking a look at the dataset in the [FiftyOne App](https://voxel51.com/docs/fiftyone/user_guide/app.html)." ] }, { "cell_type": "code", "execution_count": 34, "id": "1bc4ed66", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "