{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Qdrant as a vector database for OpenAI embeddings\n", "\n", "This notebook guides you step by step on using **`Qdrant`** as a vector database for OpenAI embeddings. [Qdrant](https://qdrant.tech) is a high-performant vector search database written in Rust. It offers RESTful and gRPC APIs to manage your embeddings. There is an official Python [qdrant-client](https://github.com/qdrant/qdrant_client) that eases the integration with your apps.\n", "\n", "This notebook presents an end-to-end process of:\n", "1. Using precomputed embeddings created by OpenAI API.\n", "2. Storing the embeddings in a local instance of Qdrant.\n", "3. Converting raw text query to an embedding with OpenAI API.\n", "4. Using Qdrant to perform the nearest neighbour search in the created collection.\n", "\n", "### What is Qdrant\n", "\n", "[Qdrant](https://qdrant.tech) is an Open Source vector database that allows storing neural embeddings along with the metadata, a.k.a [payload](https://qdrant.tech/documentation/payload/). Payloads are not only available for keeping some additional attributes of a particular point, but might be also used for filtering. [Qdrant](https://qdrant.tech) offers a unique filtering mechanism which is built-in into the vector search phase, what makes it really efficient.\n", "\n", "### Deployment options\n", "\n", "[Qdrant](https://qdrant.tech) might be launched in various ways, depending on the target load on the application it might be hosted:\n", "\n", "- Locally or on premise, with Docker containers\n", "- On Kubernetes cluster, with the [Helm chart](https://github.com/qdrant/qdrant-helm)\n", "- Using [Qdrant Cloud](https://cloud.qdrant.io/)\n", "\n", "### Integration\n", "\n", "[Qdrant](https://qdrant.tech) provides both RESTful and gRPC APIs which makes integration easy, no matter the programming language you use. However, there are some official clients for the most popular languages available, and if you use Python then the [Python Qdrant client library](https://github.com/qdrant/qdrant_client) might be the best choice." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "For the purposes of this exercise we need to prepare a couple of things:\n", "\n", "1. Qdrant server instance. In our case a local Docker container.\n", "2. The [qdrant-client](https://github.com/qdrant/qdrant_client) library to interact with the vector database.\n", "3. An [OpenAI API key](https://platform.openai.com/settings/organization/api-keys).\n", "\n", "### Start Qdrant server\n", "\n", "We're going to use a local Qdrant instance running in a Docker container. The easiest way to launch it is to use the attached [docker-compose.yaml] file and run the following command:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:04:28.198036Z", "start_time": "2023-02-16T12:04:26.950014Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1A\u001b[1B\u001b[0G\u001b[?25l[+] Running 1/0\n", " \u001b[32m✔\u001b[0m Container qdrant-qdrant-1 \u001b[32mRunning\u001b[0m \u001b[34m0.0s \u001b[0m\n", "\u001b[?25h" ] } ], "source": [ "! docker compose up -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We might validate if the server was launched successfully by running a simple curl command:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:04:29.301651Z", "start_time": "2023-02-16T12:04:29.173153Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"title\":\"qdrant - vector search engine\",\"version\":\"1.3.0\"}" ] } ], "source": [ "! curl http://localhost:6333" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install requirements\n", "\n", "This notebook obviously requires the `openai` and `qdrant-client` packages, but there are also some other additional libraries we will use. The following command installs them all:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:05:05.718972Z", "start_time": "2023-02-16T12:04:30.434820Z" } }, "outputs": [], "source": [ "! pip install openai qdrant-client pandas wget" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare your OpenAI API key\n", "\n", "The OpenAI API key is used for vectorization of the documents and queries.\n", "\n", "If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).\n", "\n", "Once you get your key, please add it to your environment variables as `OPENAI_API_KEY` by running following command:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "! export OPENAI_API_KEY=\"your API key\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:05:05.730338Z", "start_time": "2023-02-16T12:05:05.723351Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "OPENAI_API_KEY is ready\n" ] } ], "source": [ "# Test that your OpenAI API key is correctly set as an environment variable\n", "# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.\n", "import os\n", "\n", "# Note. alternatively you can set a temporary env variable like this:\n", "# os.environ[\"OPENAI_API_KEY\"] = \"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"\n", "\n", "if os.getenv(\"OPENAI_API_KEY\") is not None:\n", " print(\"OPENAI_API_KEY is ready\")\n", "else:\n", " print(\"OPENAI_API_KEY environment variable not found\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect to Qdrant\n", "\n", "Connecting to a running instance of Qdrant server is easy with the official Python library:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:05:06.827143Z", "start_time": "2023-02-16T12:05:05.733771Z" } }, "outputs": [], "source": [ "import qdrant_client\n", "\n", "client = qdrant_client.QdrantClient(\n", " host=\"localhost\",\n", " prefer_grpc=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test the connection by running any available method:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:05:06.848488Z", "start_time": "2023-02-16T12:05:06.832612Z" } }, "outputs": [ { "data": { "text/plain": [ "CollectionsResponse(collections=[])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_collections()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data\n", "\n", "In this section we are going to load the data prepared previous to this session, so you don't have to recompute the embeddings of Wikipedia articles with your own credits." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:05:37.371951Z", "start_time": "2023-02-16T12:05:06.851634Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100% [......................................................................] 698933052 / 698933052" ] }, { "data": { "text/plain": [ "'vector_database_wikipedia_articles_embedded (9).zip'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import wget\n", "\n", "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n", "\n", "# The file is ~700 MB so this will take some time\n", "wget.download(embeddings_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The downloaded file has to be then extracted:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:06:01.538851Z", "start_time": "2023-02-16T12:05:37.376042Z" } }, "outputs": [], "source": [ "import zipfile\n", "\n", "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\"r\") as zip_ref:\n", " zip_ref.extractall(\"../data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can finally load it from the provided CSV file:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:17:35.483972Z", "start_time": "2023-02-16T12:06:01.540172Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idurltitletexttitle_vectorcontent_vectorvector_id
01https://simple.wikipedia.org/wiki/AprilAprilApril is the fourth month of the year in the J...[0.001009464613161981, -0.020700545981526375, ...[-0.011253940872848034, -0.013491976074874401,...0
12https://simple.wikipedia.org/wiki/AugustAugustAugust (Aug.) is the eighth month of the year ...[0.0009286514250561595, 0.000820168002974242, ...[0.0003609954728744924, 0.007262262050062418, ...1
26https://simple.wikipedia.org/wiki/ArtArtArt is a creative activity that expresses imag...[0.003393713850528002, 0.0061537534929811954, ...[-0.004959689453244209, 0.015772193670272827, ...2
38https://simple.wikipedia.org/wiki/AAA or a is the first letter of the English alph...[0.0153952119871974, -0.013759135268628597, 0....[0.024894846603274345, -0.022186409682035446, ...3
49https://simple.wikipedia.org/wiki/AirAirAir refers to the Earth's atmosphere. Air is a...[0.02224554680287838, -0.02044147066771984, -0...[0.021524671465158463, 0.018522677943110466, -...4
\n", "
" ], "text/plain": [ " id url title \\\n", "0 1 https://simple.wikipedia.org/wiki/April April \n", "1 2 https://simple.wikipedia.org/wiki/August August \n", "2 6 https://simple.wikipedia.org/wiki/Art Art \n", "3 8 https://simple.wikipedia.org/wiki/A A \n", "4 9 https://simple.wikipedia.org/wiki/Air Air \n", "\n", " text \\\n", "0 April is the fourth month of the year in the J... \n", "1 August (Aug.) is the eighth month of the year ... \n", "2 Art is a creative activity that expresses imag... \n", "3 A or a is the first letter of the English alph... \n", "4 Air refers to the Earth's atmosphere. Air is a... \n", "\n", " title_vector \\\n", "0 [0.001009464613161981, -0.020700545981526375, ... \n", "1 [0.0009286514250561595, 0.000820168002974242, ... \n", "2 [0.003393713850528002, 0.0061537534929811954, ... \n", "3 [0.0153952119871974, -0.013759135268628597, 0.... \n", "4 [0.02224554680287838, -0.02044147066771984, -0... \n", "\n", " content_vector vector_id \n", "0 [-0.011253940872848034, -0.013491976074874401,... 0 \n", "1 [0.0003609954728744924, 0.007262262050062418, ... 1 \n", "2 [-0.004959689453244209, 0.015772193670272827, ... 2 \n", "3 [0.024894846603274345, -0.022186409682035446, ... 3 \n", "4 [0.021524671465158463, 0.018522677943110466, -... 4 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "from ast import literal_eval\n", "\n", "article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')\n", "# Read vectors from strings back into a list\n", "article_df[\"title_vector\"] = article_df.title_vector.apply(literal_eval)\n", "article_df[\"content_vector\"] = article_df.content_vector.apply(literal_eval)\n", "article_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Index data\n", "\n", "Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors. Qdrant does not require you to set up any kind of schema beforehand, so you can freely put points to the collection with a simple setup only.\n", "\n", "We will start with creating a collection, and then we will fill it with our precomputed embeddings." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:17:36.366066Z", "start_time": "2023-02-16T12:17:35.486872Z" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from qdrant_client.http import models as rest\n", "\n", "vector_size = len(article_df[\"content_vector\"][0])\n", "\n", "client.create_collection(\n", " collection_name=\"Articles\",\n", " vectors_config={\n", " \"title\": rest.VectorParams(\n", " distance=rest.Distance.COSINE,\n", " size=vector_size,\n", " ),\n", " \"content\": rest.VectorParams(\n", " distance=rest.Distance.COSINE,\n", " size=vector_size,\n", " ),\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:30:37.518210Z", "start_time": "2023-02-16T12:17:36.368564Z" } }, "outputs": [ { "data": { "text/plain": [ "UpdateResult(operation_id=0, status=)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.upsert(\n", " collection_name=\"Articles\",\n", " points=[\n", " rest.PointStruct(\n", " id=k,\n", " vector={\n", " \"title\": v[\"title_vector\"],\n", " \"content\": v[\"content_vector\"],\n", " },\n", " payload=v.to_dict(),\n", " )\n", " for k, v in article_df.iterrows()\n", " ],\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:30:40.675202Z", "start_time": "2023-02-16T12:30:40.655654Z" } }, "outputs": [ { "data": { "text/plain": [ "CountResult(count=25000)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the collection size to make sure all the points have been stored\n", "client.count(collection_name=\"Articles\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Search data\n", "\n", "Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search. Since the precomputed embeddings were created with `text-embedding-ada-002` OpenAI model we also have to use it during search.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:30:38.024370Z", "start_time": "2023-02-16T12:30:37.712816Z" } }, "outputs": [], "source": [ "from openai import OpenAI\n", "\n", "openai_client = OpenAI()\n", "\n", "def query_qdrant(query, collection_name, vector_name=\"title\", top_k=20):\n", " # Creates embedding vector from user query\n", " embedded_query = openai_client.embeddings.create(\n", " input=query,\n", " model=\"text-embedding-ada-002\",\n", " ).data[0].embedding\n", "\n", " query_results = client.search(\n", " collection_name=collection_name,\n", " query_vector=(\n", " vector_name, embedded_query\n", " ),\n", " limit=top_k,\n", " )\n", "\n", " return query_results" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:30:39.379566Z", "start_time": "2023-02-16T12:30:38.031041Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1. Museum of Modern Art (Score: 0.875)\n", "2. Western Europe (Score: 0.867)\n", "3. Renaissance art (Score: 0.864)\n", "4. Pop art (Score: 0.86)\n", "5. Northern Europe (Score: 0.855)\n", "6. Hellenistic art (Score: 0.853)\n", "7. Modernist literature (Score: 0.847)\n", "8. Art film (Score: 0.843)\n", "9. Central Europe (Score: 0.843)\n", "10. European (Score: 0.841)\n", "11. Art (Score: 0.841)\n", "12. Byzantine art (Score: 0.841)\n", "13. Postmodernism (Score: 0.84)\n", "14. Eastern Europe (Score: 0.839)\n", "15. Cubism (Score: 0.839)\n", "16. Europe (Score: 0.839)\n", "17. Impressionism (Score: 0.838)\n", "18. Bauhaus (Score: 0.838)\n", "19. Surrealism (Score: 0.837)\n", "20. Expressionism (Score: 0.837)\n" ] } ], "source": [ "query_results = query_qdrant(\"modern art in Europe\", \"Articles\")\n", "for i, article in enumerate(query_results):\n", " print(f\"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2023-02-16T12:30:40.652676Z", "start_time": "2023-02-16T12:30:39.382555Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1. Battle of Bannockburn (Score: 0.869)\n", "2. Wars of Scottish Independence (Score: 0.861)\n", "3. 1651 (Score: 0.852)\n", "4. First War of Scottish Independence (Score: 0.85)\n", "5. Robert I of Scotland (Score: 0.846)\n", "6. 841 (Score: 0.844)\n", "7. 1716 (Score: 0.844)\n", "8. 1314 (Score: 0.837)\n", "9. 1263 (Score: 0.836)\n", "10. William Wallace (Score: 0.835)\n", "11. Stirling (Score: 0.831)\n", "12. 1306 (Score: 0.831)\n", "13. 1746 (Score: 0.83)\n", "14. 1040s (Score: 0.828)\n", "15. 1106 (Score: 0.827)\n", "16. 1304 (Score: 0.826)\n", "17. David II of Scotland (Score: 0.825)\n", "18. Braveheart (Score: 0.824)\n", "19. 1124 (Score: 0.824)\n", "20. Second War of Scottish Independence (Score: 0.823)\n" ] } ], "source": [ "# This time we'll query using content vector\n", "query_results = query_qdrant(\"Famous battles in Scottish history\", \"Articles\", \"content\")\n", "for i, article in enumerate(query_results):\n", " print(f\"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 4 }