{ "cells": [ { "cell_type": "markdown", "id": "f1036493-e181-41cc-8788-8c4012c40ae5", "metadata": {}, "source": [ "# So How does Retrieval Augmented Generation Work?\n", "\n", "\n", "\n", "RAG, or Retrieval Augmented Generation, can be understood most simply as providing relevant, up-to-date information alongside a question or query to get an accurate answer or action back from a large language model.\n", "\n", "Why is this important? LLMs are incredibly capable and knowledgeable systems already, but they do not have access to up to date, domain specific, or proprietary information. Creating RAG based systems can build on top of LLMs intrinsic knowledge by providing the right context at the right time to enrich and improve responses. This often leads to more accurate and \"correct\" responses when building systems for niche or esoteric data.\n", "\n", "In this notebook we'll cover an intuitive approach towards understanding how RAG systems work, for the curious yet daunted reader.\n", "\n", "*Note: Some IFrame's and Graphs do not render on GitHub*" ] }, { "cell_type": "markdown", "id": "014369a4-0405-4d4c-a0f3-b1f92e5eeedd", "metadata": {}, "source": [ "---\n", "\n", "## Setup Functions & Imports\n", "\n", "These imports and functions set up for the below examples." ] }, { "cell_type": "code", "execution_count": null, "id": "13ebd1c5-3fac-4245-a2c9-2dee22452776", "metadata": {}, "outputs": [], "source": [ "# ========= Vector Database Setup =========\n", "\n", "from langchain_text_splitters import MarkdownTextSplitter\n", "\n", "# Instantiate the Chroma Client\n", "chroma_client = chromadb.PersistentClient(path=\"./vector_database\")\n", "\n", "# Create a Collection\n", "collection = chroma_client.get_or_create_collection(name=\"BMV080\")\n", "\n", "# Instantiate Splitter\n", "splitter = MarkdownTextSplitter.from_tiktoken_encoder(\n", " encoding_name=\"cl100k_base\", # OpenAI’s latest GPT family encoder\n", " chunk_size=1200,\n", " chunk_overlap=400,\n", " strip_whitespace=True\n", ")\n", "\n", "# Load Markdown File\n", "with open(\"./documents/bmv080-ds.md\", 'r', encoding='utf-8') as file:\n", " text = file.read()\n", "\n", "# Split text\n", "chunks = splitter.split_text(text)\n", "\n", "# Embed Chunks to the Collection\n", "collection.add(\n", " documents=chunks,\n", " ids=[str(i) for i in range(len(chunks))]\n", ")" ] }, { "cell_type": "code", "execution_count": 1, "id": "23f24661-f606-4798-b4a9-a5993e9e599a", "metadata": {}, "outputs": [], "source": [ "# ========= Notebook Helper Functions ==========\n", "\n", "import openai\n", "import chromadb\n", "import tiktoken\n", "from IPython.display import display, Markdown, HTML, IFrame\n", "from sentence_transformers import SentenceTransformer\n", "from langchain_text_splitters import MarkdownTextSplitter\n", "\n", "embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')\n", "chroma_client = chromadb.PersistentClient(path=\"./vector_database\")\n", "collection = chroma_client.get_or_create_collection(name=\"BMV080\")\n", "openai_client = openai.OpenAI()\n", "\n", "# Simple OpenAI API Caller\n", "def query_openai(prompt):\n", "\n", " response = openai_client.chat.completions.create(\n", " model=\"gpt-4o\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"You are a helpful Bosch assistant. Answer questions fully but succinctly and accurately\"\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"text\", \"text\": prompt},\n", " ]\n", " }\n", " ],\n", " max_tokens=4000,\n", " temperature=0.1\n", " )\n", " \n", " return response.choices[0].message.content.strip()\n", "\n", "def retrieve_docs(query, collection=\"BMV080\", n=5):\n", " # Load Chroma Collection\n", " collection = chroma_client.get_or_create_collection(name=collection)\n", "\n", " # Perform semantic search\n", " results = collection.query(\n", " query_texts=[query],\n", " n_results=n\n", " )\n", "\n", " # Zip documents and distances together into dicts\n", " docs = results[\"documents\"][0]\n", " scores = results[\"distances\"][0]\n", "\n", " # Combine into list of dicts\n", " return [{\"document\": doc, \"score\": score} for doc, score in zip(docs, scores)]\n", "\n", "def rag_response(query):\n", "\n", " context = retrieve_docs(query)\n", "\n", " prompt = f\"\"\"Use the provided up-to-date context to answer the question\n", "\n", "Retrieved Context:\n", "{context}\n", "\n", "Question: {query}\n", "\"\"\"\n", "\n", " response = query_openai(prompt)\n", "\n", " return response\n", "\n", "def pprint(text):\n", " display(Markdown(text))" ] }, { "cell_type": "markdown", "id": "efbd748c-fd97-4f3d-a347-a302ca5312e1", "metadata": {}, "source": [ "---\n", "## The Importance of Retrieval Augmented Generation\n", "\n", "\n", "\n", "To demonstrate the importance of RAG systems in AI systems, let's see how a large language model handles a domain specific question both with and without RAG. Our scenario will be specific questions about the Bosch particulate matter sensor BMV080, a highly specialized air quality sensor with plenty of specific specs. This is a perfect example not only because it's niche, but because it was released in January 2025. We'll be using the primary model behind ChatGPT [gpt-4o](https://platform.openai.com/docs/models/gpt-4o) as our LLM example here which has a knowledge cutoff of October 1st 2023, so the base model (without access to web searching capabilities) should have no clue that this product even exists, let alone a technical spec.\n", "\n", "We'll be asking: *What is the maximum power consumption of the BMV080 in continuous measurement mode?*" ] }, { "cell_type": "code", "execution_count": 2, "id": "2c490cac-88be-4dc2-8e67-dd994efac6fc", "metadata": {}, "outputs": [], "source": [ "question = \"What is the maximum power consumption of the BMV080 in continuous measurement mode?\"" ] }, { "cell_type": "markdown", "id": "0955dc59-d269-4983-a574-5beb32bed23d", "metadata": {}, "source": [ "### *Without* Retrieval Augmented Generation (RAG)" ] }, { "cell_type": "code", "execution_count": 3, "id": "a0d25a58-7885-4bf1-a5e7-6d0366691148", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "**Query:** What is the maximum power consumption of the BMV080 in continuous measurement mode?\n", "\n", "**Response:** The maximum power consumption of the BMV080 in continuous measurement mode is 1.3 mA.\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "response = query_openai(question)\n", "\n", "pprint(f\"\"\"\n", "**Query:** {question}\n", "\n", "**Response:** {response}\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "ff388d99-f8a4-4f65-8477-e132a2f888ee", "metadata": {}, "source": [ "### *With* Retrieval Augmented Generation (RAG)" ] }, { "cell_type": "code", "execution_count": 4, "id": "997cad8d-f85e-4db3-bd2d-82ad3cb31cc2", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "**Query:** What is the maximum power consumption of the BMV080 in continuous measurement mode?\n", "\n", "**Response:** The maximum power consumption of the BMV080 in continuous measurement mode is 181.9 mW.\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "response = rag_response(question)\n", "\n", "pprint(f\"\"\"\n", "**Query:** {question}\n", "\n", "**Response:** {response}\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "048fe745-6833-472c-a11f-5e459b7d1cdb", "metadata": {}, "source": [ "---\n", "\n", "Now let's compare to the answer from the [documentation](https://www.bosch-sensortec.com/media/boschsensortec/downloads/datasheets/bst-bmv080-ds000.pdf):\n", "\n", "\n", "\n", "RAG wins! " ] }, { "cell_type": "markdown", "id": "f3611c19-7c93-4eac-8dbd-4ab977d6bcbf", "metadata": {}, "source": [ "---\n", "## Knowledgebase Preparation\n", "\n", "All of the relevant domain specific knowledge that you want to be able to use and provide as context is what's referred to as the **knowledgebase**. It is essentially a specially prepared collection of all of your unstructured data. Unstructured data is what's found in the files we use and create day to day, i.e. powerpoints, word documents, emails, excel files, images, recordings, etc. This all needs to be formatted in a way for efficient LLM ingestion and retrieval. But first, some context:\n", "\n", "### Text Based Processing\n", "\n", "Large language model's primary method of processing is via raw text. There are some conversions back and forth between text and numbers for the actual processing (called tokenization!) but the idea remains consistent.\n", "\n", "Tokenization in action, via [OpenAI's tokenization visualizer](https://platform.openai.com/tokenizer):\n", "\n", "\n", "\n", "\n", "While this text rule stands true, some more modern model's are starting to introduce **multimodal** inputs like text, video, audio, images and more as direct inputs:\n", "\n", "\n", "\n", "But for the most part, we need to convert our unstructured data **into text based formats** as a first step. Let's take a look at what that looks like for something like the prior datasheet example." ] }, { "cell_type": "code", "execution_count": 5, "id": "caf6c140-2e97-49ae-b935-d0e63fe4155e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "IFrame(\"documents/bmv080-ds.pdf\", width=1200, height=600)" ] }, { "cell_type": "markdown", "id": "5c085e68-6f40-44ca-a2bb-6d13b78ef62e", "metadata": {}, "source": [ "We can already take the [website](https://www.bosch-sensortec.com/products/environmental-sensors/particulate-matter-sensor/bmv080/#documents) and convert it into a [PDF](https://www.bosch-sensortec.com/media/boschsensortec/downloads/datasheets/bst-bmv080-ds000.pdf) via the provided link. But we still need this in an ingestable text format! We can scrape the text pretty easily but there are also pictures, tables, and other elements that would be nice to capture.\n", "\n", "For this toy example I wrote a quick vision language model based OCR script to do this conversion of a PDF into formatted raw text known as Markdown. After running that we get the following output:" ] }, { "cell_type": "code", "execution_count": 6, "id": "ec42978c-6cb3-4927-a08e-64f943e26c1c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "IFrame(\"documents/bmv080-ds.md\", width=1200, height=600)" ] }, { "cell_type": "markdown", "id": "72dcdf4d-3dd1-47f6-b87c-0e718fe96bb0", "metadata": {}, "source": [ "This data processing step is usually (somewhat) customized to the specific type of data you're working with, but this would be repeated across whatever file types you're working with. General text scraping is usually the initial approach, but more specialized approaches like my vLLM transformation can enrich or convert data into more effective formats.\n", "\n", "But once you have your data in an LLM ready format, there's still one slight issue. LLM's have what's called a [context window](https://www.ibm.com/think/topics/context-window), or an upper limit to the amount of text you can actually pass in as context. While these are increasing as new technology is developed (1 Million+ tokens in some!), when working with enterprise-scale knowledgebases it is not feasible or cost effective to provide all context at all times for every question. So what we need to do is break our text up into bite sized pieces, otherwise known as **chunking**.\n", "\n", "### Chunking\n", "\n", "\n", "\n", "There are many different ways and approaches to chunking, but the most popular approach is a token based limit. Often this involves initially splitting the text by common seperators like periods, line breaks, paragraph starts, etc. and then combining these splits into chunks that respect a specific token limit. \n", "\n", "I used a prebuilt Markdown based splitter, that uses different MD headers to do the initial splitting, then respects a token based chunk size. Let's see this in action:" ] }, { "cell_type": "code", "execution_count": 7, "id": "888db53f-176b-4f5e-b92c-be3456fca493", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Token Count**: 22081" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with open(\"./documents/bmv080-ds.md\", 'r', encoding='utf-8') as file:\n", " text = file.read()\n", "\n", "# Check how many tokens\n", "encoder = tiktoken.get_encoding(\"cl100k_base\")\n", "tokens = encoder.encode(text)\n", "pprint(f\"**Token Count**: {len(tokens)}\")" ] }, { "cell_type": "markdown", "id": "6215b158-984d-45fa-b9f5-ed26e1d43668", "metadata": {}, "source": [ "Given our document size of ~20.5k tokens, let's split these into 1200 token chunks" ] }, { "cell_type": "code", "execution_count": 8, "id": "a8e89d8d-008e-445a-bcc0-1de1f27738e5", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Chunk Count**: 28" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Load splitter\n", "splitter = MarkdownTextSplitter.from_tiktoken_encoder(\n", " encoding_name=\"cl100k_base\",\n", " chunk_size=1200,\n", " chunk_overlap=400,\n", " strip_whitespace=True\n", ")\n", "\n", "# Chunk the text\n", "chunks = splitter.split_text(text)\n", "\n", "# Check how many chunks made\n", "pprint(f\"**Chunk Count**: {len(chunks)}\")" ] }, { "cell_type": "markdown", "id": "f5e4fdce-97c7-4cdd-86d5-53e7759bd0d4", "metadata": {}, "source": [ "We end up with 28 total chunks of the datasheet! Let's look at what one looks like" ] }, { "cell_type": "code", "execution_count": 10, "id": "ddeda203-c482-4478-8e38-7beed50441f1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "**Summary** \n", "Close the sensor unit.\n", "\n", "**Precondition** \n", "Must be called last to destroy the handle created by bmv080_open.\n", "\n", "**Postcondition** \n", "N/A\n", "\n", "**Arguments**\n", "\n", "| Argument | Description |\n", "|----------|---------------------------------|\n", "| * handle | Unique handle for a sensor unit |\n", "\n", "**Return Value** \n", "E_BMV080_OK if successful. Otherwise, the return value is a BMV080 status code.\n", "\n", "\n", "---\n", "\n", "\n", "# 5.2.4 Sensor Identification\n", "\n", "## 5.2.4.1 bmv080_get_sensor_id\n", "\n", "**Function**\n", "\n", "\n", "bmv080_status_code_t bmv080_get_sensor_id\n", "(\n", " const bmv080_handle_t handle,\n", " char id[13]\n", ");\n", "\n", "\n", "**Summary** \n", "Get the sensor ID of a sensor unit.\n", "\n", "**Precondition** \n", "A valid handle generated by `bmv080_open` is required. The application must have allocated the char array id with a size of 13 elements.\n", "\n", "**Postcondition** \n", "N/A\n", "\n", "**Arguments**\n", "\n", "| Argument | Description |\n", "|----------|---------------------------------------|\n", "| handle | Unique handle for a sensor unit |\n", "| id | Character array of 13 elements |\n", "\n", "**Return Value** \n", "E_BMV080_OK if successful. Otherwise, the return value is a BMV080 status code.\n", "\n", "# 5.2.5 Particulate Matter Measurement\n", "\n", "## 5.2.5.1 User Application Flows\n", "\n", "This section introduces two typical types of application flow by means of examples.\n", "\n", "### 5.2.5.1.1 Continuous Measurement\n", "\n", "[Figure 37 presents an activity diagram illustrating the process of conducting a continuous measurement, which is an unlimited duration measurement. The application is required to execute the sub-programs (highlighted in purple) from the BMV080 library.]\n", "\n", "The measurement process begins by establishing a connection with the BMV080. Following this, measurement parameters can be set using the `bmv080_set_parameter()` function.\n", "\n", "Initially, the sensor operates in sleep mode, drawing only standby current (as detailed in Table 12). A measurement is initiated by calling `bmv080_start_continuous_measurement()`, which activates the continuous measurement of particle density.\n", "\n", "The service function `bmv080_serve_interrupt()` fetches and processes data from the BMV080. Sensor output is provided through the `callback function()`, which is triggered every second and is implemented on application level.\n", "\n", "The `bmv080_serve_interrupt()` function can be invoked either at regular intervals (at least once per second) or event-driven, based on an external interrupt.\n", "\n", "The measurement process can be halted by calling `bmv080_stop_measurement()`. This action puts the BMV080 back into sleep mode, reducing the current consumption to standby levels.\n", "\n", "This setup allows for the implementation of various end-user applications. For instance, PM data can be logged into a database, displayed in real-time, or streamed directly to the cloud.\n", "\n", "\n", "---\n", "\n", "\n", "# Bosch Sensortec | BMV080 Ultra-mini Particulate Matter Sensor – Datasheet\n", "\n", "## Application BMV080 API\n", "\n", "[Flow diagram of continuous measurement]\n", "\n", "**Figure 37: Flow diagram of continuous measurement**\n", "\n", "A detailed explanation of each sub-program definition is in the API specification above.\n", "\n", "### Timing sequence during continuous measurement\n", "\n", "Figure 38 depicts the timing sequence for initiating a measurement. After the measurement process is initiated by calling `bmv080_start_continuous_measurement()`, the first measurement will be ready after a delay of 1.9 seconds. The sensor will then provide a new measurement every 1.03 seconds. If the IRQ line is used to trigger the `bmv080_serve_interrupt()` function, the first measurement will be ready after a longer delay of 2.9 seconds. However, subsequent measurements will continue to be available at regular intervals of 1.03 seconds.\n", "\n", "---\n", "\n", "© Bosch Sensortec GmbH 2025 | All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights\n", "\n", "Document number: BST-BMV080-DS000-11\n", "\n", "\n", "---\n", "\n", "\n", "## 5.2.5.1.2 Duty Cycling Measurement\n", "\n", "Figure 39 is an activity diagram that shows how to perform a duty cycling measurement – repeating numerous measurements separated by a pause. The main difference from continuous measurement is the duty cycling measurement will pause before repeating the next measurement cycle.\n", "\n", "For duty cycling measurement, the service function `bmv080_serve_interrupt()` must be invoked at regular intervals, at least once per second.\n", "\n", "**Note:** The event-driven approach using an external interrupt is not compatible with duty cycling measurement.\n", "\n", "The period at which new data becomes available is determined by the `duty_cycling_period` parameter (refer to the `bmv080_set_parameter()` function for more details). During the sleep time of the duty cycling period, the sensor will be in sleep mode, limiting current consumption to standby levels. For more details, see Section 2.4.2.\n", "\n", "Data availability is signaled through the callback function `bmv080_data_ready_callback()`, which is triggered once every `duty_cycling_period` has elapsed.\n", "\n", "This setup allows for the implementation of various end-user applications. For instance, PM data can be logged into a database, displayed in real-time, or streamed directly to the cloud.\n", "\n", "![Figure 38: Start-up sequence in case of Continuous Measurement Mode](#)\n", "\n", "![Figure 39: Flow diagram of a duty cycling measurement](#)\n", "\n", "\n", "---\n", "\n", "\n", "# Timing Sequence During Duty Cycling Measurement\n" ] } ], "source": [ "print(chunks[20])" ] }, { "cell_type": "markdown", "id": "1fc15c10-afea-43e4-bb47-f69af556a573", "metadata": {}, "source": [ "Great! But now that our knowledge is split across chunks, that offers a different challenge. Since we can't pass all context in at all times, we need a way to find the most relevant chunk(s) based on the questions or inputs into the system. " ] }, { "cell_type": "markdown", "id": "fcb07fe7-7db8-4bf8-b488-d532db7fc14d", "metadata": {}, "source": [ "## Retrieval\n", "\n", "So how do you determine what's relevant to answer a question? We need a system that can do database style retrieval of these chunks but for relevancy!\n", "\n", "That's where **embeddings** come in, a more advanced concept but crucial to the core of *Retrieval* in retrieval augmented generation.\n", "\n", "### Embeddings\n", "\n", "The goal of relevancy or similarity based retrieval is to surface the information needed to most accurately and best answer the question being asked. To do this we need to find what chunks are relevant (or similar) to the query being input. I.e. when asking *What is the maximum power consumption of the BMV080 in continuous measurement mode?* We want to surface chunk ID **6** which contains the answer to this question.\n", "\n", "\n", "\n", "The first step of being able to do this action is to **encode** the text into a numerical representation known as a **text embedding**. This is done with the help of a seperate language model known as a sentence transformer. These are smaller models that have been trained through predicting fill in the black style language predictions.\n", "\n", "" ] }, { "cell_type": "markdown", "id": "3932b26a-cba5-460e-ad30-414f618b08be", "metadata": {}, "source": [ "Through scaled deep learning and relying on the core transformer architecture and attention mechanisms, these models gain the ability to create a representation of sentences that capture the underlying semantics of the text conditional on the entire sentence.\n", "\n", "\n", "\n", "Let's see what this looks like real quick, using one of the most popular AI models in existence, [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)" ] }, { "cell_type": "code", "execution_count": 11, "id": "a6bb1167-a1ce-4a8e-bef5-1affcb28a1fe", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**Length of Representation**: 384 Dimensions" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "**First 10 Dimensions**: [ 0.00887027 0.06133682 -0.06260985 0.03094546 -0.06726976 0.00943975\n", " -0.00361226 0.04340531 -0.0748492 -0.00466658] ..." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"What is the maximum power consumption of the BMV080 in continuous measurement mode?\"\n", "\n", "representation = embedding_model.encode(query)\n", "\n", "pprint(f\"**Length of Representation**: {len(representation)} Dimensions\")\n", "pprint(f\"**First 10 Dimensions**: {representation[:10]} ...\")" ] }, { "cell_type": "markdown", "id": "2b7c1ce2-905f-4951-b4c2-7a2a7ee16451", "metadata": {}, "source": [ "Now why and how is having a semantically rich numerical representation useful? Let's dig a little further into the \"dimensionality\" of this. Intuitively, these dimensions are similar to the dimensions we understand, I.E 1D, 2D, 3D, except this time were capturing multiple dimensions of the concepts and ideas and meanings within the text sequence through the machine learning model. Now the interesting part of this is that in a way these can be represented in lower dimensions that we can \"see.\"\n", "\n", "Let's take some different categories of words, embed them, and see what that looks like when reduced to 3 dimensions:" ] }, { "cell_type": "code", "execution_count": 12, "id": "552ec7d3-20cf-48c7-af36-27fd33eb670a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", "To disable this warning, you can either:\n", "\t- Avoid using `tokenizers` before the fork if possible\n", "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" ] }, { "data": { "text/html": [ " \n", " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hoverinfo": "text", "hovertext": [ "X: 48.34, Y: -14.55, Z: -50.81", "X: 43.14, Y: -26.78, Z: -68.24", "X: 6.68, Y: -85.68, Z: -43.61", "X: 28.05, Y: -37.99, Z: -32.72", "X: 27.41, Y: -62.79, Z: -34.83", "X: 14.80, Y: -98.60, Z: -60.90", "X: 33.25, Y: -93.30, Z: -68.77", "X: 39.23, Y: -85.56, Z: -7.81", "X: 47.79, Y: -125.56, Z: -42.26", "X: 50.24, Y: -107.65, Z: -38.79", "X: 14.40, Y: -100.28, Z: -11.53", "X: 1.34, Y: -98.60, Z: -25.39", "X: 76.63, Y: -3.37, Z: -24.59", "X: 27.94, Y: -20.75, Z: -22.90", "X: 28.76, Y: -4.90, Z: -3.05", "X: 66.85, Y: -30.15, Z: -10.24", "X: 64.84, Y: -17.19, Z: -36.86", "X: 63.56, Y: -81.66, Z: -6.38", "X: 35.76, Y: 4.82, Z: -39.97", "X: 3.33, Y: -54.67, Z: -45.64", "X: 71.28, Y: -55.62, Z: -26.90", "X: 64.18, Y: -64.98, Z: -47.85", "X: 53.95, Y: -45.17, Z: -16.50", "X: 44.21, Y: -33.54, Z: 2.26", "X: 25.54, Y: -65.30, Z: 4.14" ], "marker": { "color": "red", "opacity": 0.7, "size": 5 }, "mode": "markers+text", "name": "Animal", "text": [ "dog", "cat", "elephant", "lion", "tiger", "giraffe", "zebra", "penguin", "koala", "kangaroo", "dolphin", "whale", "bear", "wolf", "fox", "rabbit", "deer", "squirrel", "owl", "eagle", "snake", "crocodile", "turtle", "frog", "butterfly" ], "textfont": { "size": 12 }, "textposition": "top center", "type": "scatter3d", "x": { "bdata": "ll9BQjCMLEKLudVA3GTgQXpJ20Hgy2xBUfsEQhXwHEKBKD9CYfJIQvRRZkGqp6s/TECZQhCN30EiGeZBn7GFQpasgUJrQX5CAg8PQhVUVUCjj45CSFuAQgDNV0IF1TBChFHMQQ==", "dtype": "f4" }, "y": { "bdata": "+8RowUE11sG1W6vCavkXwqMse8KDMcXCSpm6wu4dq8I0HvvCTk/XwuWNyMJBMsXCKvBXwIH5pcFzt5zAijvxwceJicGNUKPCekOaQAWpWsJFel7CP/OBwoytNMIELQbCjZqCwg==", "dtype": "f4" }, "z": { "bdata": "10FLwhB5iMIfcy7CstwCwptRC8LpmnPCVYyJwpDs+cBZCSnC3icbwkWEOMEiGMvB87vEwRgqt8H+90LADuQjwT91E8J8QMzAYuIfwnOONsK5LdfBJWs/wqoDhMGQcxBARImEQA==", "dtype": "f4" } }, { "hoverinfo": "text", "hovertext": [ "X: 65.01, Y: 31.05, Z: -2.62", "X: 77.72, Y: 16.58, Z: 12.95", "X: 42.64, Y: 22.10, Z: -0.79", "X: 84.04, Y: 61.45, Z: -16.30", "X: 108.85, Y: 46.56, Z: -6.12", "X: 102.95, Y: 36.91, Z: -23.77", "X: 80.72, Y: 21.97, Z: -26.19", "X: 105.80, Y: 68.77, Z: -29.34", "X: 122.21, Y: 55.60, Z: -37.26", "X: 130.61, Y: 60.34, Z: -7.13", "X: 105.37, Y: 67.55, Z: 29.70", "X: 110.50, Y: 69.84, Z: 59.12", "X: 97.08, Y: 81.56, Z: 14.94", "X: 78.40, Y: 71.29, Z: 29.24", "X: 125.88, Y: 65.41, Z: 13.54", "X: 133.44, Y: 85.10, Z: 36.45", "X: 112.25, Y: 86.72, Z: 44.82", "X: 84.69, Y: -11.89, Z: 10.08", "X: 81.24, Y: 94.09, Z: 34.90", "X: 68.15, Y: 117.27, Z: 37.25", "X: 68.07, Y: 104.04, Z: 63.37", "X: 63.59, Y: 80.29, Z: 58.29", "X: 72.90, Y: 79.13, Z: 75.11", "X: 59.44, Y: -31.37, Z: 21.03", "X: 59.82, Y: 22.73, Z: -45.97" ], "marker": { "color": "green", "opacity": 0.7, "size": 5 }, "mode": "markers+text", "name": "Food", "text": [ "apple", "banana", "orange", "grape", "strawberry", "peach", "pear", "mango", "pineapple", "watermelon", "carrot", "broccoli", "tomato", "potato", "cucumber", "lettuce", "spinach", "rice", "pasta", "bread", "cheese", "chicken", "beef", "fish", "egg" ], "textfont": { "size": 12 }, "textposition": "top center", "type": "scatter3d", "x": { "bdata": "kgaCQgdwm0KgkipC3hOoQgay2UKK581CmXChQuqb00LqbPRCEZ0CQza90kKn/txCOifCQi/PnEKbxPtCxXAFQx5/4EK8XqlC6nyiQsFMiEK4I4hCg1d+QnfLkUKnwW1CNkZvQg==", "dtype": "f4" }, "y": { "bdata": "WW/4QYWghEE60bBB3851QgRAOkKjnxNCfbuvQVWKiUKqY15CdV1xQmcZh0IYrotC1SCjQoiWjkIB04JC2TSqQuRyrULZUD7BDC68QjGL6kJ+E9BCLJKgQrJBnkLN8PrB09i1QQ==", "dtype": "f4" }, "z": { "bdata": "23MnwCMlT0FU/Uq/JWSCwWuzw8C8L77BnYrRwdS36sH6DBXCnizkwOii7UHDemxC5BRvQYrq6UGrpFhBOdERQt9HM0LsWSFB0pcLQmD7FEIDeH1CHydpQs81lkKNRahBmuI3wg==", "dtype": "f4" } }, { "hoverinfo": "text", "hovertext": [ "X: -17.12, Y: 35.94, Z: -121.78", "X: -1.16, Y: 25.83, Z: -91.08", "X: -79.83, Y: 71.08, Z: -46.60", "X: -35.88, Y: 41.12, Z: -110.37", "X: -21.71, Y: 61.79, Z: -56.00", "X: -22.95, Y: 75.32, Z: -40.74", "X: -16.00, Y: 69.75, Z: -79.58", "X: -89.21, Y: 66.68, Z: -102.16", "X: 8.28, Y: 37.07, Z: -106.13", "X: -85.48, Y: 42.28, Z: -93.57", "X: -85.77, Y: 29.49, Z: -78.93", "X: -62.24, Y: 65.00, Z: -31.52", "X: -15.87, Y: 39.55, Z: -54.45", "X: -28.43, Y: 47.45, Z: -73.14", "X: -35.17, Y: 99.56, Z: -54.78", "X: -73.37, Y: 97.89, Z: -61.53", "X: -50.37, Y: 30.48, Z: -81.42", "X: -13.19, Y: 92.14, Z: -25.25", "X: -92.96, Y: 78.62, Z: -32.34", "X: -96.91, Y: 55.17, Z: -48.93", "X: -107.90, Y: 66.84, Z: -63.56", "X: -67.81, Y: 74.20, Z: -108.86", "X: -55.11, Y: 91.64, Z: -62.32", "X: -56.87, Y: 70.36, Z: -74.26", "X: -42.15, Y: 57.16, Z: -101.37" ], "marker": { "color": "blue", "opacity": 0.7, "size": 5 }, "mode": "markers+text", "name": "Occupation", "text": [ "teacher", "doctor", "engineer", "chef", "artist", "musician", "writer", "lawyer", "nurse", "policeman", "firefighter", "scientist", "athlete", "actor", "photographer", "architect", "pilot", "farmer", "mechanic", "electrician", "plumber", "accountant", "designer", "programmer", "manager" ], "textfont": { "size": 12 }, "textposition": "top center", "type": "scatter3d", "x": { "bdata": "gfiIwRSklL+mpp/CPoMPwuOnrcEHmrfB7ACAwVVtssLZiwRBpfSqwr+Iq8Jc+HjCj/F9wVxu48HurwzCHb2Sws15ScIkFFPBseq5wl3SwcIiy9fCUKCHwkp0XMJXf2PCM5kowg==", "dtype": "f4" }, "y": { "bdata": "uL8PQtmmzkEtJ45C330kQp8rd0KqopZCU4KLQqxahUIlRhRCexopQoLj60GsAIJC/i4eQn7QPUKOHMdCNMXDQizh80HRSbhCsj+dQq6uXEJ5rYVCnGSUQo9Gt0LZuIxCuKBkQg==", "dtype": "f4" }, "z": { "bdata": "RpDzwucqtsIaZTrCU7/cwoD8X8Ks8yLCaCmfws1RzML+QtTCiSK7wnjancITI/zBYstZwhFGksIeHlvCfx52wvfWosKAAMrBQl0BwtC0Q8JQPn7CpLjZwhlMecIkhpTCP77Kwg==", "dtype": "f4" } }, { "hoverinfo": "text", "hovertext": [ "X: -25.19, Y: -85.58, Z: 99.50", "X: -43.19, Y: -72.97, Z: 112.07", "X: -33.85, Y: -58.26, Z: 93.45", "X: -49.66, Y: -77.38, Z: 80.20", "X: -61.90, Y: -62.88, Z: 102.51", "X: -64.57, Y: -91.33, Z: 105.58", "X: -29.18, Y: -43.54, Z: 79.81", "X: -56.57, Y: -47.13, Z: 54.07", "X: -76.87, Y: -60.24, Z: 38.48", "X: -93.66, Y: -70.79, Z: 59.17", "X: -97.74, Y: -42.04, Z: 68.45", "X: -108.95, Y: -58.85, Z: 62.86", "X: -87.67, Y: -88.31, Z: 75.20", "X: -97.86, Y: -107.08, Z: 98.47", "X: -35.56, Y: -90.71, Z: 70.11", "X: -73.85, Y: -45.74, Z: 118.78", "X: -77.11, Y: -73.96, Z: 154.63", "X: -66.28, Y: -88.10, Z: 140.78", "X: -45.84, Y: -76.85, Z: 136.40", "X: -91.40, Y: -41.40, Z: 138.28", "X: -73.14, Y: -30.66, Z: 132.57", "X: -81.60, Y: -104.79, Z: 124.35", "X: -97.38, Y: -36.63, Z: 94.75", "X: -92.82, Y: -88.53, Z: 102.41", "X: -51.41, Y: -38.05, Z: 71.10" ], "marker": { "color": "purple", "opacity": 0.7, "size": 5 }, "mode": "markers+text", "name": "Weather", "text": [ "sunny", "rainy", "cloudy", "windy", "stormy", "snowy", "foggy", "humid", "dry", "cold", "hot", "warm", "chilly", "freezing", "breezy", "thunderstorm", "hail", "sleet", "drizzle", "hurricane", "tornado", "blizzard", "heatwave", "frosty", "muggy" ], "textfont": { "size": 12 }, "textposition": "top center", "type": "scatter3d", "x": { "bdata": "korJwRbFLMJTaAfCRahGwoaZd8K0JYHCJXDpwVVMYsLmupnCzVC7wul8w8Kr5dnCPVivwmy6w8LvPA7CJrOTwiU5msInj4TCMVo3wl7NtsInR5LCAzOjwh7AwsKQo7nCk6BNwg==", "dtype": "f4" }, "y": { "bdata": "NCirws7wkcIGB2nCOMKawkOGe8INqLbCqC0uwp6FPMJh93DCTZKNwhwoKMJFZmvC656wwpEp1sLLa7XCcvM2wsvtk8IvMbDClrSZwmucJcItRfXBW5bRwmuFEsJFDrHCzDEYwg==", "dtype": "f4" }, "z": { "bdata": "4P/GQs4h4EJC5LpCFmigQpkGzUJPKdNC8Z6fQiBKWEK47hlCyLBsQvTjiEL9bntCk2eWQvzyxEIwOIxCnY7tQh6hGkMzxwxDnmcIQ4FICkPqkARDpLP4QrR+vULn08xCqDWOQg==", "dtype": "f4" } } ], "layout": { "height": 800, "hovermode": "closest", "legend": { "bgcolor": "rgba(255, 255, 255, 0.5)", "font": { "size": 12 }, "traceorder": "normal", "x": 0.9, "y": 0.9 }, "margin": { "b": 0, "l": 0, "r": 0, "t": 40 }, "scene": { "aspectmode": "cube", "camera": { "center": { "x": 0, "y": 0, "z": 0 }, "eye": { "x": 1.5, "y": 1.5, "z": 1.5 }, "up": { "x": 0, "y": 0, "z": 1 } }, "xaxis": { "title": { "text": "X" } }, "yaxis": { "title": { "text": "Y" } }, "zaxis": { "title": { "text": "Z" } } }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Word Embeddings Visualized by Category using t-SNE (3D)" }, "width": 1000 } }, "text/html": [ "
\n", "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from plotly.offline import init_notebook_mode, iplot\n", "from sklearn.manifold import TSNE\n", "import matplotlib.pyplot as plt\n", "import plotly.graph_objs as go\n", "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv('./documents/100_embeddings.csv')\n", "\n", "# Convert string representations of lists to numpy arrays\n", "matrix = np.array(df['embedding'].apply(eval).tolist())\n", "\n", "# Create a t-SNE model and transform the data\n", "tsne = TSNE(\n", " n_components=3,\n", " perplexity=10,\n", " max_iter=5000,\n", " learning_rate='auto',\n", " init='pca',\n", " random_state=3\n", ")\n", "vis_dims = tsne.fit_transform(matrix)\n", "\n", "category_colors = {\n", " 'Animal': 'red',\n", " 'Food': 'green',\n", " 'Occupation': 'blue',\n", " 'Weather': 'purple'\n", "}\n", "\n", "# Create traces for each category\n", "traces = []\n", "for category, color in category_colors.items():\n", " category_mask = df['Category'] == category\n", " category_data = vis_dims[category_mask]\n", " words = df['Word'][category_mask]\n", " \n", " # Create hover text with only coordinates\n", " hovertext = [f\"X: {x:.2f}, Y: {y:.2f}, Z: {z:.2f}\" \n", " for x, y, z in category_data]\n", " \n", " trace = go.Scatter3d(\n", " x=category_data[:, 0],\n", " y=category_data[:, 1],\n", " z=category_data[:, 2],\n", " mode='markers+text',\n", " name=category,\n", " marker=dict(\n", " size=5,\n", " color=color,\n", " opacity=0.7\n", " ),\n", " text=words,\n", " textposition=\"top center\",\n", " hovertext=hovertext,\n", " hoverinfo='text',\n", " textfont=dict(size=12)\n", " )\n", " traces.append(trace)\n", "\n", "# Create the layout\n", "layout = go.Layout(\n", " title=\"Word Embeddings Visualized by Category using t-SNE (3D)\",\n", " scene=dict(\n", " xaxis_title='X',\n", " yaxis_title='Y',\n", " zaxis_title='Z',\n", " aspectmode='cube',\n", " camera=dict(\n", " up=dict(x=0, y=0, z=1),\n", " center=dict(x=0, y=0, z=0),\n", " eye=dict(x=1.5, y=1.5, z=1.5)\n", " ),\n", " ),\n", " width=1000, \n", " height=800,\n", " margin=dict(l=0, r=0, b=0, t=40),\n", " legend=dict(\n", " x=0.9,\n", " y=0.9,\n", " traceorder=\"normal\",\n", " font=dict(size=12),\n", " bgcolor=\"rgba(255, 255, 255, 0.5)\"\n", " ),\n", " hovermode='closest'\n", ")\n", "\n", "# Create the figure and display\n", "fig = go.Figure(data=traces, layout=layout)\n", "init_notebook_mode(connected=True)\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "26820d6e-60fa-4fe0-a1af-fd4243408ce5", "metadata": {}, "source": [ "From this you can see that similar items *conceptually* are located in a **similar space**. This holds past 3D space up to N Dimension vector space! \n", "\n", "So with that understanding, and a numerical representation, we get to the useful part of being able to do direct comparisons of these embedded sentences." ] }, { "cell_type": "markdown", "id": "10524980-0be5-41fb-bdce-2b1e146d7222", "metadata": {}, "source": [ "### Semantic Similarity\n", "\n", "Once you have these embeddings, finding the similarity between them becomes relatively straightforward! We can take some inspiration from our middle school math classes and consider the distance between two points on a cartesian plane.\n", "\n", "\n", "\n", "Very similarly to how we can use the distance formula for 2D points, we can do the same (with some nuance) with high dimensional representations. They can be thought of as the distance between two points. Modern approaches tend to use the difference of angles between points, but intuitively this follows the same line of thinking.\n", "\n", "Let's see what this looks like:" ] }, { "cell_type": "code", "execution_count": 13, "id": "e6e0fe87-63b9-4546-99ce-2ed79c783da9", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "**Sentences:** \"Dog\", \"Cat\", \"Toyota Prius\"\n", "\n", "**Similarity Matrix:**\n", "```\n", "tensor([[1.0000, 0.6606, 0.2199],\n", " [0.6606, 1.0000, 0.2156],\n", " [0.2199, 0.2156, 1.0000]])\n", "``` \n", "\n", "**Key Relationships:**\n", "- Dog ↔ Cat: 0.6606 (moderate similarity - both animals)\n", "- Dog ↔ Toyota Prius: 0.2199 (low similarity) \n", "- Cat ↔ Toyota Prius: 0.2156 (low similarity)\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The sentences to encode\n", "sentences = [\n", " \"Dog\",\n", " \"Cat\",\n", " \"Toyota Prius\",\n", "]\n", "\n", "embeddings = embedding_model.encode(sentences)\n", "\n", "similarities = embedding_model.similarity(embeddings, embeddings)\n", "\n", "pprint(f\"\"\"\n", "**Sentences:** {', '.join([f'\"{s}\"' for s in sentences])}\n", "\n", "**Similarity Matrix:**\n", "```\n", "{similarities}\n", "``` \n", "\n", "**Key Relationships:**\n", "- Dog ↔ Cat: {similarities[0][1]:.4f} (moderate similarity - both animals)\n", "- Dog ↔ Toyota Prius: {similarities[0][2]:.4f} (low similarity) \n", "- Cat ↔ Toyota Prius: {similarities[1][2]:.4f} (low similarity)\n", "\"\"\")" ] }, { "cell_type": "markdown", "id": "872d61a6-be25-4d53-bc1e-3bb49ae83e77", "metadata": {}, "source": [ "This can then be extrapolated to compare your documents with queries to retrieve the relevant chunks. Thus your chunked documents become embedded and then stored into a vector retrieval system. Thankfully, there are systems in place for this retrieval system called **vector databases** that do this large scale embedding and similarity calculation retrieval efficiently\n", "\n", "\n", "\n", "Before we introduce our vector databases, let's just check out what the similarity between our question and chunk ID 6 is" ] }, { "cell_type": "code", "execution_count": 14, "id": "fa795d98-fc02-40d4-b166-0474afebb647", "metadata": {}, "outputs": [], "source": [ "# The sentences to encode\n", "sentences = [\n", " query,\n", " chunks[6],\n", "]\n", "\n", "embeddings = embedding_model.encode(sentences)\n", "\n", "similarities = embedding_model.similarity(embeddings, embeddings)" ] }, { "cell_type": "code", "execution_count": 15, "id": "15b268ed-02f7-446e-aab4-3c1c054d3c3a", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "*What is the maximum power consumption of the BMV080 in continuous measurement mode?* and **Chunk ID 6**:\n", "\n", "**Score**: 0.5966" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pprint(f\"\"\"*What is the maximum power consumption of the BMV080 in continuous measurement mode?* and **Chunk ID 6**:\n", "\n", "**Score**: {similarities[0][1]:.4f}\"\"\")" ] }, { "cell_type": "markdown", "id": "adf1b84b-adc0-4811-9bab-084677d452cf", "metadata": {}, "source": [ "### Retrieval\n", "\n", "With that, we can put all of our chunks into a vector database to cover the retrieval part of our retrieval augmented generation step! Often we will retrieve a top set of similar items, as it's not guaranteed that the relevant information is in only one chunk. " ] }, { "cell_type": "code", "execution_count": 16, "id": "121b9968-55c2-465e-8e80-f1f65defbefb", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "**Retrieved Document**: 1\n", "\n", "**Database ID**: 21\n", "\n", "**Distance**: 0.7646\n", "\n", "**Chunk Snippet**:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "---\n", "\n", "© Bosch Sensortec GmbH 2025 | All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights\n", "\n", "Document number: BST-BMV080-DS000-11\n", "\n", "\n", "---\n", "\n", "\n", "## 5.2.5.1.2 Duty Cycling Measurement\n", "\n", "Figure 39 is an activity diagram that shows how to perform a duty cycling measurement – repeating numerous measurements separated by a pause. The main difference from continuous measurement is the duty cyclin...\n" ] }, { "data": { "text/markdown": [ "---" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "**Retrieved Document**: 2\n", "\n", "**Database ID**: 6\n", "\n", "**Distance**: 0.8068\n", "\n", "**Chunk Snippet**:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "---\n", "\n", "11 Supply pins are described in Table 11.\n", "\n", "12 Given self heating during operation resulting in sensor internal temperature increase of ~15 K in continuous measurement mode with the Power Optimized Configuration (Chapter 4), BMV080 is capable to operate at ambient temperatures <15 °C depending on thermal integration design. For more details, refer to Section 3.3 on thermal integration best practices in BMV080 integration guideline (BST-BMV080-AN000).\n", "\n", "13 No c...\n" ] }, { "data": { "text/markdown": [ "---" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "**Retrieved Document**: 3\n", "\n", "**Database ID**: 12\n", "\n", "**Distance**: 0.8801\n", "\n", "**Chunk Snippet**:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "## 4.4.1.2 Proposal for Filtering Signal Errors\n", "\n", "Strong disturbing signals on the SCK pin may influence the measurement results of the BMV080. In an environment where strong disturbance signals are present, the SCK pin could be protected with a suitable low pass filter, which filters out the disturbance but allows normal communication.\n", "\n", "---\n", "\n", "16P = power supply, DI = digital in, DO = digital out, GND = ground.\n", "\n", "\n", "---\n", "\n", "\n", "### 4.4.1.3 Power domains\n", "\n", "The BMV080 has four power domains, listed in Table 1...\n" ] }, { "data": { "text/markdown": [ "---" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "**Retrieved Document**: 4\n", "\n", "**Database ID**: 23\n", "\n", "**Distance**: 0.9441\n", "\n", "**Chunk Snippet**:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "# 5.2.5.5 bmv080_stop_measurement\n", "\n", "## Function\n", "c\n", "bmv080_status_code_t bmv080_stop_measurement\n", "(\n", " const bmv080_handle_t handle\n", ");\n", "\n", "\n", "## Summary\n", "Stop particle measurement.\n", "\n", "## Precondition\n", "A valid handle generated by `bmv080_open` is required, and the sensor unit entered measurement mode via `bmv080_start_continuous_measurement` or `bmv080_start_duty_cycling_measurement`. Must be called at the end of a data acquisition cycle to ensure that the sensor unit is ready for the next measurement cycle....\n" ] }, { "data": { "text/markdown": [ "---" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "\n", "**Retrieved Document**: 5\n", "\n", "**Database ID**: 22\n", "\n", "**Distance**: 0.9487\n", "\n", "**Chunk Snippet**:" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "---\n", "\n", "\n", "# 5.2.5.4 bmv080_serve_interrupt\n", "\n", "## Function\n", "c\n", "bmv080_status_code_t bmv080_serve_interrupt\n", "(\n", " const bmv080_handle_t handle,\n", " bmv080_callback_data_ready_t data_ready,\n", " void* callback_parameters\n", ");\n", "\n", "\n", "## Summary\n", "Serve an interrupt using a callback function.\n", "\n", "## Precondition\n", "A valid handle generated by `bmv080_open` is required with the sensor unit currently in measurement mode via `bmv080_start_continuous_measurement` or `bmv080_start_duty_cycling_measurement`.\n", "\n", "The application can ...\n" ] }, { "data": { "text/markdown": [ "---" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "results = collection.query(\n", " query_texts=[\"What is the maximum power consumption of the BMV080 in continuous measurement mode?\"],\n", " n_results=5\n", ")\n", "\n", "for i, doc in enumerate(results['documents'][0]):\n", " pprint(f\"\"\"\n", "**Retrieved Document**: {i+1}\n", "\n", "**Database ID**: {results['ids'][0][i]}\n", "\n", "**Distance**: {results['distances'][0][i]:.4f}\n", "\n", "**Chunk Snippet**:\"\"\")\n", " print(f\"\"\"{doc[:500]}...\"\"\")\n", " pprint(\"---\")" ] }, { "cell_type": "markdown", "id": "4884b6be-93ea-4b87-9c68-0b066f980feb", "metadata": {}, "source": [ "We see that Chunk ID 6 is within the top few results! This was our hope, and is backed up by our database result that shows a smaller distance from the query to that positive chunk." ] }, { "cell_type": "markdown", "id": "8a19e0a3-ba27-4eac-9991-ae5ac4520ee0", "metadata": {}, "source": [ "## RAG\n", "\n", "\n", "\n", "So now we have:\n", "- Our unstructured data converted into an LLM ingestible form\n", "- Chunked into manageable and processable pieces\n", "- Embedded and ready for semantic similarity based search\n", "\n", "Now we just need to put it all together! Rather than passing the user query directly to the LLM, we first pass it to our vector database, retrieve our relevant context, then pass both the question and the context to the LLM to generate a response:\n", "\n", "```python\n", "def rag_response(query):\n", "\n", " context = retrieve_docs(query)\n", "\n", " prompt = f\"\"\"Use the provided up-to-date context to answer the question\n", "\n", "Retrieved Context:\n", "{context}\n", "\n", "Question: {query}\n", "\"\"\"\n", "\n", " response = query_openai(prompt)\n", "\n", " return response\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "id": "194932e1-f2e6-49e3-8976-07b26dbf37f0", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "**AI RAG Response**: The maximum power consumption of the BMV080 in continuous measurement mode is 181.9 mW." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "response = rag_response(\"What is the maximum power consumption of the BMV080 in continuous measurement mode?\")\n", "\n", "pprint(f\"**AI RAG Response**: {response}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 5 }