{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In order to run the following noteboooks, if you haven't done yet, you need to deploy a model that uses `text-embedding-ada-002` as base model and set the deployment name inside .env file as `AZURE_OPENAI_EMBEDDINGS_ENDPOINT`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "from openai import AzureOpenAI\n", "from dotenv import load_dotenv\n", "\n", "load_dotenv()\n", "\n", "client = AzureOpenAI(\n", " api_key=os.environ['AZURE_OPENAI_API_KEY'], # this is also the default, it can be omitted\n", " api_version = \"2023-05-15\"\n", " )\n", "\n", "model = os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT']\n", "\n", "SIMILARITIES_RESULTS_THRESHOLD = 0.75\n", "DATASET_NAME = \"../embedding_index_3m.json\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we are going to load the Embedding Index into a Pandas Dataframe. The Embedding Index is stored in a JSON file called `embedding_index_3m.json`. The Embedding Index contains the Embeddings for each of the YouTube transcripts up until late Oct 2023." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def load_dataset(source: str) -> pd.core.frame.DataFrame:\n", " # Load the video session index\n", " pd_vectors = pd.read_json(source)\n", " return pd_vectors.drop(columns=[\"text\"], errors=\"ignore\").fillna(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we are going to create a function called `get_videos` that will search the Embedding Index for the query. The function will return the top 5 videos that are most similar to the query. The function works as follows:\n", "\n", "1. First, a copy of the Embedding Index is created.\n", "2. Next, the Embedding for the query is calculated using the OpenAI Embedding API.\n", "3. Then a new column is created in the Embedding Index called `similarity`. The `similarity` column contains the cosine similarity between the query Embedding and the Embedding for each video segment.\n", "4. Next, the Embedding Index is filtered by the `similarity` column. The Embedding Index is filtered to only include videos that have a cosine similarity greater than or equal to 0.75.\n", "5. Finally, the Embedding Index is sorted by the `similarity` column and the top 5 videos are returned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cosine_similarity(a, b):\n", " if len(a) > len(b):\n", " b = np.pad(b, (0, len(a) - len(b)), 'constant')\n", " elif len(b) > len(a):\n", " a = np.pad(a, (0, len(b) - len(a)), 'constant')\n", " return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n", "\n", "def get_videos(\n", " query: str, dataset: pd.core.frame.DataFrame, rows: int\n", ") -> pd.core.frame.DataFrame:\n", " # create a copy of the dataset\n", " video_vectors = dataset.copy()\n", "\n", " # get the embeddings for the query \n", " query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding\n", "\n", " # create a new column with the calculated similarity for each row\n", " video_vectors[\"similarity\"] = video_vectors[\"ada_v2\"].apply(\n", " lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))\n", " )\n", "\n", " # filter the videos by similarity\n", " mask = video_vectors[\"similarity\"] >= SIMILARITIES_RESULTS_THRESHOLD\n", " video_vectors = video_vectors[mask].copy()\n", "\n", " # sort the videos by similarity\n", " video_vectors = video_vectors.sort_values(by=\"similarity\", ascending=False).head(\n", " rows\n", " )\n", "\n", " # return the top rows\n", " return video_vectors.head(rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function is very simple, it just prints out the results of the search query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def display_results(videos: pd.core.frame.DataFrame, query: str):\n", " def _gen_yt_url(video_id: str, seconds: int) -> str:\n", " \"\"\"convert time in format 00:00:00 to seconds\"\"\"\n", " return f\"https://youtu.be/{video_id}?t={seconds}\"\n", "\n", " print(f\"\\nVideos similar to '{query}':\")\n", " for _, row in videos.iterrows():\n", " youtube_url = _gen_yt_url(row[\"videoId\"], row[\"seconds\"])\n", " print(f\" - {row['title']}\")\n", " print(f\" Summary: {' '.join(row['summary'].split()[:15])}...\")\n", " print(f\" YouTube: {youtube_url}\")\n", " print(f\" Similarity: {row['similarity']}\")\n", " print(f\" Speakers: {row['speaker']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. First, the Embedding Index is loaded into a Pandas Dataframe.\n", "2. Next, the user is prompted to enter a query.\n", "3. Then the `get_videos` function is called to search the Embedding Index for the query.\n", "4. Finally, the `display_results` function is called to display the results to the user.\n", "5. The user is then prompted to enter another query. This process continues until the user enters `exit`.\n", "\n", "![](../images/notebook-search.png?WT.mc_id=academic-105485-koreyst)\n", "\n", "You will be prompted to enter a query. Enter a query and press enter. The application will return a list of videos that are relevant to the query. The application will also return a link to the place in the video where the answer to the question is located.\n", "\n", "Here are some queries to try out:\n", "\n", "- What is Azure Machine Learning?\n", "- How do convolutional neural networks work?\n", "- What is a neural network?\n", "- Can I use Jupyter Notebooks with Azure Machine Learning?\n", "- What is ONNX?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd_vectors = load_dataset(DATASET_NAME)\n", "\n", "# get user query from imput\n", "while True:\n", " query = input(\"Enter a query: \")\n", " if query == \"exit\":\n", " break\n", " videos = get_videos(query, pd_vectors, 5)\n", " display_results(videos, query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }