{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to import the libraries and set the OpenAI API key and endpoint. You'll need to set the following environment variables:\n", "\n", "- `AZURE_OPENAI_API_KEY` - Your OpenAI API key\n", "- `AZURE_OPENAI_ENDPOINT` - Your OpenAI endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import openai\n", "from openai.embeddings_utils import cosine_similarity, get_embedding\n", "\n", "OPENAI_EMBEDDING_ENGINE = \"text-embedding-ada-002\"\n", "SIMILARITIES_RESULTS_THRESHOLD = 0.75\n", "DATASET_NAME = \"embedding_index_3m.json\"\n", "\n", "openai.api_type = \"azure\"\n", "openai.api_key = os.environ[\"AZURE_OPENAI_API_KEY\"]\n", "openai.api_base = os.environ[\"AZURE_OPENAI_ENDPOINT\"]\n", "openai.api_version = \"2023-07-01-preview\"\n", "\n", "OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.environ[\n", " \"AZURE_OPENAI_EMBEDDING_MODEL_DEPLOYMENT_NAME\"\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we are going to load the Embedding Index into a Pandas Dataframe. The Embedding Index is stored in a JSON file called `embedding_index_3m.json`. The Embedding Index contains the Embeddings for each of the YouTube transcripts up until late Oct 2023." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def load_dataset(source: str) -> pd.core.frame.DataFrame:\n", " # Load the video session index\n", " pd_vectors = pd.read_json(source)\n", " return pd_vectors.drop(columns=[\"text\"], errors=\"ignore\").fillna(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we are going to create a function called `get_videos` that will search the Embedding Index for the query. The function will return the top 5 videos that are most similar to the query. The function works as follows:\n", "\n", "1. First, a copy of the Embedding Index is created.\n", "2. Next, the Embedding for the query is calculated using the OpenAI Embedding API.\n", "3. Then a new column is created in the Embedding Index called `similarity`. The `similarity` column contains the cosine similarity between the query Embedding and the Embedding for each video segment.\n", "4. Next, the Embedding Index is filtered by the `similarity` column. The Embedding Index is filtered to only include videos that have a cosine similarity greater than or equal to 0.75.\n", "5. Finally, the Embedding Index is sorted by the `similarity` column and the top 5 videos are returned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_videos(\n", " query: str, dataset: pd.core.frame.DataFrame, rows: int\n", ") -> pd.core.frame.DataFrame:\n", " # create a copy of the dataset\n", " video_vectors = dataset.copy()\n", "\n", " # get the embeddings for the query\n", " query_embeddings = get_embedding(query, OPENAI_EMBEDDING_ENGINE)\n", "\n", " # create a new column with the calculated similarity for each row\n", " video_vectors[\"similarity\"] = video_vectors[\"ada_v2\"].apply(\n", " lambda x: cosine_similarity(query_embeddings, x)\n", " )\n", "\n", " # filter the videos by similarity\n", " mask = video_vectors[\"similarity\"] >= SIMILARITIES_RESULTS_THRESHOLD\n", " video_vectors = video_vectors[mask].copy()\n", "\n", " # sort the videos by similarity\n", " video_vectors = video_vectors.sort_values(by=\"similarity\", ascending=False).head(\n", " rows\n", " )\n", "\n", " # return the top rows\n", " return video_vectors.head(rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function is very simple, it just prints out the results of the search query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def display_results(videos: pd.core.frame.DataFrame, query: str):\n", " def _gen_yt_url(video_id: str, seconds: int) -> str:\n", " \"\"\"convert time in format 00:00:00 to seconds\"\"\"\n", " return f\"https://youtu.be/{video_id}?t={seconds}\"\n", "\n", " print(f\"\\nVideos similar to '{query}':\")\n", " for index, row in videos.iterrows():\n", " youtube_url = _gen_yt_url(row[\"videoId\"], row[\"seconds\"])\n", " print(f\" - {row['title']}\")\n", " print(f\" Summary: {' '.join(row['summary'].split()[:15])}...\")\n", " print(f\" YouTube: {youtube_url}\")\n", " print(f\" Similarity: {row['similarity']}\")\n", " print(f\" Speakers: {row['speaker']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. First, the Embedding Index is loaded into a Pandas Dataframe.\n", "2. Next, the user is prompted to enter a query.\n", "3. Then the `get_videos` function is called to search the Embedding Index for the query.\n", "4. Finally, the `display_results` function is called to display the results to the user.\n", "5. The user is then prompted to enter another query. This process continues until the user enters `exit`.\n", "\n", "![](media/notebook_search.png)\n", "\n", "You will be prompted to enter a query. Enter a query and press enter. The application will return a list of videos that are relevant to the query. The application will also return a link to the place in the video where the answer to the question is located.\n", "\n", "Here are some queries to try out:\n", "\n", "- What is Azure Machine Learning?\n", "- How do convolutional neural networks work?\n", "- What is a neural network?\n", "- Can I use Jupyter Notebooks with Azure Machine Learning?\n", "- What is ONNX?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd_vectors = load_dataset(DATASET_NAME)\n", "\n", "# get user query from imput\n", "while True:\n", " query = input(\"Enter a query: \")\n", " if query == \"exit\":\n", " break\n", " videos = get_videos(query, pd_vectors, 5)\n", " display_results(videos, query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }