{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is to import the libraries and set the OpenAI API key and endpoint. You'll need to set the following environment variables:\n",
    "\n",
    "- `AZURE_OPENAI_API_KEY` - Your OpenAI API key\n",
    "- `AZURE_OPENAI_ENDPOINT` - Your OpenAI endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import openai\n",
    "from openai.embeddings_utils import cosine_similarity, get_embedding\n",
    "\n",
    "OPENAI_EMBEDDING_ENGINE = \"text-embedding-ada-002\"\n",
    "SIMILARITIES_RESULTS_THRESHOLD = 0.75\n",
    "DATASET_NAME = \"embedding_index_3m.json\"\n",
    "\n",
    "openai.api_type = \"azure\"\n",
    "openai.api_key = os.environ[\"AZURE_OPENAI_API_KEY\"]\n",
    "openai.api_base = os.environ[\"AZURE_OPENAI_ENDPOINT\"]\n",
    "openai.api_version = \"2023-07-01-preview\"\n",
    "\n",
    "OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.environ[\n",
    "    \"AZURE_OPENAI_EMBEDDING_MODEL_DEPLOYMENT_NAME\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we are going to load the Embedding Index into a Pandas Dataframe. The Embedding Index is stored in a JSON file called `embedding_index_3m.json`. The Embedding Index contains the Embeddings for each of the YouTube transcripts up until late Oct 2023."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_dataset(source: str) -> pd.core.frame.DataFrame:\n",
    "    # Load the video session index\n",
    "    pd_vectors = pd.read_json(source)\n",
    "    return pd_vectors.drop(columns=[\"text\"], errors=\"ignore\").fillna(\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we are going to create a function called `get_videos` that will search the Embedding Index for the query. The function will return the top 5 videos that are most similar to the query. The function works as follows:\n",
    "\n",
    "1. First, a copy of the Embedding Index is created.\n",
    "2. Next, the Embedding for the query is calculated using the OpenAI Embedding API.\n",
    "3. Then a new column is created in the Embedding Index called `similarity`. The `similarity` column contains the cosine similarity between the query Embedding and the Embedding for each video segment.\n",
    "4. Next, the Embedding Index is filtered by the `similarity` column. The Embedding Index is filtered to only include videos that have a cosine similarity greater than or equal to 0.75.\n",
    "5. Finally, the Embedding Index is sorted by the `similarity` column and the top 5 videos are returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_videos(\n",
    "    query: str, dataset: pd.core.frame.DataFrame, rows: int\n",
    ") -> pd.core.frame.DataFrame:\n",
    "    # create a copy of the dataset\n",
    "    video_vectors = dataset.copy()\n",
    "\n",
    "    # get the embeddings for the query\n",
    "    query_embeddings = get_embedding(query, OPENAI_EMBEDDING_ENGINE)\n",
    "\n",
    "    # create a new column with the calculated similarity for each row\n",
    "    video_vectors[\"similarity\"] = video_vectors[\"ada_v2\"].apply(\n",
    "        lambda x: cosine_similarity(query_embeddings, x)\n",
    "    )\n",
    "\n",
    "    # filter the videos by similarity\n",
    "    mask = video_vectors[\"similarity\"] >= SIMILARITIES_RESULTS_THRESHOLD\n",
    "    video_vectors = video_vectors[mask].copy()\n",
    "\n",
    "    # sort the videos by similarity\n",
    "    video_vectors = video_vectors.sort_values(by=\"similarity\", ascending=False).head(\n",
    "        rows\n",
    "    )\n",
    "\n",
    "    # return the top rows\n",
    "    return video_vectors.head(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function is very simple, it just prints out the results of the search query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_results(videos: pd.core.frame.DataFrame, query: str):\n",
    "    def _gen_yt_url(video_id: str, seconds: int) -> str:\n",
    "        \"\"\"convert time in format 00:00:00 to seconds\"\"\"\n",
    "        return f\"https://youtu.be/{video_id}?t={seconds}\"\n",
    "\n",
    "    print(f\"\\nVideos similar to '{query}':\")\n",
    "    for index, row in videos.iterrows():\n",
    "        youtube_url = _gen_yt_url(row[\"videoId\"], row[\"seconds\"])\n",
    "        print(f\" - {row['title']}\")\n",
    "        print(f\"   Summary: {' '.join(row['summary'].split()[:15])}...\")\n",
    "        print(f\"   YouTube: {youtube_url}\")\n",
    "        print(f\"   Similarity: {row['similarity']}\")\n",
    "        print(f\"   Speakers: {row['speaker']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. First, the Embedding Index is loaded into a Pandas Dataframe.\n",
    "2. Next, the user is prompted to enter a query.\n",
    "3. Then the `get_videos` function is called to search the Embedding Index for the query.\n",
    "4. Finally, the `display_results` function is called to display the results to the user.\n",
    "5. The user is then prompted to enter another query. This process continues until the user enters `exit`.\n",
    "\n",
    "![](media/notebook_search.png)\n",
    "\n",
    "You will be prompted to enter a query. Enter a query and press enter. The application will return a list of videos that are relevant to the query. The application will also return a link to the place in the video where the answer to the question is located.\n",
    "\n",
    "Here are some queries to try out:\n",
    "\n",
    "- What is Azure Machine Learning?\n",
    "- How do convolutional neural networks work?\n",
    "- What is a neural network?\n",
    "- Can I use Jupyter Notebooks with Azure Machine Learning?\n",
    "- What is ONNX?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_vectors = load_dataset(DATASET_NAME)\n",
    "\n",
    "# get user query from imput\n",
    "while True:\n",
    "    query = input(\"Enter a query: \")\n",
    "    if query == \"exit\":\n",
    "        break\n",
    "    videos = get_videos(query, pd_vectors, 5)\n",
    "    display_results(videos, query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}