{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to run the following noteboooks, if you haven't done yet, you need to deploy a model that uses `text-embedding-ada-002` as base model and set the deployment name inside .env file as `AZURE_OPENAI_EMBEDDINGS_ENDPOINT`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from openai import AzureOpenAI\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv()\n",
    "\n",
    "client = AzureOpenAI(\n",
    "  api_key=os.environ['AZURE_OPENAI_API_KEY'],  # this is also the default, it can be omitted\n",
    "  api_version = \"2023-05-15\"\n",
    "  )\n",
    "\n",
    "model = os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT']\n",
    "\n",
    "SIMILARITIES_RESULTS_THRESHOLD = 0.75\n",
    "DATASET_NAME = \"../embedding_index_3m.json\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we are going to load the Embedding Index into a Pandas Dataframe. The Embedding Index is stored in a JSON file called `embedding_index_3m.json`. The Embedding Index contains the Embeddings for each of the YouTube transcripts up until late Oct 2023."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_dataset(source: str) -> pd.core.frame.DataFrame:\n",
    "    # Load the video session index\n",
    "    pd_vectors = pd.read_json(source)\n",
    "    return pd_vectors.drop(columns=[\"text\"], errors=\"ignore\").fillna(\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we are going to create a function called `get_videos` that will search the Embedding Index for the query. The function will return the top 5 videos that are most similar to the query. The function works as follows:\n",
    "\n",
    "1. First, a copy of the Embedding Index is created.\n",
    "2. Next, the Embedding for the query is calculated using the OpenAI Embedding API.\n",
    "3. Then a new column is created in the Embedding Index called `similarity`. The `similarity` column contains the cosine similarity between the query Embedding and the Embedding for each video segment.\n",
    "4. Next, the Embedding Index is filtered by the `similarity` column. The Embedding Index is filtered to only include videos that have a cosine similarity greater than or equal to 0.75.\n",
    "5. Finally, the Embedding Index is sorted by the `similarity` column and the top 5 videos are returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cosine_similarity(a, b):\n",
    "    if len(a) > len(b):\n",
    "        b = np.pad(b, (0, len(a) - len(b)), 'constant')\n",
    "    elif len(b) > len(a):\n",
    "        a = np.pad(a, (0, len(b) - len(a)), 'constant')\n",
    "    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n",
    "\n",
    "def get_videos(\n",
    "    query: str, dataset: pd.core.frame.DataFrame, rows: int\n",
    ") -> pd.core.frame.DataFrame:\n",
    "    # create a copy of the dataset\n",
    "    video_vectors = dataset.copy()\n",
    "\n",
    "    # get the embeddings for the query    \n",
    "    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding\n",
    "\n",
    "    # create a new column with the calculated similarity for each row\n",
    "    video_vectors[\"similarity\"] = video_vectors[\"ada_v2\"].apply(\n",
    "        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))\n",
    "    )\n",
    "\n",
    "    # filter the videos by similarity\n",
    "    mask = video_vectors[\"similarity\"] >= SIMILARITIES_RESULTS_THRESHOLD\n",
    "    video_vectors = video_vectors[mask].copy()\n",
    "\n",
    "    # sort the videos by similarity\n",
    "    video_vectors = video_vectors.sort_values(by=\"similarity\", ascending=False).head(\n",
    "        rows\n",
    "    )\n",
    "\n",
    "    # return the top rows\n",
    "    return video_vectors.head(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function is very simple, it just prints out the results of the search query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_results(videos: pd.core.frame.DataFrame, query: str):\n",
    "    def _gen_yt_url(video_id: str, seconds: int) -> str:\n",
    "        \"\"\"convert time in format 00:00:00 to seconds\"\"\"\n",
    "        return f\"https://youtu.be/{video_id}?t={seconds}\"\n",
    "\n",
    "    print(f\"\\nVideos similar to '{query}':\")\n",
    "    for _, row in videos.iterrows():\n",
    "        youtube_url = _gen_yt_url(row[\"videoId\"], row[\"seconds\"])\n",
    "        print(f\" - {row['title']}\")\n",
    "        print(f\"   Summary: {' '.join(row['summary'].split()[:15])}...\")\n",
    "        print(f\"   YouTube: {youtube_url}\")\n",
    "        print(f\"   Similarity: {row['similarity']}\")\n",
    "        print(f\"   Speakers: {row['speaker']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. First, the Embedding Index is loaded into a Pandas Dataframe.\n",
    "2. Next, the user is prompted to enter a query.\n",
    "3. Then the `get_videos` function is called to search the Embedding Index for the query.\n",
    "4. Finally, the `display_results` function is called to display the results to the user.\n",
    "5. The user is then prompted to enter another query. This process continues until the user enters `exit`.\n",
    "\n",
    "![](../images/notebook-search.png?WT.mc_id=academic-105485-koreyst)\n",
    "\n",
    "You will be prompted to enter a query. Enter a query and press enter. The application will return a list of videos that are relevant to the query. The application will also return a link to the place in the video where the answer to the question is located.\n",
    "\n",
    "Here are some queries to try out:\n",
    "\n",
    "- What is Azure Machine Learning?\n",
    "- How do convolutional neural networks work?\n",
    "- What is a neural network?\n",
    "- Can I use Jupyter Notebooks with Azure Machine Learning?\n",
    "- What is ONNX?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_vectors = load_dataset(DATASET_NAME)\n",
    "\n",
    "# get user query from imput\n",
    "while True:\n",
    "    query = input(\"Enter a query: \")\n",
    "    if query == \"exit\":\n",
    "        break\n",
    "    videos = get_videos(query, pd_vectors, 5)\n",
    "    display_results(videos, query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}