{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Setting Up a Local Database in TinyDB for Fast RAG on Personal Questions\n",
    "In testing and receiving feedback from some of my initial users, one thing I found was that most of my users expected the AI to be able to answer basic questions about me in a way that it wasn't. To solve this problem, I'm going to create a local RAG system of questions and answers to common interview questions. This database is built with [TinyDB](https://tinydb.readthedocs.io/en/latest/index.html), which offers efficient and lightweight algorithms for keeping a small, local DB. I will continue adding my answers to common questions to this dataset to improve the quality of answers."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "b2b770e8b87a23df"
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "initial_id",
   "metadata": {
    "collapsed": true,
    "ExecuteTime": {
     "end_time": "2024-08-19T14:59:53.357237400Z",
     "start_time": "2024-08-19T14:59:52.924425300Z"
    }
   },
   "outputs": [],
   "source": [
    "from tinydb import TinyDB, Query\n",
    "from GoogleEmbeddings import Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "This code splits the text read in from the file, and embeds it using the [GoogleEmbeddings](https://github.com/mocboch/website/blob/master/GoogleEmbeddings.py) class."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "3d52ebd9086d7d59"
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [],
   "source": [
    "with open('personal_info.txt') as f:\n",
    "    text = f.read()\n",
    "text = text.split('\\n* ~')\n",
    "ls = [t.strip('~* ') for t in text]\n",
    "dicts = []\n",
    "embeddings = Embeddings(api_key=open('google_api_key.txt').read())\n",
    "for i in range(len(ls)): \n",
    "    if i%2 == 0:\n",
    "        dicts.append({'question': ls[i],'answer': ls[i+1], \n",
    "                      'question-embedded': embeddings.embed_query(ls[i]),\n",
    "                      'answer-embedded': embeddings.embed_query(ls[i+1])})\n",
    "\n",
    "    "
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T14:59:59.052896800Z",
     "start_time": "2024-08-19T14:59:53.361979Z"
    }
   },
   "id": "e01b88d76c2653d"
  },
  {
   "cell_type": "markdown",
   "source": [
    "Here is an example of a question and answer that I'll be inserting into the database."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "de5bc58378d2e742"
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "outputs": [
    {
     "data": {
      "text/plain": "'Tell me about Mark'"
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dicts[0]['question']"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:33:29.134204300Z",
     "start_time": "2024-08-19T15:33:29.127179500Z"
    }
   },
   "id": "f1b0109514ac1595"
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "outputs": [
    {
     "data": {
      "text/plain": "'Mark grew up in New Jersey, and first got interested in code when he learned to program using DarkBASIC in middle school. He attended an IT-focused high school program where he continued to develop his skills, adding programming in Python and Java to his skillset along with certifications in SQL, Microsoft Excel, and Comptia A+. He went on to attend Alfred University for Environmental Science. After that, Mark had a couple of jobs in that field, including working as a Park Ranger, and as an intern working with critically endangered birds. Following that, he discovered a passion for cooking and pursued it for years, eventually becoming a kitchen manager. When he decided it was time for a change, he went back to school for a Master’s degree in data science, which he is now finishing. Mark is very interested in Natural Language Processing and Bayesian Statistics.'"
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dicts[0]['answer']"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:33:41.668691600Z",
     "start_time": "2024-08-19T15:33:41.657295700Z"
    }
   },
   "id": "fde6651a48ac89d9"
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Creating the DB with TinyDB\n",
    "This cell creates a DB and a query object in TinyDB, and the next inserts the dictionaries created in the previous step as documents. That function returns a list of the doc_ids associated with the inserts. "
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "481394ca0deb8047"
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [],
   "source": [
    "db = TinyDB('personal-info.json')\n",
    "User = Query()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T14:59:59.061374700Z",
     "start_time": "2024-08-19T14:59:59.058702100Z"
    }
   },
   "id": "2ffb82eb8b38e0e1"
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [
    {
     "data": {
      "text/plain": "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]"
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "db.insert_multiple(dicts)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T14:59:59.078649200Z",
     "start_time": "2024-08-19T14:59:59.064295Z"
    }
   },
   "id": "20138f789c051e32"
  },
  {
   "cell_type": "markdown",
   "source": [
    "The next two cells create a table object with the [default table](https://tinydb.readthedocs.io/en/latest/usage.html#default-table) created by TinyDB, and then a vector search object with a pair of vector and scalar indices for each entry. For now, I have a small enough dataset that I can hold all of these vectors in memory, and use a naive algorithm to find the best match, but I will revisit this in the near future to create a more efficient search algorithm that considers time and space complexity. "
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "84c2b3f122794faa"
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "outputs": [],
   "source": [
    "table = db.table('_default')"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:00:23.897624300Z",
     "start_time": "2024-08-19T15:00:23.892831200Z"
    }
   },
   "id": "531df61aae794307"
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [],
   "source": [
    "vec_search = [tuple((t.doc_id, t['question-embedded'])) for t in table.all()]"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:04:17.612415300Z",
     "start_time": "2024-08-19T15:04:17.607338100Z"
    }
   },
   "id": "cd13af40f7cb5067"
  },
  {
   "cell_type": "markdown",
   "source": [
    "This cell defines a function to provide cosine similarity scores. Cosine similarity scores are a commonly used metric of comparison between vector embeddings."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "20b305a3e0ea2f4b"
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "def cosine_similarity(vec1: list[float], vec2: list[float]) -> float:\n",
    "    num = np.dot(vec1, vec2)\n",
    "    denom = np.sqrt(np.dot(vec1, vec1) * np.dot(vec2, vec2))\n",
    "    return num/denom"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T16:39:07.058515700Z",
     "start_time": "2024-08-19T16:39:07.049119900Z"
    }
   },
   "id": "c4f9f21897697037"
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now, I can test it out. Below are some test queries, the resulting cosine similarity scores, and the highest scoring questions in the dataset."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "6a0945399565dbd6"
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "outputs": [
    {
     "data": {
      "text/plain": "       score\n0           \n4   0.952122\n1   0.890730\n7   0.884191\n13  0.851925\n6   0.841705\n8   0.833763\n3   0.822064\n5   0.821628\n9   0.819010\n12  0.804606\n2   0.762136\n10  0.757881\n14  0.748708\n11  0.746675",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>score</th>\n    </tr>\n    <tr>\n      <th>0</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>4</th>\n      <td>0.952122</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0.890730</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>0.884191</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>0.851925</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>0.841705</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>0.833763</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0.822064</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>0.821628</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>0.819010</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>0.804606</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0.762136</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>0.757881</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>0.748708</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>0.746675</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "query = 'Give me a reason I should hire Mark'\n",
    "embedded_query = embeddings.embed_query(query)\n",
    "scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]\n",
    "pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T16:39:08.343296500Z",
     "start_time": "2024-08-19T16:39:08.121755200Z"
    }
   },
   "id": "62f91d670d14cc32"
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "outputs": [
    {
     "data": {
      "text/plain": "'Why should we hire Mark?'"
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.get(doc_id=4)['question']"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:14:11.188336900Z",
     "start_time": "2024-08-19T15:14:11.183074900Z"
    }
   },
   "id": "60af12d17533b426"
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "outputs": [
    {
     "data": {
      "text/plain": "       score\n0           \n1   0.951057\n8   0.932492\n9   0.925750\n7   0.915157\n4   0.913031\n3   0.898955\n6   0.893020\n5   0.884995\n10  0.855615\n2   0.839801\n13  0.837229\n11  0.804068\n14  0.799615\n12  0.763757",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>score</th>\n    </tr>\n    <tr>\n      <th>0</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>1</th>\n      <td>0.951057</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>0.932492</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>0.925750</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>0.915157</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0.913031</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0.898955</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>0.893020</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>0.884995</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>0.855615</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0.839801</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>0.837229</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>0.804068</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>0.799615</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>0.763757</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = 'What should I know about Mark?'\n",
    "embedded_query = embeddings.embed_query(query)\n",
    "scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]\n",
    "pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:39:04.816833200Z",
     "start_time": "2024-08-19T15:39:04.592391900Z"
    }
   },
   "id": "6ad5c0ad655b5ea1"
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "outputs": [
    {
     "data": {
      "text/plain": "'Tell me about Mark'"
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.get(doc_id=1)['question']"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:39:45.438422Z",
     "start_time": "2024-08-19T15:39:45.429631900Z"
    }
   },
   "id": "934673808a6b39bc"
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "outputs": [
    {
     "data": {
      "text/plain": "'What are Mark’s goals?'"
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.get(doc_id=8)['question']"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:40:10.045216500Z",
     "start_time": "2024-08-19T15:40:10.038473100Z"
    }
   },
   "id": "80018c9924a516a"
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "outputs": [
    {
     "data": {
      "text/plain": "'What are Mark’s interests?'"
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.get(doc_id=9)['question']"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:16:08.698275900Z",
     "start_time": "2024-08-19T15:16:08.679138Z"
    }
   },
   "id": "c93f2057b2ff1725"
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "outputs": [
    {
     "data": {
      "text/plain": "       score\n0           \n3   0.929328\n9   0.897411\n1   0.879798\n5   0.871701\n8   0.868609\n4   0.844823\n7   0.840330\n13  0.838112\n2   0.817903\n6   0.814839\n11  0.814703\n10  0.804551\n12  0.760181\n14  0.753478",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>score</th>\n    </tr>\n    <tr>\n      <th>0</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>3</th>\n      <td>0.929328</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>0.897411</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0.879798</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>0.871701</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>0.868609</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0.844823</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>0.840330</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>0.838112</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0.817903</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>0.814839</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>0.814703</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>0.804551</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>0.760181</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>0.753478</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = 'What does Mark like to do on the weekends?'\n",
    "embedded_query = embeddings.embed_query(query)\n",
    "scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]\n",
    "pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:40:35.580401900Z",
     "start_time": "2024-08-19T15:40:35.361670200Z"
    }
   },
   "id": "799fb0ba8e61848"
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "outputs": [
    {
     "data": {
      "text/plain": "'What are Mark’s hobbies?'"
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.get(doc_id=3)['question']"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-08-19T15:40:37.975078500Z",
     "start_time": "2024-08-19T15:40:37.961517900Z"
    }
   },
   "id": "e4028b5cc5a78551"
  },
  {
   "cell_type": "markdown",
   "source": [
    "In addition to implementing a more efficient search algorithm, and adding more questions and answers to the dataset, in the near future I would like to create amn training set of some queries paired with 'golden documents.' This will allow me to get a concrete assessment of the model's accuracy, which in turn will allow me to investigate reducing the dimensionality of the vector index further with a reliable approach. I would also like to implement a 'skills' section of this database, where I will write up paragraphs about each of my skills, pulled from job descriptions I'm looking at, and then allow the model to search using string matching TinyDB's own search algorithms. \n",
    "\n",
    "The final system will allow the machine to respond to users with context from: \n",
    "- My answers to interview questions, based on their similarity to the user's query\n",
    "- My skills, based on information the user has provided about the job they are interested in hiring for and searched by keyword\n",
    "\n",
    "This system can also be integrated into the personalized resume creation process by generating skill descriptions that match keywords in the job description context"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "c7c3a7621241fbdd"
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "collapsed": false
   },
   "id": "71ef41baa81d2159"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}