{ "cells": [ { "cell_type": "markdown", "source": [ "# Setting Up a Local Database in TinyDB for Fast RAG on Personal Questions\n", "In testing and receiving feedback from some of my initial users, one thing I found was that most of my users expected the AI to be able to answer basic questions about me in a way that it wasn't. To solve this problem, I'm going to create a local RAG system of questions and answers to common interview questions. This database is built with [TinyDB](https://tinydb.readthedocs.io/en/latest/index.html), which offers efficient and lightweight algorithms for keeping a small, local DB. I will continue adding my answers to common questions to this dataset to improve the quality of answers." ], "metadata": { "collapsed": false }, "id": "b2b770e8b87a23df" }, { "cell_type": "code", "execution_count": 1, "id": "initial_id", "metadata": { "collapsed": true, "ExecuteTime": { "end_time": "2024-08-19T14:59:53.357237400Z", "start_time": "2024-08-19T14:59:52.924425300Z" } }, "outputs": [], "source": [ "from tinydb import TinyDB, Query\n", "from GoogleEmbeddings import Embeddings" ] }, { "cell_type": "markdown", "source": [ "This code splits the text read in from the file, and embeds it using the [GoogleEmbeddings](https://github.com/mocboch/website/blob/master/GoogleEmbeddings.py) class." ], "metadata": { "collapsed": false }, "id": "3d52ebd9086d7d59" }, { "cell_type": "code", "execution_count": 2, "outputs": [], "source": [ "with open('personal_info.txt') as f:\n", " text = f.read()\n", "text = text.split('\\n* ~')\n", "ls = [t.strip('~* ') for t in text]\n", "dicts = []\n", "embeddings = Embeddings(api_key=open('google_api_key.txt').read())\n", "for i in range(len(ls)): \n", " if i%2 == 0:\n", " dicts.append({'question': ls[i],'answer': ls[i+1], \n", " 'question-embedded': embeddings.embed_query(ls[i]),\n", " 'answer-embedded': embeddings.embed_query(ls[i+1])})\n", "\n", " " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T14:59:59.052896800Z", "start_time": "2024-08-19T14:59:53.361979Z" } }, "id": "e01b88d76c2653d" }, { "cell_type": "markdown", "source": [ "Here is an example of a question and answer that I'll be inserting into the database." ], "metadata": { "collapsed": false }, "id": "de5bc58378d2e742" }, { "cell_type": "code", "execution_count": 39, "outputs": [ { "data": { "text/plain": "'Tell me about Mark'" }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dicts[0]['question']" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:33:29.134204300Z", "start_time": "2024-08-19T15:33:29.127179500Z" } }, "id": "f1b0109514ac1595" }, { "cell_type": "code", "execution_count": 40, "outputs": [ { "data": { "text/plain": "'Mark grew up in New Jersey, and first got interested in code when he learned to program using DarkBASIC in middle school. He attended an IT-focused high school program where he continued to develop his skills, adding programming in Python and Java to his skillset along with certifications in SQL, Microsoft Excel, and Comptia A+. He went on to attend Alfred University for Environmental Science. After that, Mark had a couple of jobs in that field, including working as a Park Ranger, and as an intern working with critically endangered birds. Following that, he discovered a passion for cooking and pursued it for years, eventually becoming a kitchen manager. When he decided it was time for a change, he went back to school for a Master’s degree in data science, which he is now finishing. Mark is very interested in Natural Language Processing and Bayesian Statistics.'" }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dicts[0]['answer']" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:33:41.668691600Z", "start_time": "2024-08-19T15:33:41.657295700Z" } }, "id": "fde6651a48ac89d9" }, { "cell_type": "markdown", "source": [ "## Creating the DB with TinyDB\n", "This cell creates a DB and a query object in TinyDB, and the next inserts the dictionaries created in the previous step as documents. That function returns a list of the doc_ids associated with the inserts. " ], "metadata": { "collapsed": false }, "id": "481394ca0deb8047" }, { "cell_type": "code", "execution_count": 3, "outputs": [], "source": [ "db = TinyDB('personal-info.json')\n", "User = Query()" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T14:59:59.061374700Z", "start_time": "2024-08-19T14:59:59.058702100Z" } }, "id": "2ffb82eb8b38e0e1" }, { "cell_type": "code", "execution_count": 4, "outputs": [ { "data": { "text/plain": "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]" }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.insert_multiple(dicts)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T14:59:59.078649200Z", "start_time": "2024-08-19T14:59:59.064295Z" } }, "id": "20138f789c051e32" }, { "cell_type": "markdown", "source": [ "The next two cells create a table object with the [default table](https://tinydb.readthedocs.io/en/latest/usage.html#default-table) created by TinyDB, and then a vector search object with a pair of vector and scalar indices for each entry. For now, I have a small enough dataset that I can hold all of these vectors in memory, and use a naive algorithm to find the best match, but I will revisit this in the near future to create a more efficient search algorithm that considers time and space complexity. " ], "metadata": { "collapsed": false }, "id": "84c2b3f122794faa" }, { "cell_type": "code", "execution_count": 5, "outputs": [], "source": [ "table = db.table('_default')" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:00:23.897624300Z", "start_time": "2024-08-19T15:00:23.892831200Z" } }, "id": "531df61aae794307" }, { "cell_type": "code", "execution_count": 14, "outputs": [], "source": [ "vec_search = [tuple((t.doc_id, t['question-embedded'])) for t in table.all()]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:04:17.612415300Z", "start_time": "2024-08-19T15:04:17.607338100Z" } }, "id": "cd13af40f7cb5067" }, { "cell_type": "markdown", "source": [ "This cell defines a function to provide cosine similarity scores. Cosine similarity scores are a commonly used metric of comparison between vector embeddings." ], "metadata": { "collapsed": false }, "id": "20b305a3e0ea2f4b" }, { "cell_type": "code", "execution_count": 62, "outputs": [], "source": [ "import numpy as np\n", "def cosine_similarity(vec1: list[float], vec2: list[float]) -> float:\n", " num = np.dot(vec1, vec2)\n", " denom = np.sqrt(np.dot(vec1, vec1) * np.dot(vec2, vec2))\n", " return num/denom" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T16:39:07.058515700Z", "start_time": "2024-08-19T16:39:07.049119900Z" } }, "id": "c4f9f21897697037" }, { "cell_type": "markdown", "source": [ "Now, I can test it out. Below are some test queries, the resulting cosine similarity scores, and the highest scoring questions in the dataset." ], "metadata": { "collapsed": false }, "id": "6a0945399565dbd6" }, { "cell_type": "code", "execution_count": 63, "outputs": [ { "data": { "text/plain": " score\n0 \n4 0.952122\n1 0.890730\n7 0.884191\n13 0.851925\n6 0.841705\n8 0.833763\n3 0.822064\n5 0.821628\n9 0.819010\n12 0.804606\n2 0.762136\n10 0.757881\n14 0.748708\n11 0.746675", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
score
0
40.952122
10.890730
70.884191
130.851925
60.841705
80.833763
30.822064
50.821628
90.819010
120.804606
20.762136
100.757881
140.748708
110.746675
\n
" }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "query = 'Give me a reason I should hire Mark'\n", "embedded_query = embeddings.embed_query(query)\n", "scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]\n", "pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T16:39:08.343296500Z", "start_time": "2024-08-19T16:39:08.121755200Z" } }, "id": "62f91d670d14cc32" }, { "cell_type": "code", "execution_count": 26, "outputs": [ { "data": { "text/plain": "'Why should we hire Mark?'" }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.get(doc_id=4)['question']" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:14:11.188336900Z", "start_time": "2024-08-19T15:14:11.183074900Z" } }, "id": "60af12d17533b426" }, { "cell_type": "code", "execution_count": 50, "outputs": [ { "data": { "text/plain": " score\n0 \n1 0.951057\n8 0.932492\n9 0.925750\n7 0.915157\n4 0.913031\n3 0.898955\n6 0.893020\n5 0.884995\n10 0.855615\n2 0.839801\n13 0.837229\n11 0.804068\n14 0.799615\n12 0.763757", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
score
0
10.951057
80.932492
90.925750
70.915157
40.913031
30.898955
60.893020
50.884995
100.855615
20.839801
130.837229
110.804068
140.799615
120.763757
\n
" }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = 'What should I know about Mark?'\n", "embedded_query = embeddings.embed_query(query)\n", "scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]\n", "pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:39:04.816833200Z", "start_time": "2024-08-19T15:39:04.592391900Z" } }, "id": "6ad5c0ad655b5ea1" }, { "cell_type": "code", "execution_count": 52, "outputs": [ { "data": { "text/plain": "'Tell me about Mark'" }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.get(doc_id=1)['question']" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:39:45.438422Z", "start_time": "2024-08-19T15:39:45.429631900Z" } }, "id": "934673808a6b39bc" }, { "cell_type": "code", "execution_count": 53, "outputs": [ { "data": { "text/plain": "'What are Mark’s goals?'" }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.get(doc_id=8)['question']" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:40:10.045216500Z", "start_time": "2024-08-19T15:40:10.038473100Z" } }, "id": "80018c9924a516a" }, { "cell_type": "code", "execution_count": 31, "outputs": [ { "data": { "text/plain": "'What are Mark’s interests?'" }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.get(doc_id=9)['question']" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:16:08.698275900Z", "start_time": "2024-08-19T15:16:08.679138Z" } }, "id": "c93f2057b2ff1725" }, { "cell_type": "code", "execution_count": 54, "outputs": [ { "data": { "text/plain": " score\n0 \n3 0.929328\n9 0.897411\n1 0.879798\n5 0.871701\n8 0.868609\n4 0.844823\n7 0.840330\n13 0.838112\n2 0.817903\n6 0.814839\n11 0.814703\n10 0.804551\n12 0.760181\n14 0.753478", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
score
0
30.929328
90.897411
10.879798
50.871701
80.868609
40.844823
70.840330
130.838112
20.817903
60.814839
110.814703
100.804551
120.760181
140.753478
\n
" }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = 'What does Mark like to do on the weekends?'\n", "embedded_query = embeddings.embed_query(query)\n", "scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]\n", "pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:40:35.580401900Z", "start_time": "2024-08-19T15:40:35.361670200Z" } }, "id": "799fb0ba8e61848" }, { "cell_type": "code", "execution_count": 55, "outputs": [ { "data": { "text/plain": "'What are Mark’s hobbies?'" }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.get(doc_id=3)['question']" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-19T15:40:37.975078500Z", "start_time": "2024-08-19T15:40:37.961517900Z" } }, "id": "e4028b5cc5a78551" }, { "cell_type": "markdown", "source": [ "In addition to implementing a more efficient search algorithm, and adding more questions and answers to the dataset, in the near future I would like to create amn training set of some queries paired with 'golden documents.' This will allow me to get a concrete assessment of the model's accuracy, which in turn will allow me to investigate reducing the dimensionality of the vector index further with a reliable approach. I would also like to implement a 'skills' section of this database, where I will write up paragraphs about each of my skills, pulled from job descriptions I'm looking at, and then allow the model to search using string matching TinyDB's own search algorithms. \n", "\n", "The final system will allow the machine to respond to users with context from: \n", "- My answers to interview questions, based on their similarity to the user's query\n", "- My skills, based on information the user has provided about the job they are interested in hiring for and searched by keyword\n", "\n", "This system can also be integrated into the personalized resume creation process by generating skill descriptions that match keywords in the job description context" ], "metadata": { "collapsed": false }, "id": "c7c3a7621241fbdd" }, { "cell_type": "markdown", "source": [], "metadata": { "collapsed": false }, "id": "71ef41baa81d2159" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }