{ "cells": [ { "cell_type": "markdown", "source": [ "# Building a Custom Implementation of the LangChain Embeddings Class\n", "This notebook will document the steps involved in creating a custom implementation of the langchain embeddings class. The idea of this implementation is to be a lightweight alternative to the HuggingFaceEmbeddings class, which I was previously using for this integration, but takes up a ton of disk space during installation. " ], "metadata": { "collapsed": false }, "id": "19a0a430382f8101" }, { "cell_type": "markdown", "source": [ "## Imports" ], "metadata": { "collapsed": false }, "id": "9ea3f2397a718efc" }, { "cell_type": "code", "execution_count": 37, "outputs": [], "source": [ "import google.generativeai as genai\n", "from pymongo import MongoClient\n", "from langchain.vectorstores import MongoDBAtlasVectorSearch\n", "from langchain_google_genai import ChatGoogleGenerativeAI\n", "from langchain import hub\n", "from langchain_core.output_parsers import StrOutputParser\n", "from langchain_core.runnables import RunnablePassthrough\n", "with open('google_api_key.txt') as f:\n", " api_key = f.read()\n", "with open('mongo_info.txt') as f:\n", " (user, password, url) = f.readlines()\n", "mongo_uri = f'mongodb+srv://{user.strip()}:{password.strip()}@{url.strip()}/?retryWrites=true&w=majority&appName=website-database'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T21:03:57.803440700Z", "start_time": "2024-08-14T21:03:57.796105700Z" } }, "id": "885f3a9cfcfd28b1" }, { "cell_type": "markdown", "source": [ "## The Embeddings() Class\n", "This class works by getting embeddings from Google's Gecko model. It follows the abstract methods outlined on LangChain's [github](https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/embeddings/embeddings.py), and will serve my needs just fine. Most importantly, this class accomplishes in just a few lines of code what I was previously unable to fit onto the server space I have available with the AWS free tier. " ], "metadata": { "collapsed": false }, "id": "954de32abadb5e28" }, { "cell_type": "code", "execution_count": 3, "outputs": [], "source": [ "class Embeddings():\n", " def __init__(self, model='models/text-embedding-004', api_key=api_key, dim=64):\n", " self.model, self.dim = model, dim\n", " genai.configure(api_key=api_key)\n", " def embed_documents(self, texts: list[str]) -> list[list[float]]:\n", " embeddings = [genai.embed_content(model=self.model, content=text, \n", " task_type='RETRIEVAL_DOCUMENT', \n", " output_dimensionality=self.dim)['embedding']\n", " for text in texts]\n", " return embeddings\n", " def embed_query(self, text: str) -> list[float]:\n", " return genai.embed_content(model=self.model, content=text, task_type='RETRIEVAL_DOCUMENT', output_dimensionality=self.dim)['embedding']\n", " " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T20:15:43.634248600Z", "start_time": "2024-08-14T20:15:43.595884200Z" } }, "id": "dc83744cbac4f37f" }, { "cell_type": "markdown", "source": [ "These are a list of webpages with program descriptions and other related pages having to do with my education:" ], "metadata": { "collapsed": false }, "id": "1beaec4bf57f4a12" }, { "cell_type": "code", "execution_count": 27, "outputs": [], "source": [ "ed = [\n", " 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-5',\n", " 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-6',\n", " 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/ms-data-faqs',\n", " 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-10',\n", " 'https://news.asu.edu/20210322-university-news-asu-will-lead-effort-upskill-reskill-workforce-through-8m-grant',\n", " 'https://degrees.apps.asu.edu/minors/major/ASU00/BABDACERT/applied-business-data-analytics?init=false&nopassive=true',\n", " 'https://aznext.pipelineaz.com/static_assets/sites/aznext.pipelineaz.com/AZNext.Brochure.-.ASU.Salesforce.Developer.Academy.participants.pdf',\n", " 'https://www.alfred.edu/academics/undergrad-majors-minors/environmental-studies.cfm',\n", " 'https://www.alfred.edu/about/',\n", " 'https://www.ucvts.org/domain/300'\n", "]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T20:53:34.307073800Z", "start_time": "2024-08-14T20:53:34.294730300Z" } }, "id": "cce798de4236f357" }, { "cell_type": "markdown", "source": [ "### Loading the Pages" ], "metadata": { "collapsed": false }, "id": "9273487936636616" }, { "cell_type": "code", "execution_count": 28, "outputs": [], "source": [ "from langchain.document_loaders import WebBaseLoader\n", "pages = [WebBaseLoader(url).load() for url in ed]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T20:53:54.191762900Z", "start_time": "2024-08-14T20:53:36.298383100Z" } }, "id": "739104b1f9ed5070" }, { "cell_type": "markdown", "source": [ "### Splitting the Text into 'Documents' for the LLM" ], "metadata": { "collapsed": false }, "id": "47bc1e252e2cfd9" }, { "cell_type": "code", "execution_count": 30, "outputs": [], "source": [ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[\n", " \"\\n\\n\", \"\\n\", \"(?<=\\. )\", \" \"], length_function=len)\n", "docs = [text_splitter.split_documents(page) for page in pages]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T20:55:24.132642200Z", "start_time": "2024-08-14T20:55:24.106925600Z" } }, "id": "fc1263a212e0aee8" }, { "cell_type": "markdown", "source": [ "### Pushing the Documents to Mongo Atlas" ], "metadata": { "collapsed": false }, "id": "bdd980754f17e364" }, { "cell_type": "code", "execution_count": 34, "outputs": [], "source": [ "client = MongoClient(mongo_uri)\n", "collection = client['website-database']['education-v2']\n", "\n", "embeddings = Embeddings()\n", "\n", "docsearches = [MongoDBAtlasVectorSearch.from_documents(\n", " doc, embeddings, collection=collection\n", ") for doc in docs]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T20:58:36.457952300Z", "start_time": "2024-08-14T20:56:48.100921200Z" } }, "id": "fefa74589b60e6bf" }, { "cell_type": "markdown", "source": [ "### Creating a Vector Search Object\n", "This is an object in the Python code that allows LangChain to connect to MongoDB and search its records" ], "metadata": { "collapsed": false }, "id": "6dcc31c0a65d3e00" }, { "cell_type": "code", "execution_count": 35, "outputs": [], "source": [ "vector_search = MongoDBAtlasVectorSearch.from_connection_string(\n", " mongo_uri,\n", " 'website-database.education-v2', \n", " embeddings,\n", " index_name=\"vector_index\"\n", " )" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T20:58:59.365934300Z", "start_time": "2024-08-14T20:58:59.089685400Z" } }, "id": "5c3a45a940b40f0f" }, { "cell_type": "markdown", "source": [ "### Creating the Pipeline for Retrieval and Generation" ], "metadata": { "collapsed": false }, "id": "1ed818d1592a44ce" }, { "cell_type": "code", "execution_count": 38, "outputs": [], "source": [ "retriever = vector_search.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": 15})\n", "model = ChatGoogleGenerativeAI(model='gemini-1.5-flash', api_key=api_key)\n", "prompt = hub.pull('rlm/rag-prompt')\n", "def format_docs(docs):\n", " return \"\\n\\n\".join(doc.page_content for doc in docs)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T21:04:26.931650200Z", "start_time": "2024-08-14T21:04:25.384720500Z" } }, "id": "ed0148bd4f72f4de" }, { "cell_type": "code", "execution_count": 39, "outputs": [], "source": [ "rag_chain = (\n", " {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n", " | prompt\n", " | model\n", " | StrOutputParser()\n", ")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T21:04:26.949767600Z", "start_time": "2024-08-14T21:04:26.943041700Z" } }, "id": "dd613cae2b5a5ff3" }, { "cell_type": "markdown", "source": [ "### Some Test Prompts" ], "metadata": { "collapsed": false }, "id": "a4ea19eec0572e82" }, { "cell_type": "code", "execution_count": 47, "outputs": [ { "data": { "text/plain": "\"Eastern University offers a Master's in Data Science program that has been highly ranked by several organizations. The program includes a curriculum that covers various aspects of data science , and the university provides information about admissions requirements and student learning outcomes. You can find more details on the Eastern University website. \\n\"" }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = 'Tell me about Eastern University\\'s Masters in Data Science program'\n", "response = ' '.join([chunk for chunk in rag_chain.stream(query)])\n", "response" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T21:08:41.159523Z", "start_time": "2024-08-14T21:08:39.478238700Z" } }, "id": "45035ce7c6aa744d" }, { "cell_type": "code", "execution_count": 50, "outputs": [ { "data": { "text/plain": "'The Applied Business Data Analytics certificate program at Arizona State University (ASU) is offered by the W. P. Carey School of Business. It is available both online and in person in Tempe. The program focuses on practical applications of computer-based tools for managing and analyzing large datasets, including predictive analytics, big data techniques, and visualization. \\n'" }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = 'Tell me about the Advanced Business Data Analytics program at ASU'\n", "response = ' '.join([chunk for chunk in rag_chain.stream(query)])\n", "response" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-14T21:13:53.219958400Z", "start_time": "2024-08-14T21:13:51.283368Z" } }, "id": "e718ea434ae082df" }, { "cell_type": "markdown", "source": [ "This retriever is lightweight, will fit on my website, and does a pretty good job with only 64-dimensional vectors. I'd call this project a success!" ], "metadata": { "collapsed": false }, "id": "940fa7f304be010" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }