# Building a Custom Implementation of the LangChain Embeddings Class
This notebook will document the steps involved in creating a custom implementation of the langchain embeddings class. The idea of this implementation is to be a lightweight alternative to the HuggingFaceEmbeddings class, which I was previously using for this integration, but takes up a ton of disk space during installation. 

## Imports

In [37]:
import google.generativeai as genai
from pymongo import MongoClient
from langchain.vectorstores import MongoDBAtlasVectorSearch
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
with open('google_api_key.txt') as f:
 api_key = f.read()
with open('mongo_info.txt') as f:
 (user, password, url) = f.readlines()
mongo_uri = f'mongodb+srv://{user.strip()}:{password.strip()}@{url.strip()}/?retryWrites=true&w=majority&appName=website-database'

## The Embeddings() Class
This class works by getting embeddings from Google's Gecko model. It follows the abstract methods outlined on LangChain's [github](https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/embeddings/embeddings.py), and will serve my needs just fine. Most importantly, this class accomplishes in just a few lines of code what I was previously unable to fit onto the server space I have available with the AWS free tier. 

In [3]:
class Embeddings():
 def __init__(self, model='models/text-embedding-004', api_key=api_key, dim=64):
 self.model, self.dim = model, dim
 genai.configure(api_key=api_key)
 def embed_documents(self, texts: list[str]) -> list[list[float]]:
 embeddings = [genai.embed_content(model=self.model, content=text, 
 task_type='RETRIEVAL_DOCUMENT', 
 output_dimensionality=self.dim)['embedding']
 for text in texts]
 return embeddings
 def embed_query(self, text: str) -> list[float]:
 return genai.embed_content(model=self.model, content=text, task_type='RETRIEVAL_DOCUMENT', output_dimensionality=self.dim)['embedding']
 

These are a list of webpages with program descriptions and other related pages having to do with my education:

In [27]:
ed = [
 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-5',
 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-6',
 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/ms-data-faqs',
 'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-10',
 'https://news.asu.edu/20210322-university-news-asu-will-lead-effort-upskill-reskill-workforce-through-8m-grant',
 'https://degrees.apps.asu.edu/minors/major/ASU00/BABDACERT/applied-business-data-analytics?init=false&nopassive=true',
 'https://aznext.pipelineaz.com/static_assets/sites/aznext.pipelineaz.com/AZNext.Brochure.-.ASU.Salesforce.Developer.Academy.participants.pdf',
 'https://www.alfred.edu/academics/undergrad-majors-minors/environmental-studies.cfm',
 'https://www.alfred.edu/about/',
 'https://www.ucvts.org/domain/300'
]

### Loading the Pages

In [28]:
from langchain.document_loaders import WebBaseLoader
pages = [WebBaseLoader(url).load() for url in ed]

### Splitting the Text into 'Documents' for the LLM

In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[
 "\n\n", "\n", "(?<=\. )", " "], length_function=len)
docs = [text_splitter.split_documents(page) for page in pages]

### Pushing the Documents to Mongo Atlas

In [34]:
client = MongoClient(mongo_uri)
collection = client['website-database']['education-v2']

embeddings = Embeddings()

docsearches = [MongoDBAtlasVectorSearch.from_documents(
 doc, embeddings, collection=collection
) for doc in docs]

### Creating a Vector Search Object
This is an object in the Python code that allows LangChain to connect to MongoDB and search its records

In [35]:
vector_search = MongoDBAtlasVectorSearch.from_connection_string(
 mongo_uri,
 'website-database.education-v2', 
 embeddings,
 index_name="vector_index"
 )

### Creating the Pipeline for Retrieval and Generation

In [38]:
retriever = vector_search.as_retriever(search_type="similarity", search_kwargs={"k": 15})
model = ChatGoogleGenerativeAI(model='gemini-1.5-flash', api_key=api_key)
prompt = hub.pull('rlm/rag-prompt')
def format_docs(docs):
 return "\n\n".join(doc.page_content for doc in docs)

In [39]:
rag_chain = (
 {"context": retriever | format_docs, "question": RunnablePassthrough()}
 | prompt
 | model
 | StrOutputParser()
)

### Some Test Prompts

In [47]:
query = 'Tell me about Eastern University\'s Masters in Data Science program'
response = ' '.join([chunk for chunk in rag_chain.stream(query)])
response

"Eastern University offers a Master's in Data Science program that has been highly ranked by several organizations. The program includes a curriculum that covers various aspects of data science , and the university provides information about admissions requirements and student learning outcomes. You can find more details on the Eastern University website. \n"

In [50]:
query = 'Tell me about the Advanced Business Data Analytics program at ASU'
response = ' '.join([chunk for chunk in rag_chain.stream(query)])
response

'The Applied Business Data Analytics certificate program at Arizona State University (ASU) is offered by the W. P. Carey School of Business. It is available both online and in person in Tempe. The program focuses on practical applications of computer-based tools for managing and analyzing large datasets, including predictive analytics, big data techniques, and visualization. \n'

This retriever is lightweight, will fit on my website, and does a pretty good job with only 64-dimensional vectors. I'd call this project a success!