{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Postgres as memory\n", "\n", "This notebook shows how to use Postgres as a memory store in Semantic Kernel.\n", "\n", "The code below pulls the most recent papers from [ArviX](https://arxiv.org/), creates embeddings from the paper abstracts, and stores them in a Postgres database.\n", "\n", "In the future, we can use the Postgres vector store to search the database for similar papers based on the embeddings - stay tuned!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import textwrap\n", "import xml.etree.ElementTree as ET\n", "from dataclasses import dataclass\n", "from datetime import datetime\n", "from typing import Annotated, Any\n", "\n", "import numpy as np\n", "import requests\n", "\n", "from semantic_kernel.connectors.ai.open_ai.prompt_execution_settings.open_ai_prompt_execution_settings import (\n", " OpenAIEmbeddingPromptExecutionSettings,\n", ")\n", "from semantic_kernel.connectors.ai.open_ai.services.azure_text_embedding import AzureTextEmbedding\n", "from semantic_kernel.connectors.ai.open_ai.services.open_ai_text_embedding import OpenAITextEmbedding\n", "from semantic_kernel.connectors.memory.postgres.postgres_collection import PostgresCollection\n", "from semantic_kernel.data.const import DistanceFunction, IndexKind\n", "from semantic_kernel.data.vector_store_model_decorator import vectorstoremodel\n", "from semantic_kernel.data.vector_store_record_fields import (\n", " VectorStoreRecordDataField,\n", " VectorStoreRecordKeyField,\n", " VectorStoreRecordVectorField,\n", ")\n", "from semantic_kernel.data.vector_store_record_utils import VectorStoreRecordUtils\n", "from semantic_kernel.kernel import Kernel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up your environment\n", "\n", "You'll need to set up your environment to provide connection information to Postgres, as well as OpenAI or Azure OpenAI.\n", "\n", "To do this, copy the `.env.example` file to `.env` and fill in the necessary information.\n", "\n", "### Postgres configuration\n", "\n", "You'll need to provide a connection string to a Postgres database. You can use a local Postgres instance, or a cloud-hosted one.\n", "You can provide a connection string, or provide environment variables with the connection information. See the .env.example file for `POSTGRES_` settings.\n", "\n", "#### Using Docker\n", "\n", "You can also use docker to bring up a Postgres instance by following the steps below:\n", "\n", "Create an `init.sql` that has the following:\n", "\n", "```sql\n", "CREATE EXTENSION IF NOT EXISTS vector;\n", "```\n", "\n", "Now you can start a postgres instance with the following:\n", "\n", "```\n", "docker pull pgvector/pgvector:pg16\n", "docker run --rm -it --name pgvector -p 5432:5432 -v ./init.sql:/docker-entrypoint-initdb.d/init.sql -e POSTGRES_PASSWORD=example pgvector/pgvector:pg16\n", "```\n", "\n", "_Note_: Use `.\\init.sql` on Windows and `./init.sql` on WSL or Linux/Mac.\n", "\n", "Then you could use the connection string:\n", "\n", "```\n", "POSTGRES_CONNECTION_STRING=\"host=localhost port=5432 dbname=postgres user=postgres password=example\"\n", "```\n", "\n", "### OpenAI configuration\n", "\n", "You can either use OpenAI or Azure OpenAI APIs. You provide the API key and other configuration in the `.env` file. Set either the `OPENAI_` or `AZURE_OPENAI_` settings.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Path to the environment file\n", "env_file_path = \".env\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we set some additional configuration." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# -- ArXiv settings --\n", "\n", "# The search term to use when searching for papers on arXiv. All metadata fields for the papers are searched.\n", "SEARCH_TERM = \"generative ai\"\n", "\n", "# The category of papers to search for on arXiv. See https://arxiv.org/category_taxonomy for a list of categories.\n", "ARVIX_CATEGORY = \"cs.AI\"\n", "\n", "# The maximum number of papers to search for on arXiv.\n", "MAX_RESULTS = 10\n", "\n", "# -- OpenAI settings --\n", "\n", "# Set this flag to False to use the OpenAI API instead of Azure OpenAI\n", "USE_AZURE_OPENAI = True\n", "\n", "# The name of the OpenAI model or Azure OpenAI deployment to use\n", "EMBEDDING_MODEL = \"text-embedding-3-small\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we define a vector store model. This model defines the table and column names for storing the embeddings. We use the `@vectorstoremodel` decorator to tell Semantic Kernel to create a vector store definition from the model. The VectorStoreRecordField annotations define the fields that will be stored in the database, including key and vector fields." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "@vectorstoremodel\n", "@dataclass\n", "class ArxivPaper:\n", " id: Annotated[str, VectorStoreRecordKeyField()]\n", " title: Annotated[str, VectorStoreRecordDataField()]\n", " abstract: Annotated[str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name=\"abstract_vector\")]\n", " published: Annotated[datetime, VectorStoreRecordDataField()]\n", " authors: Annotated[list[str], VectorStoreRecordDataField()]\n", " link: Annotated[str | None, VectorStoreRecordDataField()]\n", "\n", " abstract_vector: Annotated[\n", " np.ndarray | None,\n", " VectorStoreRecordVectorField(\n", " embedding_settings={\"embedding\": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)},\n", " index_kind=IndexKind.HNSW,\n", " dimensions=1536,\n", " distance_function=DistanceFunction.COSINE,\n", " property_type=\"float\",\n", " serialize_function=np.ndarray.tolist,\n", " deserialize_function=np.array,\n", " ),\n", " ] = None\n", "\n", " @classmethod\n", " def from_arxiv_info(cls, arxiv_info: dict[str, Any]) -> \"ArxivPaper\":\n", " return cls(\n", " id=arxiv_info[\"id\"],\n", " title=arxiv_info[\"title\"].replace(\"\\n \", \" \"),\n", " abstract=arxiv_info[\"abstract\"].replace(\"\\n \", \" \"),\n", " published=arxiv_info[\"published\"],\n", " authors=arxiv_info[\"authors\"],\n", " link=arxiv_info[\"link\"],\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a function that queries the ArviX API for the most recent papers based on our search query and category." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def query_arxiv(search_query: str, category: str = \"cs.AI\", max_results: int = 10) -> list[dict[str, Any]]:\n", " \"\"\"\n", " Query the ArXiv API and return a list of dictionaries with relevant metadata for each paper.\n", "\n", " Args:\n", " search_query: The search term or topic to query for.\n", " category: The category to restrict the search to (default is \"cs.AI\").\n", " See https://arxiv.org/category_taxonomy for a list of categories.\n", " max_results: Maximum number of results to retrieve (default is 10).\n", " \"\"\"\n", " response = requests.get(\n", " \"http://export.arxiv.org/api/query?\"\n", " f\"search_query=all:%22{search_query.replace(' ', '+')}%22\"\n", " f\"+AND+cat:{category}&start=0&max_results={max_results}&sortBy=lastUpdatedDate&sortOrder=descending\"\n", " )\n", "\n", " root = ET.fromstring(response.content)\n", " ns = {\"atom\": \"http://www.w3.org/2005/Atom\"}\n", "\n", " return [\n", " {\n", " \"id\": entry.find(\"atom:id\", ns).text.split(\"/\")[-1],\n", " \"title\": entry.find(\"atom:title\", ns).text,\n", " \"abstract\": entry.find(\"atom:summary\", ns).text,\n", " \"published\": entry.find(\"atom:published\", ns).text,\n", " \"link\": entry.find(\"atom:id\", ns).text,\n", " \"authors\": [author.find(\"atom:name\", ns).text for author in entry.findall(\"atom:author\", ns)],\n", " \"categories\": [category.get(\"term\") for category in entry.findall(\"atom:category\", ns)],\n", " \"pdf_link\": next(\n", " (link_tag.get(\"href\") for link_tag in entry.findall(\"atom:link\", ns) if link_tag.get(\"title\") == \"pdf\"),\n", " None,\n", " ),\n", " }\n", " for entry in root.findall(\"atom:entry\", ns)\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use this function to query papers and store them in memory as our model types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "arxiv_papers: list[ArxivPaper] = [\n", " ArxivPaper.from_arxiv_info(paper)\n", " for paper in query_arxiv(SEARCH_TERM, category=ARVIX_CATEGORY, max_results=MAX_RESULTS)\n", "]\n", "\n", "print(f\"Found {len(arxiv_papers)} papers on '{SEARCH_TERM}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a `PostgresCollection`, which represents the table in Postgres where we will store the paper information and embeddings." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "collection = PostgresCollection[str, ArxivPaper](\n", " collection_name=\"arxiv_papers\", data_model_type=ArxivPaper, env_file_path=env_file_path\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a Kernel and add the TextEmbedding service, which will be used to generate embeddings of the abstract for each paper." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "kernel = Kernel()\n", "if USE_AZURE_OPENAI:\n", " text_embedding = AzureTextEmbedding(\n", " service_id=\"embedding\", deployment_name=EMBEDDING_MODEL, env_file_path=env_file_path\n", " )\n", "else:\n", " text_embedding = OpenAITextEmbedding(\n", " service_id=\"embedding\", ai_model_id=EMBEDDING_MODEL, env_file_path=env_file_path\n", " )\n", "\n", "kernel.add_service(text_embedding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we use VectorStoreRecordUtils to add embeddings to our models." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "records = await VectorStoreRecordUtils(kernel).add_vector_to_records(arxiv_papers, data_model_type=ArxivPaper)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the models have embeddings, we can write them into the Postgres database." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "async with collection:\n", " await collection.create_collection_if_not_exists()\n", " keys = await collection.upsert_batch(records)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we retrieve the first few models from the database and print out their information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "async with collection:\n", " results = await collection.get_batch(keys[:3])\n", " if results:\n", " for result in results:\n", " print(f\"# {result.title}\")\n", " print()\n", " wrapped_abstract = textwrap.fill(result.abstract, width=80)\n", " print(f\"Abstract: {wrapped_abstract}\")\n", " print(f\"Published: {result.published}\")\n", " print(f\"Link: {result.link}\")\n", " print(f\"PDF Link: {result.link}\")\n", " print(f\"Authors: {', '.join(result.authors)}\")\n", " print(f\"Embedding: {result.abstract_vector}\")\n", " print()\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...searching Postgres memory coming soon, to be continued!" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 2 }