{"cells": [{"attachments": {}, "cell_type": "markdown", "id": "bccd47fc", "metadata": {}, "source": ["\"在\n"]}, {"cell_type": "markdown", "id": "db0855d0", "metadata": {}, "source": ["# Postgres向量存储\n", "在这个笔记本中,我们将展示如何使用[Postgresql](https://www.postgresql.org)和[pgvector](https://github.com/pgvector/pgvector)在LlamaIndex中执行向量搜索。\n"]}, {"attachments": {}, "cell_type": "markdown", "id": "e4f33fc9", "metadata": {}, "source": ["如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。\n"]}, {"cell_type": "code", "execution_count": null, "id": "d2fc9c18", "metadata": {}, "outputs": [], "source": ["%pip install llama-index-vector-stores-postgres"]}, {"cell_type": "code", "execution_count": null, "id": "712daea5", "metadata": {}, "outputs": [], "source": ["!pip install llama-index"]}, {"cell_type": "markdown", "id": "eadf6b8a", "metadata": {}, "source": ["运行以下单元格将在Colab中安装带有PGVector的Postgres。\n"]}, {"cell_type": "code", "execution_count": null, "id": "9ab46b5b", "metadata": {}, "outputs": [], "source": ["!sudo apt update\n", "!echo | sudo apt install -y postgresql-common\n", "!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh\n", "!echo | sudo apt install postgresql-15-pgvector\n", "!sudo service postgresql start\n", "!sudo -u postgres psql -c \"ALTER USER postgres PASSWORD 'password';\"\n", "!sudo -u postgres psql -c \"CREATE DATABASE vector_db;\""]}, {"cell_type": "code", "execution_count": null, "id": "c2d1c538", "metadata": {}, "outputs": [], "source": ["# import logging\n", "# import sys\n", "\n", "# 取消注释以查看调试日志\n", "# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\n", "# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n", "\n", "from llama_index.core import SimpleDirectoryReader,StorageContext\n", "from llama_index.core import VectorStoreIndex\n", "from llama_index.vector_stores.postgres import PGVectorStore\n", "import textwrap\n", "import openai"]}, {"cell_type": "markdown", "id": "26c71b6d", "metadata": {}, "source": ["### 设置OpenAI\n", "第一步是配置OpenAI密钥。它将用于为加载到索引中的文档创建嵌入。\n"]}, {"cell_type": "code", "execution_count": null, "id": "67b86621", "metadata": {}, "outputs": [], "source": ["import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = \"\"\n", "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"]}, {"attachments": {}, "cell_type": "markdown", "id": "eecf4bd5", "metadata": {}, "source": ["下载数据\n"]}, {"cell_type": "code", "execution_count": null, "id": "6df9fa89", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["--2024-03-14 02:56:30-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 75042 (73K) [text/plain]\n", "Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n", "\n", "data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.001s \n", "\n", "2024-03-14 02:56:30 (72.2 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n", "\n"]}], "source": ["!mkdir -p 'data/paul_graham/'\n", "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"]}, {"attachments": {}, "cell_type": "markdown", "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396", "metadata": {}, "source": ["### 加载文档\n", "使用SimpleDirectoryReader加载存储在`data/paul_graham/`中的文档\n"]}, {"cell_type": "code", "execution_count": null, "id": "c154dd4b", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Document ID: 1306591e-cc2d-430b-a74c-03ae7105ecab\n"]}], "source": ["documents = SimpleDirectoryReader(\"./data/paul_graham\").load_data()\n", "print(\"Document ID:\", documents[0].doc_id)"]}, {"cell_type": "markdown", "id": "7bd24f0a", "metadata": {}, "source": ["### 创建数据库\n", "使用已经在本地运行的postgres,创建我们将要使用的数据库。\n"]}, {"cell_type": "code", "execution_count": null, "id": "e6d61e73", "metadata": {}, "outputs": [], "source": ["import psycopg2\n", "\n", "connection_string = \"postgresql://postgres:password@localhost:5432\"\n", "db_name = \"vector_db\"\n", "conn = psycopg2.connect(connection_string)\n", "conn.autocommit = True\n", "\n", "with conn.cursor() as c:\n", " c.execute(f\"DROP DATABASE IF EXISTS {db_name}\")\n", " c.execute(f\"CREATE DATABASE {db_name}\")"]}, {"cell_type": "markdown", "id": "c0232fd1", "metadata": {}, "source": ["### 创建索引\n", "在这里,我们使用之前加载的文档创建一个由Postgres支持的索引。PGVectorStore需要一些参数。\n"]}, {"cell_type": "code", "execution_count": null, "id": "8731da62", "metadata": {}, "outputs": [{"data": {"application/vnd.jupyter.widget-view+json": {"model_id": "99f8aee7508e4b7681b86bd84ab9a4d6", "version_major": 2, "version_minor": 0}, "text/plain": ["Parsing nodes: 0%| | 0/1 [00:00] 1.67M --.-KB/s in 0.02s \n", "\n", "2024-03-14 02:56:46 (106 MB/s) - ‘data/git_commits/commit_history.csv’ saved [1753902/1753902]\n", "\n"]}], "source": ["!mkdir -p 'data/git_commits/'\n", "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'"]}, {"cell_type": "code", "execution_count": null, "id": "fef41f44", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache. Fixes #6037 '}\n", "4167\n"]}], "source": ["import csv\n", "\n", "with open(\"data/git_commits/commit_history.csv\", \"r\") as f:\n", " commits = list(csv.DictReader(f))\n", "\n", "print(commits[0])\n", "print(len(commits))"]}, {"cell_type": "markdown", "id": "3b0d9f47", "metadata": {}, "source": ["#### 添加具有自定义元数据的节点\n"]}, {"cell_type": "code", "execution_count": null, "id": "3920109b", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Node ID: 69513543-dee5-4c65-b4b8-39295f11e669\n", "Text: Fix segfault in set_integer_now_func When an invalid function\n", "oid is passed to set_integer_now_func, it finds out that the function\n", "oid is invalid but before throwing the error, it calls ReleaseSysCache\n", "on an invalid tuple causing a segfault. Fixed that by removing the\n", "invalid call to ReleaseSysCache. Fixes #6037\n", "2023-03-22 to 2023-09-05\n", "{'rafia.sabih@gmail.com', 'erik@timescale.com', 'jguthrie@timescale.com', 'sven@timescale.com', '36882414+akuzm@users.noreply.github.com', 'me@noctarius.com', 'satish.8483@gmail.com', 'nikhil@timescale.com', 'konstantina@timescale.com', 'dmitry@timescale.com', 'mats@timescale.com', 'jan@timescale.com', 'lakshmi@timescale.com', 'fabriziomello@gmail.com', 'engel@sero-systems.de'}\n"]}], "source": ["# 为前100次提交创建一个TextNode\n", "from llama_index.core.schema import TextNode\n", "from datetime import datetime\n", "import re\n", "\n", "nodes = []\n", "dates = set()\n", "authors = set()\n", "for commit in commits[:100]:\n", " author_email = commit[\"author\"].split(\"<\")[1][:-1]\n", " commit_date = datetime.strptime(\n", " commit[\"date\"], \"%a %b %d %H:%M:%S %Y %z\"\n", " ).strftime(\"%Y-%m-%d\")\n", " commit_text = commit[\"change summary\"]\n", " if commit[\"change details\"]:\n", " commit_text += \"\\n\\n\" + commit[\"change details\"]\n", " fixes = re.findall(r\"#(\\d+)\", commit_text, re.IGNORECASE)\n", " nodes.append(\n", " TextNode(\n", " text=commit_text,\n", " metadata={\n", " \"commit_date\": commit_date,\n", " \"author\": author_email,\n", " \"fixes\": fixes,\n", " },\n", " )\n", " )\n", " dates.add(commit_date)\n", " authors.add(author_email)\n", "\n", "print(nodes[0])\n", "print(min(dates), \"到\", max(dates))\n", "print(authors)"]}, {"cell_type": "code", "execution_count": null, "id": "a638f76a", "metadata": {}, "outputs": [], "source": ["# 将参数传递给PGVectorStore对象,创建一个vector_store对象\n", "vector_store = PGVectorStore.from_params(\n", " database=db_name,\n", " host=url.host,\n", " password=url.password,\n", " port=url.port,\n", " user=url.username,\n", " table_name=\"metadata_filter_demo3\",\n", " embed_dim=1536, # openai embedding dimension\n", ")\n", "\n", "# 从vector_store对象创建一个VectorStoreIndex对象\n", "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n", "# 将节点插入到索引中\n", "index.insert_nodes(nodes)"]}, {"cell_type": "code", "execution_count": null, "id": "15f7cf45", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.\n"]}], "source": ["print(index.as_query_engine().query(\"How did Lakshmi fix the segfault?\"))"]}, {"cell_type": "markdown", "id": "7ab03ed4", "metadata": {}, "source": ["#### 应用元数据过滤器\n", "\n", "现在我们可以在检索节点时按提交作者或日期进行过滤。\n"]}, {"cell_type": "code", "execution_count": null, "id": "aa6212e7", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-27', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-07-13', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-30', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-07-25', 'author': 'mats@timescale.com', 'fixes': ['5892']}\n", "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n"]}], "source": ["from llama_index.core.vector_stores.types import (\n", " MetadataFilter,\n", " MetadataFilters,\n", ")\n", "\n", "filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n", " MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n", " ],\n", " condition=\"or\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "code", "execution_count": null, "id": "67c19ec6", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-23', 'author': 'erik@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-17', 'author': 'konstantina@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-24', 'author': 'lakshmi@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-20', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n"]}], "source": ["filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\">=\"),\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-25\", operator=\"<=\"),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "markdown", "id": "4f6e9cdf", "metadata": {}, "source": ["#### 应用嵌套过滤器\n", "\n", "在上面的例子中,我们使用AND或OR组合了多个过滤器。我们也可以组合多组过滤器。\n", "\n", "例如,在SQL中:\n", "```sql\n", "WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = 'mats@timescale.com' OR author = 'sven@timescale.com')\n", "```\n"]}, {"cell_type": "code", "execution_count": null, "id": "94f20be7", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n"]}], "source": ["filters = MetadataFilters(\n", " filters=[\n", " MetadataFilters(\n", " filters=[\n", " MetadataFilter(\n", " key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"\n", " ),\n", " MetadataFilter(\n", " key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"\n", " ),\n", " ],\n", " condition=\"and\",\n", " ),\n", " MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n", " MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n", " ],\n", " condition=\"or\",\n", " ),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "markdown", "id": "737692ce", "metadata": {}, "source": ["上述内容可以通过使用IN运算符来简化。`PGVectorStore`支持`in`、`nin`和`contains`,用于将元素与列表进行比较。\n"]}, {"cell_type": "code", "execution_count": null, "id": "85faf8b3", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n"]}], "source": ["filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n", " MetadataFilter(\n", " key=\"author\",\n", " value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n", " operator=\"in\",\n", " ),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "code", "execution_count": null, "id": "1ab9c333", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-09', 'author': 'me@noctarius.com', 'fixes': ['5805']}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-11', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}\n", "{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': ['5908']}\n", "{'commit_date': '2023-08-01', 'author': 'nikhil@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'konstantina@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n"]}], "source": ["# 使用NOT IN 进行相同的操作\n", "filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n", " MetadataFilter(\n", " key=\"author\",\n", " value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n", " operator=\"nin\",\n", " ),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"这个软件项目是关于什么的?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "code", "execution_count": null, "id": "a46764cf", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}\n"]}], "source": ["# 包含\n", "filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"fixes\", value=\"5680\", operator=\"contains\"), # 操作符为包含\n", " ]\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"这些提交是如何修复这个问题的?\")\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "markdown", "id": "2b274ecb", "metadata": {}, "source": ["### PgVector 查询选项\n"]}, {"cell_type": "markdown", "id": "a490a0fa", "metadata": {}, "source": ["#### IVFFlat探针\n", "\n", "指定[IVFFlat探针](https://github.com/pgvector/pgvector?tab=readme-ov-file#query-options)的数量(默认为1)\n", "\n", "在从索引中检索时,您可以指定适当数量的IVFFlat探针(召回率越高越好,速度越快越好)。\n"]}, {"cell_type": "code", "execution_count": null, "id": "111a3682", "metadata": {}, "outputs": [], "source": ["retriever = index.as_retriever(\n", " vector_store_query_mode=\"hybrid\",\n", " similarity_top_k=5,\n", " vector_store_kwargs={\"ivfflat_probes\": 10},\n", ")"]}, {"cell_type": "markdown", "id": "6104ef8d", "metadata": {}, "source": ["#### HNSW EF搜索\n", "\n", "指定搜索的动态[候选列表](https://github.com/pgvector/pgvector?tab=readme-ov-file#query-options-1)的大小(默认为40)\n"]}, {"cell_type": "code", "execution_count": null, "id": "f3a44758", "metadata": {}, "outputs": [], "source": ["retriever = index.as_retriever(\n", " vector_store_query_mode=\"hybrid\",\n", " similarity_top_k=5,\n", " vector_store_kwargs={\"hnsw_ef_search\": 300},\n", ")"]}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 5}