{"cells": [{"attachments": {}, "cell_type": "markdown", "id": "bccd47fc", "metadata": {}, "source": ["\n"]}, {"cell_type": "markdown", "id": "db0855d0", "metadata": {}, "source": ["# Postgres向量存储\n", "在这个笔记本中,我们将展示如何使用[Postgresql](https://www.postgresql.org)和[pgvector](https://github.com/pgvector/pgvector)在LlamaIndex中执行向量搜索。\n"]}, {"attachments": {}, "cell_type": "markdown", "id": "e4f33fc9", "metadata": {}, "source": ["如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。\n"]}, {"cell_type": "code", "execution_count": null, "id": "d2fc9c18", "metadata": {}, "outputs": [], "source": ["%pip install llama-index-vector-stores-postgres"]}, {"cell_type": "code", "execution_count": null, "id": "712daea5", "metadata": {}, "outputs": [], "source": ["!pip install llama-index"]}, {"cell_type": "markdown", "id": "eadf6b8a", "metadata": {}, "source": ["运行以下单元格将在Colab中安装带有PGVector的Postgres。\n"]}, {"cell_type": "code", "execution_count": null, "id": "9ab46b5b", "metadata": {}, "outputs": [], "source": ["!sudo apt update\n", "!echo | sudo apt install -y postgresql-common\n", "!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh\n", "!echo | sudo apt install postgresql-15-pgvector\n", "!sudo service postgresql start\n", "!sudo -u postgres psql -c \"ALTER USER postgres PASSWORD 'password';\"\n", "!sudo -u postgres psql -c \"CREATE DATABASE vector_db;\""]}, {"cell_type": "code", "execution_count": null, "id": "c2d1c538", "metadata": {}, "outputs": [], "source": ["# import logging\n", "# import sys\n", "\n", "# 取消注释以查看调试日志\n", "# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\n", "# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n", "\n", "from llama_index.core import SimpleDirectoryReader,StorageContext\n", "from llama_index.core import VectorStoreIndex\n", "from llama_index.vector_stores.postgres import PGVectorStore\n", "import textwrap\n", "import openai"]}, {"cell_type": "markdown", "id": "26c71b6d", "metadata": {}, "source": ["### 设置OpenAI\n", "第一步是配置OpenAI密钥。它将用于为加载到索引中的文档创建嵌入。\n"]}, {"cell_type": "code", "execution_count": null, "id": "67b86621", "metadata": {}, "outputs": [], "source": ["import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = \"\"\n", "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"]}, {"attachments": {}, "cell_type": "markdown", "id": "eecf4bd5", "metadata": {}, "source": ["下载数据\n"]}, {"cell_type": "code", "execution_count": null, "id": "6df9fa89", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["--2024-03-14 02:56:30-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 75042 (73K) [text/plain]\n", "Saving to: ‘data/paul_graham/paul_graham_essay.txt’\n", "\n", "data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.001s \n", "\n", "2024-03-14 02:56:30 (72.2 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]\n", "\n"]}], "source": ["!mkdir -p 'data/paul_graham/'\n", "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"]}, {"attachments": {}, "cell_type": "markdown", "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396", "metadata": {}, "source": ["### 加载文档\n", "使用SimpleDirectoryReader加载存储在`data/paul_graham/`中的文档\n"]}, {"cell_type": "code", "execution_count": null, "id": "c154dd4b", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Document ID: 1306591e-cc2d-430b-a74c-03ae7105ecab\n"]}], "source": ["documents = SimpleDirectoryReader(\"./data/paul_graham\").load_data()\n", "print(\"Document ID:\", documents[0].doc_id)"]}, {"cell_type": "markdown", "id": "7bd24f0a", "metadata": {}, "source": ["### 创建数据库\n", "使用已经在本地运行的postgres,创建我们将要使用的数据库。\n"]}, {"cell_type": "code", "execution_count": null, "id": "e6d61e73", "metadata": {}, "outputs": [], "source": ["import psycopg2\n", "\n", "connection_string = \"postgresql://postgres:password@localhost:5432\"\n", "db_name = \"vector_db\"\n", "conn = psycopg2.connect(connection_string)\n", "conn.autocommit = True\n", "\n", "with conn.cursor() as c:\n", " c.execute(f\"DROP DATABASE IF EXISTS {db_name}\")\n", " c.execute(f\"CREATE DATABASE {db_name}\")"]}, {"cell_type": "markdown", "id": "c0232fd1", "metadata": {}, "source": ["### 创建索引\n", "在这里,我们使用之前加载的文档创建一个由Postgres支持的索引。PGVectorStore需要一些参数。\n"]}, {"cell_type": "code", "execution_count": null, "id": "8731da62", "metadata": {}, "outputs": [{"data": {"application/vnd.jupyter.widget-view+json": {"model_id": "99f8aee7508e4b7681b86bd84ab9a4d6", "version_major": 2, "version_minor": 0}, "text/plain": ["Parsing nodes: 0%| | 0/1 [00:00, ?it/s]"]}, "metadata": {}, "output_type": "display_data"}, {"data": {"application/vnd.jupyter.widget-view+json": {"model_id": "6218daeaa5e0475b967ec2d00e16fbe5", "version_major": 2, "version_minor": 0}, "text/plain": ["Generating embeddings: 0%| | 0/22 [00:00, ?it/s]"]}, "metadata": {}, "output_type": "display_data"}], "source": ["from sqlalchemy import make_url\n", "\n", "url = make_url(connection_string)\n", "vector_store = PGVectorStore.from_params(\n", " database=db_name,\n", " host=url.host,\n", " password=url.password,\n", " port=url.port,\n", " user=url.username,\n", " table_name=\"paul_graham_essay\",\n", " embed_dim=1536, # openai embedding dimension\n", ")\n", "\n", "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n", "index = VectorStoreIndex.from_documents(\n", " documents, storage_context=storage_context, show_progress=True\n", ")\n", "query_engine = index.as_query_engine()"]}, {"cell_type": "markdown", "id": "8ee4473a-094f-4d0a-a825-e1213db07240", "metadata": {}, "source": ["### 查询索引\n", "现在我们可以使用我们的索引来提出问题。\n"]}, {"cell_type": "code", "execution_count": null, "id": "0a2bcc07", "metadata": {}, "outputs": [], "source": ["response = query_engine.query(\"What did the author do?\")"]}, {"cell_type": "code", "execution_count": null, "id": "8cf55bf7", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The author worked on writing and programming before college, initially focusing on writing short\n", "stories and later transitioning to programming on early computers like the IBM 1401 using Fortran.\n", "The author continued programming on microcomputers like the TRS-80, creating simple games and a word\n", "processor. In college, the author initially planned to study philosophy but switched to studying AI\n", "due to a lack of interest in philosophy courses. The author was inspired to work on AI after\n", "encountering works like Heinlein's novel \"The Moon is a Harsh Mistress\" and seeing Terry Winograd\n", "using SHRDLU in a PBS documentary.\n"]}], "source": ["print(textwrap.fill(str(response), 100))"]}, {"cell_type": "code", "execution_count": null, "id": "68cbd239-880e-41a3-98d8-dbb3fab55431", "metadata": {}, "outputs": [], "source": ["response = query_engine.query(\"What happened in the mid 1980s?\")"]}, {"cell_type": "code", "execution_count": null, "id": "fdf5287f", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["AI was in the air in the mid 1980s, with two main influences that sparked interest in working on it:\n", "a novel by Heinlein called The Moon is a Harsh Mistress, featuring an intelligent computer called\n", "Mike, and a PBS documentary showing Terry Winograd using SHRDLU.\n"]}], "source": ["print(textwrap.fill(str(response), 100))"]}, {"cell_type": "markdown", "id": "b3bed9e1", "metadata": {}, "source": ["### 查询现有索引\n"]}, {"cell_type": "code", "execution_count": null, "id": "e6b2634b", "metadata": {}, "outputs": [], "source": ["vector_store = PGVectorStore.from_params(\n", " database=\"vector_db\",\n", " host=\"localhost\",\n", " password=\"password\",\n", " port=5432,\n", " user=\"postgres\",\n", " table_name=\"paul_graham_essay\",\n", " embed_dim=1536, # openai embedding dimension\n", ")\n", "\n", "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n", "query_engine = index.as_query_engine()\n"]}, {"cell_type": "code", "execution_count": null, "id": "e7075af3-156e-4bde-8f76-6d9dee86861f", "metadata": {}, "outputs": [], "source": ["response = query_engine.query(\"What did the author do?\")"]}, {"cell_type": "code", "execution_count": null, "id": "b088c090", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The author worked on writing short stories and programming before college. Initially, the author\n", "wrote short stories and later started programming on an IBM 1401 using an early version of Fortran.\n", "With the introduction of microcomputers, the author's interest in programming grew, leading to\n", "writing simple games, predictive programs, and a word processor. Despite initially planning to study\n", "philosophy in college, the author switched to studying AI due to a lack of interest in philosophy\n", "courses. The author was inspired to work on AI after encountering a novel featuring an intelligent\n", "computer and a PBS documentary showcasing AI technology.\n"]}], "source": ["print(textwrap.fill(str(response), 100))"]}, {"cell_type": "markdown", "id": "55745895-8f01-4275-abaa-b2ebef2cb4c7", "metadata": {}, "source": ["### 混合搜索\n", "\n", "要启用混合搜索,您需要:\n", "1. 在构建`PGVectorStore`时传入`hybrid_search=True`(并可选择使用所需的语言配置`text_search_config`)\n", "2. 在构建查询引擎时,传入`vector_store_query_mode=\"hybrid\"`(此配置会传递给底层的检索器)。您还可以选择设置`sparse_top_k`来配置我们应该从稀疏文本搜索中获取多少结果(默认值与`similarity_top_k`相同)。\n"]}, {"cell_type": "code", "execution_count": null, "id": "65a7e133-39da-40c5-b2c5-7af2c0a3a792", "metadata": {}, "outputs": [], "source": ["from sqlalchemy import make_url\n", "\n", "url = make_url(connection_string)\n", "hybrid_vector_store = PGVectorStore.from_params(\n", " database=db_name,\n", " host=url.host,\n", " password=url.password,\n", " port=url.port,\n", " user=url.username,\n", " table_name=\"paul_graham_essay_hybrid_search\",\n", " embed_dim=1536, # openai embedding dimension\n", " hybrid_search=True,\n", " text_search_config=\"english\",\n", ")\n", "\n", "storage_context = StorageContext.from_defaults(\n", " vector_store=hybrid_vector_store\n", ")\n", "hybrid_index = VectorStoreIndex.from_documents(\n", " documents, storage_context=storage_context\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "6f8edee4-6c19-4d99-b602-110bdc5708e5", "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["/workspaces/llama_index/llama-index-integrations/vector_stores/llama-index-vector-stores-postgres/llama_index/vector_stores/postgres/base.py:571: SAWarning: UserDefinedType REGCONFIG() will not produce a cache key because the ``cache_ok`` attribute is not set to True. This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions. Set this attribute to True if this type object's state is safe to use in a cache key, or False to disable this warning. (Background on this warning at: https://sqlalche.me/e/20/cprf)\n", " res = session.execute(stmt)\n"]}], "source": ["hybrid_query_engine = hybrid_index.as_query_engine(\n", " vector_store_query_mode=\"hybrid\", sparse_top_k=2\n", ")\n", "hybrid_response = hybrid_query_engine.query(\n", " \"Who does Paul Graham think of with the word schtick\"\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "bd454b25-b66c-4733-8ff4-24fb2ee84cec", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Roy Lichtenstein\n"]}], "source": ["print(hybrid_response)"]}, {"cell_type": "markdown", "id": "74ac0f05", "metadata": {}, "source": ["#### 通过QueryFusionRetriever改进混合搜索\n", "\n", "由于文本搜索和向量搜索的得分计算方式不同,仅通过文本搜索找到的节点将具有更低的得分。\n", "\n", "通常可以通过使用`QueryFusionRetriever`来改进混合搜索性能,它更好地利用互信息来对节点进行排名。\n"]}, {"cell_type": "code", "execution_count": null, "id": "345ba0fb", "metadata": {}, "outputs": [], "source": ["from llama_index.core.response_synthesizers import CompactAndRefine\n", "from llama_index.core.retrievers import QueryFusionRetriever\n", "from llama_index.core.query_engine import RetrieverQueryEngine\n", "\n", "vector_retriever = hybrid_index.as_retriever(\n", " vector_store_query_mode=\"default\",\n", " similarity_top_k=5,\n", ")\n", "text_retriever = hybrid_index.as_retriever(\n", " vector_store_query_mode=\"sparse\",\n", " similarity_top_k=5, # 在这个上下文中可与sparse_top_k互换\n", ")\n", "retriever = QueryFusionRetriever(\n", " [vector_retriever, text_retriever],\n", " similarity_top_k=5,\n", " num_queries=1, # 将其设置为1以禁用查询生成\n", " mode=\"relative_score\",\n", " use_async=False,\n", ")\n", "\n", "response_synthesizer = CompactAndRefine()\n", "query_engine = RetrieverQueryEngine(\n", " retriever=retriever,\n", " response_synthesizer=response_synthesizer,\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "1a88c84e", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Paul Graham thinks of Roy Lichtenstein when he uses the word \"schtick\" because he recognizes paintings resembling a specific type of cartoon style as being created by Roy Lichtenstein.\n"]}], "source": ["response = query_engine.query(\n", " \"Who does Paul Graham think of with the word schtick, and why?\"\n", ")\n", "print(response)"]}, {"cell_type": "markdown", "id": "2e5e8083", "metadata": {}, "source": ["### 元数据过滤器\n", "\n", "PGVectorStore支持在节点中存储元数据,并在检索步骤中基于该元数据进行过滤。\n"]}, {"cell_type": "markdown", "id": "2d0ad3fc", "metadata": {}, "source": ["#### 下载git提交数据集\n"]}, {"cell_type": "code", "execution_count": null, "id": "63e90a89", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["--2024-03-14 02:56:46-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1753902 (1.7M) [text/plain]\n", "Saving to: ‘data/git_commits/commit_history.csv’\n", "\n", "data/git_commits/co 100%[===================>] 1.67M --.-KB/s in 0.02s \n", "\n", "2024-03-14 02:56:46 (106 MB/s) - ‘data/git_commits/commit_history.csv’ saved [1753902/1753902]\n", "\n"]}], "source": ["!mkdir -p 'data/git_commits/'\n", "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'"]}, {"cell_type": "code", "execution_count": null, "id": "fef41f44", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache. Fixes #6037 '}\n", "4167\n"]}], "source": ["import csv\n", "\n", "with open(\"data/git_commits/commit_history.csv\", \"r\") as f:\n", " commits = list(csv.DictReader(f))\n", "\n", "print(commits[0])\n", "print(len(commits))"]}, {"cell_type": "markdown", "id": "3b0d9f47", "metadata": {}, "source": ["#### 添加具有自定义元数据的节点\n"]}, {"cell_type": "code", "execution_count": null, "id": "3920109b", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Node ID: 69513543-dee5-4c65-b4b8-39295f11e669\n", "Text: Fix segfault in set_integer_now_func When an invalid function\n", "oid is passed to set_integer_now_func, it finds out that the function\n", "oid is invalid but before throwing the error, it calls ReleaseSysCache\n", "on an invalid tuple causing a segfault. Fixed that by removing the\n", "invalid call to ReleaseSysCache. Fixes #6037\n", "2023-03-22 to 2023-09-05\n", "{'rafia.sabih@gmail.com', 'erik@timescale.com', 'jguthrie@timescale.com', 'sven@timescale.com', '36882414+akuzm@users.noreply.github.com', 'me@noctarius.com', 'satish.8483@gmail.com', 'nikhil@timescale.com', 'konstantina@timescale.com', 'dmitry@timescale.com', 'mats@timescale.com', 'jan@timescale.com', 'lakshmi@timescale.com', 'fabriziomello@gmail.com', 'engel@sero-systems.de'}\n"]}], "source": ["# 为前100次提交创建一个TextNode\n", "from llama_index.core.schema import TextNode\n", "from datetime import datetime\n", "import re\n", "\n", "nodes = []\n", "dates = set()\n", "authors = set()\n", "for commit in commits[:100]:\n", " author_email = commit[\"author\"].split(\"<\")[1][:-1]\n", " commit_date = datetime.strptime(\n", " commit[\"date\"], \"%a %b %d %H:%M:%S %Y %z\"\n", " ).strftime(\"%Y-%m-%d\")\n", " commit_text = commit[\"change summary\"]\n", " if commit[\"change details\"]:\n", " commit_text += \"\\n\\n\" + commit[\"change details\"]\n", " fixes = re.findall(r\"#(\\d+)\", commit_text, re.IGNORECASE)\n", " nodes.append(\n", " TextNode(\n", " text=commit_text,\n", " metadata={\n", " \"commit_date\": commit_date,\n", " \"author\": author_email,\n", " \"fixes\": fixes,\n", " },\n", " )\n", " )\n", " dates.add(commit_date)\n", " authors.add(author_email)\n", "\n", "print(nodes[0])\n", "print(min(dates), \"到\", max(dates))\n", "print(authors)"]}, {"cell_type": "code", "execution_count": null, "id": "a638f76a", "metadata": {}, "outputs": [], "source": ["# 将参数传递给PGVectorStore对象,创建一个vector_store对象\n", "vector_store = PGVectorStore.from_params(\n", " database=db_name,\n", " host=url.host,\n", " password=url.password,\n", " port=url.port,\n", " user=url.username,\n", " table_name=\"metadata_filter_demo3\",\n", " embed_dim=1536, # openai embedding dimension\n", ")\n", "\n", "# 从vector_store对象创建一个VectorStoreIndex对象\n", "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n", "# 将节点插入到索引中\n", "index.insert_nodes(nodes)"]}, {"cell_type": "code", "execution_count": null, "id": "15f7cf45", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.\n"]}], "source": ["print(index.as_query_engine().query(\"How did Lakshmi fix the segfault?\"))"]}, {"cell_type": "markdown", "id": "7ab03ed4", "metadata": {}, "source": ["#### 应用元数据过滤器\n", "\n", "现在我们可以在检索节点时按提交作者或日期进行过滤。\n"]}, {"cell_type": "code", "execution_count": null, "id": "aa6212e7", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-27', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-07-13', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-30', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-07-25', 'author': 'mats@timescale.com', 'fixes': ['5892']}\n", "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n"]}], "source": ["from llama_index.core.vector_stores.types import (\n", " MetadataFilter,\n", " MetadataFilters,\n", ")\n", "\n", "filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n", " MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n", " ],\n", " condition=\"or\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "code", "execution_count": null, "id": "67c19ec6", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-23', 'author': 'erik@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-17', 'author': 'konstantina@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-24', 'author': 'lakshmi@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-20', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n"]}], "source": ["filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\">=\"),\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-25\", operator=\"<=\"),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "markdown", "id": "4f6e9cdf", "metadata": {}, "source": ["#### 应用嵌套过滤器\n", "\n", "在上面的例子中,我们使用AND或OR组合了多个过滤器。我们也可以组合多组过滤器。\n", "\n", "例如,在SQL中:\n", "```sql\n", "WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = 'mats@timescale.com' OR author = 'sven@timescale.com')\n", "```\n"]}, {"cell_type": "code", "execution_count": null, "id": "94f20be7", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n"]}], "source": ["filters = MetadataFilters(\n", " filters=[\n", " MetadataFilters(\n", " filters=[\n", " MetadataFilter(\n", " key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"\n", " ),\n", " MetadataFilter(\n", " key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"\n", " ),\n", " ],\n", " condition=\"and\",\n", " ),\n", " MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n", " MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n", " ],\n", " condition=\"or\",\n", " ),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "markdown", "id": "737692ce", "metadata": {}, "source": ["上述内容可以通过使用IN运算符来简化。`PGVectorStore`支持`in`、`nin`和`contains`,用于将元素与列表进行比较。\n"]}, {"cell_type": "code", "execution_count": null, "id": "85faf8b3", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n"]}], "source": ["filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n", " MetadataFilter(\n", " key=\"author\",\n", " value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n", " operator=\"in\",\n", " ),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "code", "execution_count": null, "id": "1ab9c333", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-09', 'author': 'me@noctarius.com', 'fixes': ['5805']}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-11', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n", "{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}\n", "{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': ['5908']}\n", "{'commit_date': '2023-08-01', 'author': 'nikhil@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': 'konstantina@timescale.com', 'fixes': []}\n", "{'commit_date': '2023-08-10', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n"]}], "source": ["# 使用NOT IN 进行相同的操作\n", "filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n", " MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n", " MetadataFilter(\n", " key=\"author\",\n", " value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n", " operator=\"nin\",\n", " ),\n", " ],\n", " condition=\"and\",\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"这个软件项目是关于什么的?\")\n", "\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "code", "execution_count": null, "id": "a46764cf", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}\n"]}], "source": ["# 包含\n", "filters = MetadataFilters(\n", " filters=[\n", " MetadataFilter(key=\"fixes\", value=\"5680\", operator=\"contains\"), # 操作符为包含\n", " ]\n", ")\n", "\n", "retriever = index.as_retriever(\n", " similarity_top_k=10,\n", " filters=filters,\n", ")\n", "\n", "retrieved_nodes = retriever.retrieve(\"这些提交是如何修复这个问题的?\")\n", "for node in retrieved_nodes:\n", " print(node.node.metadata)"]}, {"cell_type": "markdown", "id": "2b274ecb", "metadata": {}, "source": ["### PgVector 查询选项\n"]}, {"cell_type": "markdown", "id": "a490a0fa", "metadata": {}, "source": ["#### IVFFlat探针\n", "\n", "指定[IVFFlat探针](https://github.com/pgvector/pgvector?tab=readme-ov-file#query-options)的数量(默认为1)\n", "\n", "在从索引中检索时,您可以指定适当数量的IVFFlat探针(召回率越高越好,速度越快越好)。\n"]}, {"cell_type": "code", "execution_count": null, "id": "111a3682", "metadata": {}, "outputs": [], "source": ["retriever = index.as_retriever(\n", " vector_store_query_mode=\"hybrid\",\n", " similarity_top_k=5,\n", " vector_store_kwargs={\"ivfflat_probes\": 10},\n", ")"]}, {"cell_type": "markdown", "id": "6104ef8d", "metadata": {}, "source": ["#### HNSW EF搜索\n", "\n", "指定搜索的动态[候选列表](https://github.com/pgvector/pgvector?tab=readme-ov-file#query-options-1)的大小(默认为40)\n"]}, {"cell_type": "code", "execution_count": null, "id": "f3a44758", "metadata": {}, "outputs": [], "source": ["retriever = index.as_retriever(\n", " vector_store_query_mode=\"hybrid\",\n", " similarity_top_k=5,\n", " vector_store_kwargs={\"hnsw_ef_search\": 300},\n", ")"]}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 5}