{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 属性图索引\n", "\n", "在这个笔记本中,我们将演示在LlamaIndex中使用`PropertyGraphIndex`的一些基本用法。\n", "\n", "这里的属性图索引将接受非结构化文档,从中提取属性图,并提供各种方法来查询该图。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["%pip install llama-index"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 设置\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = \"sk-proj-...\""]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["!mkdir -p 'data/paul_graham/'\n", "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import nest_asyncio\n", "\n", "nest_asyncio.apply()"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.core import SimpleDirectoryReader\n", "\n", "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 构造\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["/Users/loganmarkewich/Library/Caches/pypoetry/virtualenvs/llama-index-bXUwlEfH-py3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 25.46it/s]\n", "Extracting paths from text: 100%|██████████| 22/22 [00:12<00:00, 1.72it/s]\n", "Extracting implicit paths: 100%|██████████| 22/22 [00:00<00:00, 36186.15it/s]\n", "Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 1.14it/s]\n", "Generating embeddings: 100%|██████████| 5/5 [00:00<00:00, 5.43it/s]\n"]}], "source": ["from llama_index.core import PropertyGraphIndex\n", "from llama_index.embeddings.openai import OpenAIEmbedding\n", "from llama_index.llms.openai import OpenAI\n", "\n", "index = PropertyGraphIndex.from_documents(\n", " documents,\n", " llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.3),\n", " embed_model=OpenAIEmbedding(model_name=\"text-embedding-3-small\"),\n", " show_progress=True,\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["让我们回顾一下刚刚发生了什么\n", "- `PropertyGraphIndex.from_documents()` - 我们将文档加载到了索引中\n", "- `Parsing nodes` - 索引将文档解析为节点\n", "- `Extracting paths from text` - 节点被传递给LLM,LLM被提示生成知识图三元组(即路径)\n", "- `Extracting implicit paths` - 每个`node.relationships`属性被用来推断隐含路径\n", "- `Generating embeddings` - 为每个文本节点和图节点生成了嵌入(因此这个过程发生了两次)\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["让我们来探索一下我们创建的内容!为了调试目的,默认的`SimplePropertyGraphStore`包括一个辅助功能,可以将图的`networkx`表示保存到一个`html`文件中。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["index.property_graph_store.save_networkx_graph(name=\"./kg.html\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["在浏览器中打开这个html文件,我们就可以看到我们的图表了!\n", "\n", "如果你放大图表,你会发现每个连接密集的节点实际上是源块,从那里分支出提取的实体和关系。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["![示例图表](./kg_screenshot.png)\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 自定义低级构建\n", "\n", "如果我们愿意,我们可以使用低级API来进行相同的摄取,利用`kg_extractors`。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.core.indices.property_graph import (\n", " ImplicitPathExtractor,\n", " SimpleLLMPathExtractor,\n", ")\n", "\n", "index = PropertyGraphIndex.from_documents(\n", " documents,\n", " embed_model=OpenAIEmbedding(model_name=\"text-embedding-3-small\"),\n", " kg_extractors=[\n", " ImplicitPathExtractor(),\n", " SimpleLLMPathExtractor(\n", " llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.3),\n", " num_workers=4,\n", " max_paths_per_chunk=10,\n", " ),\n", " ],\n", " show_progress=True,\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["有关所有提取器的完整指南,请参阅[详细使用页面](../../module_guides/indexing/lpg_index_guide.md#construction)。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 查询\n", "\n", "查询属性图索引通常包括使用一个或多个子检索器并组合结果。\n", "\n", "图检索可以被认为是\n", "- 选择节点\n", "- 从这些节点遍历\n", "\n", "默认情况下,联合使用两种类型的检索\n", "- 同义词/关键词扩展 - 使用LLM生成查询的同义词和关键词\n", "- 向量检索 - 使用嵌入来在图中查找节点\n", "\n", "一旦找到节点,您可以选择\n", "- 返回与所选节点相邻的路径(即三元组)\n", "- 返回路径+块的原始源文本(如果可用)\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Interleaf -> Was -> On the way down\n", "Viaweb -> Had -> Code editor\n", "Interleaf -> Built -> Impressive technology\n", "Interleaf -> Added -> Scripting language\n", "Interleaf -> Made -> Scripting language\n", "Viaweb -> Suggested -> Take to hospital\n", "Interleaf -> Had done -> Something bold\n", "Viaweb -> Called -> After\n", "Interleaf -> Made -> Dialect of lisp\n", "Interleaf -> Got crushed by -> Moore's law\n", "Dan giffin -> Worked for -> Viaweb\n", "Interleaf -> Had -> Smart people\n", "Interleaf -> Had -> Few years to live\n", "Interleaf -> Made -> Software\n", "Interleaf -> Made -> Software for creating documents\n", "Paul graham -> Started -> Viaweb\n", "Scripting language -> Was -> Dialect of lisp\n", "Scripting language -> Is -> Dialect of lisp\n", "Software -> Will be affected by -> Rapid change\n", "Code editor -> Was -> In viaweb\n", "Software -> Worked via -> Web\n", "Programs -> Typed on -> Punch cards\n", "Computers -> Skipped -> Step\n", "Idea -> Was clear from -> Experience\n", "Apartment -> Wasn't -> Rent-controlled\n"]}], "source": ["", "# 创建一个索引检索器,不包括文本,默认为True", "retriever = index.as_retriever(", " include_text=False, ", ")", "", "# 检索“What happened at Interleaf and Viaweb?”", "nodes = retriever.retrieve(\"What happened at Interleaf and Viaweb?\")", "", "# 遍历检索结果并打印文本", "for node in nodes:", " print(node.text)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Interleaf had smart people and built impressive technology, including adding a scripting language that was a dialect of Lisp. However, despite their efforts, they were eventually impacted by Moore's Law and faced challenges. Viaweb, on the other hand, was started by Paul Graham and had a code editor where users could define their own page styles using Lisp expressions. Viaweb also suggested taking someone to the hospital and called something \"After.\"\n"]}], "source": ["query_engine = index.as_query_engine(\n", " include_text=True,\n", ")\n", "\n", "response = query_engine.query(\"What happened at Interleaf and Viaweb?\")\n", "\n", "print(str(response))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["有关自定义检索和查询的详细信息,请参阅[文档页面](../../module_guides/indexing/lpg_index_guide.md#retrieval-and-querying)。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 存储\n", "\n", "默认情况下,使用我们简单的内存抽象进行存储 - `SimpleVectorStore` 用于嵌入,`SimplePropertyGraphStore` 用于属性图。\n", "\n", "我们可以将这些存储到磁盘并从磁盘加载。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["index.storage_context.persist(persist_dir=\"./storage\")\n", "\n", "from llama_index.core import StorageContext, load_index_from_storage\n", "\n", "index = load_index_from_storage(\n", " StorageContext.from_defaults(persist_dir=\"./storage\")\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 向量存储\n", "\n", "虽然一些图数据库支持向量(比如Neo4j),但在一些不支持向量的情况下,或者你想要覆盖默认设置时,你仍然可以指定要在图上使用的向量存储。\n", "\n", "下面我们将把 `ChromaVectorStore` 与默认的 `SimplePropertyGraphStore` 结合起来使用。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["%pip install llama-index-vector-stores-chroma"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.core.graph_stores import SimplePropertyGraphStore\n", "from llama_index.vector_stores.chroma import ChromaVectorStore\n", "import chromadb\n", "\n", "client = chromadb.PersistentClient(\"./chroma_db\")\n", "collection = client.get_or_create_collection(\"my_graph_vector_db\")\n", "\n", "index = PropertyGraphIndex.from_documents(\n", " documents,\n", " embed_model=OpenAIEmbedding(model_name=\"text-embedding-3-small\"),\n", " graph_store=SimplePropertyGraphStore(),\n", " vector_store=ChromaVectorStore(collection=collection),\n", " show_progress=True,\n", ")\n", "\n", "index.storage_context.persist(persist_dir=\"./storage\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["然后进行加载:\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["index = PropertyGraphIndex.from_existing(\n", " SimplePropertyGraphStore.from_persist_dir(\"./storage\"),\n", " vector_store=ChromaVectorStore(collection=collection),\n", " llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.3),\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["这看起来与纯粹使用存储上下文略有不同,但现在语法更加简洁,因为我们开始混合使用各种功能。\n"]}], "metadata": {"kernelspec": {"display_name": "llama-index-bXUwlEfH-py3.11", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 2}