<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/OpensearchDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# Opensearch 向量存储

Elasticsearch 只支持 Lucene 索引，因此只支持 Opensearch。


**设置注意事项**：我们通过以下文档设置了一个本地 Opensearch 实例。https://opensearch.org/docs/1.0/

如果遇到 SSL 问题，请尝试以下 `docker run` 命令：
```
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearchproject/opensearch:1.0.1
```

参考链接：https://github.com/opensearch-project/OpenSearch/issues/1598


下载数据


In [None]:
%pip install llama-index-readers-elasticsearch
%pip install llama-index-vector-stores-opensearch
%pip install llama-index-embeddings-ollama

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [None]:
from os import getenvfrom llama_index.core import SimpleDirectoryReaderfrom llama_index.vector_stores.opensearch import (    OpensearchVectorStore,    OpensearchVectorClient,)from llama_index.core import VectorStoreIndex, StorageContext# 用于集群的http端点（需要opensearch用于向量索引使用）endpoint = getenv("OPENSEARCH_ENDPOINT", "http://localhost:9200")# 用于演示VectorStore实现的索引idx = getenv("OPENSEARCH_INDEX", "gpt-index-demo")# 加载一些示例数据documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# OpensearchVectorClient默认将文本存储在这个字段中text_field = "content"# OpensearchVectorClient默认将嵌入向量存储在这个字段中embedding_field = "embedding"# OpensearchVectorClient封装了一个opensearch索引的逻辑，启用了向量搜索client = OpensearchVectorClient(    endpoint, idx, 1536, embedding_field=embedding_field, text_field=text_field)# 初始化向量存储vector_store = OpensearchVectorStore(client)storage_context = StorageContext.from_defaults(vector_store=vector_store)# 使用我们的示例数据和刚刚创建的客户端初始化一个索引index = VectorStoreIndex.from_documents(    documents=documents, storage_context=storage_context)

In [None]:
# 运行查询query_engine = index.as_query_engine()res = query_engine.query("作者在成长过程中做了什么？")res.response

INFO:root:> [query] Total LLM token usage: 29628 tokens
INFO:root:> [query] Total embedding token usage: 8 tokens


'\n\nThe author grew up writing short stories, programming on an IBM 1401, and building a computer kit from Heathkit. They also wrote programs for a TRS-80, such as games, a program to predict model rocket flight, and a word processor. After years of nagging, they convinced their father to buy a TRS-80, and they wrote simple games, a program to predict how high their model rockets would fly, and a word processor that their father used to write at least one book. In college, they studied philosophy and AI, and wrote a book about Lisp hacking. They also took art classes and applied to art schools, and experimented with computer graphics and animation, exploring the use of algorithms to create art. Additionally, they experimented with machine learning algorithms, such as using neural networks to generate art, and exploring the use of numerical values to create art. They also took classes in fundamental subjects like drawing, color, and design, and applied to two art schools, RISD in the U

OpenSearch向量存储支持[过滤器上下文查询](https://opensearch.org/docs/latest/query-dsl/query-filter-context/)。


In [None]:
from llama_index.core import Document
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
import regex as re

In [None]:
# 将文本分割成段落。text_chunks = documents[0].text.split("\n\n")# 为每个脚注创建一个文档footnotes = [    Document(        text=chunk,        id=documents[0].doc_id,        metadata={"is_footnote": bool(re.search(r"^\s*\[\d+\]\s*", chunk))},    )    for chunk in text_chunks    if bool(re.search(r"^\s*\[\d+\]\s*", chunk))]

In [None]:
# 将脚注插入索引中for f in footnotes:    index.insert(f)

In [None]:
# 创建一个只搜索特定脚注的查询引擎。footnote_query_engine = index.as_query_engine(    filters=MetadataFilters(        filters=[            ExactMatchFilter(                key="term", value='{"metadata.is_footnote": "true"}'            ),            ExactMatchFilter(                key="query_string",                value='{"query": "content: space AND content: lisp"}',            ),        ]    ))res = footnote_query_engine.query(    "What did the author about space aliens and lisp?")res.response

"The author believes that any sufficiently advanced alien civilization would know about the Pythagorean theorem and possibly also about Lisp in McCarthy's 1960 paper."

## 使用reader来查看VectorStoreIndex在我们的索引中创建的内容。

Reader也可以与Elasticsearch一起工作，因为它只使用基本的搜索功能。


In [None]:
# 创建一个读取器来检查前一节中使用的索引。from llama_index.readers.elasticsearch import ElasticsearchReaderrdr = ElasticsearchReader(endpoint, idx)# 可选地设置embedding_field以从elasticsearch索引中读取嵌入数据docs = rdr.load_data(text_field, embedding_field=embedding_field)# 文档中包含嵌入print("嵌入维度:", len(docs[0].embedding))# 完整的文档存储在元数据中print("索引中的所有字段:", docs[0].metadata.keys())

embedding dimension: 1536
all fields in index: dict_keys(['content', 'embedding'])


In [None]:
# 我们可以检查`GPTOpensearchIndex`是如何对文本进行分块的print("创建的分块总数:", len(docs))

total number of chunks: 10


In [None]:
# 使用标准的elasticsearch查询DSL来搜索索引docs = rdr.load_data(text_field, {"query": {"match": {text_field: "Lisp"}}})print("提到Lisp的片段数量:", len(docs))docs = rdr.load_data(text_field, {"query": {"match": {text_field: "Yahoo"}}})print("提到Yahoo的片段数量:", len(docs))

chunks that mention Lisp: 10
chunks that mention Yahoo: 8


## Opensearch向量存储的混合查询
自OpenSearch 2.10起，支持混合查询。它是向量搜索和文本搜索的组合。当您想要搜索特定文本并希望通过向量相似性对结果进行过滤时，它非常有用。您可以在这里找到更多详细信息：https://opensearch.org/docs/latest/query-dsl/compound/hybrid/。


### 初始化一个OpenSearch客户端和支持混合查询的向量存储，包括搜索管道的详细信息


In [None]:
from os import getenvfrom llama_index.vector_stores.opensearch import (    OpensearchVectorStore,    OpensearchVectorClient,)# 用于集群的http端点（需要opensearch以使用向量索引）endpoint = getenv("OPENSEARCH_ENDPOINT", "http://localhost:9200")# 用于演示VectorStore实现的索引idx = getenv("OPENSEARCH_INDEX", "auto_retriever_movies")# OpensearchVectorClient默认将文本存储在此字段中text_field = "content"# OpensearchVectorClient默认将嵌入存储在此字段中embedding_field = "embedding"# OpensearchVectorClient封装了一个opensearch索引的逻辑，启用了带有混合搜索流水线的向量搜索client = OpensearchVectorClient(    endpoint,    idx,    4096,    embedding_field=embedding_field,    text_field=text_field,    search_pipeline="hybrid-search-pipeline",)from llama_index.embeddings.ollama import OllamaEmbeddingembed_model = OllamaEmbedding(model_name="llama2")# 初始化向量存储vector_store = OpensearchVectorStore(client)

### 准备索引


In [None]:
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex, StorageContext


storage_context = StorageContext.from_defaults(vector_store=vector_store)

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
        },
    ),
]

index = VectorStoreIndex(
    nodes, storage_context=storage_context, embed_model=embed_model
)

LLM is explicitly disabled. Using MockLLM.


### 使用向量存储查询模式VectorStoreQueryMode.HYBRID和过滤器来搜索混合查询的索引


In [None]:
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
from llama_index.core.vector_stores.types import VectorStoreQueryMode

filters = MetadataFilters(
    filters=[
        ExactMatchFilter(
            key="term", value='{"metadata.theme.keyword": "Mafia"}'
        )
    ]
)

retriever = index.as_retriever(
    filters=filters, vector_store_query_mode=VectorStoreQueryMode.HYBRID
)

result = retriever.retrieve("What is inception about?")

print(result)

query_strWhat is inception about?
query_modehybrid
{'size': 2, 'query': {'hybrid': {'queries': [{'bool': {'must': {'match': {'content': {'query': 'What is inception about?'}}}, 'filter': [{'term': {'metadata.theme.keyword': 'Mafia'}}]}}, {'script_score': {'query': {'bool': {'filter': [{'term': {'metadata.theme.keyword': 'Mafia'}}]}}, 'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))", 'params': {'field': 'embedding', 'query_value': [0.41321834921836853, 0.18020285665988922, 2.5630273818969727, 1.490068793296814, -2.2188172340393066, 0.3613924980163574, 0.036182258278131485, 1.3815258741378784, -0.4603463411331177, 0.9783738851547241, 0.3667166233062744, -0.30677080154418945, -1.2893489599227905, -1.19036865234375, -1.4050743579864502, -2.200796365737915, 0.05992934852838516, 0.30156904458999634, 0.6115846633911133, -0.028691552579402924, 0.5112416744232178, -2.069373846054077, 0.6121743321418762, -0.05102552846074104, 1.8506423234939575, -1.293755292892456