{"cells": [{"attachments": {}, "cell_type": "markdown", "id": "6d2b5335", "metadata": {}, "source": ["<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/low_level/fusion_retriever.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"在 Colab 中打开\"/></a>\n"]}, {"cell_type": "markdown", "id": "a0b5c122-d577-4045-b980-cab2eae2aa0c", "metadata": {}, "source": ["# 从零开始构建高级融合检索器\n", "\n", "在本教程中，我们将向您展示如何从零开始构建一个高级检索器。\n", "\n", "具体来说，我们将向您展示如何从头开始构建我们的`QueryFusionRetriever`。\n", "\n", "这在很大程度上受到了RAG-fusion仓库的启发，网址为：https://github.com/Raudaschl/rag-fusion。\n"]}, {"cell_type": "markdown", "id": "0d82203e-1aa0-4d85-8a0f-3854dfa81494", "metadata": {}, "source": ["## 设置\n", "\n", "我们加载文档并构建一个简单的向量索引。\n"]}, {"cell_type": "code", "execution_count": null, "id": "9bafe694", "metadata": {}, "outputs": [], "source": ["%pip install llama-index-readers-file pymupdf\n", "%pip install llama-index-llms-openai\n", "%pip install llama-index-retrievers-bm25"]}, {"cell_type": "code", "execution_count": null, "id": "c79e8b40-c963-46ee-9601-6c31e5901568", "metadata": {}, "outputs": [], "source": ["import nest_asyncio\n", "\n", "nest_asyncio.apply()"]}, {"cell_type": "markdown", "id": "41d6148f-3185-4a32-973c-316f23e45804", "metadata": {}, "source": ["#### 加载文档\n"]}, {"cell_type": "code", "execution_count": null, "id": "c054a492-56c9-4dae-bede-06739858ba57", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["--2024-04-03 09:32:31--  https://arxiv.org/pdf/2307.09288.pdf\n", "Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.131.42, 151.101.67.42, ...\n", "Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 13661300 (13M) [application/pdf]\n", "Saving to: ‘data/llama2.pdf’\n", "\n", "data/llama2.pdf     100%[===================>]  13.03M  7.44MB/s    in 1.8s    \n", "\n", "2024-04-03 09:32:33 (7.44 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]\n", "\n"]}], "source": ["!mkdir data\n", "!wget --user-agent \"Mozilla\" \"https://arxiv.org/pdf/2307.09288.pdf\" -O \"data/llama2.pdf\""]}, {"attachments": {}, "cell_type": "markdown", "id": "1126a6d3", "metadata": {}, "source": ["如果您在Colab上打开此笔记本，您可能需要安装LlamaIndex 🦙。\n"]}, {"cell_type": "code", "execution_count": null, "id": "0f03cf99", "metadata": {}, "outputs": [], "source": ["!pip install llama-index"]}, {"cell_type": "code", "execution_count": null, "id": "b3b7ec9e-30cf-49ba-9b3b-9beb9a2b6758", "metadata": {}, "outputs": [], "source": ["from pathlib import Path\n", "from llama_index.readers.file import PyMuPDFReader\n", "\n", "loader = PyMuPDFReader()\n", "documents = loader.load(file_path=\"./data/llama2.pdf\")"]}, {"cell_type": "markdown", "id": "46ea385d", "metadata": {}, "source": ["```python\n", "# 设置模型\n", "```\n", "\n", "这里是设置模型的部分。\n"]}, {"cell_type": "code", "execution_count": null, "id": "0bc3bd76", "metadata": {}, "outputs": [], "source": ["import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""]}, {"cell_type": "code", "execution_count": null, "id": "75c07062", "metadata": {}, "outputs": [], "source": ["from llama_index.llms.openai import OpenAI\n", "from llama_index.embeddings.openai import OpenAIEmbedding\n", "\n", "llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n", "embed_model = OpenAIEmbedding(\n", "    model=\"text-embedding-3-small\", embed_batch_size=256\n", ")"]}, {"cell_type": "markdown", "id": "f59a6cd1-802a-4e69-afd1-de6faf4b064b", "metadata": {}, "source": ["#### 加载到向量存储中\n"]}, {"cell_type": "code", "execution_count": null, "id": "b2423b4b-5c3b-4d36-b338-9bf74f0e6a82", "metadata": {}, "outputs": [], "source": ["from llama_index.core import VectorStoreIndex\n", "from llama_index.core.node_parser import SentenceSplitter\n", "\n", "splitter = SentenceSplitter(chunk_size=1024)\n", "index = VectorStoreIndex.from_documents(\n", "    documents, transformations=[splitter], embed_model=embed_model\n", ")"]}, {"cell_type": "markdown", "id": "cbc7a9bf-0ef7-45a5-bc76-3c37a8a09c88", "metadata": {}, "source": ["## 定义高级检索器\n", "\n", "我们定义一个高级检索器，执行以下步骤：\n", "1. 查询生成/重写：根据原始用户查询生成多个查询。\n", "2. 对每个查询在一组检索器上执行检索。\n", "3. 重新排序/融合：融合所有查询的结果，并对“融合”出的前几个相关结果应用重新排序步骤！\n", "\n", "然后在下一节中，我们将把这个模块插入到我们的响应合成模块中。\n"]}, {"cell_type": "markdown", "id": "3586a793-3c5a-4d7c-b401-cd6fb71f87a1", "metadata": {}, "source": ["### 第一步：查询生成/重写\n", "\n", "第一步是从原始查询中生成查询，以更好地匹配查询意图，并提高检索结果的精确度/召回率。例如，我们可以将查询重写为更小的查询。\n", "\n", "我们可以通过提示ChatGPT来实现这一点。\n"]}, {"cell_type": "code", "execution_count": null, "id": "0a5183a0-58ce-4cc5-a74b-8428dfe12bb5", "metadata": {}, "outputs": [], "source": ["from llama_index.core import PromptTemplate"]}, {"cell_type": "code", "execution_count": null, "id": "9e745f03-5c06-43b5-ad9d-65c5b8150ba7", "metadata": {}, "outputs": [], "source": ["query_str = \"How do the models developed in this work compare to open-source chat models based on the benchmarks tested?\""]}, {"cell_type": "code", "execution_count": null, "id": "0a17441d-1f14-4de4-a4b3-55177d0a2dee", "metadata": {}, "outputs": [], "source": ["query_gen_prompt_str = (\n", "    \"You are a helpful assistant that generates multiple search queries based on a \"\n", "    \"single input query. Generate {num_queries} search queries, one on each line, \"\n", "    \"related to the following input query:\\n\"\n", "    \"Query: {query}\\n\"\n", "    \"Queries:\\n\"\n", ")\n", "query_gen_prompt = PromptTemplate(query_gen_prompt_str)"]}, {"cell_type": "code", "execution_count": null, "id": "5c3a7b04-c4fb-456c-8584-5ea93bdc7bf0", "metadata": {}, "outputs": [], "source": ["def generate_queries(llm, query_str: str, num_queries: int = 4):\n", "    fmt_prompt = query_gen_prompt.format(\n", "        num_queries=num_queries - 1, query=query_str\n", "    )\n", "    response = llm.complete(fmt_prompt)\n", "    queries = response.text.split(\"\\n\")\n", "    return queries"]}, {"cell_type": "code", "execution_count": null, "id": "2a577a95-2b58-424d-aa7d-bed9aa9ceb98", "metadata": {}, "outputs": [], "source": ["queries = generate_queries(llm, query_str, num_queries=4)"]}, {"cell_type": "code", "execution_count": null, "id": "73fe53d2-c556-44ed-a255-374ae8eca494", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["['1. Comparison of models developed in this work to open-source chat models in benchmark testing', '2. Performance evaluation of models developed in this work versus open-source chat models on tested benchmarks', '3. Analysis of differences between models developed in this work and open-source chat models in benchmark assessments']\n"]}], "source": ["print(queries)"]}, {"cell_type": "markdown", "id": "a9876c65-1a90-4606-9104-5df10009d935", "metadata": {}, "source": ["### 步骤2：对每个查询执行向量搜索\n", "\n", "现在我们对每个查询运行检索。这意味着我们从每个向量存储中获取前k个最相关的结果。\n", "\n", "**注意**：我们也可以有多个检索器。那么我们运行的总查询数量为N*M，其中N是检索器的数量，M是生成的查询数量。因此将会有N*M个检索列表。\n", "\n", "在这里，我们将使用从我们的向量存储中提供的检索器。如果您想了解如何从头开始构建这个，请参阅[我们的教程](https://docs.llamaindex.ai/en/latest/examples/low_level/retrieval.html#put-this-into-a-retriever)。\n"]}, {"cell_type": "code", "execution_count": null, "id": "53114651-0b57-4cbc-a07f-d906c5820cb7", "metadata": {}, "outputs": [], "source": ["from tqdm.asyncio import tqdm", "", "", "async def run_queries(queries, retrievers):", "    \"\"\"对检索器运行查询。\"\"\"", "    tasks = []", "    for query in queries:", "        for i, retriever in enumerate(retrievers):", "            tasks.append(retriever.aretrieve(query))", "", "    task_results = await tqdm.gather(*tasks)", "", "    results_dict = {}", "    for i, (query, query_result) in enumerate(zip(queries, task_results)):", "        results_dict[(query, i)] = query_result", "", "    return results_dict"]}, {"cell_type": "code", "execution_count": null, "id": "d046d284-ab9e-4242-b91a-8d06c472dfaf", "metadata": {}, "outputs": [], "source": ["# 获取检索器", "from llama_index.retrievers.bm25 import BM25Retriever", "", "", "## 向量检索器", "vector_retriever = index.as_retriever(similarity_top_k=2)", "", "## bm25检索器", "bm25_retriever = BM25Retriever.from_defaults(", "    docstore=index.docstore, similarity_top_k=2", ")"]}, {"cell_type": "code", "execution_count": null, "id": "6a0ddc59-1602-4078-b3e9-dadb852709fc", "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["  0%|          | 0/6 [00:00<?, ?it/s]"]}, {"name": "stderr", "output_type": "stream", "text": ["100%|██████████| 6/6 [00:00<00:00, 11.14it/s]\n"]}], "source": ["results_dict = await run_queries(queries, [vector_retriever, bm25_retriever])"]}, {"cell_type": "markdown", "id": "65f13dff-e5fa-4c60-865c-7217dd4c87e6", "metadata": {}, "source": ["### 步骤 3：执行融合\n", "\n", "这里的下一步是执行融合：将多个检索器的结果合并为一个，并重新排名。\n", "\n", "请注意，给定节点可能会从不同的检索器中多次检索到，因此需要一种方法来去重和重新排名给定多次检索的节点。\n", "\n", "我们将向您展示如何执行“倒数排名融合”：对于每个节点，将其在每个检索列表中的倒数排名相加。\n", "\n", "然后按得分从高到低重新排序节点。\n", "\n", "完整论文链接：https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf\n"]}, {"cell_type": "code", "execution_count": null, "id": "a4702259-ea82-4e08-b64c-1300a3857c63", "metadata": {}, "outputs": [], "source": ["from typing import List", "from llama_index.core.schema import NodeWithScore", "", "", "def fuse_results(results_dict, similarity_top_k: int = 2):", "    \"\"\"融合结果。\"\"\"", "    k = 60.0  # `k`是一个用于控制异常值排名影响的参数。", "    fused_scores = {}", "    text_to_node = {}", "", "    # 计算倒数排名分数", "    for nodes_with_scores in results_dict.values():", "        for rank, node_with_score in enumerate(", "            sorted(", "                nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True", "            )", "        ):", "            text = node_with_score.node.get_content()", "            text_to_node[text] = node_with_score", "            if text not in fused_scores:", "                fused_scores[text] = 0.0", "            fused_scores[text] += 1.0 / (rank + k)", "", "    # 排序结果", "    reranked_results = dict(", "        sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)", "    )", "", "    # 调整节点分数", "    reranked_nodes: List[NodeWithScore] = []", "    for text, score in reranked_results.items():", "        reranked_nodes.append(text_to_node[text])", "        reranked_nodes[-1].score = score", "", "    return reranked_nodes[:similarity_top_k]"]}, {"cell_type": "code", "execution_count": null, "id": "497cf22e-a5fe-4dcd-9227-13cf63b3622c", "metadata": {}, "outputs": [], "source": ["final_results = fuse_results(results_dict)"]}, {"cell_type": "code", "execution_count": null, "id": "87061aed-fbfb-4949-9fc6-f687050314d0", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0.03333333333333333 \n", " Figure 12: Human evaluation results for Llama 2-Chat models compared to open- and closed-source models\n", "across ~4,000 helpfulness prompts with three raters per prompt.\n", "The largest Llama 2-Chat model is competitive with ChatGPT. Llama 2-Chat 70B model has a win rate of\n", "36% and a tie rate of 31.5% relative to ChatGPT. Llama 2-Chat 70B model outperforms PaLM-bison chat\n", "model by a large percentage on our prompt set. More results and analysis is available in Section A.3.7.\n", "Inter-Rater Reliability (IRR).\n", "In our human evaluations, three different annotators provided independent\n", "assessments for each model generation comparison. High IRR scores (closer to 1.0) are typically seen as\n", "better from a data quality perspective, however, context is important. Highly subjective tasks like evaluating\n", "the overall helpfulness of LLM generations will usually have lower IRR scores than more objective labelling\n", "tasks. There are relatively few public benchmarks for these contexts, so we feel sharing our analysis here will\n", "benefit the research community.\n", "We used Gwet’s AC1/2 statistic (Gwet, 2008, 2014) to measure inter-rater reliability (IRR), as we found it to\n", "be the most stable metric across different measurement scenarios. On the 7-point Likert scale helpfulness\n", "task that is used in our analysis, Gwet’s AC2 score varies between 0.37 and 0.55 depending on the specific\n", "model comparison. We see scores on the lower end of that range for ratings from model comparisons with\n", "similar win rates to each other (like the Llama 2-Chat-70B-chat vs. ChatGPT comparison). We see scores on\n", "the higher end of that range for ratings from model comparisons with a more clear winner (like the Llama\n", "2-Chat-34b-chat vs. Falcon-40b-instruct).\n", "Limitations of human evaluations.\n", "While our results indicate that Llama 2-Chat is on par with ChatGPT\n", "on human evaluations, it is important to note that human evaluations have several limitations.\n", "• By academic and research standards, we have a large prompt set of 4k prompts. However, it does not cover\n", "real-world usage of these models, which will likely cover a significantly larger number of use cases.\n", "• Diversity of the prompts could be another factor in our results. For example, our prompt set does not\n", "include any coding- or reasoning-related prompts.\n", "• We only evaluate the final generation of a multi-turn conversation. A more interesting evaluation could be\n", "to ask the models to complete a task and rate the overall experience with the model over multiple turns.\n", "• Human evaluation for generative models is inherently subjective and noisy. As a result, evaluation on a\n", "different set of prompts or with different instructions could result in different results.\n", "19 \n", "********\n", "\n", "0.03306010928961749 \n", " Llama 2: Open Foundation and Fine-Tuned Chat Models\n", "Hugo Touvron∗\n", "Louis Martin†\n", "Kevin Stone†\n", "Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\n", "Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\n", "Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\n", "Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\n", "Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev\n", "Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich\n", "Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra\n", "Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi\n", "Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\n", "Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\n", "Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\n", "Sergey Edunov\n", "Thomas Scialom∗\n", "GenAI, Meta\n", "Abstract\n", "In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned\n", "large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\n", "Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our\n", "models outperform open-source chat models on most benchmarks we tested, and based on\n", "our human evaluations for helpfulness and safety, may be a suitable substitute for closed-\n", "source models. We provide a detailed description of our approach to fine-tuning and safety\n", "improvements of Llama 2-Chat in order to enable the community to build on our work and\n", "contribute to the responsible development of LLMs.\n", "∗Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com\n", "†Second author\n", "Contributions for all the authors can be found in Section A.1.\n", "arXiv:2307.09288v2  [cs.CL]  19 Jul 2023 \n", "********\n", "\n"]}], "source": ["for n in final_results:\n", "    print(n.score, \"\\n\", n.text, \"\\n********\\n\")"]}, {"cell_type": "markdown", "id": "8aa602b1-4188-48ae-9d6c-127778019261", "metadata": {}, "source": ["**分析**：上面的代码有一些简单的组件。\n", "1. 遍历每个检索列表中的节点，并将其倒数排名添加到节点的ID中。节点的ID是其文本的哈希值，用于去重。\n", "2. 按照从高分到低分的顺序对结果进行排序。\n", "3. 调整节点得分。\n"]}, {"cell_type": "markdown", "id": "5d92ae5d-1220-4988-998e-06a22cc4f7b2", "metadata": {}, "source": ["## 插入到RetrieverQueryEngine\n", "\n", "现在我们已经准备好将其定义为一个自定义的检索器，并将其插入到我们的`RetrieverQueryEngine`中（负责检索和综合）。\n"]}, {"cell_type": "code", "execution_count": null, "id": "a4ccd8d7-9fc5-463c-ac0c-f9e1d1bae63c", "metadata": {}, "outputs": [], "source": ["from typing import List", "", "from llama_index.core import QueryBundle", "from llama_index.core.retrievers import BaseRetriever", "from llama_index.core.schema import NodeWithScore", "import asyncio", "", "", "class FusionRetriever(BaseRetriever):", "    \"\"\"带有融合功能的集成检索器。\"\"\"", "", "    def __init__(", "        self,", "        llm,", "        retrievers: List[BaseRetriever],", "        similarity_top_k: int = 2,", "    ) -> None:", "        \"\"\"初始化参数。\"\"\"", "        self._retrievers = retrievers", "        self._similarity_top_k = similarity_top_k", "        self._llm = llm", "        super().__init__()", "", "    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:", "        \"\"\"检索。\"\"\"", "        queries = generate_queries(", "            self._llm, query_bundle.query_str, num_queries=4", "        )", "        results = asyncio.run(run_queries(queries, self._retrievers))", "        final_results = fuse_results(", "            results, similarity_top_k=self._similarity_top_k", "        )", "", "        return final_results"]}, {"cell_type": "code", "execution_count": null, "id": "b2b641b1-e64e-4ddf-9cc5-88ff5c57b70e", "metadata": {}, "outputs": [], "source": ["from llama_index.core.query_engine import RetrieverQueryEngine\n", "\n", "fusion_retriever = FusionRetriever(\n", "    llm, [vector_retriever, bm25_retriever], similarity_top_k=2\n", ")\n", "\n", "query_engine = RetrieverQueryEngine(fusion_retriever)"]}, {"cell_type": "code", "execution_count": null, "id": "c0d81b5b-39f2-42da-92b9-ce6113fa43d9", "metadata": {}, "outputs": [], "source": ["response = query_engine.query(query_str)"]}, {"cell_type": "code", "execution_count": null, "id": "93daeaaa-ce68-465a-b246-287714b4b370", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The models developed in this work, specifically the Llama 2-Chat models, outperform open-source chat models on most benchmarks that were tested.\n"]}], "source": ["print(str(response))"]}], "metadata": {"kernelspec": {"display_name": "venv", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 5}