{"cells": [{"cell_type": "markdown", "id": "0d6aba5a-abde-4e41-8a54-ef7e0de9e8a3", "metadata": {}, "source": ["# 在 Pandas 数据框上构建查询管道\n", "\n", "这是一个简单的示例，构建了一个查询管道，可以对 Pandas 数据框执行结构化操作，以满足用户查询，使用LLMs来推断操作集。\n", "\n", "这可以被视为我们的 `PandasQueryEngine` 的“从头开始”版本。\n", "\n", "警告：此工具提供LLM访问`eval`函数。\n", "在运行此工具的计算机上可能会发生任意代码执行。\n", "不建议在生产环境中使用此工具，并且需要进行严格的沙盒化或虚拟机化。\n"]}, {"cell_type": "code", "execution_count": null, "id": "beea049c", "metadata": {}, "outputs": [], "source": ["%pip install llama-index-llms-openai llama-index-experimental"]}, {"cell_type": "code", "execution_count": null, "id": "e16e7f46-6a3f-4390-b149-51d49a255d54", "metadata": {}, "outputs": [], "source": ["from llama_index.core.query_pipeline import (\n", " QueryPipeline as QP,\n", " Link,\n", " InputComponent,\n", ")\n", "from llama_index.experimental.query_engine.pandas import (\n", " PandasInstructionParser,\n", ")\n", "from llama_index.llms.openai import OpenAI\n", "from llama_index.core import PromptTemplate"]}, {"cell_type": "markdown", "id": "844f1db7-8eaa-4fb2-91f4-a363db66bd5b", "metadata": {}, "source": ["## 下载数据\n", "\n", "这里我们加载泰坦尼克号的CSV数据集。\n"]}, {"cell_type": "code", "execution_count": null, "id": "31c3d6f4-ec01-4992-8dbf-696b6feeda5f", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["--2024-01-13 18:39:07-- https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 57726 (56K) [text/plain]\n", "Saving to: ‘titanic_train.csv’\n", "\n", "titanic_train.csv 100%[===================>] 56.37K --.-KB/s in 0.007s \n", "\n", "2024-01-13 18:39:07 (7.93 MB/s) - ‘titanic_train.csv’ saved [57726/57726]\n", "\n"]}], "source": ["!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'"]}, {"cell_type": "code", "execution_count": null, "id": "b2f94d45-607c-4f04-a05c-b66ecedfe233", "metadata": {}, "outputs": [], "source": ["import pandas as pd\n", "\n", "df = pd.read_csv(\"./titanic_train.csv\")"]}, {"cell_type": "markdown", "id": "b42ccfe1-fd13-4356-b5b1-c9eabac5f391", "metadata": {}, "source": ["## 定义模块\n", "\n", "在这里，我们定义了一组模块：\n", "1. Pandas提示，用于从用户查询中推断pandas指令\n", "2. Pandas输出解析器，用于在数据框上执行pandas指令，并获取数据框\n", "3. 响应合成提示，用于根据数据框合成最终响应\n", "4. LLM\n", "\n", "特别是，pandas输出解析器专门设计用于安全执行Python代码。它包括许多安全检查，可能会让人从头开始编写时感到烦人。这包括仅从一组批准的模块中导入（例如，不允许会更改文件系统的模块，如`os`），还要确保不调用私有/双下划线方法。\n"]}, {"cell_type": "code", "execution_count": null, "id": "d47ac723-e1fc-47c2-9736-19deb50d87ae", "metadata": {}, "outputs": [], "source": ["instruction_str = (\n", " \"1. Convert the query to executable Python code using Pandas.\\n\"\n", " \"2. The final line of code should be a Python expression that can be called with the `eval()` function.\\n\"\n", " \"3. The code should represent a solution to the query.\\n\"\n", " \"4. PRINT ONLY THE EXPRESSION.\\n\"\n", " \"5. Do not quote the expression.\\n\"\n", ")\n", "\n", "pandas_prompt_str = (\n", " \"You are working with a pandas dataframe in Python.\\n\"\n", " \"The name of the dataframe is `df`.\\n\"\n", " \"This is the result of `print(df.head())`:\\n\"\n", " \"{df_str}\\n\\n\"\n", " \"Follow these instructions:\\n\"\n", " \"{instruction_str}\\n\"\n", " \"Query: {query_str}\\n\\n\"\n", " \"Expression:\"\n", ")\n", "response_synthesis_prompt_str = (\n", " \"Given an input question, synthesize a response from the query results.\\n\"\n", " \"Query: {query_str}\\n\\n\"\n", " \"Pandas Instructions (optional):\\n{pandas_instructions}\\n\\n\"\n", " \"Pandas Output: {pandas_output}\\n\\n\"\n", " \"Response: \"\n", ")\n", "\n", "pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(\n", " instruction_str=instruction_str, df_str=df.head(5)\n", ")\n", "pandas_output_parser = PandasInstructionParser(df)\n", "response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)\n", "llm = OpenAI(model=\"gpt-3.5-turbo\")"]}, {"cell_type": "markdown", "id": "1b7907f6-2926-4c83-9541-d089d0b0af88", "metadata": {}, "source": ["## 构建查询管道\n", "\n", "看起来像这样：\n", "输入查询字符串 -> pandas_prompt -> llm1 -> pandas_output_parser -> response_synthesis_prompt -> llm2\n", "\n", "与response_synthesis_prompt的额外连接：llm1 -> pandas_instructions，以及pandas_output_parser -> pandas_output。\n"]}, {"cell_type": "code", "execution_count": null, "id": "3544a6a8-3866-4806-8108-d908bd5d17e6", "metadata": {}, "outputs": [], "source": ["", "qp = QP(", " modules={", " \"input\": InputComponent(), # 输入组件", " \"pandas_prompt\": pandas_prompt, # pandas提示", " \"llm1\": llm, # llm1", " \"pandas_output_parser\": pandas_output_parser, # pandas输出解析器", " \"response_synthesis_prompt\": response_synthesis_prompt, # 响应合成提示", " \"llm2\": llm, # llm2", " },", " verbose=True, # 详细模式", ")", "qp.add_chain([\"input\", \"pandas_prompt\", \"llm1\", \"pandas_output_parser\"]) # 添加链", "", "qp.add_links( # 添加链接", " [", " Link(\"input\", \"response_synthesis_prompt\", dest_key=\"query_str\"), # 从输入到响应合成提示的链接", " Link(", " \"llm1\", \"response_synthesis_prompt\", dest_key=\"pandas_instructions\" # 从llm1到响应合成提示的链接", " ),", " Link(", " \"pandas_output_parser\",", " \"response_synthesis_prompt\",", " dest_key=\"pandas_output\", # 从pandas输出解析器到响应合成提示的链接", " ),", " ]", ")", "# 添加从响应合成提示到llm2的链接", "qp.add_link(\"response_synthesis_prompt\", \"llm2\")"]}, {"cell_type": "markdown", "id": "dea9cba7-733b-4087-a297-5c2799a8cfe5", "metadata": {}, "source": ["```python\n", "# 运行查询\n", "```\n", "在这个部分，我们将学习如何运行查询。\n"]}, {"cell_type": "code", "execution_count": null, "id": "49237e3f-45f8-4b9b-8e32-0d31555b3539", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["\u001b[1;3;38;2;155;135;227m> Running module input with input: \n", "query_str: What is the correlation between survival and age?\n", "\n", "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module pandas_prompt with input: \n", "query_str: What is the correlation between survival and age?\n", "\n", "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module llm1 with input: \n", "messages: You are working with a pandas dataframe in Python.\n", "The name of the dataframe is `df`.\n", "This is the result of `print(df.head())`:\n", " survived pclass name ...\n", "\n", "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module pandas_output_parser with input: \n", "input: assistant: df['survived'].corr(df['age'])\n", "\n", "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module response_synthesis_prompt with input: \n", "query_str: What is the correlation between survival and age?\n", "pandas_instructions: assistant: df['survived'].corr(df['age'])\n", "pandas_output: -0.07722109457217755\n", "\n", "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module llm2 with input: \n", "messages: Given an input question, synthesize a response from the query results.\n", "Query: What is the correlation between survival and age?\n", "\n", "Pandas Instructions (optional):\n", "df['survived'].corr(df['age'])\n", "\n", "Pandas ...\n", "\n", "\u001b[0m"]}], "source": ["response = qp.run(\n", " query_str=\"What is the correlation between survival and age?\",\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "17345c60-65ec-4053-b992-cc1f790f2b55", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The correlation between survival and age is -0.0772. This indicates a weak negative correlation, suggesting that as age increases, the likelihood of survival slightly decreases.\n"]}], "source": ["print(response.message.content)"]}], "metadata": {"kernelspec": {"display_name": "llama_index_v2", "language": "python", "name": "llama_index_v2"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 5}