{"cells": [{"attachments": {}, "cell_type": "markdown", "id": "0d7688a7", "metadata": {}, "source": ["<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/output_parsing/openai_pydantic_program.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"在 Colab 中打开\"/></a>\n"]}, {"cell_type": "markdown", "id": "530c973e-916d-4c9e-9365-e2d5306d7e3d", "metadata": {}, "source": ["# OpenAI Pydantic 程序\n"]}, {"cell_type": "markdown", "id": "18461ba1-6978-4b5b-861e-6dceec36857b", "metadata": {}, "source": ["本指南向您展示如何通过LlamaIndex使用[new OpenAI API](https://openai.com/blog/function-calling-and-other-api-updates)生成结构化数据。用户只需指定一个Pydantic对象。\n", "\n", "我们演示了两种设置：\n", "- 提取到一个`Album`对象中（可以包含一系列的Song对象）\n", "- 提取到一个`DirectoryTree`对象中（可以包含递归的Node对象）\n"]}, {"cell_type": "markdown", "id": "8fcefc79-68b4-4481-b1ef-a902fc12e4c8", "metadata": {}, "source": ["## 提取为`专辑`\n", "\n", "这是一个简单的示例，演示了将输出解析为`专辑`模式的过程，其中可以包含多首歌曲。\n"]}, {"attachments": {}, "cell_type": "markdown", "id": "81e5dde0", "metadata": {}, "source": ["如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。\n"]}, {"cell_type": "code", "execution_count": null, "id": "511a8171", "metadata": {}, "outputs": [], "source": ["%pip install llama-index-llms-openai\n", "%pip install llama-index-program-openai"]}, {"cell_type": "code", "execution_count": null, "id": "b2833cea", "metadata": {}, "outputs": [], "source": ["%pip install llama-index"]}, {"cell_type": "code", "execution_count": null, "id": "f7a83b49-5c34-45d5-8cf4-62f348fb1299", "metadata": {}, "outputs": [], "source": ["from pydantic import BaseModel\n", "from typing import List\n", "\n", "from llama_index.program.openai import OpenAIPydanticProgram"]}, {"cell_type": "markdown", "id": "6311f7ae", "metadata": {}, "source": ["### 没有在模型中添加文档字符串\n"]}, {"cell_type": "markdown", "id": "0563f1ba-8086-4dcc-ba35-bfda31c45ae4", "metadata": {}, "source": ["定义输出模式（不包括文档字符串）\n"]}, {"cell_type": "code", "execution_count": null, "id": "42053ea8-2580-4639-9dcf-566e8427c44e", "metadata": {}, "outputs": [], "source": ["class Song(BaseModel):\n", "    title: str\n", "    length_seconds: int\n", "\n", "\n", "class Album(BaseModel):\n", "    name: str\n", "    artist: str\n", "    songs: List[Song]"]}, {"cell_type": "markdown", "id": "4afff44e-a746-4b9f-85a9-72058bcdd29f", "metadata": {}, "source": ["OpenAI Pydantic 程序\n"]}, {"cell_type": "code", "execution_count": null, "id": "fe756697-c299-4f9a-a108-944b6693f824", "metadata": {}, "outputs": [], "source": ["prompt_template_str = \"\"\"\\\n", "生成一个示例专辑，包括一个艺术家和一组歌曲。以电影 {movie_name} 为灵感。\\\n", "\"\"\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", "    output_cls=Album, prompt_template_str=prompt_template_str, verbose=True\n", ")"]}, {"cell_type": "markdown", "id": "b7be01dc-433e-4485-bab0-36a04c3afbcb", "metadata": {}, "source": ["把程序运行起来，以便获得结构化的输出。\n"]}, {"cell_type": "code", "execution_count": null, "id": "25d02228-2907-4810-932e-83ec9fc71f6b", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: Album with args: {\n", "  \"name\": \"The Shining\",\n", "  \"artist\": \"Various Artists\",\n", "  \"songs\": [\n", "    {\n", "      \"title\": \"Main Title\",\n", "      \"length_seconds\": 180\n", "    },\n", "    {\n", "      \"title\": \"Opening Credits\",\n", "      \"length_seconds\": 120\n", "    },\n", "    {\n", "      \"title\": \"The Overlook Hotel\",\n", "      \"length_seconds\": 240\n", "    },\n", "    {\n", "      \"title\": \"Redrum\",\n", "      \"length_seconds\": 150\n", "    },\n", "    {\n", "      \"title\": \"Here's Johnny!\",\n", "      \"length_seconds\": 200\n", "    }\n", "  ]\n", "}\n"]}], "source": ["output = program(\n", "    movie_name=\"The Shining\", description=\"Data model for an album.\"\n", ")"]}, {"cell_type": "markdown", "id": "4c2af9a5", "metadata": {}, "source": ["### 在模型中使用文档字符串\n"]}, {"cell_type": "code", "execution_count": null, "id": "35c01bec", "metadata": {}, "outputs": [], "source": ["class Song(BaseModel):\n", "    \"\"\"歌曲的数据模型。\"\"\"\n", "\n", "    title: str\n", "    length_seconds: int\n", "\n", "\n", "class Album(BaseModel):\n", "    \"\"\"专辑的数据模型。\"\"\"\n", "\n", "    name: str\n", "    artist: str\n", "    songs: List[Song]"]}, {"cell_type": "code", "execution_count": null, "id": "22268e2a", "metadata": {}, "outputs": [], "source": ["prompt_template_str = \"\"\"\\\n", "生成一个示例专辑，包括一个艺术家和一组歌曲。以电影 {movie_name} 为灵感。\\\n", "\"\"\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", "    output_cls=Album, prompt_template_str=prompt_template_str, verbose=True\n", ")"]}, {"cell_type": "markdown", "id": "9411d0f1", "metadata": {}, "source": ["把程序运行起来，以获得结构化的输出。\n"]}, {"cell_type": "code", "execution_count": null, "id": "066e9c2d", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: Album with args: {\n", "  \"name\": \"The Shining\",\n", "  \"artist\": \"Various Artists\",\n", "  \"songs\": [\n", "    {\n", "      \"title\": \"Main Title\",\n", "      \"length_seconds\": 180\n", "    },\n", "    {\n", "      \"title\": \"Opening Credits\",\n", "      \"length_seconds\": 120\n", "    },\n", "    {\n", "      \"title\": \"The Overlook Hotel\",\n", "      \"length_seconds\": 240\n", "    },\n", "    {\n", "      \"title\": \"Redrum\",\n", "      \"length_seconds\": 150\n", "    },\n", "    {\n", "      \"title\": \"Here's Johnny\",\n", "      \"length_seconds\": 200\n", "    }\n", "  ]\n", "}\n"]}], "source": ["output = program(movie_name=\"The Shining\")"]}, {"cell_type": "markdown", "id": "27ec0777-28d5-494b-b419-daf6bce2b20e", "metadata": {}, "source": ["输出是一个有效的Pydantic对象，我们可以使用它来调用函数/API。\n"]}, {"cell_type": "code", "execution_count": null, "id": "3e51bcf4-e7df-47b9-b380-8e5b900a31e1", "metadata": {}, "outputs": [{"data": {"text/plain": ["Album(name='The Shining', artist='Various Artists', songs=[Song(title='Main Title', length_seconds=180), Song(title='Opening Credits', length_seconds=120), Song(title='The Overlook Hotel', length_seconds=240), Song(title='Redrum', length_seconds=150), Song(title=\"Here's Johnny\", length_seconds=200)])"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["output"]}, {"cell_type": "markdown", "id": "cbc6cd54", "metadata": {}, "source": ["## 流式传递部分中间 Pydantic 对象\n"]}, {"cell_type": "markdown", "id": "9f237a81", "metadata": {}, "source": ["而不是等待函数调用生成整个JSON，我们可以使用`program`的`stream_partial_objects()`方法，以在有效的Pydantic输出类的中间实例可用时立即进行流式传输。🔥\n"]}, {"cell_type": "markdown", "id": "06312c87", "metadata": {}, "source": ["首先让我们定义输出的Pydantic类。\n"]}, {"cell_type": "code", "execution_count": null, "id": "213794ef", "metadata": {}, "outputs": [], "source": ["\n", "from pydantic import BaseModel, Field\n", "\n", "\n", "class CharacterInfo(BaseModel):\n", "    \"\"\"角色信息。\"\"\"\n", "\n", "    character_name: str\n", "    name: str = Field(..., description=\"演员/女演员的姓名\")\n", "    hometown: str\n", "\n", "\n", "class Characters(BaseModel):\n", "    \"\"\"角色列表。\"\"\"\n", "\n", "    characters: list[CharacterInfo] = Field(default_factory=list)"]}, {"cell_type": "markdown", "id": "254b4254", "metadata": {}, "source": ["现在我们将使用提示模板初始化程序\n"]}, {"cell_type": "code", "execution_count": null, "id": "4718559f", "metadata": {}, "outputs": [], "source": ["from llama_index.program.openai import OpenAIPydanticProgram\n", "\n", "prompt_template_str = \"Information about 3 characters from the movie: {movie}\"\n", "\n", "program = OpenAIPydanticProgram.from_defaults(\n", "    output_cls=Characters, prompt_template_str=prompt_template_str\n", ")"]}, {"cell_type": "markdown", "id": "935cae78", "metadata": {}, "source": ["最后，我们使用 `stream_partial_objects()` 方法流式传输部分对象。\n"]}, {"cell_type": "code", "execution_count": null, "id": "221df824", "metadata": {}, "outputs": [], "source": ["# 遍历流式对象中的部分对象\n", "for partial_object in program.stream_partial_objects(movie=\"Harry Potter\"):\n", "    # 将部分对象发送到前端以提供更好的用户体验\n", "    print(partial_object)"]}, {"cell_type": "markdown", "id": "025c567c", "metadata": {}, "source": ["## 提取`专辑`列表（使用并行函数调用）\n"]}, {"cell_type": "markdown", "id": "273d1716", "metadata": {}, "source": ["使用OpenAI最新的[并行函数调用](https://platform.openai.com/docs/guides/function-calling/parallel-function-calling)功能，我们可以同时从单个提示中提取多个结构化数据！\n"]}, {"cell_type": "markdown", "id": "a3956a6f", "metadata": {}, "source": ["为了做到这一点，我们需要：\n", "1. 选择最新的模型之一（例如 `gpt-3.5-turbo-1106`），并且\n", "2. 在我们的 `OpenAIPydanticProgram` 中将 `allow_multiple` 设置为 True（如果不这样做，它将只返回第一个对象，并引发警告）。\n"]}, {"cell_type": "code", "execution_count": null, "id": "74b49a50", "metadata": {}, "outputs": [], "source": ["from llama_index.llms.openai import OpenAI\n", "\n", "prompt_template_str = \"\"\"\\\n", "生成4张关于春天、夏天、秋天和冬天的专辑。\n", "\"\"\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", "    output_cls=Album,\n", "    llm=OpenAI(model=\"gpt-3.5-turbo-1106\"),\n", "    prompt_template_str=prompt_template_str,\n", "    allow_multiple=True,\n", "    verbose=True,\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "23428a9a", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: Album with args: {\"name\": \"Spring\", \"artist\": \"Various Artists\", \"songs\": [{\"title\": \"Blossom\", \"length_seconds\": 180}, {\"title\": \"Sunshine\", \"length_seconds\": 240}, {\"title\": \"Renewal\", \"length_seconds\": 200}]}\n", "Function call: Album with args: {\"name\": \"Summer\", \"artist\": \"Beach Boys\", \"songs\": [{\"title\": \"Beach Party\", \"length_seconds\": 220}, {\"title\": \"Heatwave\", \"length_seconds\": 260}, {\"title\": \"Vacation\", \"length_seconds\": 180}]}\n", "Function call: Album with args: {\"name\": \"Fall\", \"artist\": \"Autumn Leaves\", \"songs\": [{\"title\": \"Golden Days\", \"length_seconds\": 210}, {\"title\": \"Harvest Moon\", \"length_seconds\": 240}, {\"title\": \"Crisp Air\", \"length_seconds\": 190}]}\n", "Function call: Album with args: {\"name\": \"Winter\", \"artist\": \"Snowflakes\", \"songs\": [{\"title\": \"Frosty Morning\", \"length_seconds\": 190}, {\"title\": \"Snowfall\", \"length_seconds\": 220}, {\"title\": \"Cozy Nights\", \"length_seconds\": 250}]}\n"]}], "source": ["output = program()"]}, {"cell_type": "markdown", "id": "a4c3ad95", "metadata": {}, "source": ["输出是一个有效的Pydantic对象列表。\n"]}, {"cell_type": "code", "execution_count": null, "id": "8a20fdce", "metadata": {}, "outputs": [{"data": {"text/plain": ["[Album(name='Spring', artist='Various Artists', songs=[Song(title='Blossom', length_seconds=180), Song(title='Sunshine', length_seconds=240), Song(title='Renewal', length_seconds=200)]),\n", " Album(name='Summer', artist='Beach Boys', songs=[Song(title='Beach Party', length_seconds=220), Song(title='Heatwave', length_seconds=260), Song(title='Vacation', length_seconds=180)]),\n", " Album(name='Fall', artist='Autumn Leaves', songs=[Song(title='Golden Days', length_seconds=210), Song(title='Harvest Moon', length_seconds=240), Song(title='Crisp Air', length_seconds=190)]),\n", " Album(name='Winter', artist='Snowflakes', songs=[Song(title='Frosty Morning', length_seconds=190), Song(title='Snowfall', length_seconds=220), Song(title='Cozy Nights', length_seconds=250)])]"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["output"]}, {"cell_type": "markdown", "id": "58eb69c1-bafe-49e4-9d14-f676662e74ac", "metadata": {}, "source": ["## 从`Album`中提取（流式处理）\n", "\n", "我们还支持通过我们的`stream_list`函数对对象列表进行流式处理。\n", "\n", "这个想法的全部功劳归功于`openai_function_call`仓库：https://github.com/jxnl/openai_function_call/tree/main/examples/streaming_multitask\n"]}, {"cell_type": "code", "execution_count": null, "id": "0a155e9c-a7e0-46b8-addb-7fbf844c0107", "metadata": {}, "outputs": [], "source": ["prompt_template_str = \"{input_str}\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", "    output_cls=Album,\n", "    prompt_template_str=prompt_template_str,\n", "    verbose=False,\n", ")\n", "\n", "output = program.stream_list(\n", "    input_str=\"make up 5 random albums\",\n", ")\n", "for obj in output:\n", "    print(obj.json(indent=2))"]}, {"cell_type": "markdown", "id": "2d6159d2-c967-4523-ad3c-82bc17009b2c", "metadata": {}, "source": ["## 将内容提取到`DirectoryTree`对象中\n", "\n", "这直接受到了jxnl在这里的令人敬畏的存储库的启发：https://github.com/jxnl/openai_function_call。\n", "\n", "该存储库展示了如何使用OpenAI的函数API来解析递归的Pydantic对象。主要要求是要将递归的Pydantic对象“包装”到一个非递归对象中。\n", "\n", "在这里，我们展示了一个“目录”设置的示例，其中`DirectoryTree`对象包装了递归的`Node`对象，以解析文件结构。\n"]}, {"cell_type": "code", "execution_count": null, "id": "b58f6a12-3f5c-414b-80df-4492f6e18be5", "metadata": {}, "outputs": [], "source": ["# 注意：在笔记本中定义递归对象会导致错误\n", "from directory import DirectoryTree, Node"]}, {"cell_type": "code", "execution_count": null, "id": "bc1c7eeb-f0a3-4d72-86ee-c6b63e01b0ff", "metadata": {}, "outputs": [{"data": {"text/plain": ["{'title': 'DirectoryTree',\n", " 'description': 'Container class representing a directory tree.\\n\\nArgs:\\n    root (Node): The root node of the tree.',\n", " 'type': 'object',\n", " 'properties': {'root': {'title': 'Root',\n", "   'description': 'Root folder of the directory tree',\n", "   'allOf': [{'$ref': '#/definitions/Node'}]}},\n", " 'required': ['root'],\n", " 'definitions': {'NodeType': {'title': 'NodeType',\n", "   'description': 'Enumeration representing the types of nodes in a filesystem.',\n", "   'enum': ['file', 'folder'],\n", "   'type': 'string'},\n", "  'Node': {'title': 'Node',\n", "   'description': 'Class representing a single node in a filesystem. Can be either a file or a folder.\\nNote that a file cannot have children, but a folder can.\\n\\nArgs:\\n    name (str): The name of the node.\\n    children (List[Node]): The list of child nodes (if any).\\n    node_type (NodeType): The type of the node, either a file or a folder.',\n", "   'type': 'object',\n", "   'properties': {'name': {'title': 'Name',\n", "     'description': 'Name of the folder',\n", "     'type': 'string'},\n", "    'children': {'title': 'Children',\n", "     'description': 'List of children nodes, only applicable for folders, files cannot have children',\n", "     'type': 'array',\n", "     'items': {'$ref': '#/definitions/Node'}},\n", "    'node_type': {'description': 'Either a file or folder, use the name to determine which it could be',\n", "     'default': 'file',\n", "     'allOf': [{'$ref': '#/definitions/NodeType'}]}},\n", "   'required': ['name']}}}"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["DirectoryTree.schema()"]}, {"cell_type": "code", "execution_count": null, "id": "02c4a7a1-f145-40bc-83b8-4153a531a8eb", "metadata": {}, "outputs": [], "source": ["program = OpenAIPydanticProgram.from_defaults(\n", "    output_cls=DirectoryTree,\n", "    prompt_template_str=\"{input_str}\",\n", "    verbose=True,\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "c88cf49f-a52f-4415-bddc-14d70c897629", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: DirectoryTree with args: {\n", "  \"root\": {\n", "    \"name\": \"root\",\n", "    \"children\": [\n", "      {\n", "        \"name\": \"folder1\",\n", "        \"children\": [\n", "          {\n", "            \"name\": \"file1.txt\",\n", "            \"children\": [],\n", "            \"node_type\": \"file\"\n", "          },\n", "          {\n", "            \"name\": \"file2.txt\",\n", "            \"children\": [],\n", "            \"node_type\": \"file\"\n", "          }\n", "        ],\n", "        \"node_type\": \"folder\"\n", "      },\n", "      {\n", "        \"name\": \"folder2\",\n", "        \"children\": [\n", "          {\n", "            \"name\": \"file3.txt\",\n", "            \"children\": [],\n", "            \"node_type\": \"file\"\n", "          },\n", "          {\n", "            \"name\": \"subfolder1\",\n", "            \"children\": [\n", "              {\n", "                \"name\": \"file4.txt\",\n", "                \"children\": [],\n", "                \"node_type\": \"file\"\n", "              }\n", "            ],\n", "            \"node_type\": \"folder\"\n", "          }\n", "        ],\n", "        \"node_type\": \"folder\"\n", "      }\n", "    ],\n", "    \"node_type\": \"folder\"\n", "  }\n", "}\n"]}], "source": ["\n", "input_str = \"\"\"\n", "根目录\n", "├── 文件夹1\n", "│   ├── 文件1.txt\n", "│   └── 文件2.txt\n", "└── 文件夹2\n", "    ├── 文件3.txt\n", "    └── 子文件夹1\n", "        └── 文件4.txt\n", "\"\"\"\n", "\n", "output = program(input_str=input_str)\n"]}, {"cell_type": "markdown", "id": "0bf23c34-2cc7-4d0f-9389-75f7cf6bf9a2", "metadata": {}, "source": ["输出是一个完整的DirectoryTree结构，其中包含递归的`Node`对象。\n"]}, {"cell_type": "code", "execution_count": null, "id": "3885032f-0f3a-4afb-9157-54851e810843", "metadata": {}, "outputs": [{"data": {"text/plain": ["DirectoryTree(root=Node(name='root', children=[Node(name='folder1', children=[Node(name='file1.txt', children=[], node_type=<NodeType.FILE: 'file'>), Node(name='file2.txt', children=[], node_type=<NodeType.FILE: 'file'>)], node_type=<NodeType.FOLDER: 'folder'>), Node(name='folder2', children=[Node(name='file3.txt', children=[], node_type=<NodeType.FILE: 'file'>), Node(name='subfolder1', children=[Node(name='file4.txt', children=[], node_type=<NodeType.FILE: 'file'>)], node_type=<NodeType.FOLDER: 'folder'>)], node_type=<NodeType.FOLDER: 'folder'>)], node_type=<NodeType.FOLDER: 'folder'>))"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["output"]}], "metadata": {"kernelspec": {"display_name": "llama-index-vs8PXMh0-py3.11", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 5}