{"cells": [{"attachments": {}, "cell_type": "markdown", "id": "0d7688a7", "metadata": {}, "source": ["\"在\n"]}, {"cell_type": "markdown", "id": "530c973e-916d-4c9e-9365-e2d5306d7e3d", "metadata": {}, "source": ["# OpenAI Pydantic 程序\n"]}, {"cell_type": "markdown", "id": "18461ba1-6978-4b5b-861e-6dceec36857b", "metadata": {}, "source": ["本指南向您展示如何通过LlamaIndex使用[new OpenAI API](https://openai.com/blog/function-calling-and-other-api-updates)生成结构化数据。用户只需指定一个Pydantic对象。\n", "\n", "我们演示了两种设置:\n", "- 提取到一个`Album`对象中(可以包含一系列的Song对象)\n", "- 提取到一个`DirectoryTree`对象中(可以包含递归的Node对象)\n"]}, {"cell_type": "markdown", "id": "8fcefc79-68b4-4481-b1ef-a902fc12e4c8", "metadata": {}, "source": ["## 提取为`专辑`\n", "\n", "这是一个简单的示例,演示了将输出解析为`专辑`模式的过程,其中可以包含多首歌曲。\n"]}, {"attachments": {}, "cell_type": "markdown", "id": "81e5dde0", "metadata": {}, "source": ["如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。\n"]}, {"cell_type": "code", "execution_count": null, "id": "511a8171", "metadata": {}, "outputs": [], "source": ["%pip install llama-index-llms-openai\n", "%pip install llama-index-program-openai"]}, {"cell_type": "code", "execution_count": null, "id": "b2833cea", "metadata": {}, "outputs": [], "source": ["%pip install llama-index"]}, {"cell_type": "code", "execution_count": null, "id": "f7a83b49-5c34-45d5-8cf4-62f348fb1299", "metadata": {}, "outputs": [], "source": ["from pydantic import BaseModel\n", "from typing import List\n", "\n", "from llama_index.program.openai import OpenAIPydanticProgram"]}, {"cell_type": "markdown", "id": "6311f7ae", "metadata": {}, "source": ["### 没有在模型中添加文档字符串\n"]}, {"cell_type": "markdown", "id": "0563f1ba-8086-4dcc-ba35-bfda31c45ae4", "metadata": {}, "source": ["定义输出模式(不包括文档字符串)\n"]}, {"cell_type": "code", "execution_count": null, "id": "42053ea8-2580-4639-9dcf-566e8427c44e", "metadata": {}, "outputs": [], "source": ["class Song(BaseModel):\n", " title: str\n", " length_seconds: int\n", "\n", "\n", "class Album(BaseModel):\n", " name: str\n", " artist: str\n", " songs: List[Song]"]}, {"cell_type": "markdown", "id": "4afff44e-a746-4b9f-85a9-72058bcdd29f", "metadata": {}, "source": ["OpenAI Pydantic 程序\n"]}, {"cell_type": "code", "execution_count": null, "id": "fe756697-c299-4f9a-a108-944b6693f824", "metadata": {}, "outputs": [], "source": ["prompt_template_str = \"\"\"\\\n", "生成一个示例专辑,包括一个艺术家和一组歌曲。以电影 {movie_name} 为灵感。\\\n", "\"\"\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=Album, prompt_template_str=prompt_template_str, verbose=True\n", ")"]}, {"cell_type": "markdown", "id": "b7be01dc-433e-4485-bab0-36a04c3afbcb", "metadata": {}, "source": ["把程序运行起来,以便获得结构化的输出。\n"]}, {"cell_type": "code", "execution_count": null, "id": "25d02228-2907-4810-932e-83ec9fc71f6b", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: Album with args: {\n", " \"name\": \"The Shining\",\n", " \"artist\": \"Various Artists\",\n", " \"songs\": [\n", " {\n", " \"title\": \"Main Title\",\n", " \"length_seconds\": 180\n", " },\n", " {\n", " \"title\": \"Opening Credits\",\n", " \"length_seconds\": 120\n", " },\n", " {\n", " \"title\": \"The Overlook Hotel\",\n", " \"length_seconds\": 240\n", " },\n", " {\n", " \"title\": \"Redrum\",\n", " \"length_seconds\": 150\n", " },\n", " {\n", " \"title\": \"Here's Johnny!\",\n", " \"length_seconds\": 200\n", " }\n", " ]\n", "}\n"]}], "source": ["output = program(\n", " movie_name=\"The Shining\", description=\"Data model for an album.\"\n", ")"]}, {"cell_type": "markdown", "id": "4c2af9a5", "metadata": {}, "source": ["### 在模型中使用文档字符串\n"]}, {"cell_type": "code", "execution_count": null, "id": "35c01bec", "metadata": {}, "outputs": [], "source": ["class Song(BaseModel):\n", " \"\"\"歌曲的数据模型。\"\"\"\n", "\n", " title: str\n", " length_seconds: int\n", "\n", "\n", "class Album(BaseModel):\n", " \"\"\"专辑的数据模型。\"\"\"\n", "\n", " name: str\n", " artist: str\n", " songs: List[Song]"]}, {"cell_type": "code", "execution_count": null, "id": "22268e2a", "metadata": {}, "outputs": [], "source": ["prompt_template_str = \"\"\"\\\n", "生成一个示例专辑,包括一个艺术家和一组歌曲。以电影 {movie_name} 为灵感。\\\n", "\"\"\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=Album, prompt_template_str=prompt_template_str, verbose=True\n", ")"]}, {"cell_type": "markdown", "id": "9411d0f1", "metadata": {}, "source": ["把程序运行起来,以获得结构化的输出。\n"]}, {"cell_type": "code", "execution_count": null, "id": "066e9c2d", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: Album with args: {\n", " \"name\": \"The Shining\",\n", " \"artist\": \"Various Artists\",\n", " \"songs\": [\n", " {\n", " \"title\": \"Main Title\",\n", " \"length_seconds\": 180\n", " },\n", " {\n", " \"title\": \"Opening Credits\",\n", " \"length_seconds\": 120\n", " },\n", " {\n", " \"title\": \"The Overlook Hotel\",\n", " \"length_seconds\": 240\n", " },\n", " {\n", " \"title\": \"Redrum\",\n", " \"length_seconds\": 150\n", " },\n", " {\n", " \"title\": \"Here's Johnny\",\n", " \"length_seconds\": 200\n", " }\n", " ]\n", "}\n"]}], "source": ["output = program(movie_name=\"The Shining\")"]}, {"cell_type": "markdown", "id": "27ec0777-28d5-494b-b419-daf6bce2b20e", "metadata": {}, "source": ["输出是一个有效的Pydantic对象,我们可以使用它来调用函数/API。\n"]}, {"cell_type": "code", "execution_count": null, "id": "3e51bcf4-e7df-47b9-b380-8e5b900a31e1", "metadata": {}, "outputs": [{"data": {"text/plain": ["Album(name='The Shining', artist='Various Artists', songs=[Song(title='Main Title', length_seconds=180), Song(title='Opening Credits', length_seconds=120), Song(title='The Overlook Hotel', length_seconds=240), Song(title='Redrum', length_seconds=150), Song(title=\"Here's Johnny\", length_seconds=200)])"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["output"]}, {"cell_type": "markdown", "id": "cbc6cd54", "metadata": {}, "source": ["## 流式传递部分中间 Pydantic 对象\n"]}, {"cell_type": "markdown", "id": "9f237a81", "metadata": {}, "source": ["而不是等待函数调用生成整个JSON,我们可以使用`program`的`stream_partial_objects()`方法,以在有效的Pydantic输出类的中间实例可用时立即进行流式传输。🔥\n"]}, {"cell_type": "markdown", "id": "06312c87", "metadata": {}, "source": ["首先让我们定义输出的Pydantic类。\n"]}, {"cell_type": "code", "execution_count": null, "id": "213794ef", "metadata": {}, "outputs": [], "source": ["\n", "from pydantic import BaseModel, Field\n", "\n", "\n", "class CharacterInfo(BaseModel):\n", " \"\"\"角色信息。\"\"\"\n", "\n", " character_name: str\n", " name: str = Field(..., description=\"演员/女演员的姓名\")\n", " hometown: str\n", "\n", "\n", "class Characters(BaseModel):\n", " \"\"\"角色列表。\"\"\"\n", "\n", " characters: list[CharacterInfo] = Field(default_factory=list)"]}, {"cell_type": "markdown", "id": "254b4254", "metadata": {}, "source": ["现在我们将使用提示模板初始化程序\n"]}, {"cell_type": "code", "execution_count": null, "id": "4718559f", "metadata": {}, "outputs": [], "source": ["from llama_index.program.openai import OpenAIPydanticProgram\n", "\n", "prompt_template_str = \"Information about 3 characters from the movie: {movie}\"\n", "\n", "program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=Characters, prompt_template_str=prompt_template_str\n", ")"]}, {"cell_type": "markdown", "id": "935cae78", "metadata": {}, "source": ["最后,我们使用 `stream_partial_objects()` 方法流式传输部分对象。\n"]}, {"cell_type": "code", "execution_count": null, "id": "221df824", "metadata": {}, "outputs": [], "source": ["# 遍历流式对象中的部分对象\n", "for partial_object in program.stream_partial_objects(movie=\"Harry Potter\"):\n", " # 将部分对象发送到前端以提供更好的用户体验\n", " print(partial_object)"]}, {"cell_type": "markdown", "id": "025c567c", "metadata": {}, "source": ["## 提取`专辑`列表(使用并行函数调用)\n"]}, {"cell_type": "markdown", "id": "273d1716", "metadata": {}, "source": ["使用OpenAI最新的[并行函数调用](https://platform.openai.com/docs/guides/function-calling/parallel-function-calling)功能,我们可以同时从单个提示中提取多个结构化数据!\n"]}, {"cell_type": "markdown", "id": "a3956a6f", "metadata": {}, "source": ["为了做到这一点,我们需要:\n", "1. 选择最新的模型之一(例如 `gpt-3.5-turbo-1106`),并且\n", "2. 在我们的 `OpenAIPydanticProgram` 中将 `allow_multiple` 设置为 True(如果不这样做,它将只返回第一个对象,并引发警告)。\n"]}, {"cell_type": "code", "execution_count": null, "id": "74b49a50", "metadata": {}, "outputs": [], "source": ["from llama_index.llms.openai import OpenAI\n", "\n", "prompt_template_str = \"\"\"\\\n", "生成4张关于春天、夏天、秋天和冬天的专辑。\n", "\"\"\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=Album,\n", " llm=OpenAI(model=\"gpt-3.5-turbo-1106\"),\n", " prompt_template_str=prompt_template_str,\n", " allow_multiple=True,\n", " verbose=True,\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "23428a9a", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: Album with args: {\"name\": \"Spring\", \"artist\": \"Various Artists\", \"songs\": [{\"title\": \"Blossom\", \"length_seconds\": 180}, {\"title\": \"Sunshine\", \"length_seconds\": 240}, {\"title\": \"Renewal\", \"length_seconds\": 200}]}\n", "Function call: Album with args: {\"name\": \"Summer\", \"artist\": \"Beach Boys\", \"songs\": [{\"title\": \"Beach Party\", \"length_seconds\": 220}, {\"title\": \"Heatwave\", \"length_seconds\": 260}, {\"title\": \"Vacation\", \"length_seconds\": 180}]}\n", "Function call: Album with args: {\"name\": \"Fall\", \"artist\": \"Autumn Leaves\", \"songs\": [{\"title\": \"Golden Days\", \"length_seconds\": 210}, {\"title\": \"Harvest Moon\", \"length_seconds\": 240}, {\"title\": \"Crisp Air\", \"length_seconds\": 190}]}\n", "Function call: Album with args: {\"name\": \"Winter\", \"artist\": \"Snowflakes\", \"songs\": [{\"title\": \"Frosty Morning\", \"length_seconds\": 190}, {\"title\": \"Snowfall\", \"length_seconds\": 220}, {\"title\": \"Cozy Nights\", \"length_seconds\": 250}]}\n"]}], "source": ["output = program()"]}, {"cell_type": "markdown", "id": "a4c3ad95", "metadata": {}, "source": ["输出是一个有效的Pydantic对象列表。\n"]}, {"cell_type": "code", "execution_count": null, "id": "8a20fdce", "metadata": {}, "outputs": [{"data": {"text/plain": ["[Album(name='Spring', artist='Various Artists', songs=[Song(title='Blossom', length_seconds=180), Song(title='Sunshine', length_seconds=240), Song(title='Renewal', length_seconds=200)]),\n", " Album(name='Summer', artist='Beach Boys', songs=[Song(title='Beach Party', length_seconds=220), Song(title='Heatwave', length_seconds=260), Song(title='Vacation', length_seconds=180)]),\n", " Album(name='Fall', artist='Autumn Leaves', songs=[Song(title='Golden Days', length_seconds=210), Song(title='Harvest Moon', length_seconds=240), Song(title='Crisp Air', length_seconds=190)]),\n", " Album(name='Winter', artist='Snowflakes', songs=[Song(title='Frosty Morning', length_seconds=190), Song(title='Snowfall', length_seconds=220), Song(title='Cozy Nights', length_seconds=250)])]"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["output"]}, {"cell_type": "markdown", "id": "58eb69c1-bafe-49e4-9d14-f676662e74ac", "metadata": {}, "source": ["## 从`Album`中提取(流式处理)\n", "\n", "我们还支持通过我们的`stream_list`函数对对象列表进行流式处理。\n", "\n", "这个想法的全部功劳归功于`openai_function_call`仓库:https://github.com/jxnl/openai_function_call/tree/main/examples/streaming_multitask\n"]}, {"cell_type": "code", "execution_count": null, "id": "0a155e9c-a7e0-46b8-addb-7fbf844c0107", "metadata": {}, "outputs": [], "source": ["prompt_template_str = \"{input_str}\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=Album,\n", " prompt_template_str=prompt_template_str,\n", " verbose=False,\n", ")\n", "\n", "output = program.stream_list(\n", " input_str=\"make up 5 random albums\",\n", ")\n", "for obj in output:\n", " print(obj.json(indent=2))"]}, {"cell_type": "markdown", "id": "2d6159d2-c967-4523-ad3c-82bc17009b2c", "metadata": {}, "source": ["## 将内容提取到`DirectoryTree`对象中\n", "\n", "这直接受到了jxnl在这里的令人敬畏的存储库的启发:https://github.com/jxnl/openai_function_call。\n", "\n", "该存储库展示了如何使用OpenAI的函数API来解析递归的Pydantic对象。主要要求是要将递归的Pydantic对象“包装”到一个非递归对象中。\n", "\n", "在这里,我们展示了一个“目录”设置的示例,其中`DirectoryTree`对象包装了递归的`Node`对象,以解析文件结构。\n"]}, {"cell_type": "code", "execution_count": null, "id": "b58f6a12-3f5c-414b-80df-4492f6e18be5", "metadata": {}, "outputs": [], "source": ["# 注意:在笔记本中定义递归对象会导致错误\n", "from directory import DirectoryTree, Node"]}, {"cell_type": "code", "execution_count": null, "id": "bc1c7eeb-f0a3-4d72-86ee-c6b63e01b0ff", "metadata": {}, "outputs": [{"data": {"text/plain": ["{'title': 'DirectoryTree',\n", " 'description': 'Container class representing a directory tree.\\n\\nArgs:\\n root (Node): The root node of the tree.',\n", " 'type': 'object',\n", " 'properties': {'root': {'title': 'Root',\n", " 'description': 'Root folder of the directory tree',\n", " 'allOf': [{'$ref': '#/definitions/Node'}]}},\n", " 'required': ['root'],\n", " 'definitions': {'NodeType': {'title': 'NodeType',\n", " 'description': 'Enumeration representing the types of nodes in a filesystem.',\n", " 'enum': ['file', 'folder'],\n", " 'type': 'string'},\n", " 'Node': {'title': 'Node',\n", " 'description': 'Class representing a single node in a filesystem. Can be either a file or a folder.\\nNote that a file cannot have children, but a folder can.\\n\\nArgs:\\n name (str): The name of the node.\\n children (List[Node]): The list of child nodes (if any).\\n node_type (NodeType): The type of the node, either a file or a folder.',\n", " 'type': 'object',\n", " 'properties': {'name': {'title': 'Name',\n", " 'description': 'Name of the folder',\n", " 'type': 'string'},\n", " 'children': {'title': 'Children',\n", " 'description': 'List of children nodes, only applicable for folders, files cannot have children',\n", " 'type': 'array',\n", " 'items': {'$ref': '#/definitions/Node'}},\n", " 'node_type': {'description': 'Either a file or folder, use the name to determine which it could be',\n", " 'default': 'file',\n", " 'allOf': [{'$ref': '#/definitions/NodeType'}]}},\n", " 'required': ['name']}}}"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["DirectoryTree.schema()"]}, {"cell_type": "code", "execution_count": null, "id": "02c4a7a1-f145-40bc-83b8-4153a531a8eb", "metadata": {}, "outputs": [], "source": ["program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=DirectoryTree,\n", " prompt_template_str=\"{input_str}\",\n", " verbose=True,\n", ")"]}, {"cell_type": "code", "execution_count": null, "id": "c88cf49f-a52f-4415-bddc-14d70c897629", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Function call: DirectoryTree with args: {\n", " \"root\": {\n", " \"name\": \"root\",\n", " \"children\": [\n", " {\n", " \"name\": \"folder1\",\n", " \"children\": [\n", " {\n", " \"name\": \"file1.txt\",\n", " \"children\": [],\n", " \"node_type\": \"file\"\n", " },\n", " {\n", " \"name\": \"file2.txt\",\n", " \"children\": [],\n", " \"node_type\": \"file\"\n", " }\n", " ],\n", " \"node_type\": \"folder\"\n", " },\n", " {\n", " \"name\": \"folder2\",\n", " \"children\": [\n", " {\n", " \"name\": \"file3.txt\",\n", " \"children\": [],\n", " \"node_type\": \"file\"\n", " },\n", " {\n", " \"name\": \"subfolder1\",\n", " \"children\": [\n", " {\n", " \"name\": \"file4.txt\",\n", " \"children\": [],\n", " \"node_type\": \"file\"\n", " }\n", " ],\n", " \"node_type\": \"folder\"\n", " }\n", " ],\n", " \"node_type\": \"folder\"\n", " }\n", " ],\n", " \"node_type\": \"folder\"\n", " }\n", "}\n"]}], "source": ["\n", "input_str = \"\"\"\n", "根目录\n", "├── 文件夹1\n", "│ ├── 文件1.txt\n", "│ └── 文件2.txt\n", "└── 文件夹2\n", " ├── 文件3.txt\n", " └── 子文件夹1\n", " └── 文件4.txt\n", "\"\"\"\n", "\n", "output = program(input_str=input_str)\n"]}, {"cell_type": "markdown", "id": "0bf23c34-2cc7-4d0f-9389-75f7cf6bf9a2", "metadata": {}, "source": ["输出是一个完整的DirectoryTree结构,其中包含递归的`Node`对象。\n"]}, {"cell_type": "code", "execution_count": null, "id": "3885032f-0f3a-4afb-9157-54851e810843", "metadata": {}, "outputs": [{"data": {"text/plain": ["DirectoryTree(root=Node(name='root', children=[Node(name='folder1', children=[Node(name='file1.txt', children=[], node_type=), Node(name='file2.txt', children=[], node_type=)], node_type=), Node(name='folder2', children=[Node(name='file3.txt', children=[], node_type=), Node(name='subfolder1', children=[Node(name='file4.txt', children=[], node_type=)], node_type=)], node_type=)], node_type=))"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["output"]}], "metadata": {"kernelspec": {"display_name": "llama-index-vs8PXMh0-py3.11", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 5}