{"cells": [{"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["\"在\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["# 使用函数调用进行微调\n", "\n", "在这个笔记本中,我们将介绍如何使用函数调用对gpt-3.5-turbo进行微调。这里的主要用例是结构化数据提取。我们的主要重点是提炼GPT-4的输出,以帮助改进gpt-3.5-turbo的函数调用能力。\n", "\n", "我们将从简单到高级逐个示例进行讲解:\n", "1. 在通过我们的OpenAI Pydantic程序对象记录的一些玩具消息/结构化输出上进行微调。\n", "2. 在整个文档语料库上进行上下文增强的查询/结构化输出微调。在RAG系统中使用这个功能。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["%pip install llama-index-finetuning\n", "%pip install llama-index-llms-openai\n", "%pip install llama-index-finetuning-callbacks\n", "%pip install llama-index-readers-file pymupdf\n", "%pip install llama-index-program-openai"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import nest_asyncio\n", "\n", "nest_asyncio.apply()"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import os\n", "import openai"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n", "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 使用GPT-4 Pydantic程序进行微调\n", "\n", "在本节中,我们将展示如何通过我们的低级Pydantic程序模块记录输入/输出。我们将使用该数据集对LLM进行微调。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 定义 Pydantic 模型 + 程序\n", "\n", "在这里,我们定义了由 GPT-4 提供支持的函数调用程序,该程序将生成结构化输出到一个 Pydantic 对象(一个专辑)。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.program.openai import OpenAIPydanticProgram\n", "from pydantic import BaseModel\n", "from llama_index.llms.openai import OpenAI\n", "from llama_index.finetuning.callbacks import OpenAIFineTuningHandler\n", "from llama_index.core.callbacks import CallbackManager\n", "from typing import List\n", "\n", "\n", "class Song(BaseModel):\n", " \"\"\"歌曲的数据模型。\"\"\"\n", "\n", " title: str\n", " length_seconds: int\n", "\n", "\n", "class Album(BaseModel):\n", " \"\"\"专辑的数据模型。\"\"\"\n", "\n", " name: str\n", " artist: str\n", " songs: List[Song]\n", "\n", "\n", "finetuning_handler = OpenAIFineTuningHandler()\n", "callback_manager = CallbackManager([finetuning_handler])\n", "\n", "llm = OpenAI(model=\"gpt-4\", callback_manager=callback_manager)\n", "\n", "\n", "prompt_template_str = \"\"\"\\\n", "生成一个示例专辑,包括艺术家和歌曲列表。\\\n", "以电影 {movie_name} 作为灵感。\\\n", "\"\"\"\n", "program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=Album,\n", " prompt_template_str=prompt_template_str,\n", " llm=llm,\n", " verbose=False,\n", ")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 记录输入/输出\n", "\n", "我们定义一些示例电影名称作为输入,并通过函数调用程序记录输出。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# 注意:我们需要至少10部电影才能使用OpenAI微调\n", "movie_names = [\n", " \"闪灵\",\n", " \"无间道\",\n", " \"泰坦尼克号\",\n", " \"好家伙\",\n", " \"风月情人\",\n", " \"小鬼当家\",\n", " \"铁笼狂怒\",\n", " \"剪刀手爱德华\",\n", " \"全面回忆\",\n", " \"幽灵\",\n", " \"震颤\",\n", " \"机械战警\",\n", " \"洛基5\",\n", "]"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"data": {"application/json": {"ascii": false, "bar_format": null, "colour": null, "elapsed": 0.004143953323364258, "initial": 0, "n": 0, "ncols": null, "nrows": 25, "postfix": null, "prefix": "", "rate": null, "total": 13, "unit": "it", "unit_divisor": 1000, "unit_scale": false}, "application/vnd.jupyter.widget-view+json": {"model_id": "0c6e3e3e2da545d1a5bb23e93d867444", "version_major": 2, "version_minor": 0}, "text/plain": [" 0%| | 0/13 [00:00\" # 如果你有一个现有的作业,可以在这里指定id\n", " validate_json=False, # openai验证json代码尚不支持函数调用\n", ")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["finetune_engine.finetune()"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"data": {"text/plain": [" JSON: {\n", " \"object\": \"fine_tuning.job\",\n", " \"id\": \"ftjob-uJ9kQ9pI0p0YNatBDxF3VITv\",\n", " \"model\": \"gpt-3.5-turbo-0613\",\n", " \"created_at\": 1696463378,\n", " \"finished_at\": 1696463749,\n", " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:llamaindex::8660TXqx\",\n", " \"organization_id\": \"org-1ZDAvajC6v2ZtAP9hLEIsXRz\",\n", " \"result_files\": [\n", " \"file-Hbpw15BAwyf3e4HK5Z9g4IK2\"\n", " ],\n", " \"status\": \"succeeded\",\n", " \"validation_file\": null,\n", " \"training_file\": \"file-MNh7snhv0triDIhsrErokSMY\",\n", " \"hyperparameters\": {\n", " \"n_epochs\": 7\n", " },\n", " \"trained_tokens\": 22834,\n", " \"error\": null\n", "}"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["finetune_engine.get_current_job()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 试一下!\n", "\n", "我们获得了经过微调的LLM,并将其与Pydantic程序一起使用。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["ft_program = OpenAIPydanticProgram.from_defaults(\n", " output_cls=Album,\n", " prompt_template_str=prompt_template_str,\n", " llm=ft_llm,\n", " verbose=False,\n", ")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"data": {"text/plain": ["Album(name='Goodfellas Soundtrack', artist='Various Artists', songs=[Song(title='Rags to Riches', length_seconds=180), Song(title='Gimme Shelter', length_seconds=270), Song(title='Layla', length_seconds=270), Song(title='Jump into the Fire', length_seconds=240), Song(title='Atlantis', length_seconds=180), Song(title='Beyond the Sea', length_seconds=180), Song(title='Sunshine of Your Love', length_seconds=240), Song(title='Mannish Boy', length_seconds=240), Song(title='Layla (Piano Exit)', length_seconds=120)])"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["ft_program(movie_name=\"Goodfellas\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 通过RAG系统对结构化输出进行微调\n", "\n", "函数调用的一个用例是通过RAG系统获取结构化输出。\n", "\n", "在这里,我们展示如何创建一个训练数据集,其中包括上下文增强输入和未结构化文档上的结构化输出。然后,我们可以对LLM进行微调,并将其插入到RAG系统中,以执行检索和输出提取。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["--2023-10-04 23:46:36-- https://arxiv.org/pdf/2307.09288.pdf\n", "Resolving arxiv.org (arxiv.org)... 128.84.21.199\n", "Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 13661300 (13M) [application/pdf]\n", "Saving to: ‘data/llama2.pdf’\n", "\n", "data/llama2.pdf 100%[===================>] 13.03M 229KB/s in 45s \n", "\n", "2023-10-04 23:47:25 (298 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]\n"]}], "source": ["!mkdir data && wget --user-agent \"Mozilla\" \"https://arxiv.org/pdf/2307.09288.pdf\" -O \"data/llama2.pdf\""]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from pydantic import Field\n", "from typing import List\n", "\n", "\n", "class Citation(BaseModel):\n", " \"\"\"引文类。\"\"\"\n", "\n", " author: str = Field(\n", " ..., description=\"推断出的第一作者(通常是姓氏)\"\n", " )\n", " year: int = Field(..., description=\"推断出的年份\")\n", " desc: str = Field(\n", " ...,\n", " description=(\n", " \"从作者被引用的作品的文本中推断出的描述\"\n", " ),\n", " )\n", "\n", "\n", "class Response(BaseModel):\n", " \"\"\"作者引文列表。\n", "\n", " 从非结构化文本中提取。\n", "\n", " \"\"\"\n", "\n", " citations: List[Citation] = Field(\n", " ...,\n", " description=(\n", " \"作者引文列表(按作者、年份和描述组织)。\"\n", " ),\n", " )"]}, {"cell_type": "markdown", "metadata": {}, "source": ["```python\n", "import pandas as pd\n", "\n", "# 读取数据\n", "data = pd.read_csv('data.csv')\n", "\n", "# 显示数据的前几行\n", "data.head()\n", "```\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.readers.file import PyMuPDFReader\n", "from llama_index.core import Document\n", "from llama_index.core.node_parser import SentenceSplitter\n", "from pathlib import Path"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["loader = PyMuPDFReader()\n", "docs0 = loader.load(file_path=Path(\"./data/llama2.pdf\"))"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["doc_text = \"\\n\\n\".join([d.get_content() for d in docs0])\n", "metadata = {\n", " \"paper_title\": \"Llama 2: Open Foundation and Fine-Tuned Chat Models\"\n", "}\n", "docs = [Document(text=doc_text, metadata=metadata)]"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["chunk_size = 1024\n", "node_parser = SentenceSplitter(chunk_size=chunk_size)\n", "nodes = node_parser.get_nodes_from_documents(docs)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"data": {"text/plain": ["89"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["len(nodes)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.core import Settings\n", "\n", "finetuning_handler = OpenAIFineTuningHandler()\n", "callback_manager = CallbackManager([finetuning_handler])\n", "\n", "Settings.chunk_size = chunk_size\n", "\n", "gpt_4_llm = OpenAI(\n", " model=\"gpt-4-0613\", temperature=0.3, callback_manager=callback_manager\n", ")\n", "\n", "gpt_35_llm = OpenAI(\n", " model=\"gpt-3.5-turbo-0613\",\n", " temperature=0.3,\n", " callback_manager=callback_manager,\n", ")\n", "\n", "eval_llm = OpenAI(model=\"gpt-4-0613\", temperature=0)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 生成数据集\n", "\n", "在这里,我们展示如何在这些非结构化的块/节点上生成一个训练数据集。\n", "\n", "我们生成问题来提取不同上下文中的引用。我们通过一个GPT-4 RAG管道运行这些问题,提取结构化输出,并记录输入/输出。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# 设置数据集生成器\n", "from llama_index.core.evaluation import DatasetGenerator\n", "from llama_index.core import SummaryIndex\n", "from llama_index.core import PromptTemplate\n", "from tqdm.notebook import tqdm\n", "from tqdm.asyncio import tqdm_asyncio\n", "\n", "\n", "fp = open(\"data/qa_pairs.jsonl\", \"w\")\n", "\n", "question_gen_prompt = PromptTemplate(\n", " \"\"\"\n", "{query_str}\n", "\n", "Context:\n", "{context_str}\n", "\n", "Questions:\n", "\"\"\"\n", ")\n", "\n", "question_gen_query = \"\"\"\\\n", "给定一篇研究论文的片段。它包含引用。\n", "请从文本中生成关于这些引用的问题。\n", "\n", "例如,以下是一些示例问题:\n", "哪些引用对应于变压器模型的相关工作?\n", "告诉我关于推进RLHF的作者。\n", "你能告诉我所有计算机视觉作品对应的引用吗?\\\n", "\"\"\"\n", "\n", "qr_pairs = []\n", "node_questions_tasks = []\n", "for idx, node in enumerate(nodes[:39]):\n", " num_questions = 1 # 更改此数字以增加节点数量\n", " dataset_generator = DatasetGenerator(\n", " [node],\n", " question_gen_query=question_gen_query,\n", " text_question_template=question_gen_prompt,\n", " llm=eval_llm,\n", " metadata_mode=\"all\",\n", " num_questions_per_chunk=num_questions,\n", " )\n", "\n", " task = dataset_generator.agenerate_questions_from_nodes(num=num_questions)\n", " node_questions_tasks.append(task)\n", "node_questions_lists = await tqdm_asyncio.gather(*node_questions_tasks)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["node_questions_lists"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.core import VectorStoreIndex\n", "\n", "gpt4_index = VectorStoreIndex(nodes=nodes)\n", "gpt4_query_engine = gpt4_index.as_query_engine(\n", " output_cls=Response, similarity_top_k=1, llm=gpt_4_llm\n", ")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"data": {"application/json": {"ascii": false, "bar_format": null, "colour": null, "elapsed": 0.007736921310424805, "initial": 0, "n": 0, "ncols": null, "nrows": 15, "postfix": null, "prefix": "", "rate": null, "total": 39, "unit": "it", "unit_divisor": 1000, "unit_scale": false}, "application/vnd.jupyter.widget-view+json": {"model_id": "b4d406c5c7144773a6cc9698e30b9828", "version_major": 2, "version_minor": 0}, "text/plain": [" 0%| | 0/39 [00:00\" # 如果你有一个现有的作业,可以在这里指定id\n", " validate_json=False, # openai验证json代码尚不支持函数调用\n", ")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["finetune_engine.finetune()"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"data": {"text/plain": [" JSON: {\n", " \"object\": \"fine_tuning.job\",\n", " \"id\": \"ftjob-ATYm4yZHP1QvXs1wx85Ix79F\",\n", " \"model\": \"gpt-3.5-turbo-0613\",\n", " \"created_at\": 1696497663,\n", " \"finished_at\": 1696498092,\n", " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:llamaindex::86EwPw83\",\n", " \"organization_id\": \"org-1ZDAvajC6v2ZtAP9hLEIsXRz\",\n", " \"result_files\": [\n", " \"file-wabcIIxjLqvhqOVohf4qSmE7\"\n", " ],\n", " \"status\": \"succeeded\",\n", " \"validation_file\": null,\n", " \"training_file\": \"file-WbYcsinIbH8vyCAstcoFEr92\",\n", " \"hyperparameters\": {\n", " \"n_epochs\": 3\n", " },\n", " \"trained_tokens\": 132678,\n", " \"error\": null\n", "}"]}, "execution_count": null, "metadata": {}, "output_type": "execute_result"}], "source": ["finetune_engine.get_current_job()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 在RAG Pipeline中使用\n", "\n", "让我们将经过微调的LLM插入到一个完整的RAG pipeline中,以输出结构化的结果。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from llama_index.core import VectorStoreIndex\n", "\n", "vector_index = VectorStoreIndex(nodes=nodes)\n", "query_engine = vector_index.as_query_engine(\n", " output_cls=Response, similarity_top_k=1, llm=ft_llm\n", ")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# 将基线设置为\n", "base_index = VectorStoreIndex(nodes=nodes)\n", "base_query_engine = base_index.as_query_engine(\n", " output_cls=Response, similarity_top_k=1, llm=gpt_35_llm\n", ")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{\"citations\": [{\"author\": \"Lin et al.\", \"year\": 2021, \"desc\": \"TruthfulQA, used for LLM hallucinations to measure whether a language model is truthful in generating answers to questions while being informative at the same time.\"}]}\n"]}], "source": ["query_str = \"\"\"\\\n", "用于衡量Llama 2真实性的引用是哪个?\\\n", "\"\"\"\n", "\n", "response = query_engine.query(query_str)\n", "print(str(response))"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{\"citations\": [{\"author\": \"Lin et al.\", \"year\": 2021, \"desc\": \"TruthfulQA\"}]}\n"]}], "source": ["base_response = base_query_engine.query(query_str)\n", "print(str(base_response))"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# 查看源代码\n", "print(response.source_nodes[0].get_content())"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{\"citations\": [{\"author\": \"Lin et al.\", \"year\": 2021, \"desc\": \"TruthfulQA, used for LLM hallucinations to measure whether a language model is truthful in generating answers to questions while being informative at the same time.\"}]}\n"]}], "source": ["# 作为参考,请查看GPT-4的响应\n", "gpt4_response = gpt4_query_engine.query(query_str)\n", "print(str(gpt4_response))"]}], "metadata": {"kernelspec": {"display_name": "llama_index_v2", "language": "python", "name": "llama_index_v2"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}}, "nbformat": 4, "nbformat_minor": 4}