<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/docs/examples/finetuning/openai_fine_tuning_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# 使用函数调用进行微调

在这个笔记本中，我们将介绍如何使用函数调用对gpt-3.5-turbo进行微调。这里的主要用例是结构化数据提取。我们的主要重点是提炼GPT-4的输出，以帮助改进gpt-3.5-turbo的函数调用能力。

我们将从简单到高级逐个示例进行讲解：
1. 在通过我们的OpenAI Pydantic程序对象记录的一些玩具消息/结构化输出上进行微调。
2. 在整个文档语料库上进行上下文增强的查询/结构化输出微调。在RAG系统中使用这个功能。


In [None]:
%pip install llama-index-finetuning
%pip install llama-index-llms-openai
%pip install llama-index-finetuning-callbacks
%pip install llama-index-readers-file pymupdf
%pip install llama-index-program-openai

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os
import openai

In [None]:
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

## 使用GPT-4 Pydantic程序进行微调

在本节中，我们将展示如何通过我们的低级Pydantic程序模块记录输入/输出。我们将使用该数据集对LLM进行微调。


### 定义 Pydantic 模型 + 程序

在这里，我们定义了由 GPT-4 提供支持的函数调用程序，该程序将生成结构化输出到一个 Pydantic 对象（一个专辑）。


In [None]:
from llama_index.program.openai import OpenAIPydanticProgram
from pydantic import BaseModel
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from typing import List


class Song(BaseModel):
    """歌曲的数据模型。"""

    title: str
    length_seconds: int


class Album(BaseModel):
    """专辑的数据模型。"""

    name: str
    artist: str
    songs: List[Song]


finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

llm = OpenAI(model="gpt-4", callback_manager=callback_manager)


prompt_template_str = """\
生成一个示例专辑，包括艺术家和歌曲列表。\
以电影 {movie_name} 作为灵感。\
"""
program = OpenAIPydanticProgram.from_defaults(
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    llm=llm,
    verbose=False,
)

### 记录输入/输出

我们定义一些示例电影名称作为输入，并通过函数调用程序记录输出。


In [None]:
# 注意：我们需要至少10部电影才能使用OpenAI微调
movie_names = [
    "闪灵",
    "无间道",
    "泰坦尼克号",
    "好家伙",
    "风月情人",
    "小鬼当家",
    "铁笼狂怒",
    "剪刀手爱德华",
    "全面回忆",
    "幽灵",
    "震颤",
    "机械战警",
    "洛基5",
]

In [None]:
from tqdm.notebook import tqdm

for movie_name in tqdm(movie_names):
    output = program(movie_name=movie_name)
    print(output.json())

  0%|          | 0/13 [00:00<?, ?it/s]

{"name": "The Shining", "artist": "Various Artists", "songs": [{"title": "Main Title", "length_seconds": 180}, {"title": "Opening Credits", "length_seconds": 120}, {"title": "The Overlook Hotel", "length_seconds": 240}, {"title": "Redrum", "length_seconds": 150}, {"title": "Here's Johnny!", "length_seconds": 200}]}
{"name": "The Departed Soundtrack", "artist": "Various Artists", "songs": [{"title": "Gimme Shelter", "length_seconds": 272}, {"title": "Comfortably Numb", "length_seconds": 383}, {"title": "I'm Shipping Up to Boston", "length_seconds": 166}, {"title": "Sweet Dreams (Are Made of This)", "length_seconds": 216}, {"title": "I'm Shipping Up to Boston (Instrumental)", "length_seconds": 166}, {"title": "The Departed Tango", "length_seconds": 123}, {"title": "Thief's Theme", "length_seconds": 201}, {"title": "Well Well Well", "length_seconds": 126}, {"title": "Comfortably Numb (Live)", "length_seconds": 383}, {"title": "Sail On, Sailor", "length_seconds": 181}]}
{"name": "Titanic S

In [None]:
finetuning_handler.save_finetuning_events("mock_finetune_songs.jsonl")

Wrote 14 examples to mock_finetune_songs.jsonl


In [None]:
!cat mock_finetune_songs.jsonl

### 在数据集上进行微调

现在我们定义一个微调引擎，并在模拟数据集上进行微调。


In [None]:

from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "mock_finetune_songs.jsonl",
    # start_job_id="<start-job-id>"  # 如果你有一个现有的作业，可以在这里指定id
    validate_json=False,  # openai验证json代码尚不支持函数调用
)

In [None]:
finetune_engine.finetune()

In [None]:
finetune_engine.get_current_job()

<FineTuningJob fine_tuning.job id=ftjob-uJ9kQ9pI0p0YNatBDxF3VITv at 0x172a5c9a0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-uJ9kQ9pI0p0YNatBDxF3VITv",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1696463378,
  "finished_at": 1696463749,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::8660TXqx",
  "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
  "result_files": [
    "file-Hbpw15BAwyf3e4HK5Z9g4IK2"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-MNh7snhv0triDIhsrErokSMY",
  "hyperparameters": {
    "n_epochs": 7
  },
  "trained_tokens": 22834,
  "error": null
}

### 试一下！

我们获得了经过微调的LLM，并将其与Pydantic程序一起使用。


In [None]:
ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)

In [None]:
ft_program = OpenAIPydanticProgram.from_defaults(
    output_cls=Album,
    prompt_template_str=prompt_template_str,
    llm=ft_llm,
    verbose=False,
)

In [None]:
ft_program(movie_name="Goodfellas")

Album(name='Goodfellas Soundtrack', artist='Various Artists', songs=[Song(title='Rags to Riches', length_seconds=180), Song(title='Gimme Shelter', length_seconds=270), Song(title='Layla', length_seconds=270), Song(title='Jump into the Fire', length_seconds=240), Song(title='Atlantis', length_seconds=180), Song(title='Beyond the Sea', length_seconds=180), Song(title='Sunshine of Your Love', length_seconds=240), Song(title='Mannish Boy', length_seconds=240), Song(title='Layla (Piano Exit)', length_seconds=120)])

## 通过RAG系统对结构化输出进行微调

函数调用的一个用例是通过RAG系统获取结构化输出。

在这里，我们展示如何创建一个训练数据集，其中包括上下文增强输入和未结构化文档上的结构化输出。然后，我们可以对LLM进行微调，并将其插入到RAG系统中，以执行检索和输出提取。


In [None]:
!mkdir data && wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2023-10-04 23:46:36--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’


2023-10-04 23:47:25 (298 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]


In [None]:
from pydantic import Field
from typing import List


class Citation(BaseModel):
    """引文类。"""

    author: str = Field(
        ..., description="推断出的第一作者（通常是姓氏）"
    )
    year: int = Field(..., description="推断出的年份")
    desc: str = Field(
        ...,
        description=(
            "从作者被引用的作品的文本中推断出的描述"
        ),
    )


class Response(BaseModel):
    """作者引文列表。

    从非结构化文本中提取。

    """

    citations: List[Citation] = Field(
        ...,
        description=(
            "作者引文列表（按作者、年份和描述组织）。"
        ),
    )

```python
import pandas as pd

# 读取数据
data = pd.read_csv('data.csv')

# 显示数据的前几行
data.head()
```


In [None]:
from llama_index.readers.file import PyMuPDFReader
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from pathlib import Path

In [None]:
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

In [None]:
doc_text = "\n\n".join([d.get_content() for d in docs0])
metadata = {
    "paper_title": "Llama 2: Open Foundation and Fine-Tuned Chat Models"
}
docs = [Document(text=doc_text, metadata=metadata)]

In [None]:
chunk_size = 1024
node_parser = SentenceSplitter(chunk_size=chunk_size)
nodes = node_parser.get_nodes_from_documents(docs)

In [None]:
len(nodes)

89

In [None]:
from llama_index.core import Settings

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

Settings.chunk_size = chunk_size

gpt_4_llm = OpenAI(
    model="gpt-4-0613", temperature=0.3, callback_manager=callback_manager
)

gpt_35_llm = OpenAI(
    model="gpt-3.5-turbo-0613",
    temperature=0.3,
    callback_manager=callback_manager,
)

eval_llm = OpenAI(model="gpt-4-0613", temperature=0)

### 生成数据集

在这里，我们展示如何在这些非结构化的块/节点上生成一个训练数据集。

我们生成问题来提取不同上下文中的引用。我们通过一个GPT-4 RAG管道运行这些问题，提取结构化输出，并记录输入/输出。


In [None]:
# 设置数据集生成器
from llama_index.core.evaluation import DatasetGenerator
from llama_index.core import SummaryIndex
from llama_index.core import PromptTemplate
from tqdm.notebook import tqdm
from tqdm.asyncio import tqdm_asyncio


fp = open("data/qa_pairs.jsonl", "w")

question_gen_prompt = PromptTemplate(
    """
{query_str}

Context:
{context_str}

Questions:
"""
)

question_gen_query = """\
给定一篇研究论文的片段。它包含引用。
请从文本中生成关于这些引用的问题。

例如，以下是一些示例问题：
哪些引用对应于变压器模型的相关工作？
告诉我关于推进RLHF的作者。
你能告诉我所有计算机视觉作品对应的引用吗？\
"""

qr_pairs = []
node_questions_tasks = []
for idx, node in enumerate(nodes[:39]):
    num_questions = 1  # 更改此数字以增加节点数量
    dataset_generator = DatasetGenerator(
        [node],
        question_gen_query=question_gen_query,
        text_question_template=question_gen_prompt,
        llm=eval_llm,
        metadata_mode="all",
        num_questions_per_chunk=num_questions,
    )

    task = dataset_generator.agenerate_questions_from_nodes(num=num_questions)
    node_questions_tasks.append(task)
node_questions_lists = await tqdm_asyncio.gather(*node_questions_tasks)

In [None]:
node_questions_lists

In [None]:
from llama_index.core import VectorStoreIndex

gpt4_index = VectorStoreIndex(nodes=nodes)
gpt4_query_engine = gpt4_index.as_query_engine(
    output_cls=Response, similarity_top_k=1, llm=gpt_4_llm
)

In [None]:
from json import JSONDecodeError

for idx, node in enumerate(tqdm(nodes[:39])):
    node_questions_0 = node_questions_lists[idx]
    for question in node_questions_0:
        try:
            # 注意：我们不需要使用response，事件通过fine-tuning处理程序记录
            gpt4_query_engine.query(question)
        except Exception as e:
            print(f"问题 {question} 出错, {repr(e)}")
            pass

  0%|          | 0/39 [00:00<?, ?it/s]

Error for question Which citations are referred to in the discussion about safety investigations into pretraining data and pretrained models?, ValidationError(model='Response', errors=[{'loc': ('__root__',), 'msg': 'Expecting value: line 1 column 1 (char 0)', 'type': 'value_error.jsondecode', 'ctx': {'msg': 'Expecting value', 'doc': 'Empty Response', 'pos': 0, 'lineno': 1, 'colno': 1}}])


In [None]:
finetuning_handler.save_finetuning_events("llama2_citation_events.jsonl")

Wrote 83 examples to llama2_citation_events.jsonl


### 设置微调

我们开始对生成的数据集进行微调。


In [None]:

from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "llama2_citation_events.jsonl",
    # start_job_id="<start-job-id>"  # 如果你有一个现有的作业，可以在这里指定id
    validate_json=False,  # openai验证json代码尚不支持函数调用
)

In [None]:
finetune_engine.finetune()

In [None]:
finetune_engine.get_current_job()

<FineTuningJob fine_tuning.job id=ftjob-ATYm4yZHP1QvXs1wx85Ix79F at 0x1752b6b60> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-ATYm4yZHP1QvXs1wx85Ix79F",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1696497663,
  "finished_at": 1696498092,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::86EwPw83",
  "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
  "result_files": [
    "file-wabcIIxjLqvhqOVohf4qSmE7"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-WbYcsinIbH8vyCAstcoFEr92",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 132678,
  "error": null
}

### 在RAG Pipeline中使用

让我们将经过微调的LLM插入到一个完整的RAG pipeline中，以输出结构化的结果。


In [None]:
ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)

In [None]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex(nodes=nodes)
query_engine = vector_index.as_query_engine(
    output_cls=Response, similarity_top_k=1, llm=ft_llm
)

In [None]:
# 将基线设置为
base_index = VectorStoreIndex(nodes=nodes)
base_query_engine = base_index.as_query_engine(
    output_cls=Response, similarity_top_k=1, llm=gpt_35_llm
)

In [None]:
query_str = """\
用于衡量Llama 2真实性的引用是哪个？\
"""

response = query_engine.query(query_str)
print(str(response))

{"citations": [{"author": "Lin et al.", "year": 2021, "desc": "TruthfulQA, used for LLM hallucinations to measure whether a language model is truthful in generating answers to questions while being informative at the same time."}]}


In [None]:
base_response = base_query_engine.query(query_str)
print(str(base_response))

{"citations": [{"author": "Lin et al.", "year": 2021, "desc": "TruthfulQA"}]}


In [None]:
# 查看源代码
print(response.source_nodes[0].get_content())

In [None]:
# 作为参考，请查看GPT-4的响应
gpt4_response = gpt4_query_engine.query(query_str)
print(str(gpt4_response))

{"citations": [{"author": "Lin et al.", "year": 2021, "desc": "TruthfulQA, used for LLM hallucinations to measure whether a language model is truthful in generating answers to questions while being informative at the same time."}]}
