# [Beta] 多模态ReAct Agent

"在Colab中打开"/

在本教程中,我们将向您展示如何构建一个多模态ReAct代理。

这是一个可以接受文本和图像作为输入任务定义的代理,通过思维链和工具使用来尝试解决任务。

这是通过我们的低级代理API实现的,允许我们明确地步进ReAct循环,以向您展示每个步骤中发生的事情。

我们展示了两个用例:
1. **RAG代理**:给定文本/图像,可以查询RAG管道以查找答案。(给定OpenAI Dev Day 2023的屏幕截图)
2. **Web代理**:给定文本/图像,可以查询网络工具以从网络中查找相关信息(给定鞋子的图片)。

**注意**:这明确是一个beta功能,抽象化可能会随时间改变!

**注意**:目前仅适用于GPT-4V。


## 使用RAG管道增强图像分析

在本节中,我们将创建一个配备了RAG工具的多模态代理。


### 设置数据


In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-readers-web
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-tools-metaphor

In [None]:
# 下载我们之后用于运行查询的图片
!wget "https://images.openai.com/blob/a2e49de2-ba5b-4869-9c2d-db3b4b5dcc19/new-models-and-developer-products-announced-at-devday.jpg?width=2000" -O other_images/openai/dev_day.png
!wget "https://drive.google.com/uc\?id\=1B4f5ZSIKN0zTTPPRlZ915Ceb3_uF9Zlq\&export\=download" -O other_images/adidas.png

--2024-01-02 20:25:25-- https://images.openai.com/blob/a2e49de2-ba5b-4869-9c2d-db3b4b5dcc19/new-models-and-developer-products-announced-at-devday.jpg?width=2000
Resolving images.openai.com (images.openai.com)... 2606:4700:4400::6812:28cd, 2606:4700:4400::ac40:9333, 172.64.147.51, ...
Connecting to images.openai.com (images.openai.com)|2606:4700:4400::6812:28cd|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300894 (294K) [image/jpeg]
Saving to: ‘other_images/openai/dev_day.png’


2024-01-02 20:25:25 (13.8 MB/s) - ‘other_images/openai/dev_day.png’ saved [300894/300894]



In [None]:
from llama_index.readers.web import SimpleWebPageReader

url = "https://openai.com/blog/new-models-and-developer-products-announced-at-devday"
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[url])

### 设置工具


In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata

In [None]:
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")

In [None]:
vector_index = VectorStoreIndex.from_documents(
 documents,
)

In [None]:
query_tool = QueryEngineTool(
 query_engine=vector_index.as_query_engine(),
 metadata=ToolMetadata(
 name=f"vector_tool",
 description=(
 "用于查找OpenAI宣布的新功能"
 # "用于查找有关图像的任何信息"
 ),
 ),
)

### 设置代理


In [None]:
from llama_index.core.agent.react_multimodal.step import (
 MultimodalReActAgentWorker,
)
from llama_index.core.multi_modal_llms import MultiModalLLM
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.agent import Task

mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=1000)

# 选项2:使用OpenAIAgentWorker进行初始化
react_step_engine = MultimodalReActAgentWorker.from_tools(
 [query_tool],
 # [],
 multi_modal_llm=mm_llm,
 verbose=True,
)
agent = react_step_engine.as_agent()

In [None]:
query_str = (
 "照片显示了OpenAI发布的一些新功能。"
 "你能在照片中找出这些功能并使用相关工具提供更多细节吗?"
)

from llama_index.core.schema import ImageDocument

# 图像文档
image_document = ImageDocument(image_path="other_images/openai/dev_day.png")

task = agent.create_task(
 query_str,
 extra_state={"image_docs": [image_document]},
)

In [None]:
from llama_index.core.agent import AgentRunner


def execute_step(agent: AgentRunner, task: Task):
 step_output = agent.run_step(task.task_id)
 if step_output.is_last:
 response = agent.finalize_response(task.task_id)
 print(f"> Agent finished: {str(response)}")
 return response
 else:
 return None


def execute_steps(agent: AgentRunner, task: Task):
 response = execute_step(agent, task)
 while response is None:
 response = execute_step(agent, task)
 return response

In [None]:
# 如果您只想一次性运行所有内容,请运行此代码,而不是下面的代码。
# response = execute_steps(agent, task)

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: I need to use a tool to help me identify the new features released by OpenAI as shown in the photo.
Action: vector_tool
Action Input: {'input': 'new features released by OpenAI'}
[0m[1;3;34mObservation: OpenAI has released several new features, including the GPT-4 Turbo model, the Assistants API, and multimodal capabilities. The GPT-4 Turbo model is more capable, cheaper, and supports a 128K context window. The Assistants API makes it easier for developers to build their own assistive AI apps with goals and the ability to call models and tools. The multimodal capabilities include vision, image creation (DALL·E 3), and text-to-speech (TTS). These new features are being rolled out to OpenAI customers starting at 1pm PT today.
[0m

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: The observation provided information about the new features released by OpenAI, which I can now relate to the image provided.
Response: The photo shows a user interface with a section titled "Playground" and several options such as "GPT-4.0-turbo," "Code Interpreter," "Translate," and "Chat." Based on the observation from the tool, these features are part of the new releases by OpenAI. Specifically, "GPT-4.0-turbo" likely refers to the GPT-4 Turbo model, which is a more capable and cost-effective version of the language model with a larger context window. The "Code Interpreter" could be related to the Assistants API, which allows developers to build AI apps that can interpret and execute code. The "Translate" and "Chat" options might be part of the multimodal capabilities, with "Translate" possibly involving text-to-text language translation and "Chat" involving conversational AI capabilities. The multimodal capabilities also include vision and image creation, w

In [None]:
print(str(response))

The photo shows a user interface with a section titled "Playground" and several options such as "GPT-4.0-turbo," "Code Interpreter," "Translate," and "Chat." Based on the observation from the tool, these features are part of the new releases by OpenAI. Specifically, "GPT-4.0-turbo" likely refers to the GPT-4 Turbo model, which is a more capable and cost-effective version of the language model with a larger context window. The "Code Interpreter" could be related to the Assistants API, which allows developers to build AI apps that can interpret and execute code. The "Translate" and "Chat" options might be part of the multimodal capabilities, with "Translate" possibly involving text-to-text language translation and "Chat" involving conversational AI capabilities. The multimodal capabilities also include vision and image creation, which could be represented in the Playground interface but are not visible in the provided section of the photo.


## 通过网络搜索增强图像分析

在这个示例中,我们将向您展示如何设置一个由GPT-4V驱动的代理,以在网络上查找信息,帮助更好地解释给定的图像。


In [None]:
from llama_index.tools.metaphor import MetaphorToolSpec
from llama_index.core.agent.react_multimodal.step import (
 MultimodalReActAgentWorker,
)
from llama_index.core.agent import AgentRunner
from llama_index.core.multi_modal_llms import MultiModalLLM
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.agent import Task

metaphor_tool_spec = MetaphorToolSpec(
 api_key="",
)
metaphor_tools = metaphor_tool_spec.to_tool_list()

In [None]:
mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=1000)

# 选项2:使用OpenAIAgentWorker进行初始化
react_step_engine = MultimodalReActAgentWorker.from_tools(
 metaphor_tools,
 # [],
 multi_modal_llm=mm_llm,
 verbose=True,
)
agent = react_step_engine.as_agent()

In [None]:
from llama_index.core.schema import ImageDocument

query_str = "Look up some reviews regarding these shoes."
image_document = ImageDocument(image_path="other_images/adidas.png")

task = agent.create_task(
 query_str, extra_state={"image_docs": [image_document]}
)

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: The image shows a pair of shoes from a website that appears to be selling them. The user is asking for reviews of these shoes, but the image does not provide specific details such as the brand or model name. I will need to use a search tool to find reviews based on the visual information provided.
Action: search
Action Input: {'query': 'reviews for yellow and white running shoes with black stripes'}
[0m[Metaphor Tool] Autoprompt: Here is a review for a great pair of yellow and white running shoes with black stripes:
[1;3;34mObservation: [{'title': '| On', 'url': 'https://www.on-running.com/en-us/', 'id': 'bO8WCIY4qIAlfi5MbHMw7A'}, {'title': 'ASICS Gel Nimbus 21 Review 2023, Facts, Deals ($81)', 'url': 'https://runrepeat.com/asics-gel-nimbus-21', 'id': 'l-1YebY9dIRt1d8MeHyaBg'}, {'title': 'ASICS UK | Official Running Shoes & Clothing | ASICS', 'url': 'https://www.asics.com/gb/en-gb/', 'id': 'aAY5Tpax5jevadvNMj34_w'}, {'title': 'Asics Gel Nimbus', 'url': 'https:

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: The search results returned a variety of shoe brands and models, but none of them seem to match the specific shoes in the image. To find reviews for the exact shoes, I need more information about the brand or model. Since the image shows a website interface with the brand likely being "UltraBOOST 1.0 DNA" and a discount code "CYBER," I can use this information to refine my search for reviews.
Action: search
Action Input: {'query': 'UltraBOOST 1.0 DNA shoes reviews'}
[0m[Metaphor Tool] Autoprompt: Here is a review of the UltraBOOST 1.0 DNA shoes:
[1;3;34mObservation: [{'title': 'Shoe', 'url': 'https://therunningclinic.com/shoe/?id=81645', 'id': 'SR5Ar004nuT97AkWDTdekg'}, {'title': 'Shoe', 'url': 'https://therunningclinic.com/shoe/?id=81644', 'id': 'kiEvTAolb2Kh4OrvSxnqUA'}, {'title': 'On Cloudboom Review', 'url': 'https://www.doctorsofrunning.com/2020/12/on-cloudboom-review.html', 'id': 'J3ARnLPH1KWzrIdGogwxnw'}, {'title': 'Adidas UltraBOOST 5.0 DNA Glow now av

In [None]:
# 执行步骤,获取agent的响应
response = execute_step(agent, task)

In [None]:
response = execute_step(agent, task)

[1;3;38;5;200mThought: I can answer without using any more tools.
Response: The Adidas Ultraboost is reviewed as an expensive but versatile shoe suitable for various activities, from running a half marathon to going out on a date. They are considered a good value due to their durability, with the reviewer getting hundreds of miles out of them. The shoes are described as lightweight, breathable, and comfortable enough to wear without socks. However, they are not recommended for wet climates as they do not perform well in the rain. The reviewer also mentions owning seven different models of Adidas Boost, indicating a strong preference for the brand.
[0m> Agent finished: The Adidas Ultraboost is reviewed as an expensive but versatile shoe suitable for various activities, from running a half marathon to going out on a date. They are considered a good value due to their durability, with the reviewer getting hundreds of miles out of them. The shoes are described as lightweight, breathable,

In [None]:
print(str(response))

The Adidas Ultraboost is reviewed as an expensive but versatile shoe suitable for various activities, from running a half marathon to going out on a date. They are considered a good value due to their durability, with the reviewer getting hundreds of miles out of them. The shoes are described as lightweight, breathable, and comfortable enough to wear without socks. However, they are not recommended for wet climates as they do not perform well in the rain. The reviewer also mentions owning seven different models of Adidas Boost, indicating a strong preference for the brand.
