# Evaluating LlamaIndex Agents Building agents that can intelligently use tools and make decisions is only half the journey; ensuring that these agents are accurate, reliable, and performant is what truly defines their success. [LlamaIndex](https://docs.llamaindex.ai/en/stable/understanding/agent/) provides various ways to create agents including [FunctionAgents](https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/), [CodeActAgents](https://docs.llamaindex.ai/en/stable/examples/agent/code_act_agent/), and [ReActAgents](https://docs.llamaindex.ai/en/stable/examples/agent/react_agent/). In this tutorial, we will explore how to evaluate these different agent types using both pre-built Ragas metrics and custom evaluation metrics. Let's get started. The tutorial is divided into three comprehensive sections: 1. **Evaluating with Off-the-Shelf Ragas Metrics** Here we will examine two fundamental evaluation tools: AgentGoalAccuracy, which measures how effectively an agent identifies and achieves the user's intended objective, and Tool Call Accuracy, which assesses the agent's ability to select and invoke appropriate tools in the correct sequence to complete tasks. 2. **Custom Metrics for CodeActAgent Evaluation** This section focuses on LlamaIndex's prebuilt CodeActAgent, demonstrating how to develop tailored evaluation metrics that address the specific requirements and capabilities of code-generating agents. 3. **Query Engine Tool Assessment** The final section explores how to leverage Ragas RAG metrics to evaluate query engine functionality within agents, providing insights into retrieval effectiveness and response quality when agents access information systems. ## Ragas Agentic Metrics To demonstrate evaluations using Ragas metrics, we will create a simple workflow with a single LlamaIndex Function Agent, and use that to cover the basic functionality. ??? note "Click to View the Function Agent Setup" ```python from llama_index.llms.openai import OpenAI async def send_message(to: str, content: str) -> str: """Dummy function to simulate sending an email.""" return f"Successfully sent mail to {to}" llm = OpenAI(model="gpt-4o-mini") ``` ```python from llama_index.core.agent.workflow import FunctionAgent agent = FunctionAgent( tools=[send_message], llm=llm, system_prompt="You are a helpful assistant of Jane", ) ``` ### Agent Goal Accuracy The true value of an AI agent lies in its ability to understand what users want and deliver it effectively. Agent Goal Accuracy serves as a fundamental metric that evaluates whether an agent successfully accomplishes what the user intended. This measurement is crucial as it directly reflects how well the agent interprets user needs and takes appropriate actions to fulfill them. Ragas provides two key variants of this metric: - [AgentGoalAccuracyWithReference](../../concepts/metrics/available_metrics/agents.md#with-reference) - A binary assessment (1 or 0) that compares the agent's final outcome against a predefined expected result. - [AgentGoalAccuracyWithoutReference](../../concepts/metrics/available_metrics/agents.md#without-reference) - A binary assessment (1 or 0) that evaluates whether the agent achieved the user's goal based on inferred intent rather than predefined expectations. With Reference is ideal for scenarios where the expected outcome is well-defined, such as in controlled testing environments or when testing against ground truth data. ```python from llama_index.core.agent.workflow import ( AgentInput, AgentOutput, AgentStream, ToolCall as LlamaToolCall, ToolCallResult, ) handler = agent.run(user_msg="Send a message to jhon asking for a meeting") events = [] async for ev in handler.stream_events(): if isinstance(ev, (AgentInput, AgentOutput, LlamaToolCall, ToolCallResult)): events.append(ev) elif isinstance(ev, AgentStream): print(f"{ev.delta}", end="", flush=True) elif isinstance(ev, ToolCallResult): print( f"\nCall {ev.tool_name} with {ev.tool_kwargs}\nReturned: {ev.tool_output}" ) response = await handler ``` Output: ``` I have successfully sent a message to Jhon asking for a meeting. ``` ```python from ragas.integrations.llama_index import convert_to_ragas_messages ragas_messages = convert_to_ragas_messages(events) ``` ```python from ragas.metrics import AgentGoalAccuracyWithoutReference from ragas.llms import LlamaIndexLLMWrapper from ragas.dataset_schema import MultiTurnSample from ragas.messages import ToolCall as RagasToolCall evaluator_llm = LlamaIndexLLMWrapper(llm=llm) sample = MultiTurnSample( user_input=ragas_messages, ) agent_goal_accuracy_without_reference = AgentGoalAccuracyWithoutReference(llm=evaluator_llm) await agent_goal_accuracy_without_reference.multi_turn_ascore(sample) ``` Output: ``` 1.0 ``` ```python from ragas.metrics import AgentGoalAccuracyWithReference sample = MultiTurnSample( user_input=ragas_messages, reference="Successfully sent a message to Jhon asking for a meeting" ) agent_goal_accuracy_with_reference = AgentGoalAccuracyWithReference(llm=evaluator_llm) await agent_goal_accuracy_with_reference.multi_turn_ascore(sample) ``` Output: ``` 1.0 ``` ### Tool Call Accuracy In agentic workflows, an AI agent's effectiveness depends heavily on its ability to select and use the right tools at the right time. The Tool Call Accuracy metric evaluates how precisely an agent identifies and invokes appropriate tools in the correct sequence to complete a user's request. This measurement ensures that agents not only understand what tools are available but also how to orchestrate them effectively to achieve the intended outcome. - [ToolCallAccuracy](../../concepts/metrics/available_metrics/agents.md#tool-call-accuracy) compares the agent's actual tool usage against a reference sequence of expected tool calls. If the agent's tool selection or sequence differs from the reference, the metric returns a score of 0, indicating a failure to follow the optimal path to task completion. ```python from ragas.metrics import ToolCallAccuracy sample = MultiTurnSample( user_input=ragas_messages, reference_tool_calls=[ RagasToolCall( name="send_message", args={'to': 'jhon', 'content': 'Hi Jhon,\n\nI hope this message finds you well. I would like to schedule a meeting to discuss some important matters. Please let me know your availability.\n\nBest regards,\nJane'}, ), ], ) tool_accuracy_scorer = ToolCallAccuracy() await tool_accuracy_scorer.multi_turn_ascore(sample) ``` Output: ``` 1.0 ``` ## Evaluating LlamaIndex CodeAct Agents LlamaIndex offers a prebuilt CodeAct Agent that can be used to write and execute code, inspired by the original CodeAct paper. The idea is: instead of outputting a simple JSON object, a Code Agent generates an executable code block—typically in a high-level language like Python. Writing actions in code rather than JSON-like snippets provides better: - Composability: Code naturally allows nesting and reuse of functions; JSON actions lack this flexibility. - Object management: Code elegantly handles operation outputs (image = generate_image()); JSON has no clean equivalent. - Generality: Code expresses any computational task; JSON imposes unnecessary constraints. - Representation in LLM training data: LLMs already understand code from training data, making it a more natural interface than specialized JSON. ??? note "Click to View the CodeActAgent Setup" ### Defining Functions ```python from llama_index.llms.openai import OpenAI # Configure the LLM llm = OpenAI(model="gpt-4o-mini") # Define a few helper functions def add(a: int, b: int) -> int: """Add two numbers together""" return a + b def subtract(a: int, b: int) -> int: """Subtract two numbers""" return a - b def multiply(a: int, b: int) -> int: """Multiply two numbers""" return a * b def divide(a: int, b: int) -> float: """Divide two numbers""" return a / b ``` ### Create a Code Executor The CodeActAgent will require a specific code_execute_fn to execute the code generated by the agent. ```python from typing import Any, Dict, Tuple import io import contextlib import ast import traceback class SimpleCodeExecutor: """ A simple code executor that runs Python code with state persistence. This executor maintains a global and local state between executions, allowing for variables to persist across multiple code runs. NOTE: not safe for production use! Use with caution. """ def __init__(self, locals: Dict[str, Any], globals: Dict[str, Any]): """ Initialize the code executor. Args: locals: Local variables to use in the execution context globals: Global variables to use in the execution context """ # State that persists between executions self.globals = globals self.locals = locals def execute(self, code: str) -> Tuple[bool, str, Any]: """ Execute Python code and capture output and return values. Args: code: Python code to execute Returns: Dict with keys `success`, `output`, and `return_value` """ # Capture stdout and stderr stdout = io.StringIO() stderr = io.StringIO() output = "" return_value = None try: # Execute with captured output with contextlib.redirect_stdout( stdout ), contextlib.redirect_stderr(stderr): # Try to detect if there's a return value (last expression) try: tree = ast.parse(code) last_node = tree.body[-1] if tree.body else None # If the last statement is an expression, capture its value if isinstance(last_node, ast.Expr): # Split code to add a return value assignment last_line = code.rstrip().split("\n")[-1] exec_code = ( code[: -len(last_line)] + "\n__result__ = " + last_line ) # Execute modified code exec(exec_code, self.globals, self.locals) return_value = self.locals.get("__result__") else: # Normal execution exec(code, self.globals, self.locals) except: # If parsing fails, just execute the code as is exec(code, self.globals, self.locals) # Get output output = stdout.getvalue() if stderr.getvalue(): output += "\n" + stderr.getvalue() except Exception as e: # Capture exception information output = f"Error: {type(e).__name__}: {str(e)}\n" output += traceback.format_exc() if return_value is not None: output += "\n\n" + str(return_value) return output ``` ```python code_executor = SimpleCodeExecutor( # give access to our functions defined above locals={ "add": add, "subtract": subtract, "multiply": multiply, "divide": divide, }, globals={ # give access to all builtins "__builtins__": __builtins__, # give access to numpy "np": __import__("numpy"), }, ) ``` ### Setup the CodeAct Agent ```python from llama_index.core.agent.workflow import CodeActAgent from llama_index.core.workflow import Context agent = CodeActAgent( code_execute_fn=code_executor.execute, llm=llm, tools=[add, subtract, multiply, divide], ) # context to hold the agent's session/state/chat history ctx = Context(agent) ``` ### Running and Evaluating the CodeAct agent ```python from llama_index.core.agent.workflow import ( AgentInput, AgentOutput, AgentStream, ToolCall, ToolCallResult, ) handler = agent.run("Calculate the sum of the first 10 fibonacci numbers", ctx=ctx) events = [] async for event in handler.stream_events(): if isinstance(event, (AgentInput, AgentOutput, ToolCall, ToolCallResult)): events.append(event) elif isinstance(event, AgentStream): print(f"{event.delta}", end="", flush=True) ``` The first 10 Fibonacci numbers are 0, 1, 1, 2, 3, 5, 8, 13, 21, and 34. I will calculate their sum. def fibonacci(n): fib_sequence = [0, 1] for i in range(2, n): next_fib = fib_sequence[-1] + fib_sequence[-2] fib_sequence.append(next_fib) return fib_sequence # Calculate the first 10 Fibonacci numbers first_10_fib = fibonacci(10) # Calculate the sum of the first 10 Fibonacci numbers sum_fib = sum(first_10_fib) print(sum_fib) The sum of the first 10 Fibonacci numbers is 88. ### Extract the ToolCall ```python CodeAct_agent_tool_call = events[2] agent_code = CodeAct_agent_tool_call.tool_kwargs["code"] print(agent_code) ``` Output ``` def fibonacci(n): fib_sequence = [0, 1] for i in range(2, n): next_fib = fib_sequence[-1] + fib_sequence[-2] fib_sequence.append(next_fib) return fib_sequence # Calculate the first 10 Fibonacci numbers first_10_fib = fibonacci(10) # Calculate the sum of the first 10 Fibonacci numbers sum_fib = sum(first_10_fib) print(sum_fib) ``` When assessing CodeAct agents, we can begin with foundational metrics that examine basic functionality, such as code compilability or appropriate argument selection. These straightforward evaluations provide a solid foundation before advancing to more sophisticated assessment approaches. Ragas offers powerful custom metric capabilities that enable increasingly nuanced evaluation as your requirements evolve. - [AspectCritic](../../concepts/metrics/available_metrics/aspect_critic.md) - Provides a binary evaluation (pass/fail) that determines whether an agent's response satisfies specific user-defined criteria, using LLM-based judgment to deliver clear success indicators. - [RubricScoreMetric](../../concepts/metrics/available_metrics/general_purpose.md#rubrics-based-criteria-scoring) - Evaluates agent responses against comprehensive, predefined quality rubrics with discrete scoring levels, enabling consistent performance assessment across multiple dimensions. ```python def is_compilable(code_str: str, mode="exec") -> bool: try: compile(code_str, "", mode) return True except Exception: return False is_compilable(agent_code) ``` Output ``` True ``` ```python from ragas.metrics import AspectCritic from ragas.dataset_schema import SingleTurnSample from ragas.llms import LlamaIndexLLMWrapper llm = OpenAI(model="gpt-4o-mini") evaluator_llm = LlamaIndexLLMWrapper(llm=llm) correct_tool_args = AspectCritic( name="correct_tool_args", llm=evaluator_llm, definition="Score 1 if the tool arguements use in the tool call are correct and 0 otherwise", ) sample = SingleTurnSample( user_input="Calculate the sum of the first 10 fibonacci numbers", response=agent_code, ) await correct_tool_args.single_turn_ascore(sample) ``` Output: ``` 1 ``` ## Evaluating Query Engine Tool When evaluating with Ragas metrics, we need to ensure that our data is formatted suitably for evaluations. When working with a query engine tool within an agentic system, we can approach the evaluation as we would for any retrieval-augmented generation (RAG) system. We will extract all instances where the query engine tool was called during user interactions. Using that, we can construct a Ragas RAG evaluation dataset based on our event stream data. Once the dataset is ready, we can apply the full suite of Ragas evaluation metrics. In this section, we will set up a Functional Agent with Query Engine Tools. The agent has access to two "tools": one to query the 2021 Lyft 10-K and the other to query the 2021 Uber 10-K. ??? note "Click to View the Agent Setup" ### Setting the LLMs ```python from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.core import Settings Settings.llm = OpenAI(model="gpt-4o-mini") Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small") ``` ### Build Query Engine Tools ```python from llama_index.core import StorageContext, load_index_from_storage try: storage_context = StorageContext.from_defaults( persist_dir="./storage/lyft" ) lyft_index = load_index_from_storage(storage_context) storage_context = StorageContext.from_defaults( persist_dir="./storage/uber" ) uber_index = load_index_from_storage(storage_context) index_loaded = True except: index_loaded = False ``` ```python !mkdir -p 'data/10k/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf' ``` ```python from llama_index.core import SimpleDirectoryReader, VectorStoreIndex if not index_loaded: # load data lyft_docs = SimpleDirectoryReader( input_files=["./data/10k/lyft_2021.pdf"] ).load_data() uber_docs = SimpleDirectoryReader( input_files=["./data/10k/uber_2021.pdf"] ).load_data() # build index lyft_index = VectorStoreIndex.from_documents(lyft_docs) uber_index = VectorStoreIndex.from_documents(uber_docs) # persist index lyft_index.storage_context.persist(persist_dir="./storage/lyft") uber_index.storage_context.persist(persist_dir="./storage/uber") ``` ```python lyft_engine = lyft_index.as_query_engine(similarity_top_k=3) uber_engine = uber_index.as_query_engine(similarity_top_k=3) ``` ```python from llama_index.core.tools import QueryEngineTool query_engine_tools = [ QueryEngineTool.from_defaults( query_engine=lyft_engine, name="lyft_10k", description=( "Provides information about Lyft financials for year 2021. " "Use a detailed plain text question as input to the tool." ), ), QueryEngineTool.from_defaults( query_engine=uber_engine, name="uber_10k", description=( "Provides information about Uber financials for year 2021. " "Use a detailed plain text question as input to the tool." ), ), ] ``` ### Agent Setup ```python from llama_index.core.agent.workflow import FunctionAgent, ReActAgent from llama_index.core.workflow import Context agent = FunctionAgent(tools=query_engine_tools, llm=OpenAI(model="gpt-4o-mini")) # context to hold the session/state ctx = Context(agent) ``` ### Running and Evaluating Agents ```python from llama_index.core.agent.workflow import ( AgentInput, AgentOutput, ToolCall, ToolCallResult, AgentStream, ) handler = agent.run("What's the revenue for Lyft in 2021 vs Uber?", ctx=ctx) events = [] async for ev in handler.stream_events(): if isinstance(ev, (AgentInput, AgentOutput, ToolCall, ToolCallResult)): events.append(ev) elif isinstance(ev, AgentStream): print(ev.delta, end="", flush=True) response = await handler ``` Output: ``` In 2021, Lyft generated a total revenue of $3.21 billion, while Uber's total revenue was significantly higher at $17.455 billion. ``` We will extract all instances of `ToolCallResult` where the query engine tool was called during user interactions using that we can construct a proper RAG evaluation dataset based on your event stream data. ```python from ragas.dataset_schema import SingleTurnSample ragas_samples = [] for event in events: if isinstance(event, ToolCallResult): if event.tool_name in ["lyft_10k", "uber_10k"]: sample = SingleTurnSample( user_input=event.tool_kwargs["input"], response=event.tool_output.content, retrieved_contexts=[node.text for node in event.tool_output.raw_output.source_nodes] ) ragas_samples.append(sample) ``` ```python from ragas.dataset_schema import EvaluationDataset dataset = EvaluationDataset(samples=ragas_samples) dataset.to_pandas() ``` Output:
user_input retrieved_contexts response
0 What was the total revenue for Uber in the yea... [Financial and Operational Highlights\nYear En... The total revenue for Uber in the year 2021 wa...
1 What was the total revenue for Lyft in the yea... [Significant items\n subject to estimates and ... The total revenue for Lyft in the year 2021 wa...
The resulting dataset will not include reference answers by default, so we’ll be limited to using metrics that do not require references. However, if you wish to run reference-based evaluations, you can add a reference column to the dataset and then apply the relevant Ragas metrics. ### Evaluating using Ragas RAG Metrics Let's assess the effectiveness of query engines, particularly regarding retrieval quality and hallucination prevention. To accomplish this evaluation, We will employ two key Ragas metrics: faithfulness and context relevance. For more you can visit [here](../../concepts/metrics/available_metrics/). This evaluation approach allows us to identify potential issues with either retrieval quality or response generation that could impact overall system performance. - [Faithfulness](../../concepts/metrics/available_metrics/faithfulness.md) - Measures how accurately the generated response adheres to the facts presented in the retrieved context, ensuring claims made by the system can be directly supported by the information provided. - [Context Relevance](../../concepts/metrics/available_metrics/nvidia_metrics.md#context-relevance) - Evaluates how effectively the retrieved information addresses the user's specific query by assessing its pertinence through dual LLM judgment mechanisms. ```python from ragas import evaluate from ragas.metrics import Faithfulness, ContextRelevance from ragas.llms import LlamaIndexLLMWrapper from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-4o") evaluator_llm = LlamaIndexLLMWrapper(llm=llm) faithfulness = Faithfulness(llm=evaluator_llm) context_precision = ContextRelevance(llm=evaluator_llm) result = evaluate(dataset, metrics=[faithfulness, context_precision]) ``` ``` Evaluating: 100%|██████████| 4/4 [00:03<00:00, 1.19it/s] ``` ```python result.to_pandas() ``` Output:
user_input retrieved_contexts response faithfulness nv_context_relevance
0 What was the total revenue for Uber in the yea... [Financial and Operational Highlights\nYear En... The total revenue for Uber in the year 2021 wa... 1.0 1.0
1 What was the total revenue for Lyft in the yea... [Significant items\n subject to estimates and ... The total revenue for Lyft in the year 2021 wa... 1.0 1.0