{
"cells": [
{
"cell_type": "markdown",
"id": "6ff95379",
"metadata": {},
"source": [
"# Tool Evaluation with OpenAI Evals\n",
"\n",
"This cookbook shows how to **measure and improve a model’s ability to extract structured information from source code** with tool evaluation. In this case, the set of *symbols* (functions, classes, methods, and variables) defined in Python files. "
]
},
{
"cell_type": "markdown",
"id": "4cc30394",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Install the latest **openai** Python package ≥ 1.14.0 and set your `OPENAI_API_KEY` environment variable. If you also want to evaluate an *assistant with tools*, enable the *Assistants v2 beta* in your account.\n",
"\n",
"```bash\n",
"pip install --upgrade openai\n",
"export OPENAI_API_KEY=sk‑...\n",
"```\n",
"Below we import the SDK, create a client, and define a helper that builds a small dataset from files inside the **openai** package itself."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "acd0d746",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install --upgrade openai pandas jinja2 rich --quiet\n",
"\n",
"import os\n",
"import time\n",
"import openai\n",
"from rich import print\n",
"\n",
"client = openai.OpenAI(\n",
" api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "80618b60",
"metadata": {},
"source": [
"### Dataset factory & grading rubric\n",
"* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
"* `structured_output_grader` defines a detailed evaluation rubric. \n",
"* `sampled.output_tools[0].function.arguments.symbols` specifies the extracted symbols from the code file based on the tool invocation.\n",
"* `client.evals.create(...)` registers the eval with the platform."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "120b6e4d",
"metadata": {
"tags": [
"original"
]
},
"outputs": [],
"source": [
"def get_dataset(limit=None):\n",
" openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
"\n",
" file_paths = [\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
" ]\n",
"\n",
" items = []\n",
" for file_path in file_paths:\n",
" items.append({\"input\": open(file_path, \"r\").read()})\n",
" if limit:\n",
" return items[:limit]\n",
" return items\n",
"\n",
"\n",
"structured_output_grader = \"\"\"\n",
"You are a helpful assistant that grades the quality of extracted information from a code file.\n",
"You will be given a code file and a list of extracted information.\n",
"You should grade the quality of the extracted information.\n",
"\n",
"You should grade the quality on a scale of 1 to 7.\n",
"You should apply the following criteria, and calculate your score as follows:\n",
"You should first check for completeness on a scale of 1 to 7.\n",
"Then you should apply a quality modifier.\n",
"\n",
"The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
"If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
"If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
"etc.\n",
"\"\"\"\n",
"\n",
"structured_output_grader_user_prompt = \"\"\"\n",
"\n",
"{{item.input}}\n",
"\n",
"\n",
"\n",
"{{sample.output_tools[0].function.arguments.symbols}}\n",
"\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "d7f66a56",
"metadata": {},
"source": [
"### Evals Creation\n",
"\n",
"Here we create an eval that will be used to evaluate the quality of extracted information from code files.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95a5eaf6",
"metadata": {},
"outputs": [],
"source": [
"logs_eval = client.evals.create(\n",
" name=\"Code QA Eval\",\n",
" data_source_config={\n",
" \"type\": \"custom\",\n",
" \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n",
" \"include_sample_schema\": True,\n",
" },\n",
" testing_criteria=[\n",
" {\n",
" \"type\": \"score_model\",\n",
" \"name\": \"General Evaluator\",\n",
" \"model\": \"o3\",\n",
" \"input\": [\n",
" {\"role\": \"system\", \"content\": structured_output_grader},\n",
" {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n",
" ],\n",
" \"range\": [1, 7],\n",
" \"pass_threshold\": 5.0,\n",
" }\n",
" ],\n",
")\n",
"\n",
"symbol_tool = {\n",
" \"name\": \"extract_symbols\",\n",
" \"description\": \"Extract the symbols from the code file\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"symbols\": {\n",
" \"type\": \"array\",\n",
" \"description\": \"A list of symbols extracted from Python code.\",\n",
" \"items\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
" \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n",
" },\n",
" \"required\": [\"name\", \"symbol_type\"],\n",
" \"additionalProperties\": False,\n",
" },\n",
" }\n",
" },\n",
" \"required\": [\"symbols\"],\n",
" \"additionalProperties\": False,\n",
" },\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "73ae7e5e",
"metadata": {},
"source": [
"### Kick off model runs\n",
"Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d650e02",
"metadata": {},
"outputs": [],
"source": [
"gpt_4one_completions_run = client.evals.runs.create(\n",
" name=\"gpt-4.1\",\n",
" eval_id=logs_eval.id,\n",
" data_source={\n",
" \"type\": \"completions\",\n",
" \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
" \"input_messages\": {\n",
" \"type\": \"template\",\n",
" \"template\": [\n",
" {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
" {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
" ],\n",
" },\n",
" \"model\": \"gpt-4.1\",\n",
" \"sampling_params\": {\n",
" \"seed\": 42,\n",
" \"temperature\": 0.7,\n",
" \"max_completions_tokens\": 10000,\n",
" \"top_p\": 0.9,\n",
" \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n",
" },\n",
" },\n",
")\n",
"\n",
"gpt_4one_responses_run = client.evals.runs.create(\n",
" name=\"gpt-4.1-mini\",\n",
" eval_id=logs_eval.id,\n",
" data_source={\n",
" \"type\": \"responses\",\n",
" \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
" \"input_messages\": {\n",
" \"type\": \"template\",\n",
" \"template\": [\n",
" {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
" {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
" ],\n",
" },\n",
" \"model\": \"gpt-4.1-mini\",\n",
" \"sampling_params\": {\n",
" \"seed\": 42,\n",
" \"temperature\": 0.7,\n",
" \"max_completions_tokens\": 10000,\n",
" \"top_p\": 0.9,\n",
" \"tools\": [{\"type\": \"function\", **symbol_tool}],\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "6ea31f2a",
"metadata": {},
"source": [
"### Utility Poller\n",
"\n",
"We create a utility poller that will be used to poll for the results of the eval runs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb8f3df4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
evalrun_6848e2269570819198b757fe12b979da completed\n",
"ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
"\n"
],
"text/plain": [
"evalrun_6848e2269570819198b757fe12b979da completed\n",
"\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
"ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
"\n"
],
"text/plain": [
"evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
"\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def poll_runs(eval_id, run_ids):\n",
" # poll both runs at the same time, until they are complete or failed\n",
" while True:\n",
" runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
" for run in runs:\n",
" print(run.id, run.status, run.result_counts)\n",
" if all(run.status in (\"completed\", \"failed\") for run in runs):\n",
" break\n",
" time.sleep(5)\n",
"\n",
"\n",
"poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4014cde",
"metadata": {
"tags": [
"original"
]
},
"outputs": [],
"source": [
"\n",
"### Get Output\n",
"completions_output = client.evals.runs.output_items.list(\n",
" run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
")\n",
"\n",
"responses_output = client.evals.runs.output_items.list(\n",
" run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "88ae7e17",
"metadata": {},
"source": [
"### Inspecting results\n",
"\n",
"For both completions and responses, we print the *symbols* dictionary that the model returned. You can diff this against the reference answer or compute precision / recall."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0cddb6d",
"metadata": {
"tags": [
"original"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" Completions vs Responses Output Symbols\n",
"
\n",
"
\n",
" \n",
" \n",
" | Completions Output | \n",
" Responses Output | \n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" | name | \n",
" symbol_type | \n",
" \n",
" \n",
" \n",
" \n",
" | Evals | \n",
" class | \n",
" \n",
" \n",
" | AsyncEvals | \n",
" class | \n",
" \n",
" \n",
" | EvalsWithRawResponse | \n",
" class | \n",
" \n",
" \n",
" | AsyncEvalsWithRawResponse | \n",
" class | \n",
" \n",
" \n",
" | EvalsWithStreamingResponse | \n",
" class | \n",
" \n",
" \n",
" | AsyncEvalsWithStreamingResponse | \n",
" class | \n",
" \n",
" \n",
" | __all__ | \n",
" variable | \n",
" \n",
" \n",
" \n",
" | \n",
" \n",
"\n",
" \n",
" \n",
" | name | \n",
" symbol_type | \n",
" \n",
" \n",
" \n",
" \n",
" | Evals | \n",
" class | \n",
" \n",
" \n",
" | runs | \n",
" function | \n",
" \n",
" \n",
" | with_raw_response | \n",
" function | \n",
" \n",
" \n",
" | with_streaming_response | \n",
" function | \n",
" \n",
" \n",
" | create | \n",
" function | \n",
" \n",
" \n",
" | retrieve | \n",
" function | \n",
" \n",
" \n",
" | update | \n",
" function | \n",
" \n",
" \n",
" | list | \n",
" function | \n",
" \n",
" \n",
" | delete | \n",
" function | \n",
" \n",
" \n",
" | AsyncEvals | \n",
" class | \n",
" \n",
" \n",
" | runs | \n",
" function | \n",
" \n",
" \n",
" | with_raw_response | \n",
" function | \n",
" \n",
" \n",
" | with_streaming_response | \n",
" function | \n",
" \n",
" \n",
" | create | \n",
" function | \n",
" \n",
" \n",
" | retrieve | \n",
" function | \n",
" \n",
" \n",
" | update | \n",
" function | \n",
" \n",
" \n",
" | list | \n",
" function | \n",
" \n",
" \n",
" | delete | \n",
" function | \n",
" \n",
" \n",
" | EvalsWithRawResponse | \n",
" class | \n",
" \n",
" \n",
" | __init__ | \n",
" function | \n",
" \n",
" \n",
" | runs | \n",
" function | \n",
" \n",
" \n",
" | AsyncEvalsWithRawResponse | \n",
" class | \n",
" \n",
" \n",
" | __init__ | \n",
" function | \n",
" \n",
" \n",
" | runs | \n",
" function | \n",
" \n",
" \n",
" | EvalsWithStreamingResponse | \n",
" class | \n",
" \n",
" \n",
" | __init__ | \n",
" function | \n",
" \n",
" \n",
" | runs | \n",
" function | \n",
" \n",
" \n",
" | AsyncEvalsWithStreamingResponse | \n",
" class | \n",
" \n",
" \n",
" | __init__ | \n",
" function | \n",
" \n",
" \n",
" | runs | \n",
" function | \n",
" \n",
" \n",
" \n",
" | \n",
"
\n",
" \n",
" \n",
"
\n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import json\n",
"import pandas as pd\n",
"from IPython.display import display, HTML\n",
"\n",
"def extract_symbols(output_list):\n",
" symbols_list = []\n",
" for item in output_list:\n",
" try:\n",
" args = item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"]\n",
" symbols = json.loads(args)[\"symbols\"]\n",
" symbols_list.append(symbols)\n",
" except Exception as e:\n",
" symbols_list.append([{\"error\": str(e)}])\n",
" return symbols_list\n",
"\n",
"completions_symbols = extract_symbols(completions_output)\n",
"responses_symbols = extract_symbols(responses_output)\n",
"\n",
"def symbols_to_html_table(symbols):\n",
" if symbols and isinstance(symbols, list):\n",
" df = pd.DataFrame(symbols)\n",
" return (\n",
" df.style\n",
" .set_properties(**{\n",
" 'white-space': 'pre-wrap',\n",
" 'word-break': 'break-word',\n",
" 'padding': '2px 6px',\n",
" 'border': '1px solid #C3E7FA',\n",
" 'font-size': '0.92em',\n",
" 'background-color': '#FDFEFF'\n",
" })\n",
" .set_table_styles([{\n",
" 'selector': 'th',\n",
" 'props': [\n",
" ('font-size', '0.95em'),\n",
" ('background-color', '#1CA7EC'),\n",
" ('color', '#fff'),\n",
" ('border-bottom', '1px solid #18647E'),\n",
" ('padding', '2px 6px')\n",
" ]\n",
" }])\n",
" .hide(axis='index')\n",
" .to_html()\n",
" )\n",
" return f\"{str(symbols)}
\"\n",
"\n",
"table_rows = []\n",
"max_len = max(len(completions_symbols), len(responses_symbols))\n",
"for i in range(max_len):\n",
" c_html = symbols_to_html_table(completions_symbols[i]) if i < len(completions_symbols) else \"\"\n",
" r_html = symbols_to_html_table(responses_symbols[i]) if i < len(responses_symbols) else \"\"\n",
" table_rows.append(f\"\"\"\n",
" \n",
" | {c_html} | \n",
" {r_html} | \n",
"
\n",
" \"\"\")\n",
"\n",
"table_html = f\"\"\"\n",
"\n",
"
\n",
" Completions vs Responses Output Symbols\n",
"
\n",
"
\n",
" \n",
" \n",
" | Completions Output | \n",
" Responses Output | \n",
"
\n",
" \n",
" \n",
" {''.join(table_rows)}\n",
" \n",
"
\n",
"
\n",
"\"\"\"\n",
"\n",
"display(HTML(table_html))\n"
]
},
{
"cell_type": "markdown",
"id": "e8e4ca5a",
"metadata": {},
"source": [
"### Visualize Evals Dashboard\n",
"\n",
"You can navigate to the Evals Dashboard in order to visualize the data.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"You can also take a look at the explanation of the failed results in the Evals Dashboard after the run is complete as shown in the image below.\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "50ad84ad",
"metadata": {},
"source": [
"This notebook demonstrated how to use OpenAI Evals to assess and improve a model’s ability to extract structured information from Python code using tool calls. \n",
"\n",
"\n",
"OpenAI Evals provides a robust, reproducible framework for evaluating LLMs on structured extraction tasks. By combining clear tool schemas, rigorous grading rubrics, and well-structured datasets, you can measure and improve overall performance.\n",
"\n",
"*For more details, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).*"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}