{ "cells": [ { "cell_type": "markdown", "id": "6ff95379", "metadata": {}, "source": [ "# Tool Evaluation with OpenAI Evals\n", "\n", "This cookbook shows how to **measure and improve a model’s ability to extract structured information from source code** with tool evaluation. In this case, the set of *symbols* (functions, classes, methods, and variables) defined in Python files. " ] }, { "cell_type": "markdown", "id": "4cc30394", "metadata": {}, "source": [ "## Setup\n", "\n", "Install the latest **openai** Python package ≥ 1.14.0 and set your `OPENAI_API_KEY` environment variable. If you also want to evaluate an *assistant with tools*, enable the *Assistants v2 beta* in your account.\n", "\n", "```bash\n", "pip install --upgrade openai\n", "export OPENAI_API_KEY=sk‑...\n", "```\n", "Below we import the SDK, create a client, and define a helper that builds a small dataset from files inside the **openai** package itself." ] }, { "cell_type": "code", "execution_count": 2, "id": "acd0d746", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install --upgrade openai pandas jinja2 rich --quiet\n", "\n", "import os\n", "import time\n", "import openai\n", "from rich import print\n", "\n", "client = openai.OpenAI(\n", " api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n", ")" ] }, { "cell_type": "markdown", "id": "80618b60", "metadata": {}, "source": [ "### Dataset factory & grading rubric\n", "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n", "* `structured_output_grader` defines a detailed evaluation rubric. \n", "* `sampled.output_tools[0].function.arguments.symbols` specifies the extracted symbols from the code file based on the tool invocation.\n", "* `client.evals.create(...)` registers the eval with the platform." ] }, { "cell_type": "code", "execution_count": null, "id": "120b6e4d", "metadata": { "tags": [ "original" ] }, "outputs": [], "source": [ "def get_dataset(limit=None):\n", " openai_sdk_file_path = os.path.dirname(openai.__file__)\n", "\n", " file_paths = [\n", " os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n", " os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n", " os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n", " os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n", " os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n", " ]\n", "\n", " items = []\n", " for file_path in file_paths:\n", " items.append({\"input\": open(file_path, \"r\").read()})\n", " if limit:\n", " return items[:limit]\n", " return items\n", "\n", "\n", "structured_output_grader = \"\"\"\n", "You are a helpful assistant that grades the quality of extracted information from a code file.\n", "You will be given a code file and a list of extracted information.\n", "You should grade the quality of the extracted information.\n", "\n", "You should grade the quality on a scale of 1 to 7.\n", "You should apply the following criteria, and calculate your score as follows:\n", "You should first check for completeness on a scale of 1 to 7.\n", "Then you should apply a quality modifier.\n", "\n", "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n", "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n", "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n", "etc.\n", "\"\"\"\n", "\n", "structured_output_grader_user_prompt = \"\"\"\n", "\n", "{{item.input}}\n", "\n", "\n", "\n", "{{sample.output_tools[0].function.arguments.symbols}}\n", "\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "d7f66a56", "metadata": {}, "source": [ "### Evals Creation\n", "\n", "Here we create an eval that will be used to evaluate the quality of extracted information from code files.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "95a5eaf6", "metadata": {}, "outputs": [], "source": [ "logs_eval = client.evals.create(\n", " name=\"Code QA Eval\",\n", " data_source_config={\n", " \"type\": \"custom\",\n", " \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n", " \"include_sample_schema\": True,\n", " },\n", " testing_criteria=[\n", " {\n", " \"type\": \"score_model\",\n", " \"name\": \"General Evaluator\",\n", " \"model\": \"o3\",\n", " \"input\": [\n", " {\"role\": \"system\", \"content\": structured_output_grader},\n", " {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n", " ],\n", " \"range\": [1, 7],\n", " \"pass_threshold\": 5.0,\n", " }\n", " ],\n", ")\n", "\n", "symbol_tool = {\n", " \"name\": \"extract_symbols\",\n", " \"description\": \"Extract the symbols from the code file\",\n", " \"parameters\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"symbols\": {\n", " \"type\": \"array\",\n", " \"description\": \"A list of symbols extracted from Python code.\",\n", " \"items\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n", " \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n", " },\n", " \"required\": [\"name\", \"symbol_type\"],\n", " \"additionalProperties\": False,\n", " },\n", " }\n", " },\n", " \"required\": [\"symbols\"],\n", " \"additionalProperties\": False,\n", " },\n", "}" ] }, { "cell_type": "markdown", "id": "73ae7e5e", "metadata": {}, "source": [ "### Kick off model runs\n", "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint." ] }, { "cell_type": "code", "execution_count": null, "id": "0d650e02", "metadata": {}, "outputs": [], "source": [ "gpt_4one_completions_run = client.evals.runs.create(\n", " name=\"gpt-4.1\",\n", " eval_id=logs_eval.id,\n", " data_source={\n", " \"type\": \"completions\",\n", " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n", " \"input_messages\": {\n", " \"type\": \"template\",\n", " \"template\": [\n", " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n", " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n", " ],\n", " },\n", " \"model\": \"gpt-4.1\",\n", " \"sampling_params\": {\n", " \"seed\": 42,\n", " \"temperature\": 0.7,\n", " \"max_completions_tokens\": 10000,\n", " \"top_p\": 0.9,\n", " \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n", " },\n", " },\n", ")\n", "\n", "gpt_4one_responses_run = client.evals.runs.create(\n", " name=\"gpt-4.1-mini\",\n", " eval_id=logs_eval.id,\n", " data_source={\n", " \"type\": \"responses\",\n", " \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n", " \"input_messages\": {\n", " \"type\": \"template\",\n", " \"template\": [\n", " {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n", " {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n", " ],\n", " },\n", " \"model\": \"gpt-4.1-mini\",\n", " \"sampling_params\": {\n", " \"seed\": 42,\n", " \"temperature\": 0.7,\n", " \"max_completions_tokens\": 10000,\n", " \"top_p\": 0.9,\n", " \"tools\": [{\"type\": \"function\", **symbol_tool}],\n", " },\n", " },\n", ")" ] }, { "cell_type": "markdown", "id": "6ea31f2a", "metadata": {}, "source": [ "### Utility Poller\n", "\n", "We create a utility poller that will be used to poll for the results of the eval runs." ] }, { "cell_type": "code", "execution_count": null, "id": "fb8f3df4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
evalrun_6848e2269570819198b757fe12b979da completed\n",
       "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
       "
\n" ], "text/plain": [ "evalrun_6848e2269570819198b757fe12b979da completed\n", "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
       "ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
       "
\n" ], "text/plain": [ "evalrun_6848e227d3a481918a9b970c897b5998 completed\n", "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def poll_runs(eval_id, run_ids):\n", " # poll both runs at the same time, until they are complete or failed\n", " while True:\n", " runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n", " for run in runs:\n", " print(run.id, run.status, run.result_counts)\n", " if all(run.status in (\"completed\", \"failed\") for run in runs):\n", " break\n", " time.sleep(5)\n", "\n", "\n", "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])" ] }, { "cell_type": "code", "execution_count": null, "id": "f4014cde", "metadata": { "tags": [ "original" ] }, "outputs": [], "source": [ "\n", "### Get Output\n", "completions_output = client.evals.runs.output_items.list(\n", " run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n", ")\n", "\n", "responses_output = client.evals.runs.output_items.list(\n", " run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n", ")\n" ] }, { "cell_type": "markdown", "id": "88ae7e17", "metadata": {}, "source": [ "### Inspecting results\n", "\n", "For both completions and responses, we print the *symbols* dictionary that the model returned. You can diff this against the reference answer or compute precision / recall." ] }, { "cell_type": "code", "execution_count": null, "id": "c0cddb6d", "metadata": { "tags": [ "original" ] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "

\n", " Completions vs Responses Output Symbols\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Completions OutputResponses Output
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesymbol_type
Evalsclass
AsyncEvalsclass
EvalsWithRawResponseclass
AsyncEvalsWithRawResponseclass
EvalsWithStreamingResponseclass
AsyncEvalsWithStreamingResponseclass
__all__variable
\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesymbol_type
Evalsclass
runsfunction
with_raw_responsefunction
with_streaming_responsefunction
createfunction
retrievefunction
updatefunction
listfunction
deletefunction
AsyncEvalsclass
runsfunction
with_raw_responsefunction
with_streaming_responsefunction
createfunction
retrievefunction
updatefunction
listfunction
deletefunction
EvalsWithRawResponseclass
__init__function
runsfunction
AsyncEvalsWithRawResponseclass
__init__function
runsfunction
EvalsWithStreamingResponseclass
__init__function
runsfunction
AsyncEvalsWithStreamingResponseclass
__init__function
runsfunction
\n", "
\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import json\n", "import pandas as pd\n", "from IPython.display import display, HTML\n", "\n", "def extract_symbols(output_list):\n", " symbols_list = []\n", " for item in output_list:\n", " try:\n", " args = item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"]\n", " symbols = json.loads(args)[\"symbols\"]\n", " symbols_list.append(symbols)\n", " except Exception as e:\n", " symbols_list.append([{\"error\": str(e)}])\n", " return symbols_list\n", "\n", "completions_symbols = extract_symbols(completions_output)\n", "responses_symbols = extract_symbols(responses_output)\n", "\n", "def symbols_to_html_table(symbols):\n", " if symbols and isinstance(symbols, list):\n", " df = pd.DataFrame(symbols)\n", " return (\n", " df.style\n", " .set_properties(**{\n", " 'white-space': 'pre-wrap',\n", " 'word-break': 'break-word',\n", " 'padding': '2px 6px',\n", " 'border': '1px solid #C3E7FA',\n", " 'font-size': '0.92em',\n", " 'background-color': '#FDFEFF'\n", " })\n", " .set_table_styles([{\n", " 'selector': 'th',\n", " 'props': [\n", " ('font-size', '0.95em'),\n", " ('background-color', '#1CA7EC'),\n", " ('color', '#fff'),\n", " ('border-bottom', '1px solid #18647E'),\n", " ('padding', '2px 6px')\n", " ]\n", " }])\n", " .hide(axis='index')\n", " .to_html()\n", " )\n", " return f\"
{str(symbols)}
\"\n", "\n", "table_rows = []\n", "max_len = max(len(completions_symbols), len(responses_symbols))\n", "for i in range(max_len):\n", " c_html = symbols_to_html_table(completions_symbols[i]) if i < len(completions_symbols) else \"\"\n", " r_html = symbols_to_html_table(responses_symbols[i]) if i < len(responses_symbols) else \"\"\n", " table_rows.append(f\"\"\"\n", " \n", " {c_html}\n", " {r_html}\n", " \n", " \"\"\")\n", "\n", "table_html = f\"\"\"\n", "
\n", "

\n", " Completions vs Responses Output Symbols\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " {''.join(table_rows)}\n", " \n", "
Completions OutputResponses Output
\n", "
\n", "\"\"\"\n", "\n", "display(HTML(table_html))\n" ] }, { "cell_type": "markdown", "id": "e8e4ca5a", "metadata": {}, "source": [ "### Visualize Evals Dashboard\n", "\n", "You can navigate to the Evals Dashboard in order to visualize the data.\n", "\n", "\n", "![evals_tool_dashboard](../../../images/evals_tool_dashboard.png)\n", "\n", "\n", "You can also take a look at the explanation of the failed results in the Evals Dashboard after the run is complete as shown in the image below.\n", "\n", "![evals_tool_failed](../../../images/eval_tools_fail.png)\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "50ad84ad", "metadata": {}, "source": [ "This notebook demonstrated how to use OpenAI Evals to assess and improve a model’s ability to extract structured information from Python code using tool calls. \n", "\n", "\n", "OpenAI Evals provides a robust, reproducible framework for evaluating LLMs on structured extraction tasks. By combining clear tool schemas, rigorous grading rubrics, and well-structured datasets, you can measure and improve overall performance.\n", "\n", "*For more details, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).*" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 5 }