{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6ff95379",
   "metadata": {},
   "source": [
    "# Tool Evaluation with OpenAI Evals\n",
    "\n",
    "This cookbook shows how to **measure and improve a model’s ability to extract structured information from source code** with tool evaluation. In this case, the set of *symbols* (functions, classes, methods, and variables) defined in Python files.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cc30394",
   "metadata": {},
   "source": [
    "## Setup<a name=\"Setup\"></a>\n",
    "\n",
    "Install the latest **openai** Python package ≥ 1.14.0 and set your `OPENAI_API_KEY` environment variable.  If you also want to evaluate an *assistant with tools*, enable the *Assistants v2 beta* in your account.\n",
    "\n",
    "```bash\n",
    "pip install --upgrade openai\n",
    "export OPENAI_API_KEY=sk‑...\n",
    "```\n",
    "Below we import the SDK, create a client, and define a helper that builds a small dataset from files inside the **openai** package itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "acd0d746",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install --upgrade openai pandas jinja2 rich --quiet\n",
    "\n",
    "import os\n",
    "import time\n",
    "import openai\n",
    "from rich import print\n",
    "\n",
    "client = openai.OpenAI(\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80618b60",
   "metadata": {},
   "source": [
    "### Dataset factory & grading rubric\n",
    "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
    "* `structured_output_grader` defines a detailed evaluation rubric. \n",
    "* `sampled.output_tools[0].function.arguments.symbols` specifies the extracted symbols from the code file based on the tool invocation.\n",
    "* `client.evals.create(...)` registers the eval with the platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "120b6e4d",
   "metadata": {
    "tags": [
     "original"
    ]
   },
   "outputs": [],
   "source": [
    "def get_dataset(limit=None):\n",
    "    openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
    "\n",
    "    file_paths = [\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
    "    ]\n",
    "\n",
    "    items = []\n",
    "    for file_path in file_paths:\n",
    "        items.append({\"input\": open(file_path, \"r\").read()})\n",
    "    if limit:\n",
    "        return items[:limit]\n",
    "    return items\n",
    "\n",
    "\n",
    "structured_output_grader = \"\"\"\n",
    "You are a helpful assistant that grades the quality of extracted information from a code file.\n",
    "You will be given a code file and a list of extracted information.\n",
    "You should grade the quality of the extracted information.\n",
    "\n",
    "You should grade the quality on a scale of 1 to 7.\n",
    "You should apply the following criteria, and calculate your score as follows:\n",
    "You should first check for completeness on a scale of 1 to 7.\n",
    "Then you should apply a quality modifier.\n",
    "\n",
    "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
    "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
    "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
    "etc.\n",
    "\"\"\"\n",
    "\n",
    "structured_output_grader_user_prompt = \"\"\"\n",
    "<Code File>\n",
    "{{item.input}}\n",
    "</Code File>\n",
    "\n",
    "<Extracted Information>\n",
    "{{sample.output_tools[0].function.arguments.symbols}}\n",
    "</Extracted Information>\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7f66a56",
   "metadata": {},
   "source": [
    "### Evals Creation\n",
    "\n",
    "Here we create an eval that will be used to evaluate the quality of extracted information from code files.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95a5eaf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "logs_eval = client.evals.create(\n",
    "    name=\"Code QA Eval\",\n",
    "    data_source_config={\n",
    "        \"type\": \"custom\",\n",
    "        \"item_schema\": {\"type\": \"object\", \"properties\": {\"input\": {\"type\": \"string\"}}},\n",
    "        \"include_sample_schema\": True,\n",
    "    },\n",
    "    testing_criteria=[\n",
    "        {\n",
    "            \"type\": \"score_model\",\n",
    "            \"name\": \"General Evaluator\",\n",
    "            \"model\": \"o3\",\n",
    "            \"input\": [\n",
    "                {\"role\": \"system\", \"content\": structured_output_grader},\n",
    "                {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n",
    "            ],\n",
    "            \"range\": [1, 7],\n",
    "            \"pass_threshold\": 5.0,\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "symbol_tool = {\n",
    "    \"name\": \"extract_symbols\",\n",
    "    \"description\": \"Extract the symbols from the code file\",\n",
    "    \"parameters\": {\n",
    "        \"type\": \"object\",\n",
    "        \"properties\": {\n",
    "            \"symbols\": {\n",
    "                \"type\": \"array\",\n",
    "                \"description\": \"A list of symbols extracted from Python code.\",\n",
    "                \"items\": {\n",
    "                    \"type\": \"object\",\n",
    "                    \"properties\": {\n",
    "                        \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
    "                        \"symbol_type\": {\"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\"},\n",
    "                    },\n",
    "                    \"required\": [\"name\", \"symbol_type\"],\n",
    "                    \"additionalProperties\": False,\n",
    "                },\n",
    "            }\n",
    "        },\n",
    "        \"required\": [\"symbols\"],\n",
    "        \"additionalProperties\": False,\n",
    "    },\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73ae7e5e",
   "metadata": {},
   "source": [
    "### Kick off model runs\n",
    "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d650e02",
   "metadata": {},
   "outputs": [],
   "source": [
    "gpt_4one_completions_run = client.evals.runs.create(\n",
    "    name=\"gpt-4.1\",\n",
    "    eval_id=logs_eval.id,\n",
    "    data_source={\n",
    "        \"type\": \"completions\",\n",
    "        \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
    "        \"input_messages\": {\n",
    "            \"type\": \"template\",\n",
    "            \"template\": [\n",
    "                {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
    "                {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
    "            ],\n",
    "        },\n",
    "        \"model\": \"gpt-4.1\",\n",
    "        \"sampling_params\": {\n",
    "            \"seed\": 42,\n",
    "            \"temperature\": 0.7,\n",
    "            \"max_completions_tokens\": 10000,\n",
    "            \"top_p\": 0.9,\n",
    "            \"tools\": [{\"type\": \"function\", \"function\": symbol_tool}],\n",
    "        },\n",
    "    },\n",
    ")\n",
    "\n",
    "gpt_4one_responses_run = client.evals.runs.create(\n",
    "    name=\"gpt-4.1-mini\",\n",
    "    eval_id=logs_eval.id,\n",
    "    data_source={\n",
    "        \"type\": \"responses\",\n",
    "        \"source\": {\"type\": \"file_content\", \"content\": [{\"item\": item} for item in get_dataset(limit=1)]},\n",
    "        \"input_messages\": {\n",
    "            \"type\": \"template\",\n",
    "            \"template\": [\n",
    "                {\"type\": \"message\", \"role\": \"system\", \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"}},\n",
    "                {\"type\": \"message\", \"role\": \"user\", \"content\": {\"type\": \"input_text\", \"text\": \"Extract the symbols from the code file {{item.input}}\"}},\n",
    "            ],\n",
    "        },\n",
    "        \"model\": \"gpt-4.1-mini\",\n",
    "        \"sampling_params\": {\n",
    "            \"seed\": 42,\n",
    "            \"temperature\": 0.7,\n",
    "            \"max_completions_tokens\": 10000,\n",
    "            \"top_p\": 0.9,\n",
    "            \"tools\": [{\"type\": \"function\", **symbol_tool}],\n",
    "        },\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ea31f2a",
   "metadata": {},
   "source": [
    "### Utility Poller\n",
    "\n",
    "We create a utility poller that will be used to poll for the results of the eval runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb8f3df4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_6848e2269570819198b757fe12b979da completed\n",
       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "evalrun_6848e2269570819198b757fe12b979da completed\n",
       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "evalrun_6848e227d3a481918a9b970c897b5998 completed\n",
       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def poll_runs(eval_id, run_ids):\n",
    "    # poll both runs at the same time, until they are complete or failed\n",
    "    while True:\n",
    "        runs = [client.evals.runs.retrieve(run_id, eval_id=eval_id) for run_id in run_ids]\n",
    "        for run in runs:\n",
    "            print(run.id, run.status, run.result_counts)\n",
    "        if all(run.status in (\"completed\", \"failed\") for run in runs):\n",
    "            break\n",
    "        time.sleep(5)\n",
    "\n",
    "\n",
    "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4014cde",
   "metadata": {
    "tags": [
     "original"
    ]
   },
   "outputs": [],
   "source": [
    "\n",
    "### Get Output\n",
    "completions_output = client.evals.runs.output_items.list(\n",
    "    run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
    ")\n",
    "\n",
    "responses_output = client.evals.runs.output_items.list(\n",
    "    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88ae7e17",
   "metadata": {},
   "source": [
    "### Inspecting results<a name=\"Inspecting-results\"></a>\n",
    "\n",
    "For both completions and responses, we print the *symbols* dictionary that the model returned. You can diff this against the reference answer or compute precision / recall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0cddb6d",
   "metadata": {
    "tags": [
     "original"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div style=\"margin-bottom:0.5em;margin-top:0.2em;\">\n",
       "  <h4 style=\"color:#1CA7EC;font-weight:600;letter-spacing:0.5px;\n",
       "     text-shadow:0 1px 2px rgba(0,0,0,0.06), 0 0px 0px #fff;font-size:1.05em;margin:0 0 0.35em 0;\">\n",
       "    Completions vs Responses Output Symbols\n",
       "  </h4>\n",
       "  <table style=\"border-collapse:separate;border-spacing:0 0.2em;width:100%;border-radius:5px;overflow:hidden;box-shadow:0 1px 7px #BEE7FA22;\">\n",
       "    <thead>\n",
       "      <tr style=\"height:1.4em;\">\n",
       "              <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Completions Output</th>\n",
       "      <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Responses Output</th>\n",
       "      </tr>\n",
       "    </thead>\n",
       "    <tbody>\n",
       "      \n",
       "      <tr style=\"height:1.2em;\">\n",
       "          <td style=\"vertical-align:top; background:#F6F8FA; border-right:1px solid #E3E3E3; padding:2px 4px;\"><style type=\"text/css\">\n",
       "#T_f295b th {\n",
       "  font-size: 0.95em;\n",
       "  background-color: #1CA7EC;\n",
       "  color: #fff;\n",
       "  border-bottom: 1px solid #18647E;\n",
       "  padding: 2px 6px;\n",
       "}\n",
       "#T_f295b_row0_col0, #T_f295b_row0_col1, #T_f295b_row1_col0, #T_f295b_row1_col1, #T_f295b_row2_col0, #T_f295b_row2_col1, #T_f295b_row3_col0, #T_f295b_row3_col1, #T_f295b_row4_col0, #T_f295b_row4_col1, #T_f295b_row5_col0, #T_f295b_row5_col1, #T_f295b_row6_col0, #T_f295b_row6_col1 {\n",
       "  white-space: pre-wrap;\n",
       "  word-break: break-word;\n",
       "  padding: 2px 6px;\n",
       "  border: 1px solid #C3E7FA;\n",
       "  font-size: 0.92em;\n",
       "  background-color: #FDFEFF;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_f295b\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th id=\"T_f295b_level0_col0\" class=\"col_heading level0 col0\" >name</th>\n",
       "      <th id=\"T_f295b_level0_col1\" class=\"col_heading level0 col1\" >symbol_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_f295b_row0_col0\" class=\"data row0 col0\" >Evals</td>\n",
       "      <td id=\"T_f295b_row0_col1\" class=\"data row0 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f295b_row1_col0\" class=\"data row1 col0\" >AsyncEvals</td>\n",
       "      <td id=\"T_f295b_row1_col1\" class=\"data row1 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f295b_row2_col0\" class=\"data row2 col0\" >EvalsWithRawResponse</td>\n",
       "      <td id=\"T_f295b_row2_col1\" class=\"data row2 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f295b_row3_col0\" class=\"data row3 col0\" >AsyncEvalsWithRawResponse</td>\n",
       "      <td id=\"T_f295b_row3_col1\" class=\"data row3 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f295b_row4_col0\" class=\"data row4 col0\" >EvalsWithStreamingResponse</td>\n",
       "      <td id=\"T_f295b_row4_col1\" class=\"data row4 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f295b_row5_col0\" class=\"data row5 col0\" >AsyncEvalsWithStreamingResponse</td>\n",
       "      <td id=\"T_f295b_row5_col1\" class=\"data row5 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f295b_row6_col0\" class=\"data row6 col0\" >__all__</td>\n",
       "      <td id=\"T_f295b_row6_col1\" class=\"data row6 col1\" >variable</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</td>\n",
       "          <td style=\"vertical-align:top; background:#F6F8FA; padding:2px 4px;\"><style type=\"text/css\">\n",
       "#T_c1589 th {\n",
       "  font-size: 0.95em;\n",
       "  background-color: #1CA7EC;\n",
       "  color: #fff;\n",
       "  border-bottom: 1px solid #18647E;\n",
       "  padding: 2px 6px;\n",
       "}\n",
       "#T_c1589_row0_col0, #T_c1589_row0_col1, #T_c1589_row1_col0, #T_c1589_row1_col1, #T_c1589_row2_col0, #T_c1589_row2_col1, #T_c1589_row3_col0, #T_c1589_row3_col1, #T_c1589_row4_col0, #T_c1589_row4_col1, #T_c1589_row5_col0, #T_c1589_row5_col1, #T_c1589_row6_col0, #T_c1589_row6_col1, #T_c1589_row7_col0, #T_c1589_row7_col1, #T_c1589_row8_col0, #T_c1589_row8_col1, #T_c1589_row9_col0, #T_c1589_row9_col1, #T_c1589_row10_col0, #T_c1589_row10_col1, #T_c1589_row11_col0, #T_c1589_row11_col1, #T_c1589_row12_col0, #T_c1589_row12_col1, #T_c1589_row13_col0, #T_c1589_row13_col1, #T_c1589_row14_col0, #T_c1589_row14_col1, #T_c1589_row15_col0, #T_c1589_row15_col1, #T_c1589_row16_col0, #T_c1589_row16_col1, #T_c1589_row17_col0, #T_c1589_row17_col1, #T_c1589_row18_col0, #T_c1589_row18_col1, #T_c1589_row19_col0, #T_c1589_row19_col1, #T_c1589_row20_col0, #T_c1589_row20_col1, #T_c1589_row21_col0, #T_c1589_row21_col1, #T_c1589_row22_col0, #T_c1589_row22_col1, #T_c1589_row23_col0, #T_c1589_row23_col1, #T_c1589_row24_col0, #T_c1589_row24_col1, #T_c1589_row25_col0, #T_c1589_row25_col1, #T_c1589_row26_col0, #T_c1589_row26_col1, #T_c1589_row27_col0, #T_c1589_row27_col1, #T_c1589_row28_col0, #T_c1589_row28_col1, #T_c1589_row29_col0, #T_c1589_row29_col1 {\n",
       "  white-space: pre-wrap;\n",
       "  word-break: break-word;\n",
       "  padding: 2px 6px;\n",
       "  border: 1px solid #C3E7FA;\n",
       "  font-size: 0.92em;\n",
       "  background-color: #FDFEFF;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_c1589\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th id=\"T_c1589_level0_col0\" class=\"col_heading level0 col0\" >name</th>\n",
       "      <th id=\"T_c1589_level0_col1\" class=\"col_heading level0 col1\" >symbol_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row0_col0\" class=\"data row0 col0\" >Evals</td>\n",
       "      <td id=\"T_c1589_row0_col1\" class=\"data row0 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row1_col0\" class=\"data row1 col0\" >runs</td>\n",
       "      <td id=\"T_c1589_row1_col1\" class=\"data row1 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row2_col0\" class=\"data row2 col0\" >with_raw_response</td>\n",
       "      <td id=\"T_c1589_row2_col1\" class=\"data row2 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row3_col0\" class=\"data row3 col0\" >with_streaming_response</td>\n",
       "      <td id=\"T_c1589_row3_col1\" class=\"data row3 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row4_col0\" class=\"data row4 col0\" >create</td>\n",
       "      <td id=\"T_c1589_row4_col1\" class=\"data row4 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row5_col0\" class=\"data row5 col0\" >retrieve</td>\n",
       "      <td id=\"T_c1589_row5_col1\" class=\"data row5 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row6_col0\" class=\"data row6 col0\" >update</td>\n",
       "      <td id=\"T_c1589_row6_col1\" class=\"data row6 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row7_col0\" class=\"data row7 col0\" >list</td>\n",
       "      <td id=\"T_c1589_row7_col1\" class=\"data row7 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row8_col0\" class=\"data row8 col0\" >delete</td>\n",
       "      <td id=\"T_c1589_row8_col1\" class=\"data row8 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row9_col0\" class=\"data row9 col0\" >AsyncEvals</td>\n",
       "      <td id=\"T_c1589_row9_col1\" class=\"data row9 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row10_col0\" class=\"data row10 col0\" >runs</td>\n",
       "      <td id=\"T_c1589_row10_col1\" class=\"data row10 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row11_col0\" class=\"data row11 col0\" >with_raw_response</td>\n",
       "      <td id=\"T_c1589_row11_col1\" class=\"data row11 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row12_col0\" class=\"data row12 col0\" >with_streaming_response</td>\n",
       "      <td id=\"T_c1589_row12_col1\" class=\"data row12 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row13_col0\" class=\"data row13 col0\" >create</td>\n",
       "      <td id=\"T_c1589_row13_col1\" class=\"data row13 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row14_col0\" class=\"data row14 col0\" >retrieve</td>\n",
       "      <td id=\"T_c1589_row14_col1\" class=\"data row14 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row15_col0\" class=\"data row15 col0\" >update</td>\n",
       "      <td id=\"T_c1589_row15_col1\" class=\"data row15 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row16_col0\" class=\"data row16 col0\" >list</td>\n",
       "      <td id=\"T_c1589_row16_col1\" class=\"data row16 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row17_col0\" class=\"data row17 col0\" >delete</td>\n",
       "      <td id=\"T_c1589_row17_col1\" class=\"data row17 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row18_col0\" class=\"data row18 col0\" >EvalsWithRawResponse</td>\n",
       "      <td id=\"T_c1589_row18_col1\" class=\"data row18 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row19_col0\" class=\"data row19 col0\" >__init__</td>\n",
       "      <td id=\"T_c1589_row19_col1\" class=\"data row19 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row20_col0\" class=\"data row20 col0\" >runs</td>\n",
       "      <td id=\"T_c1589_row20_col1\" class=\"data row20 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row21_col0\" class=\"data row21 col0\" >AsyncEvalsWithRawResponse</td>\n",
       "      <td id=\"T_c1589_row21_col1\" class=\"data row21 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row22_col0\" class=\"data row22 col0\" >__init__</td>\n",
       "      <td id=\"T_c1589_row22_col1\" class=\"data row22 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row23_col0\" class=\"data row23 col0\" >runs</td>\n",
       "      <td id=\"T_c1589_row23_col1\" class=\"data row23 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row24_col0\" class=\"data row24 col0\" >EvalsWithStreamingResponse</td>\n",
       "      <td id=\"T_c1589_row24_col1\" class=\"data row24 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row25_col0\" class=\"data row25 col0\" >__init__</td>\n",
       "      <td id=\"T_c1589_row25_col1\" class=\"data row25 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row26_col0\" class=\"data row26 col0\" >runs</td>\n",
       "      <td id=\"T_c1589_row26_col1\" class=\"data row26 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row27_col0\" class=\"data row27 col0\" >AsyncEvalsWithStreamingResponse</td>\n",
       "      <td id=\"T_c1589_row27_col1\" class=\"data row27 col1\" >class</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row28_col0\" class=\"data row28 col0\" >__init__</td>\n",
       "      <td id=\"T_c1589_row28_col1\" class=\"data row28 col1\" >function</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_c1589_row29_col0\" class=\"data row29 col0\" >runs</td>\n",
       "      <td id=\"T_c1589_row29_col1\" class=\"data row29 col1\" >function</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</td>\n",
       "      </tr>\n",
       "    \n",
       "    </tbody>\n",
       "  </table>\n",
       "</div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "def extract_symbols(output_list):\n",
    "    symbols_list = []\n",
    "    for item in output_list:\n",
    "        try:\n",
    "            args = item.sample.output[0].tool_calls[0][\"function\"][\"arguments\"]\n",
    "            symbols = json.loads(args)[\"symbols\"]\n",
    "            symbols_list.append(symbols)\n",
    "        except Exception as e:\n",
    "            symbols_list.append([{\"error\": str(e)}])\n",
    "    return symbols_list\n",
    "\n",
    "completions_symbols = extract_symbols(completions_output)\n",
    "responses_symbols = extract_symbols(responses_output)\n",
    "\n",
    "def symbols_to_html_table(symbols):\n",
    "    if symbols and isinstance(symbols, list):\n",
    "        df = pd.DataFrame(symbols)\n",
    "        return (\n",
    "            df.style\n",
    "            .set_properties(**{\n",
    "                'white-space': 'pre-wrap',\n",
    "                'word-break': 'break-word',\n",
    "                'padding': '2px 6px',\n",
    "                'border': '1px solid #C3E7FA',\n",
    "                'font-size': '0.92em',\n",
    "                'background-color': '#FDFEFF'\n",
    "            })\n",
    "            .set_table_styles([{\n",
    "                'selector': 'th',\n",
    "                'props': [\n",
    "                    ('font-size', '0.95em'),\n",
    "                    ('background-color', '#1CA7EC'),\n",
    "                    ('color', '#fff'),\n",
    "                    ('border-bottom', '1px solid #18647E'),\n",
    "                    ('padding', '2px 6px')\n",
    "                ]\n",
    "            }])\n",
    "            .hide(axis='index')\n",
    "            .to_html()\n",
    "        )\n",
    "    return f\"<div style='padding:4px 0;color:#D9534F;font-style:italic;font-size:0.9em'>{str(symbols)}</div>\"\n",
    "\n",
    "table_rows = []\n",
    "max_len = max(len(completions_symbols), len(responses_symbols))\n",
    "for i in range(max_len):\n",
    "    c_html = symbols_to_html_table(completions_symbols[i]) if i < len(completions_symbols) else \"\"\n",
    "    r_html = symbols_to_html_table(responses_symbols[i]) if i < len(responses_symbols) else \"\"\n",
    "    table_rows.append(f\"\"\"\n",
    "      <tr style=\"height:1.2em;\">\n",
    "          <td style=\"vertical-align:top; background:#F6F8FA; border-right:1px solid #E3E3E3; padding:2px 4px;\">{c_html}</td>\n",
    "          <td style=\"vertical-align:top; background:#F6F8FA; padding:2px 4px;\">{r_html}</td>\n",
    "      </tr>\n",
    "    \"\"\")\n",
    "\n",
    "table_html = f\"\"\"\n",
    "<div style=\"margin-bottom:0.5em;margin-top:0.2em;\">\n",
    "  <h4 style=\"color:#1CA7EC;font-weight:600;letter-spacing:0.5px;\n",
    "     text-shadow:0 1px 2px rgba(0,0,0,0.06), 0 0px 0px #fff;font-size:1.05em;margin:0 0 0.35em 0;\">\n",
    "    Completions vs Responses Output Symbols\n",
    "  </h4>\n",
    "  <table style=\"border-collapse:separate;border-spacing:0 0.2em;width:100%;border-radius:5px;overflow:hidden;box-shadow:0 1px 7px #BEE7FA22;\">\n",
    "    <thead>\n",
    "      <tr style=\"height:1.4em;\">\n",
    "              <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Completions Output</th>\n",
    "      <th style=\"width:50%;background:#323C50;color:#fff;font-size:1em;padding:6px 10px;border-bottom:2px solid #1CA7EC;text-align:center;\">Responses Output</th>\n",
    "      </tr>\n",
    "    </thead>\n",
    "    <tbody>\n",
    "      {''.join(table_rows)}\n",
    "    </tbody>\n",
    "  </table>\n",
    "</div>\n",
    "\"\"\"\n",
    "\n",
    "display(HTML(table_html))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8e4ca5a",
   "metadata": {},
   "source": [
    "### Visualize Evals Dashboard\n",
    "\n",
    "You can navigate to the Evals Dashboard in order to visualize the data.\n",
    "\n",
    "\n",
    "![evals_tool_dashboard](../../../images/evals_tool_dashboard.png)\n",
    "\n",
    "\n",
    "You can also take a look at the explanation of the failed results in the Evals Dashboard after the run is complete as shown in the image below.\n",
    "\n",
    "![evals_tool_failed](../../../images/eval_tools_fail.png)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50ad84ad",
   "metadata": {},
   "source": [
    "This notebook demonstrated how to use OpenAI Evals to assess and improve a model’s ability to extract structured information from Python code using tool calls. \n",
    "\n",
    "\n",
    "OpenAI Evals provides a robust, reproducible framework for evaluating LLMs on structured extraction tasks. By combining clear tool schemas, rigorous grading rubrics, and well-structured datasets, you can measure and improve overall performance.\n",
    "\n",
    "*For more details, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}