{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0a2d56c0",
   "metadata": {},
   "source": [
    "\n",
    "# Structured Output Evaluation Cookbook\n",
    " \n",
    "This notebook walks you through a set of focused, runnable examples how to use the OpenAI **Evals** framework to **test, grade, and iterate on tasks that require large‑language models to produce structured outputs**.\n",
    "\n",
    "> **Why does this matter?**  \n",
    "> Production systems often depend on JSON, SQL, or domain‑specific formats.  Relying on spot checks or ad‑hoc prompt tweaks quickly breaks down.  Instead, you can *codify* expectations as automated evals and let your team ship with safety bricks instead of sand.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45eee293",
   "metadata": {},
   "source": [
    "\n",
    "## Quick Tour\n",
    "\n",
    "* **Section 1 – Prerequisites**: environment variables and package setup  \n",
    "* **Section 2 – Walk‑through: Code‑symbol extraction**: end‑to‑end demo that grades the model’s ability to extract function and class names from source code.  We keep the original logic intact and simply layer documentation around it.  \n",
    "* **Section 3 – Additional Recipes**: sketches of common production patterns such as sentiment extraction as additional code sample for evaluation.\n",
    "* **Section 4 – Result Exploration**: lightweight helpers for pulling run output and digging into failures.  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e027be46",
   "metadata": {},
   "source": [
    "\n",
    "## Prerequisites\n",
    "\n",
    "1. **Install dependencies** (minimum versions shown):\n",
    "\n",
    "```bash\n",
    "pip install --upgrade openai\n",
    "```\n",
    "\n",
    "2. **Authenticate** by exporting your key:\n",
    "\n",
    "```bash\n",
    "export OPENAI_API_KEY=\"sk‑...\"\n",
    "```\n",
    "\n",
    "3. **Optional**: if you plan to run evals in bulk, set up an [organization‑level key](https://platform.openai.com/account/org-settings) with appropriate limits.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4592675d",
   "metadata": {},
   "source": [
    "### Use Case 1: Code symbol extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2a32d53",
   "metadata": {},
   "source": [
    "\n",
    "The goal is to **extract all function, class, and constant symbols from python files inside the OpenAI SDK**.  \n",
    "For each file we ask the model to emit structured JSON like:\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"symbols\": [\n",
    "    {\"name\": \"OpenAI\", \"kind\": \"class\"},\n",
    "    {\"name\": \"Evals\", \"kind\": \"module\"},\n",
    "    ...\n",
    "  ]\n",
    "}\n",
    "```\n",
    "\n",
    "A rubric model then grades **completeness** (did we capture every symbol?) and **quality** (are the kinds correct?) on a 1‑7 scale.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dd88e7c",
   "metadata": {},
   "source": [
    "### Evaluating Code Quality Extraction with a Custom Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64bf0667",
   "metadata": {},
   "source": [
    "Let us walk though an example to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c95faa47",
   "metadata": {},
   "source": [
    "### Initialize SDK client\n",
    "Creates an `openai.OpenAI` client using the `OPENAI_API_KEY` we exported above.  Nothing will run without this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "eacc6ac7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install --upgrade openai pandas rich --quiet\n",
    "\n",
    "\n",
    "\n",
    "import os\n",
    "import time\n",
    "import openai\n",
    "from rich import print\n",
    "import pandas as pd\n",
    "\n",
    "client = openai.OpenAI(\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8200aaf1",
   "metadata": {},
   "source": [
    "### Dataset factory & grading rubric\n",
    "* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
    "* `structured_output_grader` defines a detailed evaluation rubric.\n",
    "* `client.evals.create(...)` registers the eval with the platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b272e193",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_dataset(limit=None):\n",
    "    openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
    "\n",
    "    file_paths = [\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
    "        os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
    "    ]\n",
    "\n",
    "    items = []\n",
    "    for file_path in file_paths:\n",
    "        items.append({\"input\": open(file_path, \"r\").read()})\n",
    "    if limit:\n",
    "        return items[:limit]\n",
    "    return items\n",
    "\n",
    "\n",
    "structured_output_grader = \"\"\"\n",
    "You are a helpful assistant that grades the quality of extracted information from a code file.\n",
    "You will be given a code file and a list of extracted information.\n",
    "You should grade the quality of the extracted information.\n",
    "\n",
    "You should grade the quality on a scale of 1 to 7.\n",
    "You should apply the following criteria, and calculate your score as follows:\n",
    "You should first check for completeness on a scale of 1 to 7.\n",
    "Then you should apply a quality modifier.\n",
    "\n",
    "The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
    "If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
    "If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
    "etc.\n",
    "\"\"\"\n",
    "\n",
    "structured_output_grader_user_prompt = \"\"\"\n",
    "<Code File>\n",
    "{{item.input}}\n",
    "</Code File>\n",
    "\n",
    "<Extracted Information>\n",
    "{{sample.output_json.symbols}}\n",
    "</Extracted Information>\n",
    "\"\"\"\n",
    "\n",
    "logs_eval = client.evals.create(\n",
    "    name=\"Code QA Eval\",\n",
    "    data_source_config={\n",
    "        \"type\": \"custom\",\n",
    "        \"item_schema\": {\n",
    "            \"type\": \"object\",\n",
    "            \"properties\": {\"input\": {\"type\": \"string\"}},\n",
    "        },\n",
    "        \"include_sample_schema\": True,\n",
    "    },\n",
    "    testing_criteria=[\n",
    "        {\n",
    "            \"type\": \"score_model\",\n",
    "            \"name\": \"General Evaluator\",\n",
    "            \"model\": \"o3\",\n",
    "            \"input\": [\n",
    "                {\"role\": \"system\", \"content\": structured_output_grader},\n",
    "                {\"role\": \"user\", \"content\": structured_output_grader_user_prompt},\n",
    "            ],\n",
    "            \"range\": [1, 7],\n",
    "            \"pass_threshold\": 5.5,\n",
    "        }\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e77cbe6",
   "metadata": {},
   "source": [
    "### Kick off model runs\n",
    "Here we launch two runs against the same eval: one that calls the **Completions** endpoint, and one that calls the **Responses** endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "18f357e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Kick off model runs\n",
    "gpt_4one_completions_run = client.evals.runs.create(\n",
    "    name=\"gpt-4.1\",\n",
    "    eval_id=logs_eval.id,\n",
    "    data_source={\n",
    "        \"type\": \"completions\",\n",
    "        \"source\": {\n",
    "            \"type\": \"file_content\",\n",
    "            \"content\": [{\"item\": item} for item in get_dataset(limit=1)],\n",
    "        },\n",
    "        \"input_messages\": {\n",
    "            \"type\": \"template\",\n",
    "            \"template\": [\n",
    "                {\n",
    "                    \"type\": \"message\",\n",
    "                    \"role\": \"system\",\n",
    "                    \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"},\n",
    "                },\n",
    "                {\n",
    "                    \"type\": \"message\",\n",
    "                    \"role\": \"user\",\n",
    "                    \"content\": {\n",
    "                        \"type\": \"input_text\",\n",
    "                        \"text\": \"Extract the symbols from the code file {{item.input}}\",\n",
    "                    },\n",
    "                },\n",
    "            ],\n",
    "        },\n",
    "        \"model\": \"gpt-4.1\",\n",
    "        \"sampling_params\": {\n",
    "            \"seed\": 42,\n",
    "            \"temperature\": 0.7,\n",
    "            \"max_completions_tokens\": 10000,\n",
    "            \"top_p\": 0.9,\n",
    "            \"response_format\": {\n",
    "                \"type\": \"json_schema\",\n",
    "                \"json_schema\": {\n",
    "                    \"name\": \"python_symbols\",\n",
    "                    \"schema\": {\n",
    "                        \"type\": \"object\",\n",
    "                        \"properties\": {\n",
    "                            \"symbols\": {\n",
    "                                \"type\": \"array\",\n",
    "                                \"description\": \"A list of symbols extracted from Python code.\",\n",
    "                                \"items\": {\n",
    "                                    \"type\": \"object\",\n",
    "                                    \"properties\": {\n",
    "                                        \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
    "                                        \"symbol_type\": {\n",
    "                                            \"type\": \"string\", \"description\": \"The type of the symbol, e.g., variable, function, class.\",\n",
    "                                        },\n",
    "                                    },\n",
    "                                    \"required\": [\"name\", \"symbol_type\"],\n",
    "                                    \"additionalProperties\": False,\n",
    "                                },\n",
    "                            }\n",
    "                        },\n",
    "                        \"required\": [\"symbols\"],\n",
    "                        \"additionalProperties\": False,\n",
    "                    },\n",
    "                    \"strict\": True,\n",
    "                },\n",
    "            },\n",
    "        },\n",
    "    },\n",
    ")\n",
    "\n",
    "gpt_4one_responses_run = client.evals.runs.create(\n",
    "    name=\"gpt-4.1-mini\",\n",
    "    eval_id=logs_eval.id,\n",
    "    data_source={\n",
    "        \"type\": \"responses\",\n",
    "        \"source\": {\n",
    "            \"type\": \"file_content\",\n",
    "            \"content\": [{\"item\": item} for item in get_dataset(limit=1)],\n",
    "        },\n",
    "        \"input_messages\": {\n",
    "            \"type\": \"template\",\n",
    "            \"template\": [\n",
    "                {\n",
    "                    \"type\": \"message\",\n",
    "                    \"role\": \"system\",\n",
    "                    \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"},\n",
    "                },\n",
    "                {\n",
    "                    \"type\": \"message\",\n",
    "                    \"role\": \"user\",\n",
    "                    \"content\": {\n",
    "                        \"type\": \"input_text\",\n",
    "                        \"text\": \"Extract the symbols from the code file {{item.input}}\",\n",
    "                    },\n",
    "                },\n",
    "            ],\n",
    "        },\n",
    "        \"model\": \"gpt-4.1-mini\",\n",
    "        \"sampling_params\": {\n",
    "            \"seed\": 42,\n",
    "            \"temperature\": 0.7,\n",
    "            \"max_completions_tokens\": 10000,\n",
    "            \"top_p\": 0.9,\n",
    "            \"text\": {\n",
    "                \"format\": {\n",
    "                    \"type\": \"json_schema\",\n",
    "                    \"name\": \"python_symbols\",\n",
    "                    \"schema\": {\n",
    "                        \"type\": \"object\",\n",
    "                        \"properties\": {\n",
    "                            \"symbols\": {\n",
    "                                \"type\": \"array\",\n",
    "                                \"description\": \"A list of symbols extracted from Python code.\",\n",
    "                                \"items\": {\n",
    "                                    \"type\": \"object\",\n",
    "                                    \"properties\": {\n",
    "                                        \"name\": {\"type\": \"string\", \"description\": \"The name of the symbol.\"},\n",
    "                                        \"symbol_type\": {\n",
    "                                            \"type\": \"string\",\n",
    "                                            \"description\": \"The type of the symbol, e.g., variable, function, class.\",\n",
    "                                        },\n",
    "                                    },\n",
    "                                    \"required\": [\"name\", \"symbol_type\"],\n",
    "                                    \"additionalProperties\": False,\n",
    "                                },\n",
    "                            }\n",
    "                        },\n",
    "                        \"required\": [\"symbols\"],\n",
    "                        \"additionalProperties\": False,\n",
    "                    },\n",
    "                    \"strict\": True,\n",
    "                },\n",
    "            },\n",
    "        },\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd0aa0c0",
   "metadata": {},
   "source": [
    "### Utility poller\n",
    "Next, we will use a simple loop that waits for all runs to finish, then saves each run’s JSON to disk so you can inspect it later or attach it to CI artifacts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cbc4f775",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
       "<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ResultCounts</span><span style=\"font-weight: bold\">(</span><span style=\"color: #808000; text-decoration-color: #808000\">errored</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">failed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>, <span style=\"color: #808000; text-decoration-color: #808000\">passed</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span>, <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"font-weight: bold\">)</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
       "\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "### Utility poller\n",
    "def poll_runs(eval_id, run_ids):\n",
    "    while True:\n",
    "        runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]\n",
    "        for run in runs:\n",
    "            print(run.id, run.status, run.result_counts)\n",
    "        if all(run.status in {\"completed\", \"failed\"} for run in runs):\n",
    "            # dump results to file\n",
    "            for run in runs:\n",
    "                with open(f\"{run.id}.json\", \"w\") as f:\n",
    "                    f.write(\n",
    "                        client.evals.runs.output_items.list(\n",
    "                            run_id=run.id, eval_id=eval_id\n",
    "                        ).model_dump_json(indent=4)\n",
    "                    )\n",
    "            break\n",
    "        time.sleep(5)\n",
    "\n",
    "poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77331859",
   "metadata": {},
   "source": [
    "### Load outputs for quick inspection\n",
    "We will fetch the output items for both runs so we can print or post‑process them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "c316e6eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "completions_output = client.evals.runs.output_items.list(\n",
    "    run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
    ")\n",
    "\n",
    "responses_output = client.evals.runs.output_items.list(\n",
    "    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cc61c54",
   "metadata": {},
   "source": [
    "### Human-readable dump\n",
    "Let us print a side-by-side view of completions vs responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "9f1b502e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<h4 style=\"color: #1CA7EC; font-weight: 600; letter-spacing: 1px; text-shadow: 0 1px 2px rgba(0,0,0,0.08), 0 0px 0px #fff;\">\n",
       "Completions vs Responses Output\n",
       "</h4>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_ac15e th {\n",
       "  font-size: 1.1em;\n",
       "  background-color: #323C50;\n",
       "  color: #FFFFFF;\n",
       "  border-bottom: 2px solid #1CA7EC;\n",
       "}\n",
       "#T_ac15e td {\n",
       "  font-size: 1em;\n",
       "  max-width: 650px;\n",
       "  background-color: #F6F8FA;\n",
       "  color: #222;\n",
       "  border-bottom: 1px solid #DDD;\n",
       "}\n",
       "#T_ac15e tr:hover td {\n",
       "  background-color: #D1ECF1;\n",
       "  color: #18647E;\n",
       "}\n",
       "#T_ac15e tbody tr:nth-child(even) td {\n",
       "  background-color: #E8F1FB;\n",
       "}\n",
       "#T_ac15e tbody tr:nth-child(odd) td {\n",
       "  background-color: #F6F8FA;\n",
       "}\n",
       "#T_ac15e table {\n",
       "  border-collapse: collapse;\n",
       "  border-radius: 6px;\n",
       "  overflow: hidden;\n",
       "}\n",
       "#T_ac15e_row0_col0, #T_ac15e_row0_col1 {\n",
       "  white-space: pre-wrap;\n",
       "  word-break: break-word;\n",
       "  padding: 8px;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_ac15e\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th id=\"T_ac15e_level0_col0\" class=\"col_heading level0 col0\" >Completions Output</th>\n",
       "      <th id=\"T_ac15e_level0_col1\" class=\"col_heading level0 col1\" >Responses Output</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_ac15e_row0_col0\" class=\"data row0 col0\" >{\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvals\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithStreamingResponse\",\"symb...</td>\n",
       "      <td id=\"T_ac15e_row0_col1\" class=\"data row0 col1\" >{\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"runs\",\"symbol_type\":\"property\"},{\"name\":\"with_raw_response\",\"symbol_type\":\"property\"},{\"name\":\"with_streaming_response\",\"symbol_type\":\"property\"},{\"name\":\"create\",\"symbol_type\":\"function\"},{...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x11dc60790>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display, HTML\n",
    "\n",
    "# Collect outputs for both runs\n",
    "completions_outputs = [item.sample.output[0].content for item in completions_output]\n",
    "responses_outputs = [item.sample.output[0].content for item in responses_output]\n",
    "\n",
    "# Create DataFrame for side-by-side display (truncated to 250 chars for readability)\n",
    "df = pd.DataFrame({\n",
    "    \"Completions Output\": [c[:250].replace('\\n', ' ') + ('...' if len(c) > 250 else '') for c in completions_outputs],\n",
    "    \"Responses Output\": [r[:250].replace('\\n', ' ') + ('...' if len(r) > 250 else '') for r in responses_outputs]\n",
    "})\n",
    "\n",
    "# Custom color scheme\n",
    "custom_styles = [\n",
    "    {'selector': 'th', 'props': [('font-size', '1.1em'), ('background-color', '#323C50'), ('color', '#FFFFFF'), ('border-bottom', '2px solid #1CA7EC')]},\n",
    "    {'selector': 'td', 'props': [('font-size', '1em'), ('max-width', '650px'), ('background-color', '#F6F8FA'), ('color', '#222'), ('border-bottom', '1px solid #DDD')]},\n",
    "    {'selector': 'tr:hover td', 'props': [('background-color', '#D1ECF1'), ('color', '#18647E')]},\n",
    "    {'selector': 'tbody tr:nth-child(even) td', 'props': [('background-color', '#E8F1FB')]},\n",
    "    {'selector': 'tbody tr:nth-child(odd) td', 'props': [('background-color', '#F6F8FA')]},\n",
    "    {'selector': 'table', 'props': [('border-collapse', 'collapse'), ('border-radius', '6px'), ('overflow', 'hidden')]},\n",
    "]\n",
    "\n",
    "styled = (\n",
    "    df.style\n",
    "    .set_properties(**{'white-space': 'pre-wrap', 'word-break': 'break-word', 'padding': '8px'})\n",
    "    .set_table_styles(custom_styles)\n",
    "    .hide(axis=\"index\")\n",
    ")\n",
    "\n",
    "display(HTML(\"\"\"\n",
    "<h4 style=\"color: #1CA7EC; font-weight: 600; letter-spacing: 1px; text-shadow: 0 1px 2px rgba(0,0,0,0.08), 0 0px 0px #fff;\">\n",
    "Completions vs Responses Output\n",
    "</h4>\n",
    "\"\"\"))\n",
    "display(styled)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cbe934f",
   "metadata": {},
   "source": [
    "### Visualize the Results\n",
    "\n",
    "Below are visualizations that represent the evaluation data and code outputs for structured QA evaluation. These images provide insights into the data distribution and the evaluation workflow.\n",
    "\n",
    "---\n",
    "\n",
    "**Evaluation Data Overview**\n",
    "\n",
    "![Evaluation Data Part 1](../../../images/eval_qa_data_1.png)\n",
    "\n",
    "![Evaluation Data Part 2](../../../images/eval_qa_data_2.png)\n",
    "\n",
    "---\n",
    "\n",
    "**Evaluation Code Workflow**\n",
    "\n",
    "![Evaluation Code Structure](../../../images/eval_qa_code.png)\n",
    "\n",
    "---\n",
    "\n",
    "By reviewing these visualizations, you can better understand the structure of the evaluation dataset and the steps involved in evaluating structured outputs for QA tasks.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0ae89ef",
   "metadata": {},
   "source": [
    "### Use Case 2: Multi-lingual Sentiment Extraction\n",
    "In a similar way, let us evaluate a multi-lingual sentiment extraction model with structured outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "e5f0b782",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample in-memory dataset for sentiment extraction\n",
    "sentiment_dataset = [\n",
    "    {\n",
    "        \"text\": \"I love this product!\",\n",
    "        \"channel\": \"twitter\",\n",
    "        \"language\": \"en\"\n",
    "    },\n",
    "    {\n",
    "        \"text\": \"This is the worst experience I've ever had.\",\n",
    "        \"channel\": \"support_ticket\",\n",
    "        \"language\": \"en\"\n",
    "    },\n",
    "    {\n",
    "        \"text\": \"It's okay – not great but not bad either.\",\n",
    "        \"channel\": \"app_review\",\n",
    "        \"language\": \"en\"\n",
    "    },\n",
    "    {\n",
    "        \"text\": \"No estoy seguro de lo que pienso sobre este producto.\",\n",
    "        \"channel\": \"facebook\",\n",
    "        \"language\": \"es\"\n",
    "    },\n",
    "    {\n",
    "        \"text\": \"总体来说，我对这款产品很满意。\",\n",
    "        \"channel\": \"wechat\",\n",
    "        \"language\": \"zh\"\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "cb6954f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define output schema\n",
    "sentiment_output_schema = {\n",
    "    \"type\": \"object\",\n",
    "    \"properties\": {\n",
    "        \"sentiment\": {\n",
    "            \"type\": \"string\",\n",
    "            \"description\": \"overall label: positive / negative / neutral\"\n",
    "        },\n",
    "        \"confidence\": {\n",
    "            \"type\": \"number\",\n",
    "            \"description\": \"confidence score 0-1\"\n",
    "        },\n",
    "        \"emotions\": {\n",
    "            \"type\": \"array\",\n",
    "            \"description\": \"list of dominant emotions (e.g. joy, anger)\",\n",
    "            \"items\": {\"type\": \"string\"}\n",
    "        }\n",
    "    },\n",
    "    \"required\": [\"sentiment\", \"confidence\", \"emotions\"],\n",
    "    \"additionalProperties\": False\n",
    "}\n",
    "\n",
    "# Grader prompts\n",
    "sentiment_grader_system = \"\"\"You are a strict grader for sentiment extraction.\n",
    "Given the text and the model's JSON output, score correctness on a 1-5 scale.\"\"\"\n",
    "\n",
    "sentiment_grader_user = \"\"\"Text: {{item.text}}\n",
    "Model output:\n",
    "{{sample.output_json}}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "ac815aec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Register an eval for the richer sentiment task\n",
    "sentiment_eval = client.evals.create(\n",
    "    name=\"sentiment_extraction_eval\",\n",
    "    data_source_config={\n",
    "        \"type\": \"custom\",\n",
    "        \"item_schema\": {          # matches the new dataset fields\n",
    "            \"type\": \"object\",\n",
    "            \"properties\": {\n",
    "                \"text\": {\"type\": \"string\"},\n",
    "                \"channel\": {\"type\": \"string\"},\n",
    "                \"language\": {\"type\": \"string\"},\n",
    "            },\n",
    "            \"required\": [\"text\"],\n",
    "        },\n",
    "        \"include_sample_schema\": True,\n",
    "    },\n",
    "    testing_criteria=[\n",
    "        {\n",
    "            \"type\": \"score_model\",\n",
    "            \"name\": \"Sentiment Grader\",\n",
    "            \"model\": \"o3\",\n",
    "            \"input\": [\n",
    "                {\"role\": \"system\", \"content\": sentiment_grader_system},\n",
    "                {\"role\": \"user\",   \"content\": sentiment_grader_user},\n",
    "            ],\n",
    "            \"range\": [1, 5],\n",
    "            \"pass_threshold\": 3.5,\n",
    "        }\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f4aa9d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run the sentiment eval\n",
    "sentiment_run = client.evals.runs.create(\n",
    "    name=\"gpt-4.1-sentiment\",\n",
    "    eval_id=sentiment_eval.id,\n",
    "    data_source={\n",
    "        \"type\": \"responses\",\n",
    "        \"source\": {\n",
    "            \"type\": \"file_content\",\n",
    "            \"content\": [{\"item\": item} for item in sentiment_dataset],\n",
    "        },\n",
    "        \"input_messages\": {\n",
    "            \"type\": \"template\",\n",
    "            \"template\": [\n",
    "                {\n",
    "                    \"type\": \"message\",\n",
    "                    \"role\": \"system\",\n",
    "                    \"content\": {\"type\": \"input_text\", \"text\": \"You are a helpful assistant.\"},\n",
    "                },\n",
    "                {\n",
    "                    \"type\": \"message\",\n",
    "                    \"role\": \"user\",\n",
    "                    \"content\": {\n",
    "                        \"type\": \"input_text\",\n",
    "                        \"text\": \"{{item.text}}\",\n",
    "                    },\n",
    "                },\n",
    "            ],\n",
    "        },\n",
    "        \"model\": \"gpt-4.1\",\n",
    "        \"sampling_params\": {\n",
    "            \"seed\": 42,\n",
    "            \"temperature\": 0.7,\n",
    "            \"max_completions_tokens\": 100,\n",
    "            \"top_p\": 0.9,\n",
    "            \"text\": {\n",
    "                \"format\": {\n",
    "                    \"type\": \"json_schema\",\n",
    "                    \"name\": \"sentiment_output\",\n",
    "                    \"schema\": sentiment_output_schema,\n",
    "                    \"strict\": True,\n",
    "                },\n",
    "            },\n",
    "        },\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17f5f960",
   "metadata": {},
   "source": [
    "### Visualize evals data \n",
    "![image](../../../images/evals_sentiment.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab141018",
   "metadata": {},
   "source": [
    "### Summary and Next Steps\n",
    "\n",
    "In this notebook, we have demonstrated how to use the OpenAI Evaluation API to evaluate a model's performance on a structured output task. \n",
    "\n",
    "**Next steps:**\n",
    "- We encourage you to try out the API with your own models and datasets.\n",
    "- You can also explore the API documentation for more details on how to use the API.    \n",
    "\n",
    "For more information, see the [OpenAI Evals documentation](https://platform.openai.com/docs/guides/evals).\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}