{
"cells": [
{
"cell_type": "markdown",
"id": "0a2d56c0",
"metadata": {},
"source": [
"\n",
"# Structured Output Evaluation Cookbook\n",
" \n",
"This notebook walks you through a set of focused, runnable examples how to use the OpenAI **Evals** framework to **test, grade, and iterate on tasks that require large‑language models to produce structured outputs**.\n",
"\n",
"> **Why does this matter?** \n",
"> Production systems often depend on JSON, SQL, or domain‑specific formats. Relying on spot checks or ad‑hoc prompt tweaks quickly breaks down. Instead, you can *codify* expectations as automated evals and let your team ship with safety bricks instead of sand.\n"
]
},
{
"cell_type": "markdown",
"id": "45eee293",
"metadata": {},
"source": [
"\n",
"## Quick Tour\n",
"\n",
"* **Section 1 – Prerequisites**: environment variables and package setup \n",
"* **Section 2 – Walk‑through: Code‑symbol extraction**: end‑to‑end demo that grades the model’s ability to extract function and class names from source code. We keep the original logic intact and simply layer documentation around it. \n",
"* **Section 3 – Additional Recipes**: sketches of common production patterns such as sentiment extraction as additional code sample for evaluation.\n",
"* **Section 4 – Result Exploration**: lightweight helpers for pulling run output and digging into failures. \n"
]
},
{
"cell_type": "markdown",
"id": "e027be46",
"metadata": {},
"source": [
"\n",
"## Prerequisites\n",
"\n",
"1. **Install dependencies** (minimum versions shown):\n",
"\n",
"```bash\n",
"pip install --upgrade openai\n",
"```\n",
"\n",
"2. **Authenticate** by exporting your key:\n",
"\n",
"```bash\n",
"export OPENAI_API_KEY=\"sk‑...\"\n",
"```\n",
"\n",
"3. **Optional**: if you plan to run evals in bulk, set up an [organization‑level key](https://platform.openai.com/account/org-settings) with appropriate limits.\n"
]
},
{
"cell_type": "markdown",
"id": "4592675d",
"metadata": {},
"source": [
"### Use Case 1: Code symbol extraction"
]
},
{
"cell_type": "markdown",
"id": "d2a32d53",
"metadata": {},
"source": [
"\n",
"The goal is to **extract all function, class, and constant symbols from python files inside the OpenAI SDK**. \n",
"For each file we ask the model to emit structured JSON like:\n",
"\n",
"```json\n",
"{\n",
" \"symbols\": [\n",
" {\"name\": \"OpenAI\", \"kind\": \"class\"},\n",
" {\"name\": \"Evals\", \"kind\": \"module\"},\n",
" ...\n",
" ]\n",
"}\n",
"```\n",
"\n",
"A rubric model then grades **completeness** (did we capture every symbol?) and **quality** (are the kinds correct?) on a 1‑7 scale.\n"
]
},
{
"cell_type": "markdown",
"id": "9dd88e7c",
"metadata": {},
"source": [
"### Evaluating Code Quality Extraction with a Custom Dataset"
]
},
{
"cell_type": "markdown",
"id": "64bf0667",
"metadata": {},
"source": [
"Let us walk though an example to evaluate a model's ability to extract symbols from code using the OpenAI **Evals** framework with a custom in-memory dataset."
]
},
{
"cell_type": "markdown",
"id": "c95faa47",
"metadata": {},
"source": [
"### Initialize SDK client\n",
"Creates an `openai.OpenAI` client using the `OPENAI_API_KEY` we exported above. Nothing will run without this."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "eacc6ac7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install --upgrade openai pandas rich --quiet\n",
"\n",
"\n",
"\n",
"import os\n",
"import time\n",
"import openai\n",
"from rich import print\n",
"import pandas as pd\n",
"\n",
"client = openai.OpenAI(\n",
" api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8200aaf1",
"metadata": {},
"source": [
"### Dataset factory & grading rubric\n",
"* `get_dataset` builds a small in-memory dataset by reading several SDK files.\n",
"* `structured_output_grader` defines a detailed evaluation rubric.\n",
"* `client.evals.create(...)` registers the eval with the platform."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b272e193",
"metadata": {},
"outputs": [],
"source": [
"def get_dataset(limit=None):\n",
" openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
"\n",
" file_paths = [\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
" os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
" ]\n",
"\n",
" items = []\n",
" for file_path in file_paths:\n",
" items.append({\"input\": open(file_path, \"r\").read()})\n",
" if limit:\n",
" return items[:limit]\n",
" return items\n",
"\n",
"\n",
"structured_output_grader = \"\"\"\n",
"You are a helpful assistant that grades the quality of extracted information from a code file.\n",
"You will be given a code file and a list of extracted information.\n",
"You should grade the quality of the extracted information.\n",
"\n",
"You should grade the quality on a scale of 1 to 7.\n",
"You should apply the following criteria, and calculate your score as follows:\n",
"You should first check for completeness on a scale of 1 to 7.\n",
"Then you should apply a quality modifier.\n",
"\n",
"The quality modifier is a multiplier from 0 to 1 that you multiply by the completeness score.\n",
"If there is 100% coverage for completion and it is all high quality, then you would return 7*1.\n",
"If there is 100% coverage for completion but it is all low quality, then you would return 7*0.5.\n",
"etc.\n",
"\"\"\"\n",
"\n",
"structured_output_grader_user_prompt = \"\"\"\n",
"\n",
"{{item.input}}\n",
"\n",
"\n",
"
evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
"ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
"\n"
],
"text/plain": [
"evalrun_68487dcc749081918ec2571e76cc9ef6 completed\n",
"\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
"ResultCounts(errored=0, failed=1, passed=0, total=1)\n",
"\n"
],
"text/plain": [
"evalrun_68487dcdaba0819182db010fe5331f2e completed\n",
"\u001b[1;35mResultCounts\u001b[0m\u001b[1m(\u001b[0m\u001b[33merrored\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mfailed\u001b[0m=\u001b[1;36m1\u001b[0m, \u001b[33mpassed\u001b[0m=\u001b[1;36m0\u001b[0m, \u001b[33mtotal\u001b[0m=\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"### Utility poller\n",
"def poll_runs(eval_id, run_ids):\n",
" while True:\n",
" runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]\n",
" for run in runs:\n",
" print(run.id, run.status, run.result_counts)\n",
" if all(run.status in {\"completed\", \"failed\"} for run in runs):\n",
" # dump results to file\n",
" for run in runs:\n",
" with open(f\"{run.id}.json\", \"w\") as f:\n",
" f.write(\n",
" client.evals.runs.output_items.list(\n",
" run_id=run.id, eval_id=eval_id\n",
" ).model_dump_json(indent=4)\n",
" )\n",
" break\n",
" time.sleep(5)\n",
"\n",
"poll_runs(logs_eval.id, [gpt_4one_completions_run.id, gpt_4one_responses_run.id])"
]
},
{
"cell_type": "markdown",
"id": "77331859",
"metadata": {},
"source": [
"### Load outputs for quick inspection\n",
"We will fetch the output items for both runs so we can print or post‑process them."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c316e6eb",
"metadata": {},
"outputs": [],
"source": [
"completions_output = client.evals.runs.output_items.list(\n",
" run_id=gpt_4one_completions_run.id, eval_id=logs_eval.id\n",
")\n",
"\n",
"responses_output = client.evals.runs.output_items.list(\n",
" run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1cc61c54",
"metadata": {},
"source": [
"### Human-readable dump\n",
"Let us print a side-by-side view of completions vs responses."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "9f1b502e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"| Completions Output | \n", "Responses Output | \n", "
|---|---|
| {\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvals\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"AsyncEvalsWithRawResponse\",\"symbol_type\":\"class\"},{\"name\":\"EvalsWithStreamingResponse\",\"symb... | \n", "{\"symbols\":[{\"name\":\"Evals\",\"symbol_type\":\"class\"},{\"name\":\"runs\",\"symbol_type\":\"property\"},{\"name\":\"with_raw_response\",\"symbol_type\":\"property\"},{\"name\":\"with_streaming_response\",\"symbol_type\":\"property\"},{\"name\":\"create\",\"symbol_type\":\"function\"},{... | \n", "