{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Eval-Driven System Design: From Prototype to Production\n", "\n", "## Overview\n", "\n", "### Purpose of This Cookbook\n", "\n", "This cookbook provides a **practical**, end-to-end guide on how to effectively use \n", "evals as the core process in creating a production-grade autonomous system to \n", "replace a labor-intensive human workflow. It's a direct product of collaborative \n", "experience dealing with projects where users may not have started with pristine \n", "labeled data or a perfect understanding of the problem - two issues that most tutorials gloss \n", "over but are in practice almost always serious challenges.\n", "\n", "Making evals the core process prevents poke-and-hope guesswork and impressionistic\n", "judgments of accuracy, instead demanding engineering rigor. This means we can make\n", "principled decisions about cost trade-offs and investment. \n", "\n", "### Target Audience\n", "\n", "This guide is designed for ML/AI engineers and Solution Architects who are\n", "looking for practical guidance beyond introductory tutorials. This notebook is fully\n", "executable and organized to be as modular as possible to support using code\n", "samples directly in your own applications.\n", "\n", "### Guiding Narrative: From Tiny Seed to Production System\n", "\n", "We'll follow a realistic storyline: replacing a manual receipt-analysis service for validating expenses.\n", "\n", "* **Start Small:** Begin with a very small set of labeled data (retail receipts). Many businesses don't have good ground truth data sets. \n", "* **Build Incrementally:** Develop a minimal viable system and establish initial evals. \n", "* **Business Alignment:** Evaluate eval performance in the context of business KPIs and\n", " dollar impact, and target efforts to avoid working on low-impact improvements.\n", "* **Eval-Driven Iteration:** Iteratively improve by using eval scores to power model\n", " improvements, then by using better models on more data to expand evals and identify more\n", " areas for improvement.\n", "\n", "### How to Use This Cookbook\n", "\n", "This cookbook is structured as an eval-centric guide through the lifecycle of building\n", "an LLM application.\n", "\n", "1. If you're primarily interested in the ideas presented, read through the text and skim over\n", " the code.\n", "2. If you're here because of something else you're working on, you can go ahead and jump to that\n", " section and dig into the code there, copy it, and adapt it to your needs.\n", "3. If you want to really understand how this all works, download this notebook and run\n", " the cells as you read through it; edit the code to make your own changes, test your\n", " hypotheses, and make sure you actually understand how it all works together.\n", "\n", "> Note: If your OpenAI organization has a Zero Data Retention (ZDR) policy, Evals will still be available, but will retain data to maintain application state." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use Case: Receipt Parsing\n", "\n", "In order to condense this guide we'll be using a small hypothetical problem that's still complex\n", "enough to merit detailed and multi-faceted evals. In particular, we'll be focused on how\n", "to solve a problem given a limited amount of data to work with, so we're working with a\n", "dataset that's quite small.\n", "\n", "### Problem Definition\n", "\n", "For this guide, we assume that we are starting with a workflow for reviewing and filing \n", "receipts. While in general, this is a problem that already has a lot of established \n", "solutions, it's analogous to other problems that don't have nearly so much prior work; \n", "further, even when good enterprise solutions exist there is often still a \n", "\"last mile\" problem that still requires human time.\n", "\n", "In our case, we'll assume we have a pipeline where:\n", "\n", "* People upload photos of receipts\n", "* An accounting team reviews each receipt to categorize and approve or audit the expense\n", "\n", "Based on interviews with the accounting team, they make their decisions based on\n", "\n", "1. Merchant\n", "2. Geographic location\n", "3. Expense amount\n", "4. Items or services purchased\n", "5. Handwritten notes or annotations\n", "\n", "Our system will be expected to handle most receipts without any human intervention, but\n", "escalate low-confidence decisions for human QA. We'll be focused on reducing the total\n", "cost of the accounting process, which is dependent on\n", "\n", "1. How much the previous / current system cost to run per-receipt\n", "2. How many receipts the new system sends to QA\n", "3. How much the system costs to run per-receipt, plus any fixed costs\n", "4. What the business impact is of mistakes, either receipts kicked out for review or mistakes missed\n", "5. The cost of engineering to develop and integrate the system\n", "\n", "### Dataset Overview\n", "\n", "The receipt images come from the CC by 4.0 licensed\n", "[Receipt Handwriting Detection Computer Vision Project](https://universe.roboflow.com/newreceipts/receipt-handwriting-detection)\n", "dataset published by Roboflow. We've added our own labels and narrative spin in order to\n", "tell a story with a small number of examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Project Lifecycle\n", "\n", "Not every project will proceed in the same way, but projects generally have some \n", "important components in common.\n", "\n", "![Project Lifecycle](../../../images/partner_project_lifecycle.png)\n", "\n", "The solid arrows show the primary progressions or steps, while the dotted line \n", "represents the ongoing nature of problem understanding - uncovering more about\n", "the customer domain will influence every step of the process. We wil examine \n", "several of these iterative cycles of refinement in detail below. \n", "Not every project will proceed in the same way, but projects generally have some common\n", "important components.\n", "\n", "### 1. Understand the Problem\n", "\n", "Usually, the decision to start an engineering process is made by leadership who\n", "understand the business impact but don't need to know the process details. In our\n", "example, we're building a system designed to replace a non-AI workflow. In a sense this\n", "is ideal: we have a set of domain experts, *the people currently doing the task* who we\n", "can interview to understand the task details and who we can lean upon to help develop\n", "appropriate evals.\n", "\n", "This step doesn't end before we start building our system; invariably, our initial\n", "assessments are an incomplete understanding of the problem space and we will continue to\n", "refine our understanding as we get closer to a solution.\n", "\n", "### 2. Assemble Examples (Gather Data)\n", "\n", "It's very rare for a real-world project to begin with all the data necessary to achieve a satisfactory solution, let alone establish confidence.\n", "\n", "In our case, we'll assume we have a decent sample of system *inputs*, in the form of but receipt images, but start without any fully annotated data. We find this is a not-unusual situation when automating an existing process. We'll walk through the process of incrementally expanding our test and training sets in collaboration with domain experts as we go along and make our evals progressively more comprehensive.\n", "\n", "### 3. Build an End-to-End V0 System\n", "\n", "We want to get the skeleton of a system built as quickly as possible. We don't need a\n", "system that performs well - we just need something that accepts the right inputs and\n", "provides outputs of the correct type. Usually this is almost as simple as describing the\n", "task in a prompt, adding the inputs, and using a single model (usually with structured\n", "outputs) to make an initial best-effort attempt.\n", "\n", "### 4. Label Data and Build Initial Evals\n", "\n", "We've found that in the absence of an established ground truth, it's not uncommon to \n", "use an early version of a system to generate 'draft' truth data which can be annotated \n", "or corrected by domain experts.\n", "\n", "Once we have an end-to-end system constructed, we can start processing the inputs we\n", "have to generate plausible outputs. We'll send these to our domain experts to grade \n", "and correct. We will use these corrections and conversations about how the experts \n", "are making their decisions to design further evals and to embed expertise in the system.\n", "\n", "### 5. Map Evals to Business Metrics\n", "\n", "Before we jump into correcting every error, we need to make sure that we're investing\n", "time effectively. The most critical task at this stage is to review our evals and\n", "gain an understanding of how they connect to our key objectives.\n", "\n", "- Step back and assess the potential costs and benefits of the system\n", "- Identify which eval measurements speak directly to those costs and benefits\n", "- For example, what does \"failure\" on a particular eval cost? Are we measuring\n", " something worthwhile?\n", "- Create a (non-LLM) model that uses eval metrics to provide a dollar value\n", "- Balance performance (accuracy, or speed) with cost to develop and run\n", "\n", "### 6. Progressively Improve System and Evals\n", "\n", "Having identified which efforts are most worth making, we can begin iterating on \n", "improvements to the system. The evals act as an objective guide so we know when we've\n", "made the system good enough, and ensure we avoid or identify regression. \n", "\n", "### 7. Integrate QA Process and Ongoing Improvements\n", "\n", "Evals aren't just for development. Instrumenting all or a portion of a production\n", "service will surface more useful test and training samples over time, identifying\n", "incorrect assumptions or finding areas with insufficient coverage. This is also the only\n", "way you can ensure that your models continue performing well long after your initial\n", "development process is complete." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## V0 System Construction\n", "\n", "In practice, we would probably be building a system that operates via a REST API,\n", "possibly with some web frontend that would have access to some set of components and\n", "resources. For the purposes of this cookbook, we'll distill that down to a pair of\n", "functions, `extract_receipt_details` and `evaluate_receipt_for_audit` that collectively\n", "decide what we should do with a given receipt.\n", "\n", "- `extract_receipt_details` will take an image as input and produce structured output\n", " containing important details about the receipt.\n", "- `evaluate_receipt_for_audit` will take that structure as input and decide whether or\n", " not the receipt should be audited.\n", "\n", "> Breaking up a process into steps like this has both pros and cons; it is easier to\n", "> examine and develop if the process is made up of small isolated steps. But you can\n", "> progressively lose information, effectively letting your agents play \"telephone\". In\n", "> this notebook we break up the steps and don't let the auditor see the actual receipt\n", "> because it's more instructive for the evals we want to discuss.\n", "\n", "We'll start with the first step, the literal data extraction. This is *intermediate*\n", "data: it's information that people would examine implicitly, but often isn't recorded.\n", "And for this reason, we often don't have labeled data to work from." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install --upgrade openai pydantic python-dotenv rich persist-cache -qqq\n", "%load_ext dotenv\n", "%dotenv\n", "\n", "# Place your API key in a file called .env\n", "# OPENAI_API_KEY=sk-..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Structured Output Model\n", "\n", "Capture the meaningful information in a structured output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pydantic import BaseModel\n", "\n", "\n", "class Location(BaseModel):\n", " city: str | None\n", " state: str | None\n", " zipcode: str | None\n", "\n", "\n", "class LineItem(BaseModel):\n", " description: str | None\n", " product_code: str | None\n", " category: str | None\n", " item_price: str | None\n", " sale_price: str | None\n", " quantity: str | None\n", " total: str | None\n", "\n", "\n", "class ReceiptDetails(BaseModel):\n", " merchant: str | None\n", " location: Location\n", " time: str | None\n", " items: list[LineItem]\n", " subtotal: str | None\n", " tax: str | None\n", " total: str | None\n", " handwritten_notes: list[str]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> *Note*: Normally we would use `decimal.Decimal` objects for the numbers above and `datetime.datetime` objects for `time` field, but neither of those deserialize well. For the purposes of this cookbook, we'll work with strings, but in practice you'd want to have another level of translation to get the correct output validated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Info Extraction\n", "\n", "Let's build our `extract_receipt_details` function.\n", "\n", "Usually, for the very first stab at something that might work, we'll simply feed ChatGPT\n", "the available documents we've assembled so far and ask it to generate a prompt. It's not\n", "worth spending too much time on prompt engineering before you have a benchmark to grade\n", "yourself against! This is a prompt produced by o4-mini based on the problem description\n", "above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "BASIC_PROMPT = \"\"\"\n", "Given an image of a retail receipt, extract all relevant information and format it as a structured response.\n", "\n", "# Task Description\n", "\n", "Carefully examine the receipt image and identify the following key information:\n", "\n", "1. Merchant name and any relevant store identification\n", "2. Location information (city, state, ZIP code)\n", "3. Date and time of purchase\n", "4. All purchased items with their:\n", " * Item description/name\n", " * Item code/SKU (if present)\n", " * Category (infer from context if not explicit)\n", " * Regular price per item (if available)\n", " * Sale price per item (if discounted)\n", " * Quantity purchased\n", " * Total price for the line item\n", "5. Financial summary:\n", " * Subtotal before tax\n", " * Tax amount\n", " * Final total\n", "6. Any handwritten notes or annotations on the receipt (list each separately)\n", "\n", "## Important Guidelines\n", "\n", "* If information is unclear or missing, return null for that field\n", "* Format dates as ISO format (YYYY-MM-DDTHH:MM:SS)\n", "* Format all monetary values as decimal numbers\n", "* Distinguish between printed text and handwritten notes\n", "* Be precise with amounts and totals\n", "* For ambiguous items, use your best judgment based on context\n", "\n", "Your response should be structured and complete, capturing all available information\n", "from the receipt.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import base64\n", "import mimetypes\n", "from pathlib import Path\n", "\n", "from openai import AsyncOpenAI\n", "\n", "client = AsyncOpenAI()\n", "\n", "\n", "async def extract_receipt_details(\n", " image_path: str, model: str = \"o4-mini\"\n", ") -> ReceiptDetails:\n", " \"\"\"Extract structured details from a receipt image.\"\"\"\n", " # Determine image type for data URI.\n", " mime_type, _ = mimetypes.guess_type(image_path)\n", "\n", " # Read and base64 encode the image.\n", " b64_image = base64.b64encode(Path(image_path).read_bytes()).decode(\"utf-8\")\n", " image_data_url = f\"data:{mime_type};base64,{b64_image}\"\n", "\n", " response = await client.responses.parse(\n", " model=model,\n", " input=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"input_text\", \"text\": BASIC_PROMPT},\n", " {\"type\": \"input_image\", \"image_url\": image_data_url},\n", " ],\n", " }\n", " ],\n", " text_format=ReceiptDetails,\n", " )\n", "\n", " return response.output_parsed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test on one receipt\n", "\n", "Let's evaluate just a single receipt and review it manually to see how well a smart model with a naive prompt can do." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Walmart_image\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from rich import print\n", "\n", "receipt_image_dir = Path(\"data/test\")\n", "ground_truth_dir = Path(\"data/ground_truth\")\n", "\n", "example_receipt = Path(\n", " \"data/train/Supplies_20240322_220858_Raven_Scan_3_jpeg.rf.50852940734939c8838819d7795e1756.jpg\"\n", ")\n", "result = await extract_receipt_details(example_receipt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll get different answers if we re-run it, but it usually gets most things correct\n", "with a few errors. Here's a specific example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "walmart_receipt = ReceiptDetails(\n", " merchant=\"Walmart\",\n", " location=Location(city=\"Vista\", state=\"CA\", zipcode=\"92083\"),\n", " time=\"2023-06-30T16:40:45\",\n", " items=[\n", " LineItem(\n", " description=\"SPRAY 90\",\n", " product_code=\"001920056201\",\n", " category=None,\n", " item_price=None,\n", " sale_price=None,\n", " quantity=\"2\",\n", " total=\"28.28\",\n", " ),\n", " LineItem(\n", " description=\"LINT ROLLER 70\",\n", " product_code=\"007098200355\",\n", " category=None,\n", " item_price=None,\n", " sale_price=None,\n", " quantity=\"1\",\n", " total=\"6.67\",\n", " ),\n", " LineItem(\n", " description=\"SCRUBBER\",\n", " product_code=\"003444193232\",\n", " category=None,\n", " item_price=None,\n", " sale_price=None,\n", " quantity=\"2\",\n", " total=\"12.70\",\n", " ),\n", " LineItem(\n", " description=\"FLOUR SACK 10\",\n", " product_code=\"003444194263\",\n", " category=None,\n", " item_price=None,\n", " sale_price=None,\n", " quantity=\"1\",\n", " total=\"0.77\",\n", " ),\n", " ],\n", " subtotal=\"50.77\",\n", " tax=\"4.19\",\n", " total=\"54.96\",\n", " handwritten_notes=[],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model extracted a lot of things correctly, but renamed some of the line\n", "items - incorrectly, in fact. More importantly, it got some of the prices wrong, and it\n", "decided not to categorize any of the line items.\n", "\n", "That's okay, we don't expect to have perfect answers at this point! Instead, our\n", "objective is to build a basic system we can evaluate. Then, when we start iterating, we\n", "won't be 'vibing' our way to something that *looks* better -- we'll be engineering a\n", "reliable solution. But first, we'll add an action decision to complete our draft system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Action Decision\n", "\n", "Next, we need to close the loop and get to an actual decision based on receipts. This\n", "looks pretty similar, so we'll present the code without comment.\n", "\n", "Ordinarily one would start with the most capable model - `o3`, at this time - for a \n", "first pass, and then once correctness is established experiment with different models\n", "to analyze any tradeoffs for their business impact, and potentially consider whether \n", "they are remediable with iteration. A client may be willing to take a certain accuracy \n", "hit for lower latency or cost, or it may be more effective to change the architecture\n", "to hit cost, latency, and accuracy goals. We'll get into how to make these tradeoffs\n", "explicitly and objectively later on. \n", "\n", "For this cookbook, `o3` might be too good. We'll use `o4-mini` for our first pass, so \n", "that we get a few reasoning errors we can use to illustrate the means of addressing\n", "them when they occur.\n", "\n", "Next, we need to close the loop and get to an actual decision based on receipts. This\n", "looks pretty similar, so we'll present the code without comment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pydantic import BaseModel, Field\n", "\n", "audit_prompt = \"\"\"\n", "Evaluate this receipt data to determine if it need to be audited based on the following\n", "criteria:\n", "\n", "1. NOT_TRAVEL_RELATED:\n", " - IMPORTANT: For this criterion, travel-related expenses include but are not limited\n", " to: gas, hotel, airfare, or car rental.\n", " - If the receipt IS for a travel-related expense, set this to FALSE.\n", " - If the receipt is NOT for a travel-related expense (like office supplies), set this\n", " to TRUE.\n", " - In other words, if the receipt shows FUEL/GAS, this would be FALSE because gas IS\n", " travel-related.\n", "\n", "2. AMOUNT_OVER_LIMIT: The total amount exceeds $50\n", "\n", "3. MATH_ERROR: The math for computing the total doesn't add up (line items don't sum to\n", " total)\n", "\n", "4. HANDWRITTEN_X: There is an \"X\" in the handwritten notes\n", "\n", "For each criterion, determine if it is violated (true) or not (false). Provide your\n", "reasoning for each decision, and make a final determination on whether the receipt needs\n", "auditing. A receipt needs auditing if ANY of the criteria are violated.\n", "\n", "Return a structured response with your evaluation.\n", "\"\"\"\n", "\n", "\n", "class AuditDecision(BaseModel):\n", " not_travel_related: bool = Field(\n", " description=\"True if the receipt is not travel-related\"\n", " )\n", " amount_over_limit: bool = Field(description=\"True if the total amount exceeds $50\")\n", " math_error: bool = Field(description=\"True if there are math errors in the receipt\")\n", " handwritten_x: bool = Field(\n", " description=\"True if there is an 'X' in the handwritten notes\"\n", " )\n", " reasoning: str = Field(description=\"Explanation for the audit decision\")\n", " needs_audit: bool = Field(\n", " description=\"Final determination if receipt needs auditing\"\n", " )\n", "\n", "\n", "async def evaluate_receipt_for_audit(\n", " receipt_details: ReceiptDetails, model: str = \"o4-mini\"\n", ") -> AuditDecision:\n", " \"\"\"Determine if a receipt needs to be audited based on defined criteria.\"\"\"\n", " # Convert receipt details to JSON for the prompt\n", " receipt_json = receipt_details.model_dump_json(indent=2)\n", "\n", " response = await client.responses.parse(\n", " model=model,\n", " input=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"input_text\", \"text\": audit_prompt},\n", " {\"type\": \"input_text\", \"text\": f\"Receipt details:\\n{receipt_json}\"},\n", " ],\n", " }\n", " ],\n", " text_format=AuditDecision,\n", " )\n", "\n", " return response.output_parsed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A schematic of the overall process shows two LLM calls:\n", "\n", "![Process Flowchart](../../../images/partner_process_flowchart.png)\n", "\n", "If we run our above example through this model, here's what we get -- again, we'll use \n", "an example result here. When you run the code you might get slightly different results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audit_decision = await evaluate_receipt_for_audit(result)\n", "print(audit_decision)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audit_decision = AuditDecision(\n", " not_travel_related=True,\n", " amount_over_limit=True,\n", " math_error=False,\n", " handwritten_x=False,\n", " reasoning=\"\"\"\n", " The receipt from Walmart is for office supplies, which are not travel-related, thus NOT_TRAVEL_RELATED is TRUE.\n", " The total amount of the receipt is $54.96, which exceeds the limit of $50, making AMOUNT_OVER_LIMIT TRUE.\n", " The subtotal ($50.77) plus tax ($4.19) correctly sums to the total ($54.96), so there is no MATH_ERROR.\n", " There are no handwritten notes, so HANDWRITTEN_X is FALSE.\n", " Since two criteria (amount over limit and travel-related) are violated, the receipt\n", " needs auditing.\n", " \"\"\",\n", " needs_audit=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example illustrates why we care about end-to-end evals and why we can't use them in\n", "isolation. Here, the initial extraction had OCR errors and forwarded the prices to the\n", "auditor that don't add up to the total, but the auditor fails to detect it and asserts\n", "there are no math errors. However, missing this doesn't change the audit decision\n", "because it did pick up on the other two reasons the receipt needs to be audited.\n", "\n", "Thus, `AuditDecision` is factually incorrect, but the decision that we care about\n", "is correct. This gives us an edge to improve upon, but also guides us toward making\n", "sound choices for where and when we apply our engineering efforts.\n", "\n", "With that said, let's build ourselves some evals!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Evals\n", "\n", "Once we have a minimally functional system we should process more inputs and get domain\n", "experts to help develop ground-truth data. Domain experts doing expert tasks may not\n", "have much time to devote to our project, so we want to be efficient and start small,\n", "aiming for breadth rather than depth at first.\n", "\n", "> If your data *doesn't* require domain expertise, then you'd want to reach for a\n", "> labeling solution (such as [Label Studio](https://labelstud.io/)) and attempt to annotate\n", "> as much data as you can given the policy, budget, and data availability restrictions.\n", "> In this case, we're going to proceed as if data labeling is a scarce resource; one we\n", "> can rely on for small amounts each week, but these are people with other job\n", "> responsibilities whose time and willingness to help may be limited. Sitting with these\n", "> experts to help annotate examples can help make selecting future examples more\n", "> efficient.\n", "\n", "Because we have a chain of two steps, we'll be collecting tuples of type\n", "`[FilePath, ReceiptDetails, AuditDecision]`. Generally, the way to do this is to take\n", "unlabeled samples, run them through our model, and then have experts correct the output.\n", "For the purposes of this notebook, we've already gone through that process for all the\n", "receipt images in `data/test`.\n", "\n", "### Additional Considerations\n", "\n", "There's a little more to it than that though, because when you are evaluating a\n", "multistep process it's important to know both the end to end performance and the\n", "performance of each individual step, *conditioned on the output of the prior step*.\n", "\n", "In this case, we want to evaluate:\n", "\n", "1. Given an input image, how well do we extract the information we need?\n", "2. Given receipt information, how good is our **judgement** for our audit decision?\n", "3. Given an input image, how **successful** are we about making our final audit decision?\n", "\n", "The phrasing difference between #2 and #3 is because if we give our auditor incorrect\n", "data, we expect it to come to incorrect conclusions. What we *want* is to be confident\n", "that the auditor is making the correct decision based on the evidence available, even if\n", "that evidence is misleading. If we don't pay attention to that case, we can end up\n", "training the auditor to ignore its inputs and cause our overall performance to degrade." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graders\n", "\n", "The core component of an eval is the\n", "[grader](https://platform.openai.com/docs/guides/graders). Our eventual eval is going to\n", "use 18 of them, but we only use three kinds, and they're all quite conceptually\n", "straightforward.\n", "\n", "Here are examples of one of our string check graders, one of our text similarity\n", "graders, and finally one of our model graders." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "example_graders = [\n", " {\n", " \"name\": \"Total Amount Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.total }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.total }}\",\n", " },\n", " {\n", " \"name\": \"Merchant Name Accuracy\",\n", " \"type\": \"text_similarity\",\n", " \"input\": \"{{ item.predicted_receipt_details.merchant }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.merchant }}\",\n", " \"pass_threshold\": 0.8,\n", " \"evaluation_metric\": \"bleu\",\n", " },\n", "]\n", "\n", "# A model grader needs a prompt to instruct it in what it should be scoring.\n", "missed_items_grader_prompt = \"\"\"\n", "Your task is to evaluate the correctness of a receipt extraction model.\n", "\n", "The following items are the actual (correct) line items from a specific receipt.\n", "\n", "{{ item.correct_receipt_details.items }}\n", "\n", "The following items are the line items extracted by the model.\n", "\n", "{{ item.predicted_receipt_details.items }}\n", "\n", "Score 0 if the sample evaluation missed any items from the receipt; otherwise score 1.\n", "\n", "The line items are permitted to have small differences or extraction mistakes, but each\n", "item from the actual receipt must be present in some form in the model's output. Only\n", "evaluate whether there are MISSED items; ignore other mistakes or extra items.\n", "\"\"\"\n", "\n", "example_graders.append(\n", " {\n", " \"name\": \"Missed Line Items\",\n", " \"type\": \"score_model\",\n", " \"model\": \"o4-mini\",\n", " \"input\": [{\"role\": \"system\", \"content\": missed_items_grader_prompt}],\n", " \"range\": [0, 1],\n", " \"pass_threshold\": 1,\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each grader evaluates some portion of a predicted output. This might be a very narrow\n", "check for a specific field in a structured output, or a more holistic check that\n", "judges an output in its entirety. Some graders can work without context, and evaluate an\n", "output in isolation (for example, an LLM judge that is evaluating if a paragraph is rude\n", "or inappropriate). Others can evaluate based on the input and output, while while the\n", "ones we're using here rely on an output and a ground-truth (correct) output to compare\n", "against.\n", "\n", "The most direct way of using Evals provides a prompt and a model, and lets the eval run\n", "on an input to generate output itself. Another useful method uses previously logged\n", "responses or completions as the source of the outputs. It's not quite as simple, but the\n", "most flexible thing we can do is to supply an item containing everything we want it to\n", "use—this allows us to have the \"prediction\" function be an arbitrary system rather than\n", "restricting it to a single model call. This is how we're using it in the examples below;\n", "the `EvaluationRecord` shown below will be used to populate the `{{ }}` template\n", "variables.\n", "\n", "> **Note on Model Selection:** \n", "> Selecting the right model is crucial. While faster, less expensive models are often preferable in production, development workflows benefit from prioritizing the most capable models available. For this guide, we use `o4-mini` for both system tasks and LLM-based grading—while `o3` is more capable, our experience suggests the difference in output quality is modest relative to the substantial increase in cost. In practice, spending $10+/day/engineer on evals is typical, but scaling to $100+/day/engineer may not be sustainable.\n", ">\n", "> Nonetheless, it's valuable to periodically benchmark with a more advanced model like `o3`. If you observe significant improvements, consider incorporating it for a representative subset of your evaluation data. Discrepancies between models can reveal important edge cases and guide system improvements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "\n", "\n", "class EvaluationRecord(BaseModel):\n", " \"\"\"Holds both the correct (ground truth) and predicted audit decisions.\"\"\"\n", "\n", " receipt_image_path: str\n", " correct_receipt_details: ReceiptDetails\n", " predicted_receipt_details: ReceiptDetails\n", " correct_audit_decision: AuditDecision\n", " predicted_audit_decision: AuditDecision\n", "\n", "\n", "async def create_evaluation_record(image_path: Path, model: str) -> EvaluationRecord:\n", " \"\"\"Create a ground truth record for a receipt image.\"\"\"\n", " extraction_path = ground_truth_dir / \"extraction\" / f\"{image_path.stem}.json\"\n", " correct_details = ReceiptDetails.model_validate_json(extraction_path.read_text())\n", " predicted_details = await extract_receipt_details(image_path, model)\n", "\n", " audit_path = ground_truth_dir / \"audit_results\" / f\"{image_path.stem}.json\"\n", " correct_audit = AuditDecision.model_validate_json(audit_path.read_text())\n", " predicted_audit = await evaluate_receipt_for_audit(predicted_details, model)\n", "\n", " return EvaluationRecord(\n", " receipt_image_path=image_path.name,\n", " correct_receipt_details=correct_details,\n", " predicted_receipt_details=predicted_details,\n", " correct_audit_decision=correct_audit,\n", " predicted_audit_decision=predicted_audit,\n", " )\n", "\n", "\n", "async def create_dataset_content(\n", " receipt_image_dir: Path, model: str = \"o4-mini\"\n", ") -> list[dict]:\n", " # Assemble paired samples of ground truth data and predicted results. You could\n", " # instead upload this data as a file and pass a file id when you run the eval.\n", " tasks = [\n", " create_evaluation_record(image_path, model)\n", " for image_path in receipt_image_dir.glob(\"*.jpg\")\n", " ]\n", " return [{\"item\": record.model_dump()} for record in await asyncio.gather(*tasks)]\n", "\n", "\n", "file_content = await create_dataset_content(receipt_image_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the graders and the data, creating and running our evals is very straightforward:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from persist_cache import cache\n", "\n", "\n", "# We're caching the output so that if we re-run this cell we don't create a new eval.\n", "@cache\n", "async def create_eval(name: str, graders: list[dict]):\n", " eval_cfg = await client.evals.create(\n", " name=name,\n", " data_source_config={\n", " \"type\": \"custom\",\n", " \"item_schema\": EvaluationRecord.model_json_schema(),\n", " \"include_sample_schema\": False, # Don't generate new completions.\n", " },\n", " testing_criteria=graders,\n", " )\n", " print(f\"Created new eval: {eval_cfg.id}\")\n", " return eval_cfg\n", "\n", "\n", "initial_eval = await create_eval(\n", " \"Initial Receipt Processing Evaluation\", example_graders\n", ")\n", "\n", "# Run the eval.\n", "eval_run = await client.evals.runs.create(\n", " name=\"initial-receipt-processing-run\",\n", " eval_id=initial_eval.id,\n", " data_source={\n", " \"type\": \"jsonl\",\n", " \"source\": {\"type\": \"file_content\", \"content\": file_content},\n", " },\n", ")\n", "print(f\"Evaluation run created: {eval_run.id}\")\n", "print(f\"View results at: {eval_run.report_url}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After you run that eval you'll be able to view it in the UI, and should see something\n", "like the below. \n", "\n", "(Note, if you have a Zero-Data-Retention agreement, this data is not stored\n", "by OpenAI, so will not be available in this interface.)\n", "like:\n", "\n", "![Summary UI](../../../images/partner_summary_ui.png)\n", "\n", "You can drill into the data tab to look at individual examples:\n", "\n", "![Details UI](../../../images/partner_details_ui.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connecting Evals to Business Metrics\n", "\n", "Evals show you where you can improve, and help track progress and regressions over time.\n", "But the three evals above are just measurements — we need to imbue them with raison\n", "d'être.\n", "\n", "The first thing we need is to add evaluations for the final stage of our receipt\n", "processing, so that we can start seeing the results of our audit decisions. The next\n", "thing we need, the most important, is a *model of business relevance*.\n", "\n", "### A Business Model\n", "\n", "It's almost never easy to work out what costs and benefits you could get out of a new\n", "system depending on how well it performs. Often people will avoid trying to put\n", "numbers to things because they know how much uncertainty there is and they don't want to\n", "make guesses that make them look bad. That's okay; we just have to make our best guess,\n", "and if we get more information later we can refine our model.\n", "\n", "For this cookbook, we're going to create a simple cost structure:\n", "\n", "- our company processes 1 million receipts a year, at a baseline cost of $0.20 /\n", " receipt\n", "- auditing a receipt costs about $2\n", "- failing to audit a receipt we should have audited costs an average of $30\n", "- 5% of receipts need to be audited\n", "- the existing process\n", " - identifies receipts that need to be audited 97% of the time\n", " - misidentifies receipts that don't need to be audited 2% of the time\n", "\n", "This gives us two baseline comparisons:\n", "\n", "- if we identified every receipt correctly, we would spend $100,000 on audits\n", "- our current process spends $135,000 on audits and loses $45,000 to un-audited expenses\n", "\n", "On top of that, the human-driven process costs an additional $200,000.\n", "\n", "We're expecting our service to save money by costing less to run (≈1¢/receipt if we use\n", "the prompts from above with `o4-mini`), but whether we save or lose money on audits and\n", "missed audits depends on how well our system performs. It might be worth writing this as\n", "a simple function — written below is a version that includes the above factors but\n", "neglects nuance and ignores development, maintenance, and serving costs.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_costs(fp_rate: float, fn_rate: float, per_receipt_cost: float):\n", " audit_cost = 2\n", " missed_audit_cost = 30\n", " receipt_count = 1e6\n", " audit_fraction = 0.05\n", "\n", " needs_audit_count = receipt_count * audit_fraction\n", " no_needs_audit_count = receipt_count - needs_audit_count\n", "\n", " missed_audits = needs_audit_count * fn_rate\n", " total_audits = needs_audit_count * (1 - fn_rate) + no_needs_audit_count * fp_rate\n", "\n", " audit_cost = total_audits * audit_cost\n", " missed_audit_cost = missed_audits * missed_audit_cost\n", " processing_cost = receipt_count * per_receipt_cost\n", "\n", " return audit_cost + missed_audit_cost + processing_cost\n", "\n", "\n", "perfect_system_cost = calculate_costs(0, 0, 0)\n", "current_system_cost = calculate_costs(0.02, 0.03, 0.20)\n", "\n", "print(f\"Current system cost: ${current_system_cost:,.0f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Connecting Back To Evals\n", "\n", "The point of the above model is it lets us apply meaning to an eval that would\n", "otherwise just be a number. For instance, when we ran the system above we were wrong 85%\n", "of the time for merchant names. But digging in, it seems like most instances are\n", "capitalization issues or \"Shell Gasoline\" vs. \"Shell Oil #2144\" — problems that when\n", "we follow through, do not appear to affect our audit decision or change our fundamental\n", "costs.\n", "\n", "On the other hand, it seems like we fail to catch handwritten \"X\"s on receipts about\n", "half the time, and about half of the time when there's an \"X\" on a receipt that gets\n", "missed, it results in a receipt not getting audited when it should. Those are\n", "overrepresented in our dataset, but if that makes up even 1% of receipts, that 50%\n", "failure would cost us $75,000 a year.\n", "\n", "Similarly, it seems like we have OCR errors that cause us to audit receipts quite often\n", "on account of the math not working out, up to 20% of the time. This could cost us almost\n", "$400,000!\n", "\n", "Now, we're in a place to add more graders and start working backwards from the audit\n", "decision accuracy to determine which problems we should focus on.\n", "\n", "Below are the rest of our graders and the results we get with our initial un-optimized\n", "prompts. Note that at this point we do quite badly! Across our 20 samples (8 positive,\n", "12 negative), we had two false negatives and two false positives. If we extrapolated to\n", "our entire business, we'd be losing $375,000 on audits we missed and $475,000 on\n", "unnecessary audits." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "simple_extraction_graders = [\n", " {\n", " \"name\": \"Merchant Name Accuracy\",\n", " \"type\": \"text_similarity\",\n", " \"input\": \"{{ item.predicted_receipt_details.merchant }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.merchant }}\",\n", " \"pass_threshold\": 0.8,\n", " \"evaluation_metric\": \"bleu\",\n", " },\n", " {\n", " \"name\": \"Location City Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.location.city }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.location.city }}\",\n", " },\n", " {\n", " \"name\": \"Location State Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.location.state }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.location.state }}\",\n", " },\n", " {\n", " \"name\": \"Location Zipcode Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.location.zipcode }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.location.zipcode }}\",\n", " },\n", " {\n", " \"name\": \"Time Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.time }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.time }}\",\n", " },\n", " {\n", " \"name\": \"Subtotal Amount Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.subtotal }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.subtotal }}\",\n", " },\n", " {\n", " \"name\": \"Tax Amount Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.tax }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.tax }}\",\n", " },\n", " {\n", " \"name\": \"Total Amount Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_receipt_details.total }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.total }}\",\n", " },\n", " {\n", " \"name\": \"Handwritten Notes Accuracy\",\n", " \"type\": \"text_similarity\",\n", " \"input\": \"{{ item.predicted_receipt_details.handwritten_notes }}\",\n", " \"reference\": \"{{ item.correct_receipt_details.handwritten_notes }}\",\n", " \"pass_threshold\": 0.8,\n", " \"evaluation_metric\": \"fuzzy_match\",\n", " },\n", "]\n", "\n", "item_extraction_base = \"\"\"\n", "Your task is to evaluate the correctness of a receipt extraction model.\n", "\n", "The following items are the actual (correct) line items from a specific receipt.\n", "\n", "{{ item.correct_receipt_details.items }}\n", "\n", "The following items are the line items extracted by the model.\n", "\n", "{{ item.predicted_receipt_details.items }}\n", "\"\"\"\n", "\n", "missed_items_instructions = \"\"\"\n", "Score 0 if the sample evaluation missed any items from the receipt; otherwise score 1.\n", "\n", "The line items are permitted to have small differences or extraction mistakes, but each\n", "item from the actual receipt must be present in some form in the model's output. Only\n", "evaluate whether there are MISSED items; ignore other mistakes or extra items.\n", "\"\"\"\n", "\n", "extra_items_instructions = \"\"\"\n", "Score 0 if the sample evaluation extracted any extra items from the receipt; otherwise\n", "score 1.\n", "\n", "The line items are permitted to have small differences or extraction mistakes, but each\n", "item from the actual receipt must be present in some form in the model's output. Only\n", "evaluate whether there are EXTRA items; ignore other mistakes or missed items.\n", "\"\"\"\n", "\n", "item_mistakes_instructions = \"\"\"\n", "Score 0 to 10 based on the number and severity of mistakes in the line items.\n", "\n", "A score of 10 means that the two lists are perfectly identical.\n", "\n", "Remove 1 point for each minor mistake (typos, capitalization, category name\n", "differences), and up to 3 points for significant mistakes (incorrect quantity, price, or\n", "total, or categories that are not at all similar).\n", "\"\"\"\n", "\n", "item_extraction_graders = [\n", " {\n", " \"name\": \"Missed Line Items\",\n", " \"type\": \"score_model\",\n", " \"model\": \"o4-mini\",\n", " \"input\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": item_extraction_base + missed_items_instructions,\n", " }\n", " ],\n", " \"range\": [0, 1],\n", " \"pass_threshold\": 1,\n", " },\n", " {\n", " \"name\": \"Extra Line Items\",\n", " \"type\": \"score_model\",\n", " \"model\": \"o4-mini\",\n", " \"input\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": item_extraction_base + extra_items_instructions,\n", " }\n", " ],\n", " \"range\": [0, 1],\n", " \"pass_threshold\": 1,\n", " },\n", " {\n", " \"name\": \"Item Mistakes\",\n", " \"type\": \"score_model\",\n", " \"model\": \"o4-mini\",\n", " \"input\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": item_extraction_base + item_mistakes_instructions,\n", " }\n", " ],\n", " \"range\": [0, 10],\n", " \"pass_threshold\": 8,\n", " },\n", "]\n", "\n", "\n", "simple_audit_graders = [\n", " {\n", " \"name\": \"Not Travel Related Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_audit_decision.not_travel_related }}\",\n", " \"reference\": \"{{ item.correct_audit_decision.not_travel_related }}\",\n", " },\n", " {\n", " \"name\": \"Amount Over Limit Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_audit_decision.amount_over_limit }}\",\n", " \"reference\": \"{{ item.correct_audit_decision.amount_over_limit }}\",\n", " },\n", " {\n", " \"name\": \"Math Error Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_audit_decision.math_error }}\",\n", " \"reference\": \"{{ item.correct_audit_decision.math_error }}\",\n", " },\n", " {\n", " \"name\": \"Handwritten X Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_audit_decision.handwritten_x }}\",\n", " \"reference\": \"{{ item.correct_audit_decision.handwritten_x }}\",\n", " },\n", " {\n", " \"name\": \"Needs Audit Accuracy\",\n", " \"type\": \"string_check\",\n", " \"operation\": \"eq\",\n", " \"input\": \"{{ item.predicted_audit_decision.needs_audit }}\",\n", " \"reference\": \"{{ item.correct_audit_decision.needs_audit }}\",\n", " },\n", "]\n", "\n", "\n", "reasoning_eval_prompt = \"\"\"\n", "Your task is to evaluate the quality of *reasoning* for audit decisions on receipts.\n", "Here are the rules for audit decisions:\n", "\n", "Expenses should be audited if they violate any of the following criteria:\n", "1. Expenses must be travel-related\n", "2. Expenses must not exceed $50\n", "3. All math should be correct; the line items plus tax should equal the total\n", "4. There must not be an \"X\" in the handwritten notes\n", "\n", "If ANY of those criteria are violated, the expense should be audited.\n", "\n", "Here is the input to the grader:\n", "{{ item.predicted_receipt_details }}\n", "\n", "Below is the output of an authoritative grader making a decision about whether or not to\n", "audit an expense. This is a correct reference decision.\n", "\n", "GROUND TRUTH:\n", "{{ item.correct_audit_decision }}\n", "\n", "\n", "Here is the output of the model we are evaluating:\n", "\n", "MODEL GENERATED:\n", "{{ item.predicted_audit_decision }}\n", "\n", "\n", "Evaluate:\n", "1. For each of the 4 criteria, did the model correctly score it as TRUE or FALSE?\n", "2. Based on the model's *scoring* of the criteria (regardless if it scored it\n", " correctly), did the model reason appropriately about the criteria (i.e. did it\n", " understand and apply the prompt correctly)?\n", "3. Is the model's reasoning logically sound, sufficient, and comprehensible?\n", "4. Is the model's reasoning concise, without extraneous details?\n", "5. Is the final decision to audit or not audit correct?\n", "\n", "Grade the model with the following rubric:\n", "- (1) point for each of the 4 criteria that the model scored correctly\n", "- (3) points for each aspect of the model's reasoning that is meets the criteria\n", "- (3) points for the model's final decision to audit or not audit\n", "\n", "The total score is the sum of the points, and should be between 0 and 10 inclusive.\n", "\"\"\"\n", "\n", "\n", "model_judgement_graders = [\n", " {\n", " \"name\": \"Audit Reasoning Quality\",\n", " \"type\": \"score_model\",\n", " \"model\": \"o4-mini\",\n", " \"input\": [{\"role\": \"system\", \"content\": reasoning_eval_prompt}],\n", " \"range\": [0, 10],\n", " \"pass_threshold\": 8,\n", " },\n", "]\n", "\n", "full_eval = await create_eval(\n", " \"Full Receipt Processing Evaluation\",\n", " simple_extraction_graders\n", " + item_extraction_graders\n", " + simple_audit_graders\n", " + model_judgement_graders,\n", ")\n", "\n", "eval_run = await client.evals.runs.create(\n", " name=\"complete-receipt-processing-run\",\n", " eval_id=full_eval.id,\n", " data_source={\n", " \"type\": \"jsonl\",\n", " \"source\": {\"type\": \"file_content\", \"content\": file_content},\n", " },\n", ")\n", "\n", "eval_run.report_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Large Summary UI](../../../images/partner_large_summary_ui.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spin Up the Flywheel\n", "\n", "Having our business model means we have a map of what's worth doing and what isn't. Our\n", "initial evals are a road sign that lets us know we're moving in the right direction; but\n", "eventually we'll need more signage. At this point in the process we usually have a lot\n", "of different things we can work on, with a few linked cycles where improvement on one\n", "will open up more room for improvement on a different cycle.\n", "\n", "![Development Flywheel](../../../images/partner_development_flywheel.png)\n", "\n", "1. Our evals show us where we can improve, and we can immediately use them to guide us\n", " in model selection, prompt engineering, tool use, and fine-tuning strategies.\n", "2. We're not done once system performs well according to our evals. That's when it's\n", " time to *improve our evals*. We will process more data, give it to our domain experts\n", " to review, and feed the corrections into building better, more comprehensive evals.\n", "\n", "This cycle can go on for a while. We can speed it along by identifying the efficient\n", "frontier of \"interesting\" data to examine. There are a few techniques for this, but an\n", "easy one is re-running models on inputs to prioritize labeling inputs that don't\n", "get consistent answers. This works especially well when using different underlying\n", "models, and often even benefits from using less-intelligent models (if a dumb model\n", "agrees with a smart model then it's probably not a hard problem).\n", "\n", "Once it seems like we've hit a point of dimishing returns on performance, we can keep\n", "using the same techniques to optimize model cost; if we have a system that performs\n", "quite well, then fine-tuning or some form of model distillation will probably allow us\n", "to get similar performance from smaller, cheaper, faster models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## System Improvements\n", "\n", "With our evals in place and an understanding of how they connect to our business metrics,\n", "we're finally ready to turn our attention to improving the output of our system.\n", "\n", "Above, we noted that we get merchant names wrong 85% of the time, more than any other\n", "output we're evaluating. This looks pretty bad, and it's probably something we can\n", "improve dramaticaly with only a little work, but instead let's start from the endpoint\n", "of our business metrics and work backwards to see what issues caused incorrect\n", "decisions.\n", "\n", "When we do that, we see that the mistakes we made on merchant names are completely\n", "uncorrelated with our final audit decision, and there's no evidence that they have any\n", "impact on that decision. Based on our business model, we don't actually see a need to\n", "improve it -- in other words, *not all evals matter*. Instead, we can examine\n", "specifically the examples where we made a bad audit decision. There are only two of them\n", "(out of 20). Examining them closely, we observe that in both cases the problem came from\n", "the second stage of the pipeline making a wrong decision based on a non-problematic\n", "extraction. And in fact, both of them come from a failure to reason correctly about\n", "travel-related expenses.\n", "\n", "In the first case, the purchase is a snowbroom from an auto-parts store. This is a\n", "little bit of an edge case, but our domain experts identified this as a valid travel\n", "expense (because drivers might need one to clear their windshield). This seems like\n", "explaining the decision process in more detail and providing an analogous example would\n", "correct the error.\n", "\n", "In the second case, the purchase is some tools from a home improvement score. The tools\n", "don't have anything to do with normal driving, so this receipt should be audited as a\n", "\"non-travel-related expense\". In this case our model *correctly* identifies it as an\n", "expense that's not travel-related, but then reasons incorrectly about that fact,\n", "apparently misunderstanding that `true` for `not_travel_related` should imply `true` for\n", "`needs_audit`. Again, this seems like an example where more clarity in our instructions\n", "and a few examples should fix the issue.\n", "\n", "Connecting this back to our cost model, we note that we have 1 false negative and 1\n", "false positive, along with 7 true positives and 11 true negatives. Extrapolating this to\n", "the frequencies we see in production, this would increase our overall costs by $63,000\n", "per year.\n", "\n", "Let's modify the prompt and re-run our evals to see how we do. We'll provide more\n", "guidance in the form of a specific example in the instructions about engine oil\n", "(different from a snow broom, but requires the same reasoning), and we'll include three\n", "examples pulled from our training set (`data/train`) as few-shot guidance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_ai_system_cost = calculate_costs(\n", " fp_rate=1 / 12, fn_rate=1 / 8, per_receipt_cost=0.01\n", ")\n", "\n", "print(f\"First version of our system, estimated cost: ${first_ai_system_cost:,.0f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nursery_receipt_details = ReceiptDetails(\n", " merchant=\"WESTERN SIERRA NURSERY\",\n", " location=Location(city=\"Oakhurst\", state=\"CA\", zipcode=\"93644\"),\n", " time=\"2024-09-27T12:33:38\",\n", " items=[\n", " LineItem(\n", " description=\"Plantskydd Repellent RTU 1 Liter\",\n", " product_code=None,\n", " category=\"Garden/Pest Control\",\n", " item_price=\"24.99\",\n", " sale_price=None,\n", " quantity=\"1\",\n", " total=\"24.99\",\n", " )\n", " ],\n", " subtotal=\"24.99\",\n", " tax=\"1.94\",\n", " total=\"26.93\",\n", " handwritten_notes=[],\n", ")\n", "\n", "nursery_audit_decision = AuditDecision(\n", " not_travel_related=True,\n", " amount_over_limit=False,\n", " math_error=False,\n", " handwritten_x=False,\n", " reasoning=\"\"\"\n", " 1. The merchant is a plant nursery and the item purchased an insecticide, so this\n", " purchase is not travel-related (criterion 1 violated).\n", " 2. The total is $26.93, under $50, so criterion 2 is not violated.\n", " 3. The line items (1 * $24.99 + $1.94 tax) sum to $26.93, so criterion 3 is not\n", " violated.\n", " 4. There are no handwritten notes or 'X's, so criterion 4 is not violated.\n", " Since NOT_TRAVEL_RELATED is true, the receipt must be audited.\n", " \"\"\",\n", " needs_audit=True,\n", ")\n", "\n", "flying_j_details = ReceiptDetails(\n", " merchant=\"Flying J #616\",\n", " location=Location(city=\"Frazier Park\", state=\"CA\", zipcode=None),\n", " time=\"2024-10-01T13:23:00\",\n", " items=[\n", " LineItem(\n", " description=\"Unleaded\",\n", " product_code=None,\n", " category=\"Fuel\",\n", " item_price=\"4.459\",\n", " sale_price=None,\n", " quantity=\"11.076\",\n", " total=\"49.39\",\n", " )\n", " ],\n", " subtotal=\"49.39\",\n", " tax=None,\n", " total=\"49.39\",\n", " handwritten_notes=[\"yos -> home sequoia\", \"236660\"],\n", ")\n", "flying_j_audit_decision = AuditDecision(\n", " not_travel_related=False,\n", " amount_over_limit=False,\n", " math_error=False,\n", " handwritten_x=False,\n", " reasoning=\"\"\"\n", " 1. The only item purchased is Unleaded gasoline, which is travel-related so\n", " NOT_TRAVEL_RELATED is false.\n", " 2. The total is $49.39, which is under $50, so AMOUNT_OVER_LIMIT is false.\n", " 3. The line items ($4.459 * 11.076 = $49.387884) sum to the total of $49.39, so\n", " MATH_ERROR is false.\n", " 4. There is no \"X\" in the handwritten notes, so HANDWRITTEN_X is false.\n", " Since none of the criteria are violated, the receipt does not need auditing.\n", " \"\"\",\n", " needs_audit=False,\n", ")\n", "\n", "engine_oil_details = ReceiptDetails(\n", " merchant=\"O'Reilly Auto Parts\",\n", " location=Location(city=\"Sylmar\", state=\"CA\", zipcode=\"91342\"),\n", " time=\"2024-04-26T8:43:11\",\n", " items=[\n", " LineItem(\n", " description=\"VAL 5W-20\",\n", " product_code=None,\n", " category=\"Auto\",\n", " item_price=\"12.28\",\n", " sale_price=None,\n", " quantity=\"1\",\n", " total=\"12.28\",\n", " )\n", " ],\n", " subtotal=\"12.28\",\n", " tax=\"1.07\",\n", " total=\"13.35\",\n", " handwritten_notes=[\"vista -> yos\"],\n", ")\n", "engine_oil_audit_decision = AuditDecision(\n", " not_travel_related=False,\n", " amount_over_limit=False,\n", " math_error=False,\n", " handwritten_x=False,\n", " reasoning=\"\"\"\n", " 1. The only item purchased is engine oil, which might be required for a vehicle\n", " while traveling, so NOT_TRAVEL_RELATED is false.\n", " 2. The total is $13.35, which is under $50, so AMOUNT_OVER_LIMIT is false.\n", " 3. The line items ($12.28 + $1.07 tax) sum to the total of $13.35, so\n", " MATH_ERROR is false.\n", " 4. There is no \"X\" in the handwritten notes, so HANDWRITTEN_X is false.\n", " None of the criteria are violated so the receipt does not need to be audited.\n", " \"\"\",\n", " needs_audit=False,\n", ")\n", "\n", "examples = [\n", " {\"input\": nursery_receipt_details, \"output\": nursery_audit_decision},\n", " {\"input\": flying_j_details, \"output\": flying_j_audit_decision},\n", " {\"input\": engine_oil_details, \"output\": engine_oil_audit_decision},\n", "]\n", "\n", "# Format the examples as JSON, with each example wrapped in XML tags.\n", "example_format = \"\"\"\n", "\n", " \n", " {input}\n", " \n", " \n", " {output}\n", " \n", "\n", "\"\"\"\n", "\n", "examples_string = \"\"\n", "for example in examples:\n", " example_input = example[\"input\"].model_dump_json()\n", " correct_output = example[\"output\"].model_dump_json()\n", " examples_string += example_format.format(input=example_input, output=correct_output)\n", "\n", "audit_prompt = f\"\"\"\n", "Evaluate this receipt data to determine if it need to be audited based on the following\n", "criteria:\n", "\n", "1. NOT_TRAVEL_RELATED:\n", " - IMPORTANT: For this criterion, travel-related expenses include but are not limited\n", " to: gas, hotel, airfare, or car rental.\n", " - If the receipt IS for a travel-related expense, set this to FALSE.\n", " - If the receipt is NOT for a travel-related expense (like office supplies), set this\n", " to TRUE.\n", " - In other words, if the receipt shows FUEL/GAS, this would be FALSE because gas IS\n", " travel-related.\n", " - Travel-related expenses include anything that could be reasonably required for\n", " business-related travel activities. For instance, an employee using a personal\n", " vehicle might need to change their oil; if the receipt is for an oil change or the\n", " purchase of oil from an auto parts store, this would be acceptable and counts as a\n", " travel-related expense.\n", "\n", "2. AMOUNT_OVER_LIMIT: The total amount exceeds $50\n", "\n", "3. MATH_ERROR: The math for computing the total doesn't add up (line items don't sum to\n", " total)\n", " - Add up the price and quantity of each line item to get the subtotal\n", " - Add tax to the subtotal to get the total\n", " - If the total doesn't match the amount on the receipt, this is a math error\n", " - If the total is off by no more than $0.01, this is NOT a math error\n", "\n", "4. HANDWRITTEN_X: There is an \"X\" in the handwritten notes\n", "\n", "For each criterion, determine if it is violated (true) or not (false). Provide your\n", "reasoning for each decision, and make a final determination on whether the receipt needs\n", "auditing. A receipt needs auditing if ANY of the criteria are violated.\n", "\n", "Note that violation of a criterion means that it is `true`. If any of the above four\n", "values are `true`, then the receipt needs auditing (`needs_audit` should be `true`: it\n", "functions as a boolean OR over all four criteria).\n", "\n", "If the receipt contains non-travel expenses, then NOT_TRAVEL_RELATED should be `true`\n", "and therefore NEEDS_AUDIT must also be set to `true`. IF THE RECEIPT LISTS ITEMS THAT\n", "ARE NOT TRAVEL-RELATED, THEN IT MUST BE AUDITED. Here are some example inputs to\n", "demonstrate how you should act:\n", "\n", "\n", "{examples_string}\n", "\n", "\n", "Return a structured response with your evaluation.\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The modifications we made to the prompt above are:\n", "\n", "1. Under item 1 concerning travel-related expenses, we added a bullet point\n", "\n", "```\n", "- Travel-related expenses include anything that could be reasonably required for\n", " business-related travel activities. For instance, an employee using a personal\n", " vehicle might need to change their oil; if the receipt is for an oil change or the\n", " purchase of oil from an auto parts store, this would be acceptable and counts as a\n", " travel-related expense.\n", "```\n", "\n", "2. We added more proscriptive guidance on how to evaluate for a math error.\n", " Specifically, we added the bullet points:\n", "\n", "```\n", " - Add up the price and quantity of each line item to get the subtotal\n", " - Add tax to the subtotal to get the total\n", " - If the total doesn't match the amount on the receipt, this is a math error\n", " - If the total is off by no more than $0.01, this is NOT a math error\n", "```\n", "\n", " This doesn't actually have to do with the issues we mentioned, but is another issue\n", " we noticed as a flaw in the reasoning provided by the audit model.\n", "\n", "3. We added very strong guidance (we actually needed to state it and restate it\n", " emphatically) to say that non-travel-related expenses should be audited.\n", "\n", "```\n", "Note that violation of a criterion means that it is `true`. If any of the above four\n", "values are `true`, then the receipt needs auditing (`needs_audit` should be `true`: it\n", "functions as a boolean OR over all four criteria).\n", "\n", "If the receipt contains non-travel expenses, then NOT_TRAVEL_RELATED should be `true`\n", "and therefore NEEDS_AUDIT must also be set to `true`. IF THE RECEIPT LISTS ITEMS THAT\n", "ARE NOT TRAVEL-RELATED, THEN IT MUST BE AUDITED.\n", "```\n", "\n", "4. We added three examples, JSON input/output pairs wrapped in XML tags.\n", "3. We added three examples, JSON input/output pairs wrapped in XML tags.\n", "\n", "With our prompt revisions, we'll regenerate the data to evaluate and re-run the same\n", "eval to compare our results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file_content = await create_dataset_content(receipt_image_dir)\n", "\n", "eval_run = await client.evals.runs.create(\n", " name=\"updated-receipt-processing-run\",\n", " eval_id=full_eval.id,\n", " data_source={\n", " \"type\": \"jsonl\",\n", " \"source\": {\"type\": \"file_content\", \"content\": file_content},\n", " },\n", ")\n", "\n", "eval_run.report_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we ran the eval again, we actually still got two audit decisions wrong. Digging into\n", "the examples we made a mistake on, it turns out that we completely fixed the issues we\n", "identified, but our examples improved the reasoning step and caused two other issues to\n", "surface. Specifically:\n", "\n", "1. One receipt needed to be audited only because there was a mistake in extraction and\n", " a handwritten \"X\" wasn't identified. The audit model reasoned correctly, but based on\n", " incorrect data.\n", "2. One receipt was extracted in such a way that a $0.35 debit fee wasn't visible, so the\n", " audit model identified a math error. This almost certainly happened because we\n", " provided it with more detailed instructions and clear examples that demonstrated it\n", " needed to actually add up all the line items in order to decide whether there was a\n", " math error. Again, this demonstrates correct behavior on the part of the audit model\n", " and suggests we need to correct the extraction model.\n", "\n", "This is great, and we'll continue iterating on issues as we uncover them. This is the\n", "cycle of improvement!\n", "\n", "### Model Choice\n", "\n", "When beginning a project, we usually start with one of the most capable models available, such as `o4-mini`, to establish a performance baseline. Once we’re confident in the model’s ability to solve the task, the next step is to explore smaller, faster, or more cost-effective alternatives.\n", "\n", "Optimizing for inference cost and latency is essential, especially for production or customer-facing systems, where these factors can significantly impact overall expenses and user experience. For instance, switching from `o4-mini` to `gpt-4.1-mini` could reduce inference costs by nearly two-thirds—an example where thoughtful model selection leads to meaningful savings.\n", "\n", "In the next section, we’ll rerun our evaluations using `gpt-4.1-mini` for both extraction and audit steps to see how well a more efficient model performs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file_content = await create_dataset_content(receipt_image_dir, model=\"gpt-4.1-mini\")\n", "\n", "eval_run = await client.evals.runs.create(\n", " name=\"receipt-processing-run-gpt-4-1-mini\",\n", " eval_id=full_eval.id,\n", " data_source={\n", " \"type\": \"jsonl\",\n", " \"source\": {\"type\": \"file_content\", \"content\": file_content},\n", " },\n", ")\n", "\n", "eval_run.report_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are pretty promising. It doesn't look like the extraction accuracy suffered\n", "at all. We see one regression (the snowbroom again), but our audit decision is correct\n", "twice as often as it was before our prompt changes.\n", "\n", "![Eval Variations](../../../images/partner_eval_variations.png)\n", "\n", "This is great evidence that we'll be able to switch to a cheaper model, but it might\n", "require more prompt engineering, fine-tuning, or some form of model-distillation. Note\n", "however that according to our current model this would already be saving us money. We\n", "don't quite believe that yet because we don't have a large enough sample — our real\n", "false negative rate will be more than the 0 we see here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "system_cost_4_1_mini = calculate_costs(\n", " fp_rate=1 / 12, fn_rate=0, per_receipt_cost=0.003\n", ")\n", "\n", "print(f\"Cost using gpt-4.1-mini: ${system_cost_4_1_mini:,.0f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Further improvements\n", "\n", "This cookbook focuses on the philosophy and practicalities of evals, not the full range of model improvement techniques. For boosting or maintaining model performance (especially when moving to smaller, faster, or cheaper models), consider these steps in order—start from the top, and only proceed down if needed. For example, always optimize your prompt before resorting to fine-tuning; fine-tuning on a weak prompt can lock in bad performance even if you improve the prompt later.\n", "\n", "![Model Improvement Waterfall](../../../images/partner_model_improvement_waterfall.png)\n", "\n", "1. **Model selection:** try smarter models, or increase their reasoning budget.\n", "2. **Prompt tuning:** clarify instructions and provide very explicit rules.\n", "3. **Examples and context:** add few- or many-shot examples, or more context for the\n", " problem. RAG fits in here, and may be used to dynamically select similar examples.\n", "4. **Tools use:** provide tools to solve specific problems, including access to external\n", " APIs, the ability to query databases, or otherwise enable the model to have its own\n", " questions answered.\n", "5. **Accessory models:** add models to perform limited sub-tasks, to supervise and provide\n", " guardrails, or use a mixture of experts and aggregate solutions from multiple\n", " sub-models.\n", "6. **Fine-tuning:** use labeled training data for supervised fine tuning, eval\n", " graders for reinforcement fine tuning, or different outputs for direct preference\n", " optimization.\n", "\n", "The above options are all tools to maximize performance. Once you're trying to optimize\n", "for a price:performance ratio, you'll usually have already done all of the above and\n", "likely don't need to repeat most steps, but you can still fine-tune smaller models or\n", "use your best model to train a smaller model (model distillation).\n", "\n", "> One really excellent thing about OpenAI Evals is that you can use the same graders for\n", "> [Reinforcement Fine-Tuning](https://cookbook.openai.com/examples/reinforcement_fine_tuning)\n", "> to produce better model performance in an extremely sample-efficient manner. One note\n", "> of caution is to make sure that you use separate training data and don't leak your\n", "> eval datasets during RFT." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploying and Post-Development\n", "Building and deploying an LLM application is just the beginning—the real value comes from ongoing improvement. Once your system is live, prioritize continuous monitoring: log traces, track outputs, and proactively sample real user interactions for human review using smart sampling techniques.\n", "\n", "Production data is your most authentic source for evolving your evaluation and training datasets. Regularly collect and curate fresh samples from actual use cases to identify gaps, edge cases, and new opportunities for enhancement.\n", "\n", "In practice, leverage this data for rapid iteration. Automate periodic fine-tuning pipelines that retrain your models on recent, high-quality samples and automatically deploy new versions when they outperform existing ones in your evals. Capture user corrections and feedback, then systematically feed these insights back into your prompts or retraining process—especially when they highlight persistent issues.\n", "\n", "By embedding these feedback loops into your post-development workflow, you ensure your LLM applications continuously adapt, stay robust, and remain closely aligned with user needs as they evolve." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Contributors\n", "This cookbook serves as a joint collaboration effort between OpenAI and [Fractional](https://www.fractional.ai/).\n", "\n", "- Hugh Wimberly\n", "- Joshua Marker\n", "- Eddie Siegel\n", "- Shikhar Kwatra" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 4 }