{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluating Web Search Quality with a Custom Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook demonstrates how to evaluate a model's ability to retrieve correct answers from the web using the OpenAI **Evals** framework with a custom in-memory dataset.\n",
"\n",
"**Goals:**\n",
"- Show how to set up and run an evaluation for web search quality.\n",
"- Provide a template for evaluating information retrieval capabilities of LLMs.\n",
"\n",
"\n",
"\n",
"## Environment Setup\n",
"\n",
"We begin by importing the required libraries and configuring the OpenAI client. \n",
"This ensures we have access to the OpenAI API and all necessary utilities for evaluation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.1.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"# Update OpenAI client\n",
"%pip install --upgrade openai --quiet"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import time\n",
"import pandas as pd\n",
"from IPython.display import display\n",
"\n",
"from openai import OpenAI\n",
"\n",
"client = OpenAI(\n",
" api_key=os.getenv(\"OPENAI_API_KEY\") or os.getenv(\"_OPENAI_API_KEY\"),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the Custom Evaluation Dataset\n",
"\n",
"We define a small, in-memory dataset of question-answer pairs for web search evaluation. \n",
"Each item contains a `query` (the user's search prompt) and an `answer` (the expected ground truth).\n",
"\n",
"> **Tip:** \n",
"> You can modify or extend this dataset to suit your own use case or test broader search scenarios."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def get_dataset(limit=None):\n",
" dataset = [\n",
" {\n",
" \"query\": \"coolest person in the world, the 100m dash at the 2008 olympics was the best sports event of all time\",\n",
" \"answer\": \"usain bolt\",\n",
" },\n",
" {\n",
" \"query\": \"best library in the world, there is nothing better than a dataframe\",\n",
" \"answer\": \"pandas\",\n",
" },\n",
" {\n",
" \"query\": \"most fun place to visit, I am obsessed with the Philbrook Museum of Art\",\n",
" \"answer\": \"tulsa, oklahoma\",\n",
" },\n",
" {\n",
" \"query\": \"who created the python programming language, beloved by data scientists everywhere\",\n",
" \"answer\": \"guido van rossum\",\n",
" },\n",
" {\n",
" \"query\": \"greatest chess player in history, famous for the 1972 world championship\",\n",
" \"answer\": \"bobby fischer\",\n",
" },\n",
" {\n",
" \"query\": \"the city of lights, home to the eiffel tower and louvre museum\",\n",
" \"answer\": \"paris\",\n",
" },\n",
" {\n",
" \"query\": \"most popular search engine, whose name is now a verb\",\n",
" \"answer\": \"google\",\n",
" },\n",
" {\n",
" \"query\": \"the first man to walk on the moon, giant leap for mankind\",\n",
" \"answer\": \"neil armstrong\",\n",
" },\n",
" {\n",
" \"query\": \"groundbreaking electric car company founded by elon musk\",\n",
" \"answer\": \"tesla\",\n",
" },\n",
" {\n",
" \"query\": \"founder of microsoft, philanthropist and software pioneer\",\n",
" \"answer\": \"bill gates\",\n",
" },\n",
" ]\n",
" return dataset[:limit] if limit else dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Grading Logic\n",
"\n",
"To evaluate the model’s answers, we use an LLM-based pass/fail grader:\n",
"\n",
"- **Pass/Fail Grader:** \n",
" An LLM-based grader that checks if the model’s answer (from web search) matches the expected answer (ground truth) or contains the correct information.\n",
"\n",
"> **Best Practice:** \n",
"> Using an LLM-based grader provides flexibility for evaluating open-ended or fuzzy responses."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"pass_fail_grader = \"\"\"\n",
"You are a helpful assistant that grades the quality of a web search.\n",
"You will be given a query and an answer.\n",
"You should grade the quality of the web search.\n",
"\n",
"You should either say \"pass\" or \"fail\", if the query contains the answer.\n",
"\n",
"\"\"\"\n",
"\n",
"pass_fail_grader_user_prompt = \"\"\"\n",
"
| \n", " | GPT-4.1 Output | \n", "GPT-4.1-mini Output | \n", "
|---|---|---|
| 0 | \n", "If you're captivated by the Philbrook Museum o... | \n", "Bobby Fischer is widely regarded as one of the... | \n", "
| 1 | \n", "\\n## [Paris, France](https://www.google.com/ma... | \n", "The 2008 Olympic 100m dash is widely regarded ... | \n", "
| 2 | \n", "Bill Gates, born on October 28, 1955, in Seatt... | \n", "If you're looking for fun places to visit in T... | \n", "
| 3 | \n", "Usain Bolt's performance in the 100-meter fina... | \n", "On July 20, 1969, astronaut Neil Armstrong bec... | \n", "
| 4 | \n", "It seems you're interested in both the world's... | \n", "Bill Gates is a renowned software pioneer, phi... | \n", "
| 5 | \n", "Neil Armstrong was the first person to walk on... | \n", "Your statement, \"there is nothing better than ... | \n", "
| 6 | \n", "Tesla, Inc. is an American electric vehicle an... | \n", "The search engine whose name has become synony... | \n", "
| 7 | \n", "Bobby Fischer, widely regarded as one of the g... | \n", "\\n## [Paris, France](https://www.google.com/ma... | \n", "
| 8 | \n", "Guido van Rossum, a Dutch programmer born on J... | \n", "Guido van Rossum, a Dutch programmer born on J... | \n", "
| 9 | \n", "The most popular search engine whose name has ... | \n", "Elon Musk is the CEO and largest shareholder o... | \n", "