{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a5e91602",
   "metadata": {},
   "source": [
    "# Practical Guide for Model Selection for Real‑World Use Cases\n",
    "\n",
    "## Purpose & Audience\n",
    "\n",
    "This cookbook serves as your practical guide to selecting, prompting, and deploying the right OpenAI model (between GPT 4.1, o3, and o4-mini) for specific workloads. Instead of exhaustive documentation, we provide actionable decision frameworks and real-world examples that help Solutions Engineers, Technical Account Managers, Partner Architects, and semi-technical practitioners quickly build working solutions. The content focuses on current model capabilities, vertical-specific implementations, and today's industry needs, with clear pathways from model selection to production deployment. Each section offers concise, adaptable code examples that you can immediately apply to your use cases while pointing to existing resources for deeper dives into specific topics.\n",
    "\n",
    "> Note: The below prescriptive guidance and experimentation has been conducted with latest SOTA models available today. These metrics are bound to change in the future with different scenarios and timeline into consideration.\n",
    "\n",
    "## How to Use This Cookbook\n",
    "\n",
    "This cookbook is organized into distinct sections to help you quickly find the information you need. Each section covers a specific aspect of model selection, implementation, and deployment.\n",
    "\n",
    "1. **[Purpose & Audience](#purpose-audience)**: An overview of who this cookbook is for and what it covers.\n",
    "2. **[Model Guide](#model-guide)**: A quick reference to help you select the right model for your needs, including model comparisons and evolution diagrams based on mapping different use-case scenarios.\n",
    "3. **Use Cases**:\n",
    "   - **[3A. Long-Context RAG for Legal Q&A](#3a-use-case-long-context-rag-for-legal-qa)**: Building an agentic system to answer questions from complex legal documents.\n",
    "   - **[3B. AI Co-Scientist for Pharma R&D](#3b-use-case-ai-co-scientist-for-pharma-rd)**: Accelerating experimental design in pharmaceutical research with multi-agent systems.\n",
    "   - **[3C. Insurance Claim Processing](#3c-use-case-insurance-claim-processing)**: Digitizing and validating handwritten insurance forms with vision and reasoning.\n",
    "4. **[Prototype to Production](#prototype-to-production)**: A checklist to help you transition from prototype to production.\n",
    "5. **[Adaptation Decision Tree](#adaptation-decision-tree)**: A flowchart to guide your model selection based on specific requirements.\n",
    "6. **[Appendices](#appendices)**: Reference materials including pricing, latency, prompt patterns, and links to external resources.\n",
    "\n",
    "For quick decisions, focus on the Model Guide and Adaptation Decision Tree sections. For implementation details, explore the specific use cases relevant to your needs.\n",
    "\n",
    "\n",
    "================================================================================\n",
    "\n",
    "\n",
    "\n",
    "## Model Guide\n",
    "\n",
    "## 2.1 Model‑Intro Matrix\n",
    "\n",
    "| Model | Core strength | Ideal first reach‑for | Watch‑outs | Escalate / Downgrade path |\n",
    "| :---- | :---- | :---- | :---- | :---- |\n",
    "| GPT‑4o | Real‑time voice / vision chat | Live multimodal agents | Slightly below 4.1 on text SOTA (state-of-the-art) | Need deep reasoning → o4‑mini |\n",
    "| GPT‑4.1 | 1 M‑token text accuracy king | Long‑doc analytics, code review | Cannot natively reason; higher cost than minis | Tight budget → 4.1‑mini / nano |\n",
    "| o3 | Deep tool‑using agent | High‑stakes, multi‑step reasoning | Latency & price | Cost/latency → o4‑mini |\n",
    "| o4‑mini | Cheap, fast reasoning | High‑volume \"good‑enough\" logic | Depth ceiling vs o3 | Accuracy critical → o3 |\n",
    "\n",
    "*(Full price and utility table → [Section 6.1](#appendices))*\n",
    "\n",
    "## 2.2 Model Evolution at a Glance\n",
    "\n",
    "OpenAI's model lineup has evolved to address specialized needs across different dimensions. These diagrams showcase the current model families and their relationships.\n",
    "\n",
    "### Fundamental Differences: \"o-series\" vs \"GPT\" Models\n",
    "\n",
    "OpenAI offers two distinct model families, each with unique strengths:\n",
    "\n",
    "- **GPT Models (4o, 4.1)**: Optimized for general-purpose tasks with excellent instruction following. GPT-4.1 excels with long contexts (1M tokens) while GPT-4o has variants for realtime speech, text-to-speech, and speech-to-text. GPT-4.1 also comes in a mini, and nano variant, while GPT-4o has a mini variant. These variants are cheaper and faster than their full-size counterparts.\n",
    "\n",
    "- **o-series Models (o3, o4-mini)**: Specialized for deep reasoning and step-by-step problem solving. These models excel at complex, multi-stage tasks requiring logical thinking and tool use. Choose these when accuracy and reasoning depth are paramount. These models also have an optional `reasoning_effort` parameter (that can be set to `low`, `medium`, or `high`), which allows users to control the amount of tokens used for reasoning.\n",
    "\n",
    "### OpenAI Model Evolution \n",
    "\n",
    "![OpenAI Model Evolution](../../../images/2.2_model_evolution.png)\n",
    "\n",
    "### Key Characteristics\n",
    "\n",
    "- **GPT-4.1 Family**: Optimized for long context processing with 1M token context window.\n",
    "- **o3**: Specialized for deep multi-step reasoning. \n",
    "- **o4-mini**: Combines reasoning capabilities with vision at lower cost.\n",
    "\n",
    "Each model excels in different scenarios, with complementary strengths that can be combined for complex workflows.\n",
    "\n",
    "In this cookbook we only experimented with the GPT-4.1 series models, o3, and o4-mini. We didn't experiment with the GPT-4o series models.\n",
    "\n",
    "================================================================================\n",
    "\n",
    "\n",
    "\n",
    "## 3A. Use Case: Long-Context RAG for Legal Q&A\n",
    "\n",
    "![Long-Context RAG for Legal Q&A](../../../images/3A_rag_task_card.png)\n",
    "## 🗂️ TL;DR Matrix\n",
    "\n",
    "This table summarizes the core technology choices and their rationale for **this specific Long-Context Agentic RAG implementation**.\n",
    "\n",
    "| Layer | Choice | Utility |\n",
    "| :---- | :---- | :---- |\n",
    "| **Chunking** | Sentence-aware Splitter | Splits document into 20 equal chunks, respecting sentence boundaries. |\n",
    "| **Routing** | `gpt-4.1-mini` | Uses natural language understanding to identify relevant chunks without embedding index. |\n",
    "| **Path Selection** | `select(ids=[...])` and `scratchpad(text=\"...\")` | Records reasoning while drilling down through document hierarchy. |\n",
    "| **Citation** | Paragraph-level | Balances precision with cost; provides meaningful context for answers. |\n",
    "| **Synthesis** | `gpt-4.1` (Structured Output) | Generates answers directly from selected paragraphs with citations. |\n",
    "| **Verification** | `o4-mini` (LLM-as-Judge) | Validates factual accuracy and citation correctness. |\n",
    "\n",
    "*Note: Prices and model identifiers accurate as of April 2025, subject to change.*\n",
    "\n",
    "This section outlines the construction of a Retrieval-Augmented Generation (RAG) system designed to accurately answer questions about complex and lengthy procedural texts, using the *Trademark Trial and Appeal Board Manual of Procedure (TBMP)* as a representative case. The TBMP is an essential legal resource detailing the procedures governing trademark litigation before the USPTO's Trademark Trial and Appeal Board, and is frequently consulted by intellectual property attorneys and legal professionals. By leveraging the latest OpenAI models, the system enhances understanding and interpretability of dense legal content, enabling precise, contextually aware responses through advanced language understanding and dynamic retrieval capabilities.\n",
    "\n",
    "These approaches can also be applied to other use cases that require precise information retrieval from complex documentation, such as healthcare compliance manuals, financial regulatory frameworks, or technical documentation systems where accuracy, citation, and auditability are mission-critical requirements.\n",
    "\n",
    "## 1\\. Scenario Snapshot\n",
    "\n",
    "* **Corpus:** The primary document is the [Trademark Trial and Appeal Board Manual of Procedure (TBMP, 2024 version)](https://www.uspto.gov/sites/default/files/documents/tbmp-Master-June2024.pdf). This manual contains detailed procedural rules and guidelines, coming to 1194 pages total.  \n",
    "* **Users:** The target users are intellectual property (IP) litigation associates and paralegals who need quick, accurate answers to procedural questions based *only* on the TBMP.  \n",
    "* **Typical Asks:** Users pose questions requiring synthesis and citation, such as:  \n",
    "  1. \"What are the requirements for filing a motion to compel discovery according to the TBMP?\"  \n",
    "  2. \"What deadlines apply to discovery conferences as specified in the manual?\"  \n",
    "  3. \"Explain how the Board handles claims of attorney-client privilege during depositions according to the TBMP.\"  \n",
    "  4. \"Enumerate the Fed. R. Civ. P. 11 sanctions the Board can invoke according to the TBMP.\"  \n",
    "\n",
    "*Note: Depending on your specific deployment environment, you may need to adapt some implementation steps to match your infrastructure requirements.*\n",
    "\n",
    "> While OpenAI's File Search tool offers a good starting point for many use cases, this section introduces a different approach that takes advantage of million-token context windows to process large documents without any preprocessing or vector database. The agentic approach described here enables zero-latency ingestion, dynamic granularity of retrieval, and fine-grained citation traceability.\n",
    "\n",
    "## 2\\. Agentic RAG Flow\n",
    "\n",
    "Before diving into the implementation, let's understand the overall approach:\n",
    "\n",
    "1. **Load the entire document** into the context window\n",
    "2. **Split into 20 chunks** that respect sentence boundaries\n",
    "3. **Ask the model** which chunks might contain relevant information\n",
    "4. **Drill down** into selected chunks by splitting them further\n",
    "5. **Repeat** until we reach paragraph-level content\n",
    "6. **Generate an answer** based on the selected paragraphs\n",
    "7. **Verify the answer** for factual accuracy\n",
    "\n",
    "This hierarchical navigation approach mimics how a human might skim a document, focus on relevant chapters, then specific sections, and finally read only the most relevant paragraphs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db9bad1b",
   "metadata": {},
   "source": [
    "![Hierarchical Router](../../../images/3A_rag_hierarchical_router.png)\n",
    "\n",
    "\n",
    "## Agentic RAG System: Model Usage\n",
    "\n",
    "| Process Stage | Model Used | Purpose |\n",
    "|---------------|------------|---------|\n",
    "| Initial Routing | `gpt-4.1-mini` | Identifies which document chunks might contain relevant information |\n",
    "| Hierarchical Navigation | `gpt-4.1-mini` | Continues drilling down to find most relevant paragraphs |\n",
    "| Answer Generation | `gpt-4.1` | Creates structured response with citations from selected paragraphs |\n",
    "| Answer Verification | `o4-mini` | Validates factual accuracy and proper citation usage |\n",
    "\n",
    "This zero-preprocessing approach leverages large context windows to navigate documents on-the-fly, mimicking how a human would skim a document to find relevant information. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df87f0ac",
   "metadata": {},
   "source": [
    "## 3\\. Implementation\n",
    "\n",
    "Let's implement this approach step by step.\n",
    "\n",
    "Start by installing the required packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "63c78cd6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install tiktoken pypdf nltk openai pydantic --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd1d7d60",
   "metadata": {},
   "source": [
    "### 3.1 Document Loading\n",
    "\n",
    "First, let's load the document and check its size. For this guide, we'll focus on sections 100-900, which cover the core procedural aspects through Review of Decision of Board. Sections 1000 and beyond (Interferences, Concurrent Use Proceedings, Ex Parte Appeals) are specialized procedures outside our current scope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "dd5fb149",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt_tab to\n",
      "[nltk_data]     /Users/kmurali/nltk_data...\n",
      "[nltk_data]   Package punkt_tab is already up-to-date!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading document from https://www.uspto.gov/sites/default/files/documents/tbmp-Master-June2024.pdf...\n",
      "Document loaded: 1194 pages, 595197 words, 932964 tokens\n",
      "\n",
      "Document preview (first 500 chars):\n",
      "--------------------------------------------------\n",
      "TRADEMARK TRIAL AND\n",
      "APPEAL BOARD MANUAL\n",
      "OF PROCEDURE (TBMP)\n",
      " June 2024\n",
      "June   2024\n",
      "United States Patent and Trademark Office\n",
      "PREFACE TO THE JUNE 2024 REVISION\n",
      "The June 2024 revision of the Trademark Trial and Appeal Board Manual of Procedure is an update of the\n",
      "June 2023 edition. This update is moderate in nature and incorporates relevant case law issued between March\n",
      "3, 2023 and March 1, 2024.\n",
      "The title of the manual is abbreviated as “TBMP.” A citation to a section of the manual may be written\n",
      "--------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "from io import BytesIO\n",
    "from pypdf import PdfReader\n",
    "import re\n",
    "import tiktoken\n",
    "from nltk.tokenize import sent_tokenize\n",
    "import nltk\n",
    "from typing import List, Dict, Any\n",
    "\n",
    "# Download nltk data if not already present\n",
    "nltk.download('punkt_tab')\n",
    "\n",
    "def load_document(url: str) -> str:\n",
    "    \"\"\"Load a document from a URL and return its text content.\"\"\"\n",
    "    print(f\"Downloading document from {url}...\")\n",
    "    response = requests.get(url)\n",
    "    response.raise_for_status()\n",
    "    pdf_bytes = BytesIO(response.content)\n",
    "    pdf_reader = PdfReader(pdf_bytes)\n",
    "    \n",
    "    full_text = \"\"\n",
    "    \n",
    "\n",
    "    max_page = 920  # Page cutoff before section 1000 (Interferences)\n",
    "    for i, page in enumerate(pdf_reader.pages):\n",
    "        if i >= max_page:\n",
    "            break\n",
    "        full_text += page.extract_text() + \"\\n\"\n",
    "    \n",
    "    # Count words and tokens\n",
    "    word_count = len(re.findall(r'\\b\\w+\\b', full_text))\n",
    "    \n",
    "    tokenizer = tiktoken.get_encoding(\"o200k_base\")\n",
    "    token_count = len(tokenizer.encode(full_text))\n",
    "    \n",
    "    print(f\"Document loaded: {len(pdf_reader.pages)} pages, {word_count} words, {token_count} tokens\")\n",
    "    return full_text\n",
    "\n",
    "# Load the document\n",
    "tbmp_url = \"https://www.uspto.gov/sites/default/files/documents/tbmp-Master-June2024.pdf\"\n",
    "document_text = load_document(tbmp_url)\n",
    "\n",
    "# Show the first 500 characters\n",
    "print(\"\\nDocument preview (first 500 chars):\")\n",
    "print(\"-\" * 50)\n",
    "print(document_text[:500])\n",
    "print(\"-\" * 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bf86c84",
   "metadata": {},
   "source": [
    "We can see that the document is over 900k tokens long! While we could fit that into GPT 4.1's context length, we also want to have verifiable citations, so we're going to proceed with a recursive chunking strategy."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "445cbcaa",
   "metadata": {},
   "source": [
    "### 3.2 Improved 20-Chunk Splitter with Minimum Token Size\n",
    "\n",
    "Now, let's create an improved function to split the document into 20 chunks, ensuring each has a minimum token size and respecting sentence boundaries.\n",
    "\n",
    "> 20 is an empirically chosen number for this specific document/task and it might need tuning for other documents based on size and structure (The higher the number, the more fine-grained the chunks). The key principle here however is splitting sections of the document up, in order to let the language model decide relevant components. This same reasoning also applies to the `max_depth` parameter which will be introduced later on in the cookbook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "604f869b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split document into 20 chunks\n",
      "Chunk 0: 42326 tokens\n",
      "Chunk 1: 42093 tokens\n",
      "Chunk 2: 42107 tokens\n",
      "Chunk 3: 39797 tokens\n",
      "Chunk 4: 58959 tokens\n",
      "Chunk 5: 48805 tokens\n",
      "Chunk 6: 37243 tokens\n",
      "Chunk 7: 33453 tokens\n",
      "Chunk 8: 38644 tokens\n",
      "Chunk 9: 49402 tokens\n",
      "Chunk 10: 51568 tokens\n",
      "Chunk 11: 49586 tokens\n",
      "Chunk 12: 47722 tokens\n",
      "Chunk 13: 48952 tokens\n",
      "Chunk 14: 44994 tokens\n",
      "Chunk 15: 50286 tokens\n",
      "Chunk 16: 54424 tokens\n",
      "Chunk 17: 62651 tokens\n",
      "Chunk 18: 47430 tokens\n",
      "Chunk 19: 42507 tokens\n"
     ]
    }
   ],
   "source": [
    "# Global tokenizer name to use consistently throughout the code\n",
    "TOKENIZER_NAME = \"o200k_base\"\n",
    "\n",
    "def split_into_20_chunks(text: str, min_tokens: int = 500) -> List[Dict[str, Any]]:\n",
    "    \"\"\"\n",
    "    Split text into up to 20 chunks, respecting sentence boundaries and ensuring\n",
    "    each chunk has at least min_tokens (unless it's the last chunk).\n",
    "    \n",
    "    Args:\n",
    "        text: The text to split\n",
    "        min_tokens: The minimum number of tokens per chunk (default: 500)\n",
    "    \n",
    "    Returns:\n",
    "        A list of dictionaries where each dictionary has:\n",
    "        - id: The chunk ID (0-19)\n",
    "        - text: The chunk text content\n",
    "    \"\"\"\n",
    "    # First, split the text into sentences\n",
    "    sentences = sent_tokenize(text)\n",
    "    \n",
    "    # Get tokenizer for counting tokens\n",
    "    tokenizer = tiktoken.get_encoding(TOKENIZER_NAME)\n",
    "    \n",
    "    # Create chunks that respect sentence boundaries and minimum token count\n",
    "    chunks = []\n",
    "    current_chunk_sentences = []\n",
    "    current_chunk_tokens = 0\n",
    "    \n",
    "    for sentence in sentences:\n",
    "        # Count tokens in this sentence\n",
    "        sentence_tokens = len(tokenizer.encode(sentence))\n",
    "        \n",
    "        # If adding this sentence would make the chunk too large AND we already have the minimum tokens,\n",
    "        # finalize the current chunk and start a new one\n",
    "        if (current_chunk_tokens + sentence_tokens > min_tokens * 2) and current_chunk_tokens >= min_tokens:\n",
    "            chunk_text = \" \".join(current_chunk_sentences)\n",
    "            chunks.append({\n",
    "                \"id\": len(chunks),  # Integer ID instead of string\n",
    "                \"text\": chunk_text\n",
    "            })\n",
    "            current_chunk_sentences = [sentence]\n",
    "            current_chunk_tokens = sentence_tokens\n",
    "        else:\n",
    "            # Add this sentence to the current chunk\n",
    "            current_chunk_sentences.append(sentence)\n",
    "            current_chunk_tokens += sentence_tokens\n",
    "    \n",
    "    # Add the last chunk if there's anything left\n",
    "    if current_chunk_sentences:\n",
    "        chunk_text = \" \".join(current_chunk_sentences)\n",
    "        chunks.append({\n",
    "            \"id\": len(chunks),  # Integer ID instead of string\n",
    "            \"text\": chunk_text\n",
    "        })\n",
    "    \n",
    "    # If we have more than 20 chunks, consolidate them\n",
    "    if len(chunks) > 20:\n",
    "        # Recombine all text\n",
    "        all_text = \" \".join(chunk[\"text\"] for chunk in chunks)\n",
    "        # Re-split into exactly 20 chunks, without minimum token requirement\n",
    "        sentences = sent_tokenize(all_text)\n",
    "        sentences_per_chunk = len(sentences) // 20 + (1 if len(sentences) % 20 > 0 else 0)\n",
    "        \n",
    "        chunks = []\n",
    "        for i in range(0, len(sentences), sentences_per_chunk):\n",
    "            # Get the sentences for this chunk\n",
    "            chunk_sentences = sentences[i:i+sentences_per_chunk]\n",
    "            # Join the sentences into a single text\n",
    "            chunk_text = \" \".join(chunk_sentences)\n",
    "            # Create a chunk object with ID and text\n",
    "            chunks.append({\n",
    "                \"id\": len(chunks),  # Integer ID instead of string\n",
    "                \"text\": chunk_text\n",
    "            })\n",
    "    \n",
    "    # Print chunk statistics\n",
    "    print(f\"Split document into {len(chunks)} chunks\")\n",
    "    for i, chunk in enumerate(chunks):\n",
    "        token_count = len(tokenizer.encode(chunk[\"text\"]))\n",
    "        print(f\"Chunk {i}: {token_count} tokens\")\n",
    "    \n",
    "    return chunks\n",
    "\n",
    "# Split the document into 20 chunks with minimum token size\n",
    "document_chunks = split_into_20_chunks(document_text, min_tokens=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dccc89e6",
   "metadata": {},
   "source": [
    "### 3.3 Router Function with Improved Tool Schema\n",
    "\n",
    "Now, let's create the router function that will select relevant chunks and maintain a scratchpad.\n",
    "\n",
    "> Maintaining a scratchpad allows the model to track decision criteria and reasoning over time. This implementation uses a two-pass approach with GPT-4.1-mini: first requiring the model to update the scratchpad via a tool call (tool_choice=\"required\"), then requesting structured JSON output for chunk selection. This approach provides better visibility into the model's reasoning process while ensuring consistent structured outputs for downstream processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a8373af1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "import json\n",
    "from typing import List, Dict, Any\n",
    "\n",
    "# Initialize OpenAI client\n",
    "client = OpenAI()\n",
    "\n",
    "def route_chunks(question: str, chunks: List[Dict[str, Any]], \n",
    "                depth: int, scratchpad: str = \"\") -> Dict[str, Any]:\n",
    "    \"\"\"\n",
    "    Ask the model which chunks contain information relevant to the question.\n",
    "    Maintains a scratchpad for the model's reasoning.\n",
    "    Uses structured output for chunk selection and required tool calls for scratchpad.\n",
    "    \n",
    "    Args:\n",
    "        question: The user's question\n",
    "        chunks: List of chunks to evaluate\n",
    "        depth: Current depth in the navigation hierarchy\n",
    "        scratchpad: Current scratchpad content\n",
    "    \n",
    "    Returns:\n",
    "        Dictionary with selected IDs and updated scratchpad\n",
    "    \"\"\"\n",
    "    print(f\"\\n==== ROUTING AT DEPTH {depth} ====\")\n",
    "    print(f\"Evaluating {len(chunks)} chunks for relevance\")\n",
    "    \n",
    "    # Build system message\n",
    "    system_message = \"\"\"You are an expert document navigator. Your task is to:\n",
    "1. Identify which text chunks might contain information to answer the user's question\n",
    "2. Record your reasoning in a scratchpad for later reference\n",
    "3. Choose chunks that are most likely relevant. Be selective, but thorough. Choose as many chunks as you need to answer the question, but avoid selecting too many.\n",
    "\n",
    "First think carefully about what information would help answer the question, then evaluate each chunk.\n",
    "\"\"\"\n",
    "\n",
    "    # Build user message with chunks and current scratchpad\n",
    "    user_message = f\"QUESTION: {question}\\n\\n\"\n",
    "    \n",
    "    if scratchpad:\n",
    "        user_message += f\"CURRENT SCRATCHPAD:\\n{scratchpad}\\n\\n\"\n",
    "    \n",
    "    user_message += \"TEXT CHUNKS:\\n\\n\"\n",
    "    \n",
    "    # Add each chunk to the message\n",
    "    for chunk in chunks:\n",
    "        user_message += f\"CHUNK {chunk['id']}:\\n{chunk['text']}\\n\\n\"\n",
    "    \n",
    "    # Define function schema for scratchpad tool calling\n",
    "    tools = [\n",
    "        {\n",
    "            \"type\": \"function\",\n",
    "            \"name\": \"update_scratchpad\",\n",
    "            \"description\": \"Record your reasoning about why certain chunks were selected\",\n",
    "            \"strict\": True,\n",
    "            \"parameters\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": {\n",
    "                    \"text\": {\n",
    "                        \"type\": \"string\",\n",
    "                        \"description\": \"Your reasoning about the chunk(s) selection\"\n",
    "                    }\n",
    "                },\n",
    "                \"required\": [\"text\"],\n",
    "                \"additionalProperties\": False\n",
    "            }\n",
    "        }\n",
    "    ]\n",
    "    \n",
    "    # Define JSON schema for structured output (selected chunks)\n",
    "    text_format = {\n",
    "        \"format\": {\n",
    "            \"type\": \"json_schema\",\n",
    "            \"name\": \"selected_chunks\",\n",
    "            \"strict\": True,\n",
    "            \"schema\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": {\n",
    "                    \"chunk_ids\": {\n",
    "                        \"type\": \"array\",\n",
    "                        \"items\": {\"type\": \"integer\"},\n",
    "                        \"description\": \"IDs of the selected chunks that contain information to answer the question\"\n",
    "                    }\n",
    "                },\n",
    "                \"required\": [\n",
    "                    \"chunk_ids\"\n",
    "                ],\n",
    "                \"additionalProperties\": False\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "    \n",
    "    # First pass: Call the model to update scratchpad (required tool call)\n",
    "    messages = [\n",
    "        {\"role\": \"system\", \"content\": system_message},\n",
    "        {\"role\": \"user\", \"content\": user_message + \"\\n\\nFirst, you must use the update_scratchpad function to record your reasoning.\"}\n",
    "    ]\n",
    "    \n",
    "    response = client.responses.create(\n",
    "        model=\"gpt-4.1-mini\",\n",
    "        input=messages,\n",
    "        tools=tools,\n",
    "        tool_choice=\"required\"\n",
    "    )\n",
    "    \n",
    "    # Process the scratchpad tool call\n",
    "    new_scratchpad = scratchpad\n",
    "    \n",
    "    for tool_call in response.output:\n",
    "        if tool_call.type == \"function_call\" and tool_call.name == \"update_scratchpad\":\n",
    "            args = json.loads(tool_call.arguments)\n",
    "            scratchpad_entry = f\"DEPTH {depth} REASONING:\\n{args.get('text', '')}\"\n",
    "            if new_scratchpad:\n",
    "                new_scratchpad += \"\\n\\n\" + scratchpad_entry\n",
    "            else:\n",
    "                new_scratchpad = scratchpad_entry\n",
    "            \n",
    "            # Add function call and result to messages\n",
    "            messages.append(tool_call)\n",
    "            messages.append({\n",
    "                \"type\": \"function_call_output\",\n",
    "                \"call_id\": tool_call.call_id,\n",
    "                \"output\": \"Scratchpad updated successfully.\"\n",
    "            })\n",
    "    \n",
    "    # Second pass: Get structured output for chunk selection\n",
    "    messages.append({\"role\": \"user\", \"content\": \"Now, select the chunks that could contain information to answer the question. Return a JSON object with the list of chunk IDs.\"})\n",
    "    \n",
    "    response_chunks = client.responses.create(\n",
    "        model=\"gpt-4.1-mini\",\n",
    "        input=messages,\n",
    "        text=text_format\n",
    "    )\n",
    "    \n",
    "    # Extract selected chunk IDs from structured output\n",
    "    selected_ids = []\n",
    "    if response_chunks.output_text:\n",
    "        try:\n",
    "            # The output_text should already be in JSON format due to the schema\n",
    "            chunk_data = json.loads(response_chunks.output_text)\n",
    "            selected_ids = chunk_data.get(\"chunk_ids\", [])\n",
    "        except json.JSONDecodeError:\n",
    "            print(\"Warning: Could not parse structured output as JSON\")\n",
    "    \n",
    "    # Display results\n",
    "    print(f\"Selected chunks: {', '.join(str(id) for id in selected_ids)}\")\n",
    "    print(f\"Updated scratchpad:\\n{new_scratchpad}\")\n",
    "    \n",
    "    return {\n",
    "        \"selected_ids\": selected_ids,\n",
    "        \"scratchpad\": new_scratchpad\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c11654a9",
   "metadata": {},
   "source": [
    "### 3.4 Recursive Navigation Function\n",
    "\n",
    "Now, let's create the recursive navigation function that drills down through the document. `max_depth` is the maximum number of levels to drill down (keeping token minimums in mind):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "876940b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def navigate_to_paragraphs(document_text: str, question: str, max_depth: int = 1) -> Dict[str, Any]:\n",
    "    \"\"\"\n",
    "    Navigate through the document hierarchy to find relevant paragraphs.\n",
    "    \n",
    "    Args:\n",
    "        document_text: The full document text\n",
    "        question: The user's question\n",
    "        max_depth: Maximum depth to navigate before returning paragraphs (default: 1)\n",
    "    \n",
    "    Returns:\n",
    "        Dictionary with selected paragraphs and final scratchpad\n",
    "    \"\"\"\n",
    "    scratchpad = \"\"\n",
    "    \n",
    "    # Get initial chunks with min 500 tokens\n",
    "    chunks = split_into_20_chunks(document_text, min_tokens=500)\n",
    "    \n",
    "    # Navigator state - track chunk paths to maintain hierarchy\n",
    "    chunk_paths = {}  # Maps numeric IDs to path strings for display\n",
    "    for chunk in chunks:\n",
    "        chunk_paths[chunk[\"id\"]] = str(chunk[\"id\"])\n",
    "    \n",
    "    # Navigate through levels until max_depth or until no chunks remain\n",
    "    for current_depth in range(max_depth + 1):\n",
    "        # Call router to get relevant chunks\n",
    "        result = route_chunks(question, chunks, current_depth, scratchpad)\n",
    "        \n",
    "        # Update scratchpad\n",
    "        scratchpad = result[\"scratchpad\"]\n",
    "        \n",
    "        # Get selected chunks\n",
    "        selected_ids = result[\"selected_ids\"]\n",
    "        selected_chunks = [c for c in chunks if c[\"id\"] in selected_ids]\n",
    "        \n",
    "        # If no chunks were selected, return empty result\n",
    "        if not selected_chunks:\n",
    "            print(\"\\nNo relevant chunks found.\")\n",
    "            return {\"paragraphs\": [], \"scratchpad\": scratchpad}\n",
    "        \n",
    "        # If we've reached max_depth, return the selected chunks\n",
    "        if current_depth == max_depth:\n",
    "            print(f\"\\nReturning {len(selected_chunks)} relevant chunks at depth {current_depth}\")\n",
    "            \n",
    "            # Update display IDs to show hierarchy\n",
    "            for chunk in selected_chunks:\n",
    "                chunk[\"display_id\"] = chunk_paths[chunk[\"id\"]]\n",
    "                \n",
    "            return {\"paragraphs\": selected_chunks, \"scratchpad\": scratchpad}\n",
    "        \n",
    "        # Prepare next level by splitting selected chunks further\n",
    "        next_level_chunks = []\n",
    "        next_chunk_id = 0  # Counter for new chunks\n",
    "        \n",
    "        for chunk in selected_chunks:\n",
    "            # Split this chunk into smaller pieces\n",
    "            sub_chunks = split_into_20_chunks(chunk[\"text\"], min_tokens=200)\n",
    "            \n",
    "            # Update IDs and maintain path mapping\n",
    "            for sub_chunk in sub_chunks:\n",
    "                path = f\"{chunk_paths[chunk['id']]}.{sub_chunk['id']}\"\n",
    "                sub_chunk[\"id\"] = next_chunk_id\n",
    "                chunk_paths[next_chunk_id] = path\n",
    "                next_level_chunks.append(sub_chunk)\n",
    "                next_chunk_id += 1\n",
    "        \n",
    "        # Update chunks for next iteration\n",
    "        chunks = next_level_chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d803dfc",
   "metadata": {},
   "source": [
    "### 3.5 Run the Improved Navigation for a Sample Question\n",
    "\n",
    "Let's run the navigation for a sample question with our improved approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f6e29008",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Split document into 20 chunks\n",
      "Chunk 0: 42326 tokens\n",
      "Chunk 1: 42093 tokens\n",
      "Chunk 2: 42107 tokens\n",
      "Chunk 3: 39797 tokens\n",
      "Chunk 4: 58959 tokens\n",
      "Chunk 5: 48805 tokens\n",
      "Chunk 6: 37243 tokens\n",
      "Chunk 7: 33453 tokens\n",
      "Chunk 8: 38644 tokens\n",
      "Chunk 9: 49402 tokens\n",
      "Chunk 10: 51568 tokens\n",
      "Chunk 11: 49586 tokens\n",
      "Chunk 12: 47722 tokens\n",
      "Chunk 13: 48952 tokens\n",
      "Chunk 14: 44994 tokens\n",
      "Chunk 15: 50286 tokens\n",
      "Chunk 16: 54424 tokens\n",
      "Chunk 17: 62651 tokens\n",
      "Chunk 18: 47430 tokens\n",
      "Chunk 19: 42507 tokens\n",
      "\n",
      "==== ROUTING AT DEPTH 0 ====\n",
      "Evaluating 20 chunks for relevance\n",
      "Selected chunks: 0, 1, 2, 3, 4, 5, 6, 7, 8\n",
      "Updated scratchpad:\n",
      "DEPTH 0 REASONING:\n",
      "The user wants to know the format requirements for filing a motion to compel discovery and how signatures should be handled for such motions. \n",
      "\n",
      "Based on the evaluation of chunks:\n",
      "- Chunks 0, 1, 2, 3, 4, 5, 6, 7, 8 are highly relevant since they cover general requirements for submissions, motions, signatures, service, and specifically for motions and discovery in TTAB proceedings.\n",
      "- These chunks contain detailed info about electronic filing (via ESTTA), paper filing exceptions, signature requirements, service requirements, format of submissions (including motions), timing rules, and professionals' responsibilities.\n",
      "- Additionally, the rules for motions to compel, including required attachments, timing, and certification of good faith efforts to resolve discovery disputes, are specifically outlined.\n",
      "- Chunks 11-19 mostly cover post-trial and appeal procedures, less directly relevant.\n",
      "\n",
      "I will select these relevant chunks to provide a thorough answer about how motions to compel discovery should be filed and how signatures on such motions are handled.\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 3539 tokens\n",
      "Chunk 1: 2232 tokens\n",
      "Chunk 2: 1746 tokens\n",
      "Chunk 3: 3078 tokens\n",
      "Chunk 4: 1649 tokens\n",
      "Chunk 5: 2779 tokens\n",
      "Chunk 6: 2176 tokens\n",
      "Chunk 7: 1667 tokens\n",
      "Chunk 8: 1950 tokens\n",
      "Chunk 9: 1730 tokens\n",
      "Chunk 10: 1590 tokens\n",
      "Chunk 11: 1964 tokens\n",
      "Chunk 12: 1459 tokens\n",
      "Chunk 13: 2070 tokens\n",
      "Chunk 14: 2422 tokens\n",
      "Chunk 15: 1976 tokens\n",
      "Chunk 16: 2335 tokens\n",
      "Chunk 17: 2694 tokens\n",
      "Chunk 18: 2282 tokens\n",
      "Chunk 19: 982 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 2880 tokens\n",
      "Chunk 1: 1323 tokens\n",
      "Chunk 2: 2088 tokens\n",
      "Chunk 3: 1493 tokens\n",
      "Chunk 4: 2466 tokens\n",
      "Chunk 5: 2563 tokens\n",
      "Chunk 6: 2981 tokens\n",
      "Chunk 7: 2723 tokens\n",
      "Chunk 8: 2264 tokens\n",
      "Chunk 9: 1900 tokens\n",
      "Chunk 10: 2134 tokens\n",
      "Chunk 11: 1778 tokens\n",
      "Chunk 12: 2484 tokens\n",
      "Chunk 13: 1922 tokens\n",
      "Chunk 14: 2237 tokens\n",
      "Chunk 15: 2044 tokens\n",
      "Chunk 16: 2097 tokens\n",
      "Chunk 17: 1326 tokens\n",
      "Chunk 18: 2427 tokens\n",
      "Chunk 19: 962 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 2341 tokens\n",
      "Chunk 1: 1724 tokens\n",
      "Chunk 2: 2042 tokens\n",
      "Chunk 3: 3225 tokens\n",
      "Chunk 4: 1617 tokens\n",
      "Chunk 5: 2247 tokens\n",
      "Chunk 6: 1741 tokens\n",
      "Chunk 7: 1914 tokens\n",
      "Chunk 8: 2027 tokens\n",
      "Chunk 9: 2596 tokens\n",
      "Chunk 10: 2366 tokens\n",
      "Chunk 11: 2164 tokens\n",
      "Chunk 12: 2471 tokens\n",
      "Chunk 13: 1821 tokens\n",
      "Chunk 14: 1496 tokens\n",
      "Chunk 15: 1712 tokens\n",
      "Chunk 16: 1909 tokens\n",
      "Chunk 17: 1961 tokens\n",
      "Chunk 18: 2309 tokens\n",
      "Chunk 19: 2419 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 2304 tokens\n",
      "Chunk 1: 2140 tokens\n",
      "Chunk 2: 1845 tokens\n",
      "Chunk 3: 3053 tokens\n",
      "Chunk 4: 2008 tokens\n",
      "Chunk 5: 2052 tokens\n",
      "Chunk 6: 2240 tokens\n",
      "Chunk 7: 1943 tokens\n",
      "Chunk 8: 1732 tokens\n",
      "Chunk 9: 1507 tokens\n",
      "Chunk 10: 1453 tokens\n",
      "Chunk 11: 1976 tokens\n",
      "Chunk 12: 1871 tokens\n",
      "Chunk 13: 1620 tokens\n",
      "Chunk 14: 1906 tokens\n",
      "Chunk 15: 1558 tokens\n",
      "Chunk 16: 1889 tokens\n",
      "Chunk 17: 2233 tokens\n",
      "Chunk 18: 2208 tokens\n",
      "Chunk 19: 2259 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 4620 tokens\n",
      "Chunk 1: 3446 tokens\n",
      "Chunk 2: 1660 tokens\n",
      "Chunk 3: 3203 tokens\n",
      "Chunk 4: 4373 tokens\n",
      "Chunk 5: 4233 tokens\n",
      "Chunk 6: 3651 tokens\n",
      "Chunk 7: 3820 tokens\n",
      "Chunk 8: 3018 tokens\n",
      "Chunk 9: 3018 tokens\n",
      "Chunk 10: 4201 tokens\n",
      "Chunk 11: 3043 tokens\n",
      "Chunk 12: 2438 tokens\n",
      "Chunk 13: 3295 tokens\n",
      "Chunk 14: 2578 tokens\n",
      "Chunk 15: 2423 tokens\n",
      "Chunk 16: 1386 tokens\n",
      "Chunk 17: 1482 tokens\n",
      "Chunk 18: 1615 tokens\n",
      "Chunk 19: 1454 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 1468 tokens\n",
      "Chunk 1: 1946 tokens\n",
      "Chunk 2: 2020 tokens\n",
      "Chunk 3: 3384 tokens\n",
      "Chunk 4: 2458 tokens\n",
      "Chunk 5: 3535 tokens\n",
      "Chunk 6: 3059 tokens\n",
      "Chunk 7: 2027 tokens\n",
      "Chunk 8: 2417 tokens\n",
      "Chunk 9: 2772 tokens\n",
      "Chunk 10: 1913 tokens\n",
      "Chunk 11: 2674 tokens\n",
      "Chunk 12: 2131 tokens\n",
      "Chunk 13: 1409 tokens\n",
      "Chunk 14: 3256 tokens\n",
      "Chunk 15: 2827 tokens\n",
      "Chunk 16: 2547 tokens\n",
      "Chunk 17: 4187 tokens\n",
      "Chunk 18: 1527 tokens\n",
      "Chunk 19: 1246 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 1272 tokens\n",
      "Chunk 1: 1646 tokens\n",
      "Chunk 2: 1643 tokens\n",
      "Chunk 3: 2279 tokens\n",
      "Chunk 4: 1451 tokens\n",
      "Chunk 5: 1635 tokens\n",
      "Chunk 6: 1983 tokens\n",
      "Chunk 7: 1337 tokens\n",
      "Chunk 8: 1820 tokens\n",
      "Chunk 9: 2269 tokens\n",
      "Chunk 10: 2894 tokens\n",
      "Chunk 11: 2176 tokens\n",
      "Chunk 12: 1401 tokens\n",
      "Chunk 13: 1882 tokens\n",
      "Chunk 14: 2114 tokens\n",
      "Chunk 15: 2240 tokens\n",
      "Chunk 16: 1900 tokens\n",
      "Chunk 17: 1550 tokens\n",
      "Chunk 18: 1713 tokens\n",
      "Chunk 19: 2035 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 2694 tokens\n",
      "Chunk 1: 1808 tokens\n",
      "Chunk 2: 1874 tokens\n",
      "Chunk 3: 1328 tokens\n",
      "Chunk 4: 1552 tokens\n",
      "Chunk 5: 1436 tokens\n",
      "Chunk 6: 1367 tokens\n",
      "Chunk 7: 1333 tokens\n",
      "Chunk 8: 978 tokens\n",
      "Chunk 9: 1303 tokens\n",
      "Chunk 10: 1738 tokens\n",
      "Chunk 11: 1509 tokens\n",
      "Chunk 12: 1875 tokens\n",
      "Chunk 13: 1524 tokens\n",
      "Chunk 14: 1597 tokens\n",
      "Chunk 15: 1807 tokens\n",
      "Chunk 16: 2449 tokens\n",
      "Chunk 17: 2271 tokens\n",
      "Chunk 18: 1467 tokens\n",
      "Chunk 19: 1540 tokens\n",
      "Split document into 20 chunks\n",
      "Chunk 0: 1597 tokens\n",
      "Chunk 1: 1554 tokens\n",
      "Chunk 2: 1685 tokens\n",
      "Chunk 3: 1416 tokens\n",
      "Chunk 4: 1702 tokens\n",
      "Chunk 5: 1575 tokens\n",
      "Chunk 6: 1842 tokens\n",
      "Chunk 7: 1981 tokens\n",
      "Chunk 8: 1393 tokens\n",
      "Chunk 9: 1562 tokens\n",
      "Chunk 10: 1569 tokens\n",
      "Chunk 11: 1898 tokens\n",
      "Chunk 12: 3186 tokens\n",
      "Chunk 13: 2337 tokens\n",
      "Chunk 14: 1889 tokens\n",
      "Chunk 15: 1948 tokens\n",
      "Chunk 16: 1628 tokens\n",
      "Chunk 17: 3544 tokens\n",
      "Chunk 18: 2454 tokens\n",
      "Chunk 19: 1882 tokens\n",
      "\n",
      "==== ROUTING AT DEPTH 1 ====\n",
      "Evaluating 180 chunks for relevance\n",
      "Selected chunks: 5, 6, 7, 17, 18, 19, 20, 400, 401, 408, 410\n",
      "Updated scratchpad:\n",
      "DEPTH 0 REASONING:\n",
      "The user wants to know the format requirements for filing a motion to compel discovery and how signatures should be handled for such motions. \n",
      "\n",
      "Based on the evaluation of chunks:\n",
      "- Chunks 0, 1, 2, 3, 4, 5, 6, 7, 8 are highly relevant since they cover general requirements for submissions, motions, signatures, service, and specifically for motions and discovery in TTAB proceedings.\n",
      "- These chunks contain detailed info about electronic filing (via ESTTA), paper filing exceptions, signature requirements, service requirements, format of submissions (including motions), timing rules, and professionals' responsibilities.\n",
      "- Additionally, the rules for motions to compel, including required attachments, timing, and certification of good faith efforts to resolve discovery disputes, are specifically outlined.\n",
      "- Chunks 11-19 mostly cover post-trial and appeal procedures, less directly relevant.\n",
      "\n",
      "I will select these relevant chunks to provide a thorough answer about how motions to compel discovery should be filed and how signatures on such motions are handled.\n",
      "\n",
      "DEPTH 1 REASONING:\n",
      "The user's question asks about the format requirements for filing a motion to compel discovery and how signatures should be handled. Relevant information will likely involve sections on \"motions\" specifically \"motion to compel discovery,\" filing format, signature requirements, and related procedural rules in TTAB practice. \n",
      "\n",
      "Based on the large amount and depth of the provided chunks, I identified the following relevant topics and chunks addressing them:\n",
      "\n",
      "1. Signature Requirements & Acceptable Formats for Motions and Submissions\n",
      "- Detailed rules for signatures on submissions including motions are in chunks 5, 6, 7.\n",
      "- These include rules on electronic filing, use of ESTTA, required signature format including electronic signatures with the symbol method \"/sig/\".\n",
      "\n",
      "2. Format of Submissions and Use of ESTTA\n",
      "- Filing requirements, printing format, size, paper submissions, and special exceptions are found in chunks 7, 8, 9, 10, 11, 12, 13.\n",
      "- Motions generally must be filed via ESTTA, with exceptions requiring petitions to Director with reasons.\n",
      "\n",
      "3. Motions to Compel and Discovery Motions\n",
      "- Specific rules related to filing motions such as motions to compel discovery, service, and timing are expected in the portions covering discovery and motions.\n",
      "- Discovery and related motions are introduced in chapters starting from chunk 400 and beyond.\n",
      "\n",
      "4. Service and Certificates of Service\n",
      "- How motions must be served and proof of service with certificates is discussed in chunks 17, 18, 19, 20.\n",
      "- These include requirements that every submission in inter partes cases, except notice of opposition or petition to cancel, must be served on adversary and proof of service provided.\n",
      "\n",
      "5. Motions to Compel Discovery Details\n",
      "- Discovery and motion procedure, filing format, timing, service, and related sanctions are extensively covered in chunks 400 and following.\n",
      "- These include disclosures, discovery conferences, timing for discovery requests, responses, motions to compel, and sanctions.\n",
      "\n",
      "From the above, the following chunks are most likely to provide the requested information:\n",
      "- Chunks 5, 6, 7: Signature rules and filing format including motions.\n",
      "- Chunks 17, 18, 19, 20: Service of submissions and certificates of service.\n",
      "- Chunks 400 to 410 plus related portions (401.01, 401.02, 401.03, 408, 410): Discovery rules, motions to compel details.\n",
      "\n",
      "These cover the format of motions including motions to compel discovery, signature rules, service and proof of service, and discovery procedure and rules governing motions.\n",
      "\n",
      "Less relevant chunks to the question are routine procedural provisions on oppositions, petitions to cancel, answers, which do not specifically address filing or signatures of motions to compel discovery.\n",
      "\n",
      "Plan: Select the above relevant chunks and report key procedural points on the format in which a motion to compel discovery must be filed and how signatures must be handled.\n",
      "Split document into 8 chunks\n",
      "Chunk 0: 398 tokens\n",
      "Chunk 1: 256 tokens\n",
      "Chunk 2: 389 tokens\n",
      "Chunk 3: 356 tokens\n",
      "Chunk 4: 401 tokens\n",
      "Chunk 5: 277 tokens\n",
      "Chunk 6: 435 tokens\n",
      "Chunk 7: 265 tokens\n",
      "Split document into 6 chunks\n",
      "Chunk 0: 353 tokens\n",
      "Chunk 1: 393 tokens\n",
      "Chunk 2: 388 tokens\n",
      "Chunk 3: 398 tokens\n",
      "Chunk 4: 397 tokens\n",
      "Chunk 5: 247 tokens\n",
      "Split document into 5 chunks\n",
      "Chunk 0: 325 tokens\n",
      "Chunk 1: 389 tokens\n",
      "Chunk 2: 303 tokens\n",
      "Chunk 3: 344 tokens\n",
      "Chunk 4: 306 tokens\n",
      "Split document into 8 chunks\n",
      "Chunk 0: 396 tokens\n",
      "Chunk 1: 354 tokens\n",
      "Chunk 2: 361 tokens\n",
      "Chunk 3: 378 tokens\n",
      "Chunk 4: 388 tokens\n",
      "Chunk 5: 394 tokens\n",
      "Chunk 6: 361 tokens\n",
      "Chunk 7: 61 tokens\n",
      "Split document into 7 chunks\n",
      "Chunk 0: 396 tokens\n",
      "Chunk 1: 355 tokens\n",
      "Chunk 2: 377 tokens\n",
      "Chunk 3: 362 tokens\n",
      "Chunk 4: 326 tokens\n",
      "Chunk 5: 397 tokens\n",
      "Chunk 6: 69 tokens\n",
      "Split document into 3 chunks\n",
      "Chunk 0: 388 tokens\n",
      "Chunk 1: 373 tokens\n",
      "Chunk 2: 221 tokens\n",
      "Split document into 8 chunks\n",
      "Chunk 0: 360 tokens\n",
      "Chunk 1: 314 tokens\n",
      "Chunk 2: 369 tokens\n",
      "Chunk 3: 363 tokens\n",
      "Chunk 4: 361 tokens\n",
      "Chunk 5: 393 tokens\n",
      "Chunk 6: 361 tokens\n",
      "Chunk 7: 358 tokens\n",
      "\n",
      "==== ROUTING AT DEPTH 2 ====\n",
      "Evaluating 45 chunks for relevance\n",
      "Selected chunks: 0, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16, 17, 18, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36\n",
      "Updated scratchpad:\n",
      "DEPTH 0 REASONING:\n",
      "The user wants to know the format requirements for filing a motion to compel discovery and how signatures should be handled for such motions. \n",
      "\n",
      "Based on the evaluation of chunks:\n",
      "- Chunks 0, 1, 2, 3, 4, 5, 6, 7, 8 are highly relevant since they cover general requirements for submissions, motions, signatures, service, and specifically for motions and discovery in TTAB proceedings.\n",
      "- These chunks contain detailed info about electronic filing (via ESTTA), paper filing exceptions, signature requirements, service requirements, format of submissions (including motions), timing rules, and professionals' responsibilities.\n",
      "- Additionally, the rules for motions to compel, including required attachments, timing, and certification of good faith efforts to resolve discovery disputes, are specifically outlined.\n",
      "- Chunks 11-19 mostly cover post-trial and appeal procedures, less directly relevant.\n",
      "\n",
      "I will select these relevant chunks to provide a thorough answer about how motions to compel discovery should be filed and how signatures on such motions are handled.\n",
      "\n",
      "DEPTH 1 REASONING:\n",
      "The user's question asks about the format requirements for filing a motion to compel discovery and how signatures should be handled. Relevant information will likely involve sections on \"motions\" specifically \"motion to compel discovery,\" filing format, signature requirements, and related procedural rules in TTAB practice. \n",
      "\n",
      "Based on the large amount and depth of the provided chunks, I identified the following relevant topics and chunks addressing them:\n",
      "\n",
      "1. Signature Requirements & Acceptable Formats for Motions and Submissions\n",
      "- Detailed rules for signatures on submissions including motions are in chunks 5, 6, 7.\n",
      "- These include rules on electronic filing, use of ESTTA, required signature format including electronic signatures with the symbol method \"/sig/\".\n",
      "\n",
      "2. Format of Submissions and Use of ESTTA\n",
      "- Filing requirements, printing format, size, paper submissions, and special exceptions are found in chunks 7, 8, 9, 10, 11, 12, 13.\n",
      "- Motions generally must be filed via ESTTA, with exceptions requiring petitions to Director with reasons.\n",
      "\n",
      "3. Motions to Compel and Discovery Motions\n",
      "- Specific rules related to filing motions such as motions to compel discovery, service, and timing are expected in the portions covering discovery and motions.\n",
      "- Discovery and related motions are introduced in chapters starting from chunk 400 and beyond.\n",
      "\n",
      "4. Service and Certificates of Service\n",
      "- How motions must be served and proof of service with certificates is discussed in chunks 17, 18, 19, 20.\n",
      "- These include requirements that every submission in inter partes cases, except notice of opposition or petition to cancel, must be served on adversary and proof of service provided.\n",
      "\n",
      "5. Motions to Compel Discovery Details\n",
      "- Discovery and motion procedure, filing format, timing, service, and related sanctions are extensively covered in chunks 400 and following.\n",
      "- These include disclosures, discovery conferences, timing for discovery requests, responses, motions to compel, and sanctions.\n",
      "\n",
      "From the above, the following chunks are most likely to provide the requested information:\n",
      "- Chunks 5, 6, 7: Signature rules and filing format including motions.\n",
      "- Chunks 17, 18, 19, 20: Service of submissions and certificates of service.\n",
      "- Chunks 400 to 410 plus related portions (401.01, 401.02, 401.03, 408, 410): Discovery rules, motions to compel details.\n",
      "\n",
      "These cover the format of motions including motions to compel discovery, signature rules, service and proof of service, and discovery procedure and rules governing motions.\n",
      "\n",
      "Less relevant chunks to the question are routine procedural provisions on oppositions, petitions to cancel, answers, which do not specifically address filing or signatures of motions to compel discovery.\n",
      "\n",
      "Plan: Select the above relevant chunks and report key procedural points on the format in which a motion to compel discovery must be filed and how signatures must be handled.\n",
      "\n",
      "DEPTH 2 REASONING:\n",
      "The user's question is about the format for filing a motion to compel discovery and handling of signatures. Relevant information is likely contained in sections addressing motions, discovery procedures, submission format, signature requirements, and service rules. \n",
      "\n",
      "Chunks covering signature requirements (5-12) provide detailed rules on legal signatures, electronic signatures, who must sign (attorneys or parties with legal authority), and signature content.\n",
      "\n",
      "Chunks 0, 4, 7-10, 15-18 discuss the required format for submissions, including motions, the mandate to file electronically via ESTTA, and exceptions for paper filings.\n",
      "\n",
      "Chunks 23-35 address service of submissions, including requirements for service on all parties, methods of service, and certificates of service.\n",
      "\n",
      "Finally, discovery-related motions such as motions to compel discovery and their filing details should be in chunks from 400 onwards (although these aren't fully visible here, the rationale included these chunks as likely relevant).\n",
      "\n",
      "Therefore, chunks 0,4,5,6,7,8,9,10,11,12,15,16,17,18,23,24,25,26,27,28,29,30,31,32,33,34,35,36 are selected as most relevant to provide a thorough answer on the filing format and signatures for a motion to compel discovery.\n",
      "\n",
      "Returning 28 relevant chunks at depth 2\n",
      "\n",
      "==== FIRST 3 RETRIEVED PARAGRAPHS ====\n",
      "\n",
      "PARAGRAPH 1 (ID: 0.0.5.0):\n",
      "----------------------------------------\n",
      "104  Business to be Conducted in Writing\n",
      "37 C.F.R. § 2.190(b)  Electronic trademark documents. … Documents that r elate to proceedings before\n",
      "the Trademark Trial and Appeal Board must be filed electronically with the Board through ESTTA. 37 C.F.R. § 2.191 Action of the Office based on the written record. All business with the Office must be\n",
      "transacted in writing. The action of the Office will be based exclusively on the written record. No consideration\n",
      "will be given to any alleged oral promise, stipulation, or understanding when there is disagreement or doubt. With the exceptions of discovery conferences with Board participation, see TBMP § 401.01, and telephone\n",
      "conferences, see TBMP § 413.01 and TBMP § 502.06, all business with the Board should be transacted in\n",
      "writing. 37 C.F.R. § 2.191 . The personal attendance of parties or their attorne ys or other authorized\n",
      "representatives at the offices of the Board is unnecessary , except in the case of a pretrial conference as\n",
      "provided in 37 C.F.R. § 2.120(j), or upon oral argument at final hearing, if a party so desires, as pro vided\n",
      "in 37 C.F.R. § 2.129. Decisions of the Board will be based exclusively on the written record before it. [Note\n",
      "1.] Documents filed in proceedings before the Board must be filed through ESTT A. 37 C.F.R. § 2.190(b). See TBMP § 110.01(a). Board proceedings are conducted in English. If a party intends to rely upon an y submissions that are in a\n",
      "language other than English, the party should also file a translation of the submissions. If a translation is\n",
      "not filed, the submissions may not be considered. [Note 2.] NOTES:\n",
      "1. Cf.\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 2 (ID: 0.0.5.4):\n",
      "----------------------------------------\n",
      "The document should\n",
      "also include a title describing its nature, e.g., “Notice of Opposition,” “Answer,” “Motion to Compel,” “Brief\n",
      "in Opposition to Respondent’s Motion for Summary Judgment,” or “Notice of Reliance.”\n",
      "Documents filed in an application which is the subject of an inter partes proceeding before the Board should\n",
      "be filed with the Board, not the Trademark Operation, and should bear at the top of the first page both the\n",
      "application serial number, and the inter partes proceeding number and caption. Similarly , requests under\n",
      "Trademark Act § 7, 15 U.S.C. § 1057, to amend, correct, or surrender a registration which is the subject of\n",
      "a Board inter partes proceeding, and any new power of attorney, designation of domestic representative, or\n",
      "change of address submitted in connection with such a registration, should be filed with the Board, not with\n",
      "the Trademark Operation, and should bear at the top of its first page the re gistration number, and the inter\n",
      "partes proceeding number and the proceeding caption. [Note 2.] 100-14June   2024\n",
      "TRADEMARK TRIAL AND APPEAL BOARD MANUAL OF PROCEDURE§ 105\n",
      "NOTES:\n",
      "1. 37 C.F.R. § 2.194. 2. 37 C.F.R. § 2.194. 106.02  Signature of Submissions\n",
      "37 C.F.R. § 2.119(e) Every submission filed in an inter partes proceeding, and every request for an extension\n",
      "of time to file an opposition, must be signed by the party filing it, or by the party’s attorney or other authorized\n",
      "representative, but an unsigned submission will not be r efused consideration if a signed copy is submitted\n",
      "to the Office within the time limit set in the notification of this defect by the Office. 37 C.F.R. § 11.14(e) Appearance.\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 3 (ID: 0.0.5.5):\n",
      "----------------------------------------\n",
      "No individual other than those specified in par agraphs (a), (b), and (c)\n",
      "of this section will be permitted to pr actice before the Office in tr ademark matters on behalf of a client. Except as specified in § 2.11(a) of this chapter, an individual may appear in a trademark or other non-patent\n",
      "matter in his or her own behalf or on behalf of:\n",
      "(1)   A firm of which he or she is a member;\n",
      "(2)   A partnership of which he or she is a partner; or\n",
      "(3)   A corporation or association of which he or she is an officer and which he or she is authorized to\n",
      "represent. 37 C.F.R. § 11.18 Signature and certificate for correspondence filed in the Office. (a)   For all documents filed in the Office in patent, trademark, and other non-patent matters, and all\n",
      "documents filed with a hearing officer in a disciplinary proceeding, except for correspondence that is\n",
      "required to be signed by the applicant or party, each piece of correspondence filed by a practitioner in the\n",
      "Office must bear a signature, personally signed or inserted by such practitioner, in compliance with §\n",
      "1.4(d)(1), § 1.4(d)(2), or § 2.193(a) of this chapter.\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "# Run the navigation for a sample question\n",
    "question = \"What format should a motion to compel discovery be filed in? How should signatures be handled?\"\n",
    "navigation_result = navigate_to_paragraphs(document_text, question, max_depth=2)\n",
    "\n",
    "# Sample retrieved paragraph\n",
    "print(\"\\n==== FIRST 3 RETRIEVED PARAGRAPHS ====\")\n",
    "for i, paragraph in enumerate(navigation_result[\"paragraphs\"][:3]):\n",
    "    display_id = paragraph.get(\"display_id\", str(paragraph[\"id\"]))\n",
    "    print(f\"\\nPARAGRAPH {i+1} (ID: {display_id}):\")\n",
    "    print(\"-\" * 40)\n",
    "    print(paragraph[\"text\"])\n",
    "    print(\"-\" * 40)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcf85b3e",
   "metadata": {},
   "source": [
    "GPT 4.1-mini's results show the iterative extraction of relevant components in a document with the scratchpad explaining it's thought process through it! At depth 1, the model identifies \"*Detailed rules for signatures on submissions including motions*\" and \"*use of ESTTA, required signature format including electronic signatures with the symbol method '/sig/'*\" as critical components needed to answer the query.\n",
    "\n",
    "By depth 2, the scratchpad demonstrates sophisticated judgment by isolating precisely which chunks contain vital regulations about electronic signatures (chunks 5-12) while maintaining awareness of absent content, noting \"*discovery-related motions... should be in chunks from 400 onwards (although these aren't fully visible here...)*\".\n",
    "\n",
    "This process shows how GPT 4.1 mimics a legal analyst, through iteratively digging deeper into relevant content, and explaining it's reasoning along the way (making it easier to debug *why* the model selected the chunks it did)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "495a5230",
   "metadata": {},
   "source": [
    "### 3.6 Answer Generation\n",
    "\n",
    "Now, let's generate an answer using GPT-4.1 with the retrieved paragraphs. \n",
    "\n",
    "> We do a nifty trick here where we dynamically construct a List of Literals (which forces the model's answers to be one of the options we provide -- in this case the paragraph IDs). There are some restrictions on the number of options we can provide, so if you find your system citing > 500 documents, then this solution might not work. In that case, you can either have a filter to go up to 500 potential citations, or you can ask the model to cite the exact ID in it's response, then post-process the response to extract the IDs, thus the citations (e.g. it might say \"... [doc 0.0.12]\", and you could use some regex to extract the citation).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "c74cfe50",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "==== GENERATING ANSWER ====\n",
      "\n",
      "Answer: A motion to compel discovery must be filed electronically with the Trademark Trial and Appeal Board (TTAB) through ESTTA, unless ESTTA is unavailable due to technical problems or there are extraordinary circumstances, in which case a paper submission may be permitted with a written explanation (\"Documents that relate to proceedings before the Trademark Trial and Appeal Board must be filed electronically with the Board through ESTTA\"; \"The rules require that all submissions must be made to the Board electronically, currently through ESTTA, subject to certain limited exceptions permitting submissions to be made on paper. Any permitted paper submission must be accompanied by a written explanation showing that ESTTA was unavailable due to technical problems, or that extraordinary circumstances are present, and, where required, a Petition to the Director with the requisite petition fee\" 0.0.5.0, 0.0.5.5.7.3).\n",
      "\n",
      "The motion should include a title describing its nature, such as “Motion to Compel,” and should bear the appropriate proceeding number and caption at the top of the first page (\"The document should also include a title describing its nature, e.g., 'Motion to Compel'... should bear at the top of the first page both the application serial number, and the inter partes proceeding number and caption\" 0.0.5.4).\n",
      "\n",
      "Every submission, including a motion to compel discovery, must be signed by the party filing it, or by the party’s attorney or other authorized representative. For electronic filings through ESTTA, a conventional handwritten signature is not required; instead, an electronic signature is used. The signatory must personally enter a combination of letters, numbers, spaces, and/or punctuation marks between two forward slash ('/') symbols (e.g., /John Smith/), and the signatory's name and title or position must appear immediately below or adjacent to the signature (\"Documents filed electronically, including through ESTTA, do not require a conventional signature. Electronic signatures pursuant to 37 C.F.R. § 2.193(c) are required for electronic filings. The party or its representative enters a 'symbol' that has been adopted as a signature. The Board will accept any combination of letters, numbers, space and/or punctuation marks as a valid signature if it is placed between two forward slash ('/') symbols\"; \"The first and last name, and the title or position, of the person who signs a document in connection with a trademark application, registration, or proceeding before the Trademark Trial and Appeal Board must be set forth immediately below or adjacent to the signature\" 0.0.5.5.6.2, 0.0.5.5.6.0).\n",
      "\n",
      "If a document is filed on behalf of a party by the party’s attorney or other authorized representative, it must bear the signature of that attorney or representative, unless the document is one required to be signed personally by the party (0.0.5.5.6.3). If an unsigned or improperly signed document is filed, it will not be refused consideration if a properly signed copy is submitted within the time limit set in the notification of the defect by the Board (0.0.5.5.6.4).\n",
      "\n",
      "In summary: File the motion to compel discovery electronically via ESTTA, use an electronic signature as described above, and ensure the signatory's name and title are included. If filing on paper is necessary, follow the specific requirements for paper submissions and signatures.\n",
      "Citations: ['0.0.5.0', '0.0.5.4', '0.0.5.5.6.0', '0.0.5.5.6.2', '0.0.5.5.6.3', '0.0.5.5.6.4', '0.0.5.5.7.3']\n"
     ]
    }
   ],
   "source": [
    "from typing import List, Dict, Any\n",
    "from pydantic import BaseModel, field_validator\n",
    "\n",
    "class LegalAnswer(BaseModel):\n",
    "    \"\"\"Structured response format for legal questions\"\"\"\n",
    "    answer: str\n",
    "    citations: List[str]\n",
    "    \n",
    "    @field_validator('citations')\n",
    "    def validate_citations(cls, citations, info):\n",
    "        # Access valid_citations from the model_config\n",
    "        valid_citations = info.data.get('_valid_citations', [])\n",
    "        if valid_citations:\n",
    "            for citation in citations:\n",
    "                if citation not in valid_citations:\n",
    "                    raise ValueError(f\"Invalid citation: {citation}. Must be one of: {valid_citations}\")\n",
    "        return citations\n",
    "\n",
    "def generate_answer(question: str, paragraphs: List[Dict[str, Any]], \n",
    "                   scratchpad: str) -> LegalAnswer:\n",
    "    \"\"\"Generate an answer from the retrieved paragraphs.\"\"\"\n",
    "    print(\"\\n==== GENERATING ANSWER ====\")\n",
    "    \n",
    "    # Extract valid citation IDs\n",
    "    valid_citations = [str(p.get(\"display_id\", str(p[\"id\"]))) for p in paragraphs]\n",
    "    \n",
    "    if not paragraphs:\n",
    "        return LegalAnswer(\n",
    "            answer=\"I couldn't find relevant information to answer this question in the document.\",\n",
    "            citations=[],\n",
    "            _valid_citations=[]\n",
    "        )\n",
    "    \n",
    "    # Prepare context for the model\n",
    "    context = \"\"\n",
    "    for paragraph in paragraphs:\n",
    "        display_id = paragraph.get(\"display_id\", str(paragraph[\"id\"]))\n",
    "        context += f\"PARAGRAPH {display_id}:\\n{paragraph['text']}\\n\\n\"\n",
    "    \n",
    "    system_prompt = \"\"\"You are a legal research assistant answering questions about the \n",
    "Trademark Trial and Appeal Board Manual of Procedure (TBMP).\n",
    "\n",
    "Answer questions based ONLY on the provided paragraphs. Do not rely on any foundation knowledge or external information or extrapolate from the paragraphs.\n",
    "Cite phrases of the paragraphs that are relevant to the answer. This will help you be more specific and accurate.\n",
    "Include citations to paragraph IDs for every statement in your answer. Valid citation IDs are: {valid_citations_str}\n",
    "Keep your answer clear, precise, and professional.\n",
    "\"\"\"\n",
    "    valid_citations_str = \", \".join(valid_citations)\n",
    "    \n",
    "    # Call the model using structured output\n",
    "    response = client.responses.parse(\n",
    "        model=\"gpt-4.1\",\n",
    "        input=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt.format(valid_citations_str=valid_citations_str)},\n",
    "            {\"role\": \"user\", \"content\": f\"QUESTION: {question}\\n\\nSCRATCHPAD (Navigation reasoning):\\n{scratchpad}\\n\\nPARAGRAPHS:\\n{context}\"}\n",
    "        ],\n",
    "        text_format=LegalAnswer,\n",
    "        temperature=0.3\n",
    "    )\n",
    "    \n",
    "    # Add validation information after parsing\n",
    "    response.output_parsed._valid_citations = valid_citations\n",
    "    \n",
    "    print(f\"\\nAnswer: {response.output_parsed.answer}\")\n",
    "    print(f\"Citations: {response.output_parsed.citations}\")\n",
    "\n",
    "    return response.output_parsed\n",
    "\n",
    "# Generate an answer\n",
    "answer = generate_answer(question, navigation_result[\"paragraphs\"], \n",
    "                       navigation_result[\"scratchpad\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83d5e682",
   "metadata": {},
   "source": [
    "GPT 4.1 effectively integrates citations throughout its response while maintaining a clear flow of information. Each procedural requirement is linked to specific authoritative references (like \"0.0.5.0\" and \"0.0.5.5.6.2\"), creating a response that's both informative and precisely sourced. \n",
    "\n",
    "Rather than simply listing citations at the end, it weaves them directly into the content using parenthetical notation after each key requirement. This approach transforms a standard recitation of rules into a well-supported legal analysis where statements about ESTTA filing procedures, electronic signature requirements, and paper submission exceptions are immediately backed by their corresponding regulatory citations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9cfe43b",
   "metadata": {},
   "source": [
    "### 3.7 Answer Verification\n",
    "\n",
    "Let's first look at the cited paragraphs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4b5e9cd9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "==== CITED PARAGRAPHS ====\n",
      "\n",
      "PARAGRAPH 1 (ID: 0.0.5.0):\n",
      "----------------------------------------\n",
      "104  Business to be Conducted in Writing\n",
      "37 C.F.R. § 2.190(b)  Electronic trademark documents. … Documents that r elate to proceedings before\n",
      "the Trademark Trial and Appeal Board must be filed electronically with the Board through ESTTA. 37 C.F.R. § 2.191 Action of the Office based on the written record. All business with the Office must be\n",
      "transacted in writing. The action of the Office will be based exclusively on the written record. No consideration\n",
      "will be given to any alleged oral promise, stipulation, or understanding when there is disagreement or doubt. With the exceptions of discovery conferences with Board participation, see TBMP § 401.01, and telephone\n",
      "conferences, see TBMP § 413.01 and TBMP § 502.06, all business with the Board should be transacted in\n",
      "writing. 37 C.F.R. § 2.191 . The personal attendance of parties or their attorne ys or other authorized\n",
      "representatives at the offices of the Board is unnecessary , except in the case of a pretrial conference as\n",
      "provided in 37 C.F.R. § 2.120(j), or upon oral argument at final hearing, if a party so desires, as pro vided\n",
      "in 37 C.F.R. § 2.129. Decisions of the Board will be based exclusively on the written record before it. [Note\n",
      "1.] Documents filed in proceedings before the Board must be filed through ESTT A. 37 C.F.R. § 2.190(b). See TBMP § 110.01(a). Board proceedings are conducted in English. If a party intends to rely upon an y submissions that are in a\n",
      "language other than English, the party should also file a translation of the submissions. If a translation is\n",
      "not filed, the submissions may not be considered. [Note 2.] NOTES:\n",
      "1. Cf.\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 2 (ID: 0.0.5.4):\n",
      "----------------------------------------\n",
      "The document should\n",
      "also include a title describing its nature, e.g., “Notice of Opposition,” “Answer,” “Motion to Compel,” “Brief\n",
      "in Opposition to Respondent’s Motion for Summary Judgment,” or “Notice of Reliance.”\n",
      "Documents filed in an application which is the subject of an inter partes proceeding before the Board should\n",
      "be filed with the Board, not the Trademark Operation, and should bear at the top of the first page both the\n",
      "application serial number, and the inter partes proceeding number and caption. Similarly , requests under\n",
      "Trademark Act § 7, 15 U.S.C. § 1057, to amend, correct, or surrender a registration which is the subject of\n",
      "a Board inter partes proceeding, and any new power of attorney, designation of domestic representative, or\n",
      "change of address submitted in connection with such a registration, should be filed with the Board, not with\n",
      "the Trademark Operation, and should bear at the top of its first page the re gistration number, and the inter\n",
      "partes proceeding number and the proceeding caption. [Note 2.] 100-14June   2024\n",
      "TRADEMARK TRIAL AND APPEAL BOARD MANUAL OF PROCEDURE§ 105\n",
      "NOTES:\n",
      "1. 37 C.F.R. § 2.194. 2. 37 C.F.R. § 2.194. 106.02  Signature of Submissions\n",
      "37 C.F.R. § 2.119(e) Every submission filed in an inter partes proceeding, and every request for an extension\n",
      "of time to file an opposition, must be signed by the party filing it, or by the party’s attorney or other authorized\n",
      "representative, but an unsigned submission will not be r efused consideration if a signed copy is submitted\n",
      "to the Office within the time limit set in the notification of this defect by the Office. 37 C.F.R. § 11.14(e) Appearance.\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 3 (ID: 0.0.5.5.6.0):\n",
      "----------------------------------------\n",
      "The Office will accept an electronic signature that meets the\n",
      "requirements of paragraph (c) of this section on correspondence filed on paper or through TEAS or ESTTA. (b)   Copy of original signature. If a copy of an original signature is filed, the filer should retain the\n",
      "original as evidence of authenticity. If a question of authenticity arises, the Office may require submission\n",
      "of the original. (c)   Requirements for electronic signature. A person signing a document electronically must:\n",
      "(1)   Personally enter any combination of letters, numbers, spaces and/or punctuation marks that the\n",
      "signer has adopted as a signature, placed between two forward slash (“/”) symbols in the signature block\n",
      "on the electronic submission; or\n",
      "(2)   Sign the verified statement using some other form of electronic signature specified by the Director. (d)   Signatory must be identified. The first and last name, and the title or position, of the person who\n",
      "signs a document in connection with a trademark application, registration, or proceeding before the\n",
      "Trademark Trial and Appeal Board must be set forth immediately below or adjacent to the signature. (e)   Proper person to sign. Documents filed in connection with a trademark application or registration\n",
      "must be signed as specified in paragraphs (e)(1) through (9) of this section. (2)   Responses, amendments to applications, requests for express abandonment, requests for\n",
      "reconsideration of final actions, and requests to divide. Responses to Office actions, amendments to\n",
      "applications, requests for express abandonment, requests for reconsideration of final actions, and requests\n",
      "to divide must be signed by the owner of the application or registration, someone with legal authority to\n",
      "bind the owner (e.g.\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 4 (ID: 0.0.5.5.6.2):\n",
      "----------------------------------------\n",
      "* * * *\n",
      "(i)   Certified documents required by statute. When a statute requires that a document be certified, a\n",
      "copy or facsimile transmission of the certification is not acceptable. Every document filed in an inter partes or e x parte proceeding before the Board, and e very request for an\n",
      "extension of time to file an opposition, must be signed by the party filing it, or by the party’ s attorney or\n",
      "other authorized representative, as appropriate, and the signatory must be identified. [Note 1.] Documents filed electronically, including through ESTTA, do not require a conventional signature. Electronic\n",
      "signatures pursuant to 37 C.F.R. § 2.193(c) are required for electronic filings. The party or its representative\n",
      "enters a “symbol” that has been adopted as a signature. The Board will accept any combination of letters,\n",
      "numbers, space and/or punctuation marks as a valid signature if it is placed between two forward slash (“/”)\n",
      "symbols. [Note 2.] The electronic signature entered on the ESTTA form is sufficient as the required signature\n",
      "for the entire submission, including in the absence of a signature on any attachment to the filing form. [Note\n",
      "3.] The electronic filing cover sheet in ESTTA must be signed by the party filing it, the party’s attorney or\n",
      "other authorized representative, as appropriate. For further information regarding the filing of submissions\n",
      "using ESTTA, see TBMP § 110. A party may act in its own behalf in a proceeding before the Board, if the party is domiciled in the United\n",
      "States, or an attorney may represent the party. [Note 4.] See TBMP § 114 (Representation of a Party). When an individual who is a party to a Board proceeding elects to act in the indi vidual's own behalf, the\n",
      "individual must sign any documents that are filed with the Board.\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 5 (ID: 0.0.5.5.6.3):\n",
      "----------------------------------------\n",
      "If a party which is a partnership elects to\n",
      "act in its own behalf, a partner should sign documents filed by the partnership. If a party which is a corporation\n",
      "or association elects to act in its own behalf, an officer thereof who is authorized to sign for the corporation\n",
      "or association should sign for that corporation or association. If joint applicants elect to act on their o wn\n",
      "behalf, all joint applicants must sign any documents filed with the Board. [Note 5.] If a document is filed on behalf of a party by the party’s attorney or other authorized representative, it must\n",
      "bear the signature of, and be personally signed or inserted by , that attorney or other representative, unless\n",
      "June   2024100-17\n",
      "§ 106.02GENERAL INFORMATION\n",
      "it is a document required to be signed personally by the party. An attorney or other authorized representative\n",
      "who signs a document, and then files it with the Board on behalf of a party , should remember that the\n",
      "signature to the document constitutes a certification of the elements specified in 37 C.F.R. § 11.18(b), and\n",
      "that a violation of the pro visions of that rule by may result in sanctions or disciplinary action. [Note 6.] SeeTBMP § 114.04 (regarding meaning of the designation “other authorized representati ve”) and TBMP\n",
      "§ 527.02 (regarding motions for Fed. R. Civ. P. 11 sanctions). A person transmitting paper documents, when\n",
      "permitted, for filing with the Board may sign a co ver letter or transmittal letter , and the Office does not\n",
      "require the party, attorney, or authorized representative to sign a cover or transmittal letter. It is not appropriate for one person to sign a document for another person, as, for example, “John Smith, for\n",
      "John Doe” or “John Doe, by John Smith.” [Note 7.]\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 6 (ID: 0.0.5.5.6.4):\n",
      "----------------------------------------\n",
      "A document filed in a proceeding before the Board should include the first and last name, in typed or printed\n",
      "form, of the person who signed [Note 8]; a description of the capacity in which the person signed (e.g., as\n",
      "the individual who is a party, if the filing party is an individual; as a corporate officer, if the filing party is\n",
      "a corporation; or as the filing party’s attorney); and the business address and telephone number of the person. The inclusion of the signing person’s address and phone number on the submission itself is vital in the rare\n",
      "case any paper or physical submissions permitted under the rules because mail physically sent to the Office\n",
      "is opened in the Mail Room, and ordinarily the en velopes are discarded there before the mail is sent on to\n",
      "its ultimate destination within the Office. Thus, the Board rarely sees the return addresses on the mailing\n",
      "envelopes of papers filed in Board proceedings. In accordance with 37 C.F.R. § 2.193(b), a legible copy of the signed document is to be filed with the Board\n",
      "because filings are required to be submitted using ESTT A. The original should be retained as e vidence of\n",
      "authenticity. If a question as to the authenticity of a filed copy arises, the Office may require submission of\n",
      "the original. [Note 9.] Notwithstanding the requirement that a document filed before the Board be signed, an unsigned document\n",
      "filed in paper form, when permitted, will not be refused consideration if a signed cop y is submitted to the\n",
      "Board within the time limit set in the notification of this defect by the Board. [Note 10.] Similarly , an\n",
      "improperly signed document, whether filed in ESTT A or on paper , when permitted, will not be refused\n",
      "consideration if a properly signed cop y is submitted to the Board within the time set in the notification of\n",
      "this defect by the Board.\n",
      "----------------------------------------\n",
      "\n",
      "PARAGRAPH 7 (ID: 0.0.5.5.7.3):\n",
      "----------------------------------------\n",
      "long, and contain no tabs or other such devices extending beyond the edges of the paper;\n",
      "(3)   If a paper submission contains dividers, the dividers must not have any extruding tabs or other\n",
      "devices, and must be on the same size and weight paper as the submission;\n",
      "(4)   A paper submission must not be stapled or bound;\n",
      "(5)   All pages of a paper submission must be numbered and exhibits shall be identified in the manner\n",
      "prescribed in § 2.123(g)(2);\n",
      "June   2024100-19\n",
      "§ 106.03GENERAL INFORMATION\n",
      "(6)   Exhibits pertaining to a paper submission must be filed on paper and comply with the requirements\n",
      "for a paper submission. (c)   To be handled as confidential, submissions to the Trademark Trial and Appeal Board that are\n",
      "confidential in whole or part pursuant to § 2.125(f) must be submitted using the “Confidential” selection\n",
      "available in ESTTA or, where appropriate, under a separate paper cover. Both the submission and its cover\n",
      "must be marked confidential and must identify the case number and the parties. A copy of the submission\n",
      "for public viewing with the confidential portions redacted must be submitted concurrently. The rules require that all submissions must be made to the Board electronically, currently through ESTTA,\n",
      "subject to certain limited e xceptions permitting submissions to be made on paper . Any permitted paper\n",
      "submission must be accompanied by a written e xplanation showing that ESTTA was unavailable due to\n",
      "technical problems, or that extraordinary circumstances are present, and, where required, a Petition to the\n",
      "Director with the requisite petition fee. [Note 1.]\n",
      "----------------------------------------\n"
     ]
    }
   ],
   "source": [
    "cited_paragraphs = []\n",
    "for paragraph in navigation_result[\"paragraphs\"]:\n",
    "    para_id = str(paragraph.get(\"display_id\", str(paragraph[\"id\"])))\n",
    "    if para_id in answer.citations:\n",
    "        cited_paragraphs.append(paragraph)\n",
    "    \n",
    "\n",
    "# Display the cited paragraphs for the audience\n",
    "print(\"\\n==== CITED PARAGRAPHS ====\")\n",
    "for i, paragraph in enumerate(cited_paragraphs):\n",
    "    display_id = paragraph.get(\"display_id\", str(paragraph[\"id\"]))\n",
    "    print(f\"\\nPARAGRAPH {i+1} (ID: {display_id}):\")\n",
    "    print(\"-\" * 40)\n",
    "    print(paragraph[\"text\"])\n",
    "    print(\"-\" * 40)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b36a8431",
   "metadata": {},
   "source": [
    "The \"List of Literals\" trick forces the model to cite only specific paragraph IDs (like \"0.0.5.4\") rather than making up its own references or highlighting random text — imagine it as creating a digital \"table of contents\" that GPT-4.1 can only select from. This solution ensures you get verifiable citation trails back to exact source material, solving an important problem in long-context RAG."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7b1eb2d",
   "metadata": {},
   "source": [
    "Finally, let's verify the answer with an LLM-as-judge approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a765a9ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "==== VERIFYING ANSWER ====\n",
      "\n",
      "Accuracy verification: PASSED\n",
      "Confidence: high\n",
      "Explanation: The answer correctly states that motions to compel discovery must be filed electronically through ESTTA, with paper submissions permitted only under the limited exceptions of technical failure or extraordinary circumstances (37 C.F.R. § 2.190(b) and 2.193(b)). It accurately describes the required title and caption placement (TBMP § 105), and it appropriately summarizes the signature requirements for electronic filings (37 C.F.R. § 2.193(c) and TBMP §§ 106.02, 106.02(b)–(e)), including the use of slash‐enclosed electronic signatures and identification of the signatory’s name and title. It also correctly notes the rule regarding defective signatures (37 C.F.R. § 2.119(e) and TBMP § 106.02). The citations align with the source paragraphs. \n",
      "\n",
      "==== FINAL VERIFIED ANSWER ====\n",
      "Verification: PASSED | Confidence: high\n",
      "\n",
      "Answer:\n",
      "A motion to compel discovery must be filed electronically with the Trademark Trial and Appeal Board (TTAB) through ESTTA, unless ESTTA is unavailable due to technical problems or there are extraordinary circumstances, in which case a paper submission may be permitted with a written explanation (\"Documents that relate to proceedings before the Trademark Trial and Appeal Board must be filed electronically with the Board through ESTTA\"; \"The rules require that all submissions must be made to the Board electronically, currently through ESTTA, subject to certain limited exceptions permitting submissions to be made on paper. Any permitted paper submission must be accompanied by a written explanation showing that ESTTA was unavailable due to technical problems, or that extraordinary circumstances are present, and, where required, a Petition to the Director with the requisite petition fee\" 0.0.5.0, 0.0.5.5.7.3).\n",
      "\n",
      "The motion should include a title describing its nature, such as “Motion to Compel,” and should bear the appropriate proceeding number and caption at the top of the first page (\"The document should also include a title describing its nature, e.g., 'Motion to Compel'... should bear at the top of the first page both the application serial number, and the inter partes proceeding number and caption\" 0.0.5.4).\n",
      "\n",
      "Every submission, including a motion to compel discovery, must be signed by the party filing it, or by the party’s attorney or other authorized representative. For electronic filings through ESTTA, a conventional handwritten signature is not required; instead, an electronic signature is used. The signatory must personally enter a combination of letters, numbers, spaces, and/or punctuation marks between two forward slash ('/') symbols (e.g., /John Smith/), and the signatory's name and title or position must appear immediately below or adjacent to the signature (\"Documents filed electronically, including through ESTTA, do not require a conventional signature. Electronic signatures pursuant to 37 C.F.R. § 2.193(c) are required for electronic filings. The party or its representative enters a 'symbol' that has been adopted as a signature. The Board will accept any combination of letters, numbers, space and/or punctuation marks as a valid signature if it is placed between two forward slash ('/') symbols\"; \"The first and last name, and the title or position, of the person who signs a document in connection with a trademark application, registration, or proceeding before the Trademark Trial and Appeal Board must be set forth immediately below or adjacent to the signature\" 0.0.5.5.6.2, 0.0.5.5.6.0).\n",
      "\n",
      "If a document is filed on behalf of a party by the party’s attorney or other authorized representative, it must bear the signature of that attorney or representative, unless the document is one required to be signed personally by the party (0.0.5.5.6.3). If an unsigned or improperly signed document is filed, it will not be refused consideration if a properly signed copy is submitted within the time limit set in the notification of the defect by the Board (0.0.5.5.6.4).\n",
      "\n",
      "In summary: File the motion to compel discovery electronically via ESTTA, use an electronic signature as described above, and ensure the signatory's name and title are included. If filing on paper is necessary, follow the specific requirements for paper submissions and signatures.\n",
      "\n",
      "Citations:\n",
      "- 0.0.5.0\n",
      "- 0.0.5.4\n",
      "- 0.0.5.5.6.0\n",
      "- 0.0.5.5.6.2\n",
      "- 0.0.5.5.6.3\n",
      "- 0.0.5.5.6.4\n",
      "- 0.0.5.5.7.3\n"
     ]
    }
   ],
   "source": [
    "from typing import List, Dict, Any, Literal\n",
    "from pydantic import BaseModel\n",
    "\n",
    "class VerificationResult(BaseModel):\n",
    "    \"\"\"Verification result format\"\"\"\n",
    "    is_accurate: bool\n",
    "    explanation: str\n",
    "    confidence: Literal[\"high\", \"medium\", \"low\"]\n",
    "\n",
    "def verify_answer(question: str, answer: LegalAnswer, \n",
    "                 cited_paragraphs: List[Dict[str, Any]]) -> VerificationResult:\n",
    "    \"\"\"\n",
    "    Verify if the answer is grounded in the cited paragraphs.\n",
    "    \n",
    "    Args:\n",
    "        question: The user's question\n",
    "        answer: The generated answer\n",
    "        cited_paragraphs: Paragraphs cited in the answer\n",
    "        \n",
    "    Returns:\n",
    "        Verification result with accuracy assessment, explanation, and confidence level\n",
    "    \"\"\"\n",
    "    print(\"\\n==== VERIFYING ANSWER ====\")\n",
    "    \n",
    "    # Prepare context with the cited paragraphs\n",
    "    context = \"\"\n",
    "    for paragraph in cited_paragraphs:\n",
    "        display_id = paragraph.get(\"display_id\", str(paragraph[\"id\"]))\n",
    "        context += f\"PARAGRAPH {display_id}:\\n{paragraph['text']}\\n\\n\"\n",
    "    \n",
    "    # Prepare system prompt\n",
    "    system_prompt = \"\"\"You are a fact-checker for legal information.\n",
    "Your job is to verify if the provided answer:\n",
    "1. Is factually accurate according to the source paragraphs\n",
    "2. Uses citations correctly\n",
    "\n",
    "Be critical and look for any factual errors or unsupported claims.\n",
    "Assign a confidence level based on how directly the paragraphs answer the question:\n",
    "- high: The answer is comprehensive, accurate, and directly supported by the paragraphs\n",
    "- medium: The answer is mostly accurate but may be incomplete or have minor issues\n",
    "- low: The answer has significant gaps, inaccuracies, or is poorly supported by the paragraphs\n",
    "\"\"\"\n",
    "    \n",
    "    response = client.responses.parse(\n",
    "        model=\"o4-mini\",\n",
    "        input=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": f\"\"\"\n",
    "QUESTION: {question}\n",
    "\n",
    "ANSWER TO VERIFY:\n",
    "{answer.answer}\n",
    "\n",
    "CITATIONS USED: {', '.join(answer.citations)}\n",
    "\n",
    "SOURCE PARAGRAPHS:\n",
    "{context}\n",
    "\n",
    "Is this answer accurate and properly supported by the source paragraphs?\n",
    "Assign a confidence level (high, medium, or low) based on completeness and accuracy.\n",
    "            \"\"\"}\n",
    "        ],\n",
    "        text_format=VerificationResult\n",
    "    )\n",
    "    \n",
    "    # Log and return the verification result\n",
    "    print(f\"\\nAccuracy verification: {'PASSED' if response.output_parsed.is_accurate else 'FAILED'}\")\n",
    "    print(f\"Confidence: {response.output_parsed.confidence}\")\n",
    "    print(f\"Explanation: {response.output_parsed.explanation}\")\n",
    "    \n",
    "    return response.output_parsed\n",
    "\n",
    "# Verify the answer using only the cited paragraphs\n",
    "verification = verify_answer(question, answer, cited_paragraphs)\n",
    "\n",
    "# Display final result with verification\n",
    "print(\"\\n==== FINAL VERIFIED ANSWER ====\")\n",
    "print(f\"Verification: {'PASSED' if verification.is_accurate else 'FAILED'} | Confidence: {verification.confidence}\")\n",
    "print(\"\\nAnswer:\")\n",
    "print(answer.answer)\n",
    "print(\"\\nCitations:\")\n",
    "for citation in answer.citations:\n",
    "    print(f\"- {citation}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1004942a",
   "metadata": {},
   "source": [
    "The verification step produces a clean, structured assessment that references specific regulations and methodically checks both the answer's accuracy and its proper use of citations. Rather than just saying \"correct,\" it offers useful context by explaining exactly why the answer was correct, giving you the confidence to then present the answer to the user with specific citations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29bc9113",
   "metadata": {},
   "source": [
    "## 4. Infrastructure Costs\n",
    "\n",
    "Let's break down the cost structure for this agentic RAG approach:\n",
    "\n",
    "### Estimated Fixed vs. Variable Costs\n",
    "\n",
    "* **Estimated Fixed (One-time) Costs:**  \n",
    "  * **Traditional RAG:** ~$0.43 (embedding + metadata generation)\n",
    "  * **Agentic RAG:** $0.00 (zero preprocessing required)\n",
    "\n",
    "\n",
    "* **Estimated Variable (Per-Query) Costs:**  \n",
    "  * **Router Model (`gpt-4.1-mini`):**  \n",
    "    * Initial routing (20 chunks): ~$0.10  \n",
    "    * Two recursive levels: ~$0.20\n",
    "  * **Synthesis (`gpt-4.1`):** ~$0.05\n",
    "  * **Verification (`o4-mini`):** ~$0.01\n",
    "  * **Total per query:** ~$0.36\n",
    "\n",
    "While the per-query cost is higher than traditional RAG, this approach offers:\n",
    "- Immediate results on new documents\n",
    "- More precise citations\n",
    "- Better handling of paraphrases and conceptual questions\n",
    "- No infrastructure maintenance overhead\n",
    "\n",
    "The cost can be optimized through:\n",
    "- Caching results for common queries\n",
    "- Limiting max tokens in the model calls\n",
    "- Using a hybrid approach that pre-filters the document first\n",
    "\n",
    "## 5. Benefits and Tradeoffs versus Traditional RAG\n",
    "\n",
    "### Benefits\n",
    "- **Zero-ingest latency**: Answer questions from new documents immediately, with no preprocessing.\n",
    "- **Dynamic navigation**: Mimics human reading patterns by focusing on promising sections.\n",
    "- **Cross-section reasoning**: Model can find connections across document sections that might be missed by independent chunk retrieval, potentially increasing accuracy of generated answers and saving time on optimizing retrieval pipelines.\n",
    "\n",
    "### Tradeoffs\n",
    "- **Higher per-query cost**: Requires more computation for each question compared to embedding-based retrieval.\n",
    "- **Increased latency**: Hierarchical navigation takes longer to process than simple vector lookups.\n",
    "- **Limited scalability**: May struggle with extremely large document collections where preprocessing becomes more efficient.\n",
    "\n",
    "## 6. Future Steps\n",
    "\n",
    "There are a few modifications we can make to the approach taken:\n",
    "- **Generating a Knowledge Graph**: We can use the large context window of GPT 4.1-mini to iteratively generate a detailed knowledge graph, and then GPT 4.1 can traverse this graph to answer questions. This way we only need to \"ingest\" the document once, regardless of the question.\n",
    "- **Improved Scratchpad Tool**: The scratchpad tool could be given more choices such as editing or deleting past memory. This would allow the model to choose whatever is most relevant to the question at hand\n",
    "- **Adjust Depth**: We can adjust the depth of the hierarchical navigation to find the right balance between cost and performance. Certain usecases will require sentence level citations (like legal documents), while others may only require paragraph level citations (like news articles). \n",
    "\n",
    "## 7. Takeaways\n",
    "\n",
    "1. **Context Window is a Superpower:** Million-token context windows make it possible to navigate documents on-the-fly.\n",
    "2. **Hierarchical Approach Mimics Human Reading:** Agentic routing works like a human skimming a document for relevant sections.\n",
    "3. **Scratchpad Enables Multi-Step Reasoning:** Maintaining a reasoning record improves navigation quality.\n",
    "4. **Fast Implementation, No Database:** The entire system can be built with just API calls, no infrastructure needed.\n",
    "5. **Verification Improves Reliability:** The LLM-as-judge pattern catches errors before they reach users.\n",
    "\n",
    "================================================================================\n",
    "\n",
    "## 3B. Use Case: AI Co-Scientist for Pharma R&D\n",
    "![AI Co-Scientist for Pharma R&D](../../../images/3B_reasoning_task_card.png)\n",
    "\n",
    "This section details how to build an AI system that functions as a \"co-scientist\" to accelerate experimental design in pharmaceutical R&D, focusing on optimizing a drug synthesis process under specific constraints.\n",
    "\n",
    "## 🗂️ TL;DR Matrix\n",
    "\n",
    "This table summarizes the core technology choices and their rationale for this specific AI Co-Scientist implementation.\n",
    "\n",
    "| Layer              | Choice                                                                  | Utility                                                                                           |\n",
    "| :----------------- | :------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------- |\n",
    "| **Ideation**       | `o4-mini` (Parallel Role-Playing Agents)                                  | Generates diverse hypotheses & protocols rapidly and cost-effectively; role-playing enhances creativity. |\n",
    "| **Grounding**      | External Tool Calls (`chem_lookup`, `cost_estimator`, `outcome_db`, etc.) | Ensures plans are based on real-world data (chemical properties, costs, past results).                   |\n",
    "| **Ranking**        | `o4-mini` (Pairwise Tournament Comparison)                                | Nuanced evaluation beyond simple scoring; selects promising candidates efficiently.                      |\n",
    "| **Critique/Synth** | `o3` (Deep Review & Synthesis)                                            | Provides rigorous, senior-level analysis, identifies risks, and ensures scientific validity.             |\n",
    "| **Safety (Opt.)**  | `gpt-4.1-mini` (Targeted Check)                                       | Adds an extra layer of specialized safety review before human handoff.                                   |\n",
    "| **Learning**       | `o3` + Code Interpreter (Result Analysis → DB)                            | Captures experimental outcomes systematically, enabling continuous improvement over time.                |\n",
    "| **Core Technique** | Multi-Agent Collaboration & Escalation                                    | Leverages strengths of different models (speed vs. depth) for a complex, multi-step reasoning task.      |\n",
    "\n",
    "*Note: Model identifiers accurate as of April 2025, subject to change.*\n",
    "\n",
    "## 1. Scenario Snapshot\n",
    "\n",
    "* **Problem Space:** Optimizing complex experimental procedures in pharmaceutical R&D, such as improving the synthesis yield of a new drug compound (\"XYZ-13\") while adhering to strict constraints.\n",
    "* **Users:** Research scientists and lab technicians involved in drug discovery and development.\n",
    "* **Typical Asks:**\n",
    "    1. Suggest 3 distinct protocols to increase XYZ-13 yield by ≥15% by testing different catalysts, staying under $15k using approved reagents.\n",
    "    2. Propose protocols to optimize XYZ-13 yield below 60°C (due to past heat issues), exploring different approved solvents within budget.\n",
    "    3. Design two XYZ-13 yield strategies (aiming for ≥15%): a. one maximizing potential yield within the \\$15k budget, b. one prioritizing cost under \\$10k.\n",
    "* **Constraints:**\n",
    "    * **Budgetary:** Operate within defined financial limits (e.g., $15,000 per experiment series).\n",
    "    * **Regulatory/Safety:** Use only pre-approved chemicals/reagents and adhere rigorously to safety protocols.\n",
    "    * **Human Oversight:** Final experimental plans must be reviewed and validated by a human expert before execution.\n",
    "\n",
    "> Traditionally, optimizing such experiments involves weeks of manual planning, literature review, iterative benchwork, and analysis. This AI Co-Scientist approach aims to dramatically reduce the cycle time by automating hypothesis generation, protocol design, and preliminary evaluation, enabling scientists to focus on higher-level strategy and final validation. It shifts the scientist's role from manual execution of planning steps to expert oversight and collaboration with the AI.\n",
    "\n",
    "\n",
    "## 2. Architecture (Multi-Agent Reasoning)\n",
    "\n",
    "The system employs a multi-agent architecture that emulates a high-performing scientific team. Different AI components, acting in specialized roles (such as ideation, critique, and learning from outcomes), collaborate using various models and tools to execute the workflow.\n",
    "\n",
    "![AI Co-Scientist Architecture](../../../images/3B_coscientist_architecture.png)\n",
    "\n",
    "### 2.1. **Scientist Input & Constraints:** \n",
    "The process starts with the scientist defining the goal, target compound, and constraints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "abbeddb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "from agent_utils import Context, call_openai, log_json\n",
    "\n",
    "# Example Initial Input\n",
    "user_input = {\n",
    "    \"compound\": \"XYZ-13\",\n",
    "    \"goal\": \"Improve synthesis yield by 15%\",\n",
    "    \"budget\": 15000,\n",
    "    \"time_h\": 48,\n",
    "    \"previous\": \"Prior attempts failed at high temp; explore potential catalyst effects.\"\n",
    "}\n",
    "ctx = Context(client=OpenAI(), **user_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e791f29f",
   "metadata": {},
   "source": [
    "### 2.2.  **Ideation (`o4-mini` + Tools):** \n",
    "Multiple `o4-mini` instances, prompted with different roles (e.g., `Hypothesis Agent`, `Protocol Agent`, `Resource Agent`), generate experimental plans in parallel. Assigning distinct personas encourages diverse perspectives and covers different aspects of the problem simultaneously during the ideation phase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3f06fe8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ROLE_FOCUS = {\n",
    "    # Hypothesis Agent Prompt\n",
    "    \"hypothesis_agent\": \"\"\"You are a pharmaceutical hypothesis specialist. \n",
    "        Focus exclusively on analyzing the compound structure and research goals to generate testable hypotheses. \n",
    "        Consider mechanism of action, binding affinity predictions, and potential off-target effects.\"\"\",\n",
    "\n",
    "    # Protocol Agent Prompt\n",
    "    \"protocol_agent\"  : \"\"\"You are a laboratory protocol specialist. \n",
    "        Design experimental procedures that will effectively test the provided hypothesis. \n",
    "        Focus on experimental conditions, controls, and measurement techniques.\"\"\",\n",
    "\n",
    "    # Resource Agent Prompt\n",
    "    \"resource_agent\"  : \"\"\"You are a laboratory resource optimization specialist. \n",
    "        Review the proposed protocol and optimize for efficiency. \n",
    "        Identify opportunities to reduce reagent use, equipment time, and overall costs while maintaining scientific validity.\"\"\",\n",
    "}\n",
    "\n",
    "# Create a structured prompt template for ideation\n",
    "IDEATION_PROMPT = \"\"\"You are a pharmaceutical {role} specialist. Your goal is to {goal} for compound {compound}.\n",
    "Constraints:\n",
    "- Budget: ${budget}\n",
    "- Approved reagents only\n",
    "- Complete within {time_h} hours\n",
    "- Previous attempts: {previous}\n",
    "Respond with structured JSON describing your protocol.\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "fcf9f5ef",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Run‑id 9835f69c  Compound: XYZ-13\n",
      "Logs will be stored in: logs/9835f69c\n"
     ]
    }
   ],
   "source": [
    "import json, logging\n",
    "from pathlib import Path\n",
    "from typing import Dict, List, Any, Optional\n",
    "from dataclasses import asdict\n",
    "from functools import partial\n",
    "\n",
    "MODEL_IDEATE   = \"o4-mini-2025-04-16\"  # o4-mini model for ideation - balances speed and quality\n",
    "\n",
    "# Configure logging to help with tracking experiment progress and debugging\n",
    "logging.basicConfig(level=logging.INFO, format=\"%(message)s\")\n",
    "logging.info(f\"Run‑id {ctx.run_id}  Compound: {ctx.compound}\")\n",
    "logging.info(f\"Logs will be stored in: {Path('logs') / ctx.run_id}\")\n",
    "\n",
    "def ideation(ctx: Context):\n",
    "    logging.info(\"Starting ideation phase...\")\n",
    "    ideas = []\n",
    "    for role, focus in ROLE_FOCUS.items():\n",
    "        logging.info(f\"Running ideation agent ${role}\")\n",
    "        sys = IDEATION_PROMPT.format(role=role, focus=focus, **ctx.prompt_vars())\n",
    "        usr = f\"Design a protocol to {ctx.goal} within ${ctx.budget}.\"\n",
    "        idea = call_openai(ctx.client, MODEL_IDEATE, sys, usr, ctx)\n",
    "        ideas.append(idea)\n",
    "    log_json(\"ideation_done\", ideas, ctx)\n",
    "    return ideas"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0384e0d5",
   "metadata": {},
   "source": [
    "The ideation agents can utilize external tools such as `literature_search`, `chem_lookup` (chemical database), `cost_estimator`, `outcome_db` (outcome of previous experiments) to ground their suggestions in data. Explicitly enabling and prompting models to use external tools ensures that generated plans are feasible, compliant, and informed by existing knowledge. The model decides when and which tool to call based on the task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "a8f365d8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Starting ideation phase...\n",
      "Running ideation agent $hypothesis_agent\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) List available chemicals\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Outcome DB: XYZ-13, yield, 5\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Cost estimator: [{'name': 'Palladium chloride', 'amount': 0.05, 'unit': 'g'}, {'name': 'Triphenylphosphine', 'amount': 0.1, 'unit': 'g'}, {'name': 'Potassium carbonate', 'amount': 1, 'unit': 'g'}, {'name': 'Dimethylformamide', 'amount': 50, 'unit': 'mL'}, {'name': 'Toluene', 'amount': 50, 'unit': 'mL'}, {'name': 'Sodium borohydride', 'amount': 0.1, 'unit': 'g'}, {'name': 'Triethylamine', 'amount': 0.5, 'unit': 'mL'}], ['round-bottom flask', 'magnetic stirrer', 'reflux condenser'], 36\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "Running ideation agent $protocol_agent\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Outcome DB: XYZ-13, yield, 5\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) List available chemicals\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Literature search: XYZ-13 synthesis palladium triphenylphosphine ligand yield improvement, None, 3\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Cost estimator: [{'name': 'Palladium acetate', 'amount': 0.05, 'unit': 'g'}, {'name': 'Triphenylphosphine', 'amount': 0.1, 'unit': 'g'}, {'name': 'Potassium carbonate', 'amount': 2, 'unit': 'g'}, {'name': 'Triethylamine', 'amount': 2, 'unit': 'mL'}, {'name': 'Dimethylformamide', 'amount': 100, 'unit': 'mL'}], ['Magnetic stirrer', 'Oil bath', 'Inert gas setup'], 48\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "Running ideation agent $resource_agent\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Outcome DB: XYZ-13, yield, 5\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) List available chemicals\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Cost estimator: [{'name': 'Palladium acetate', 'amount': 0.05, 'unit': 'g'}, {'name': 'Triphenylphosphine', 'amount': 0.1, 'unit': 'g'}, {'name': 'Potassium carbonate', 'amount': 1, 'unit': 'g'}, {'name': 'Dimethylformamide', 'amount': 5, 'unit': 'mL'}, {'name': 'Triethylamine', 'amount': 2, 'unit': 'mL'}], ['Round-bottom flask', 'Reflux condenser', 'Heating mantle', 'Magnetic stirrer'], 36\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Chemical lookup: Sodium borohydride, None\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "Ideation complete!\n"
     ]
    }
   ],
   "source": [
    "IDEATION_PROMPT += \"\"\"\\nUse the following tools as appropriate:\n",
    "- Use the `list_available_chemicals` tool to get list of approved reagents.\n",
    "- Use the `chem_lookup` tool to verify properties of reagents mentioned.\n",
    "- Use the `cost_estimator` tool to calculate the approximate cost based on reagents and proposed steps.\n",
    "- Check the `outcome_db` for relevant prior experiments with {compound}\"\"\"\n",
    "\n",
    "ideas = ideation(ctx)\n",
    "logging.info(\"Ideation complete!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f507348",
   "metadata": {},
   "source": [
    "These tools are defined in `agent_utils.py`. For purposes of this solution, the tool calls are mocked in `tools.py`. In a real use case, these tools would call real APIs.\n",
    "\n",
    "\n",
    "### 2.3. **Tournament Ranking (`o4-mini` / `o3`):** \n",
    "Generated protocols are compared pairwise based on criteria like expected effectiveness, feasibility, cost, and novelty. Instead of asking a model to score protocols in isolation, providing two protocols at a time and asking for a direct comparison against specific criteria often yields more reliable relative rankings.\n",
    "\n",
    "This Elo-style ranking identifies the most promising candidates for deeper review."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "f85fe4b7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Starting tournament phase...\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "Tournament winner picked!\n"
     ]
    }
   ],
   "source": [
    "TOURNAMENT_PROMPT = \"\"\"\n",
    "Protocol A: [details...]\n",
    "Protocol B: [details...]\n",
    "\n",
    "Compare Protocol A and Protocol B for synthesizing {compound} aimed at {goal}. Score them on:\n",
    "1. Likelihood of achieving ≥ 15% yield increase.\n",
    "2. Practical feasibility (reagents, time).\n",
    "3. Estimated cost-efficiency (use tool if needed).\n",
    "4. Scientific novelty/risk.\n",
    "\n",
    "Return JSON {{\\\"winner\\\": \\\"A\\\"|\\\"B\\\", \\\"justification\\\": \\\"...\\\"}}.\"\"\"\n",
    "\n",
    "# This is a mock tourname implementation that only compares the first two protocols\n",
    "# A real implementation would compare pairs in a tournament bracket style\n",
    "def tournament(protocols: List[Dict[str, Any]], ctx: Context):\n",
    "    logging.info(\"Starting tournament phase...\")\n",
    "    if len(protocols) == 1:\n",
    "        return protocols[:1]\n",
    "    a, b = protocols[0], protocols[1]\n",
    "    sys = TOURNAMENT_PROMPT.format(**ctx.prompt_vars())\n",
    "    usr = json.dumps({\"A\": a, \"B\": b}, indent=2)\n",
    "    res = call_openai(ctx.client, MODEL_IDEATE, sys, usr, ctx)\n",
    "    winner = a if res.get(\"winner\", \"A\").upper() == \"A\" else b\n",
    "    log_json(\"tournament\", res, ctx)\n",
    "    return [winner]\n",
    "\n",
    "top_proto = tournament(ideas, ctx)[0]\n",
    "logging.info(\"Tournament winner picked!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41ad4731",
   "metadata": {},
   "source": [
    "> In early experiments, we found that asking models to score protocols on a 1-10 scale led to inconsistent results with score compression. The tournament approach solved this by forcing relative judgments that proved more reliable. This mirrors human expert behavior — scientists often find it easier to compare two options directly than to assign absolute scores.\n",
    "\n",
    "### 2.4. **Deep Critique & Synthesis (`o3`):** \n",
    "The top-ranked protocols are passed to `o3` for rigorous review. `o3` acts like a senior scientist, assessing scientific validity, methodology, safety, budget compliance, and suggesting improvements or synthesizing a final, refined protocol. It may also call tools for verification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "634ef4e2",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Starting critique phase...\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Cost estimator: [{'name': 'Palladium chloride', 'amount': 0.0045, 'unit': 'g'}, {'name': 'Triphenylphosphine', 'amount': 0.013, 'unit': 'g'}, {'name': 'Sodium borohydride', 'amount': 0.0038, 'unit': 'g'}, {'name': 'Potassium carbonate', 'amount': 0.14, 'unit': 'g'}, {'name': 'Triethylamine', 'amount': 0.07, 'unit': 'mL'}, {'name': 'Dimethylformamide', 'amount': 2, 'unit': 'mL'}, {'name': 'Toluene', 'amount': 5, 'unit': 'mL'}], ['100 mL round-bottom flask', 'magnetic stirrer', 'reflux condenser', 'inert gas line'], 24\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Outcome DB: XYZ-13, None, 5\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "Deep critique completed!\n"
     ]
    }
   ],
   "source": [
    "# Deep critique phase using a more powerful model for rigorous review\n",
    "CRITIQUE_PROMPT = \"\"\"You are a senior researcher reviewing a proposed synthesis protocol \n",
    "for {compound} aiming for {goal}, budget ${budget} using approved reagents. Review the protocol below rigorously:\n",
    "1. Identify scientific flaws or methodological weaknesses.\n",
    "2. Assess safety risks and budget compliance (use `cost_estimator` tool if needed).\n",
    "3. Check for consistency with prior `outcome_db` results if relevant.\n",
    "4. Suggest concrete improvements or rewrite sections if necessary.\n",
    "5. Provide a final go/no-go recommendation.\n",
    "\n",
    "Return JSON {{\\\"revised_protocol\\\": ..., \\\"critique\\\": \\\"...\\\", \\\"recommendation\\\": \\\"go|no-go\\\"}}.\n",
    "\n",
    "Protocol to Review:\n",
    "[Protocol details...]\n",
    "\"\"\"\n",
    "\n",
    "MODEL_CRITIQUE = \"o3-2025-04-16\"  # o3 model for deep critique\n",
    "\n",
    "def critique(protocol: Dict[str, Any], ctx: Context):\n",
    "    logging.info(\"Starting critique phase...\")\n",
    "    sys = CRITIQUE_PROMPT.format(**ctx.prompt_vars())\n",
    "    usr = json.dumps(protocol, indent=2)\n",
    "    crit = call_openai(ctx.client, MODEL_CRITIQUE, sys, usr, ctx)\n",
    "    log_json(\"critique\", crit, ctx)\n",
    "    return crit.get(\"revised_protocol\", protocol)\n",
    "\n",
    "critiqued = critique(top_proto, ctx)\n",
    "logging.info(\"Deep critique completed!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fbd87a7",
   "metadata": {},
   "source": [
    "> We deliberately separate ideation from critique using different models and personas. Having the same model both generate and critique its own work often leads to self-justification rather than objective assessment. The o3 model, acting as a \"senior scientist,\" consistently identified methodological weaknesses that o4-mini missed during ideation.\n",
    "\n",
    "### 2.5. **(Optional) Safety Check:** \n",
    "A specialized model, such as `gpt-4.1-mini`, can perform a final check for specific safety concerns (e.g., hazardous reagent combos)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "cc4405e4",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Starting safety assessment...\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Chemical lookup: Palladium chloride, None\n",
      "(Tool) Chemical lookup: Triphenylphosphine, None\n",
      "(Tool) Chemical lookup: Sodium borohydride, None\n",
      "(Tool) Chemical lookup: Potassium carbonate, None\n",
      "(Tool) Chemical lookup: Dimethylformamide, None\n",
      "(Tool) Chemical lookup: Toluene, None\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "Safety check completed!\n"
     ]
    }
   ],
   "source": [
    "# Optional safety check using a targeted model\n",
    "SAFETY_PROMPT = \"\"\"You are a lab‑safety specialist. \n",
    "Identify hazards, unsafe conditions, or compliance issues in this protocol for {compound}. \n",
    "Use `chem_lookup` tool if needed. Return JSON assessment.\"\"\"\n",
    "\n",
    "MODEL_SAFETY   = \"gpt-4.1-mini-2025-04-14\"  # gpt-4.1-mini model for safety checks - optimized for instruction following\n",
    "\n",
    "def safety(protocol: Dict[str, Any], ctx: Context):\n",
    "    logging.info(\"Starting safety assessment...\")\n",
    "    sys = SAFETY_PROMPT.format(**ctx.prompt_vars())\n",
    "    usr = json.dumps(protocol, indent=2)\n",
    "    assessment = call_openai(ctx.client, MODEL_SAFETY, sys, usr, ctx)\n",
    "    log_json(\"safety\", assessment, ctx)\n",
    "    return {\"protocol\": protocol, \"safety\": assessment}\n",
    "\n",
    "secured = safety(critiqued, ctx)\n",
    "logging.info(\"Safety check completed!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dd93396",
   "metadata": {},
   "source": [
    "### 2.6. **Human Review:** \n",
    "The AI-generated final plan is presented to the human scientist via an interface for validation, potential edits, and final approval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e2d47339",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Awaiting human review...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "=== PROTOCOL FOR REVIEW: XYZ-13 - Improve synthesis yield by 15% ===\n",
      "DETAILS: {\n",
      "  \"protocol_title\": \"Optimised In-Situ Pd(0)/PPh3 Coupling for XYZ-13 \\u2013 Target \\u2265 72 % Yield\",\n",
      "  \"key_changes_vs_original\": [\n",
      "    \"Catalyst loading reduced from 5 mol % to 2 mol % Pd to cut cost and metal contamination without loss of activity.\",\n",
      "    \"Reaction run at 0.10 M substrate concentration (12 mL solvent total) instead of 50 mL; higher effective collision frequency boosts conversion and reduces waste.\",\n",
      "    \"Single solvent system (toluene/DMF 4:1) avoids phase separation and simplifies work-up.\",\n",
      "    \"Redundant triethylamine removed; K2CO3 (2.5 eq) provides sufficient basicity.\",\n",
      "    \"Reaction temperature raised slightly to 80 \\u00b0C (still below side-reaction threshold found in exp-001) and time shortened to 24 h with in-process HPLC check at 6 h intervals.\",\n",
      "    \"Work-up switched from large silica column to two-step: (a) aqueous EDTA wash to strip Pd, (b) recrystallisation from EtOAc/hexane \\u2013 typically 5\\u20138 % higher isolated yield on this substrate.\"\n",
      "  ],\n",
      "  \"objective\": \"Isolated yield \\u2265 72 % within 24 h, total direct cost \\u2264 US $5 000.\",\n",
      "  \"scale\": \"0.5 mmol XYZ-13 (170 mg, assume MW \\u2248 340).\",\n",
      "  \"reagents\": [\n",
      "    {\n",
      "      \"name\": \"Palladium chloride\",\n",
      "      \"amount\": 0.02,\n",
      "      \"unit\": \"g\",\n",
      "      \"role\": \"precatalyst (2 mol %)\"\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Triphenylphosphine\",\n",
      "      \"amount\": 0.041,\n",
      "      \"unit\": \"g\",\n",
      "      \"role\": \"ligand (2 eq vs Pd)\"\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Sodium borohydride\",\n",
      "      \"amount\": 0.02,\n",
      "      \"unit\": \"g\",\n",
      "      \"role\": \"Pd(II)\\u2192Pd(0) reducer\"\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Potassium carbonate\",\n",
      "      \"amount\": 0.345,\n",
      "      \"unit\": \"g\",\n",
      "      \"role\": \"base (2.5 eq)\"\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Dimethylformamide\",\n",
      "      \"amount\": 2.0,\n",
      "      \"unit\": \"mL\",\n",
      "      \"role\": \"co-solvent (20 %)\"\n",
      "    },\n",
      "    {\n",
      "      \"name\": \"Toluene\",\n",
      "      \"amount\": 10.0,\n",
      "      \"unit\": \"mL\",\n",
      "      \"role\": \"primary solvent (80 %)\"\n",
      "    }\n",
      "  ],\n",
      "  \"equipment\": [\n",
      "    \"50 mL round-bottom flask\",\n",
      "    \"magnetic stirrer\",\n",
      "    \"reflux condenser\",\n",
      "    \"argon line\"\n",
      "  ],\n",
      "  \"reaction_conditions\": {\n",
      "    \"atmosphere\": \"Ar\",\n",
      "    \"temperature\": \"80 \\u00b0C (oil bath)\",\n",
      "    \"duration\": \"24 h\",\n",
      "    \"stirring\": \"600 rpm\"\n",
      "  },\n",
      "  \"procedure\": [\n",
      "    \"1. Charge dry 50 mL flask with PdCl2 (20 mg) and PPh3 (41 mg) under Ar. Add DMF (2 mL) and stir 5 min.\",\n",
      "    \"2. Add NaBH4 (20 mg) portion-wise over 3 min; colour turns dark brown.\",\n",
      "    \"3. Add XYZ-13 (170 mg, 0.50 mmol) and K2CO3 (345 mg). Add toluene (10 mL). Fit condenser.\",\n",
      "    \"4. Heat to 80 \\u00b0C for 24 h. Take 0.1 mL aliquots at 6, 12, 18 h; quench in NH4Cl and analyse by HPLC to confirm \\u2265 95 % conversion.\",\n",
      "    \"5. Cool to RT, add 10 mL 0.05 M EDTA (aq) and stir 5 min to complex Pd. Separate layers, extract aqueous twice with 5 mL toluene.\",\n",
      "    \"6. Combine organic layers, wash with brine, dry (Na2SO4), filter, concentrate in vacuo.\",\n",
      "    \"7. Recrystallise residue from 4:1 hexane/EtOAc (15 mL) to afford XYZ-13 as off-white solid. Record mass, calculate yield, check purity by HPLC.\"\n",
      "  ],\n",
      "  \"expected_outcome\": {\n",
      "    \"projected_yield\": \"72\\u201378 %\",\n",
      "    \"purity\": \"\\u2265 97 % (HPLC)\"\n",
      "  },\n",
      "  \"safety_and_waste\": [\n",
      "    \"NaBH4 generates H2; add slowly behind blast shield.\",\n",
      "    \"DMF and toluene are toxic/flammable \\u2013 use fume hood.\",\n",
      "    \"EDTA washwater and Pd residues collected for heavy-metal disposal.\",\n",
      "    \"Standard PPE (lab coat, gloves, goggles).\"\n",
      "  ],\n",
      "  \"cost_estimate_USD\": {\n",
      "    \"reagents\": 1120,\n",
      "    \"equipment_amortisation\": 150,\n",
      "    \"labor (24 h @ $75/h)\": 1800,\n",
      "    \"total\": 3070\n",
      "  }\n",
      "}\n",
      "SAFETY: {\n",
      "  \"hazards\": [\n",
      "    {\n",
      "      \"chemical\": \"Sodium borohydride\",\n",
      "      \"hazard\": \"Flammable, water-reactive\",\n",
      "      \"unsafe_condition\": \"Adding NaBH4 portion-wise generates hydrogen gas (H2) which is explosive; requires slow addition behind blast shield and in well-ventilated fume hood.\"\n",
      "    },\n",
      "    {\n",
      "      \"chemical\": \"Dimethylformamide\",\n",
      "      \"hazard\": \"Reproductive toxin, flammable\",\n",
      "      \"compliance\": \"Use only in fume hood with appropriate PPE to avoid inhalation exposure; handle with care due to reproductive toxicity.\"\n",
      "    },\n",
      "    {\n",
      "      \"chemical\": \"Toluene\",\n",
      "      \"hazard\": \"Flammable, CNS depressant\",\n",
      "      \"compliance\": \"Use in fume hood and avoid ignition sources; ensure proper ventilation to minimize exposure.\"\n",
      "    },\n",
      "    {\n",
      "      \"chemical\": \"Palladium chloride\",\n",
      "      \"hazard\": \"Irritant, potential carcinogen\",\n",
      "      \"compliance\": \"Minimize exposure; use gloves and handle in fume hood. Collect and dispose of Pd-containing waste as hazardous heavy metal waste.\"\n",
      "    },\n",
      "    {\n",
      "      \"chemical\": \"Potassium carbonate\",\n",
      "      \"hazard\": \"Irritant\",\n",
      "      \"compliance\": \"Use gloves to prevent skin irritation.\"\n",
      "    },\n",
      "    {\n",
      "      \"chemical\": \"Triphenylphosphine\",\n",
      "      \"hazard\": \"Irritant\",\n",
      "      \"compliance\": \"Use gloves and avoid inhalation of dust.\"\n",
      "    }\n",
      "  ],\n",
      "  \"unsafe_conditions\": [\n",
      "    {\n",
      "      \"condition\": \"Reaction temperature at 80 \\u00b0C with flammable solvents (toluene, DMF)\",\n",
      "      \"recommendation\": \"Ensure all heating apparatus is explosion-proof; maintain constant stirring to avoid hot spots.\"\n",
      "    },\n",
      "    {\n",
      "      \"condition\": \"Use of Argon atmosphere\",\n",
      "      \"recommendation\": \"Ensure proper inert gas handling to prevent oxygen contamination; adequate ventilation to prevent asphyxiation risk.\"\n",
      "    }\n",
      "  ],\n",
      "  \"compliance_issues\": [\n",
      "    {\n",
      "      \"issue\": \"Hydrogen gas evolution during NaBH4 addition\",\n",
      "      \"recommendation\": \"Add NaBH4 slowly behind blast shield, wear full PPE including face shield, and perform operation in a well-ventilated fume hood.\"\n",
      "    },\n",
      "    {\n",
      "      \"issue\": \"Heavy metal waste handling\",\n",
      "      \"recommendation\": \"Collect EDTA wash water and palladium residues separately and dispose as hazardous heavy metal waste in compliance with local regulations.\"\n",
      "    },\n",
      "    {\n",
      "      \"issue\": \"PPE not explicitly stating face shield\",\n",
      "      \"recommendation\": \"Recommend including face shield during NaBH4 addition step for splash and blast protection.\"\n",
      "    }\n",
      "  ],\n",
      "  \"general_comments\": [\n",
      "    \"The protocol includes appropriate solvent proportions and reaction scale to reduce waste and cost.\",\n",
      "    \"The use of EDTA wash for palladium removal and dual solvent recrystallization is a safer, more efficient approach than large silica columns.\",\n",
      "    \"The procedural timing with intermittent HPLC monitoring is good practice to avoid over-reaction and side products.\",\n",
      "    \"Standard lab safety practices are advised including lab coat, gloves, and goggles; upgrading to include face shield for hazardous steps is recommended.\",\n",
      "    \"No major equipment safety issues identified with specified items. Ensure all glassware is rated for heating and inert atmosphere.\"\n",
      "  ]\n",
      "}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Protocol approved\n"
     ]
    }
   ],
   "source": [
    "def human_review(safety_package: Dict[str, Any], ctx: Context):\n",
    "    logging.info(\"Awaiting human review...\")\n",
    "    protocol = safety_package[\"protocol\"]\n",
    "    safety_assessment = safety_package[\"safety\"]\n",
    "    \n",
    "    print(f\"\\n=== PROTOCOL FOR REVIEW: {ctx.compound} - {ctx.goal} ===\")\n",
    "    print(f\"DETAILS: {json.dumps(protocol, indent=2)}\")\n",
    "    print(f\"SAFETY: {json.dumps(safety_assessment, indent=2)}\")\n",
    "    \n",
    "    while True:\n",
    "        approval = input(\"\\nApprove for execution? (yes/no): \").lower()\n",
    "        if approval in ['yes', 'y', 'no', 'n']:\n",
    "            approved = approval in ['yes', 'y']\n",
    "            logging.info(f\"Protocol {'approved' if approved else 'rejected'}\")\n",
    "            return {\"protocol\": protocol, \"approved\": approved}\n",
    "        print(\"Please enter 'yes' or 'no'\")\n",
    "\n",
    "human_decision = human_review(secured, ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e51e598b",
   "metadata": {},
   "source": [
    "### 2.7. **Execution & Learning (`o3` + Code Interpreter):** \n",
    "Once the human approves, the plan is sent for lab execution. After lab execution, results are fed back into the system. `o3` combined with the `Code Interpreter` analyzes the data, generates insights, and stores structured outcomes (protocol, parameters, results, insights) in a database (`Outcome DB`). This database informs future ideation cycles, creating a learning loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "3894d1b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Starting mock execution and analysis...\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Literature search: Pd(0) PPh3 coupling yield optimization EDTA work-up recrystallization losses, None, 3\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "(Tool) Outcome DB: XYZ-13, yield, 5\n",
      "HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
      "Analysis complete\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "🎉 Completed. Summary written to output/9835f69c_summary.json\n"
     ]
    }
   ],
   "source": [
    "# Simulating execution and analyzing results\n",
    "ANALYSIS_PROMPT = \"\"\"You are a data analyst. \n",
    "Did the experiment achieve {goal}?  Analyse factors, suggest improvements, and return structured JSON.\n",
    "\"\"\"\n",
    "\n",
    "def execute_and_analyse(pkt: Dict[str, Any], ctx: Context):\n",
    "    logging.info(\"Starting mock execution and analysis...\")\n",
    "    # These are mock results for a lab experiment\n",
    "    mock_results = {\n",
    "        \"yield_improvement\": 12.5,\n",
    "        \"success\": False,\n",
    "        \"actual_cost\": ctx.budget * 0.85,\n",
    "        \"notes\": \"Mock execution\"\n",
    "    }\n",
    "    sys = ANALYSIS_PROMPT.format(**ctx.prompt_vars())\n",
    "    usr = json.dumps({\"protocol\": pkt, \"results\": mock_results}, indent=2)\n",
    "    analysis = call_openai(ctx.client, MODEL_CRITIQUE, sys, usr, ctx)\n",
    "    log_json(\"analysis\", analysis, ctx)\n",
    "    return analysis\n",
    "\n",
    "# Only proceed to execution if approved by the human reviewer\n",
    "if human_decision[\"approved\"]:\n",
    "    summary = execute_and_analyse(human_decision, ctx)\n",
    "    logging.info(\"Analysis complete\")\n",
    "else:\n",
    "    logging.info(\"Protocol rejected by human reviewer - execution skipped\")\n",
    "    summary = None\n",
    "\n",
    "Path(\"output\").mkdir(exist_ok=True)\n",
    "out_path = Path(\"output\") / f\"{ctx.run_id}_summary.json\"\n",
    "out_path.write_text(json.dumps(summary, indent=2))\n",
    "print(f\"\\n🎉 Completed. Summary written to {out_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f4ecb9f",
   "metadata": {},
   "source": [
    "## 3. Model Playbook\n",
    "\n",
    "Choosing between `o4-mini` and `o3` depends on the task's complexity and required depth. For other tasks, `gpt-4.1-mini` provides balance between cost and performance, with the more powerful `gpt4.1` recommended when greater capability or nuance is needed.\n",
    "\n",
    "| Task               | Start With     | Upgrade When...                                            | Escalate To  | Rationale                                                                                    |\n",
    "| :----------------- | :------------- | :--------------------------------------------------------- | :----------- | :------------------------------------------------------------------------------------------- |\n",
    "| Ideation & Protocol Generation | `o4-mini` | Hypotheses lack depth or creativity needed for complex chemical synthesis. | `o3` | `o4-mini` rapidly generates diverse protocols cost-effectively. `o3` provides deeper scientific reasoning when more nuanced approaches are required. |\n",
    "| Protocol Ranking | `o4-mini` | Comparison requires deeper scientific assessment or multi-factor trade-offs. | `o3` | Tournament-style ranking with `o4-mini` efficiently identifies promising candidates. Escalate when subtle scientific validity needs evaluation. |\n",
    "| Deep Critique & Synthesis | `o3` | N/A - Already using the most capable model for this critical task. | N/A | `o3` excels at rigorous scientific review, identifying methodological flaws, and synthesizing improvements across complex protocols. This task inherently requires deep reasoning. |\n",
    "| Safety Assessment | `gpt-4.1-mini` | Domain-specific hazards require higher accuracy or specialized knowledge. | `gpt-4.1` | `gpt-4.1-mini` offers a good balance of cost and performance for standard safety checks. Escalate to `gpt4.1` when higher accuracy or more nuanced reasoning is needed for complex safety risks. |\n",
    "\n",
    "**Key Insight:**\n",
    "> This use case exemplifies a powerful pattern: using faster, cheaper models (`o4-mini`) for breadth and initial filtering, then escalating to more powerful models (`o3`) for depth, critical review, and synthesis. This layered approach optimizes for both creativity/speed and rigor/accuracy, while managing computational costs effectively. The integration with tools is essential for grounding the AI's reasoning in verifiable, real-world data.\n",
    "\n",
    "## 4. Deployment Notes\n",
    "\n",
    "Transitioning the AI Co-Scientist from prototype to lab use involves careful planning.\n",
    "\n",
    "* **Cost Control:**\n",
    "    * Implement configurable \"modes\" (such as `Fast`, `Standard`, `Thorough`) that adjust the number of `o4-mini` ideation agents, the depth of `o3` critique, or the use of optional checks to balance result quality with cost and latency.\n",
    "    * Track token usage per stage (ideation, ranking, critique) and per tool call for fine-grained cost monitoring.\n",
    "* **Observability:**\n",
    "    * Log inputs, outputs, model choices, tool calls/responses, latencies, and token counts for each step.\n",
    "    * Monitor the performance of the tournament ranking and the impact of `o3` critiques (such as how often plans are significantly altered or rejected).\n",
    "    * Track user interactions: which plans are approved, edited, or rejected by the human scientist.\n",
    "* **Safety & Compliance:**\n",
    "    * Implement multiple safety layers: constraints in prompts, tool-based checks (such as reagent compatibility via `chem_lookup`), optional dedicated model checks (`gpt-4.1-mini`), automated filters (such as for known hazardous combinations), and mandatory human review.\n",
    "    * Ensure tool endpoints (such as internal databases) meet security requirements.\n",
    "* **Rollout Strategy:** \n",
    "    * Begin with retrospective analysis of past experiments, then move to shadow mode (AI suggests plans alongside human planners), followed by limited live use cases with close monitoring before broader adoption.\n",
    "\n",
    "\n",
    "## 5. Takeaways\n",
    "\n",
    "1. **Model pairing creates synergy**: `o4-mini` covers more ground quickly; `o3` brings precision and depth.\n",
    "2. **Tool integration grounds reasoning in reality**: Real-world data such as chemical costs and safety constraints inform decision-making.\n",
    "3. **Human scientists remain central**: The system empowers experts by removing grunt work—not by replacing them.\n",
    "\n",
    "\n",
    "## 6. Useful Cookbooks & Resources\n",
    "\n",
    "Here are select resources that complement the design and implementation of the AI Co-Scientist system:\n",
    "\n",
    "- **[Orchestrating Agents: Routines and Handoffs](https://cookbook.openai.com/examples/orchestrating_agents)** Structuring multi-agent workflows with routines and handoffs, relevant to the ideation→ranking→critique pipeline.\n",
    "\n",
    "- **[GPT-4.1 Prompting Guide](https://cookbook.openai.com/examples/gpt4-1_prompting_guide)** Advanced prompting, tool use, and task decomposition for improved accuracy in critique and safety reviews.\n",
    "\n",
    "- **[Structured Outputs for Multi-Agent Systems](https://cookbook.openai.com/examples/structured_outputs_multi_agent)** Enforcing consistent JSON outputs with schema validation for agent interoperability.\n",
    "\n",
    "- **[Agents - OpenAI API](https://platform.openai.com/docs/guides/agents)**  \n",
    "  Comprehensive guide to building multi-agent systems with OpenAI tools, covering orchestration, tool use, and best practices foundational to this system's architecture.\n",
    "\n",
    "================================================================================\n",
    "\n",
    "\n",
    "\n",
    "## 3C. Use Case: Insurance Claim Processing\n",
    "\n",
    "![](../../../images/3C_insurance_task_card.png)\n",
    "\n",
    "Many businesses are faced with the task of digitizing hand-filled forms. In this section, we will demonstrate how OpenAI can be used to digitize and validate a hand-filled insurance form. While this is a common problem for insurance, the same techniques can be applied to a variety of other industries and forms, for example tax forms, invoices, and more.\n",
    "\n",
    "## 🗂️ TL;DR Matrix\n",
    "\n",
    "This table summarizes the core technology choices and their rationale for this specific OCR implementation targeting the insurance use case.\n",
    "\n",
    "| Layer | Choice | Utility |\n",
    "| :---- | :---- | :---- |\n",
    "| JSON Output | Structured output with Pydantic | Easy to specify formatting, adheres to schema better than `JSON mode` |\n",
    "| OCR and Vision | `gpt-4.1` | Powerful OCR and vision capabilities, structured output |\n",
    "| Reasoning | `o4-mini` | Affordable but capable reasoning, function calling available |\n",
    "| Form Validation | Custom function calling | Can provide interaction with custom or internal databases |\n",
    "\n",
    "\\*Note: Prices and model identifiers accurate as of April 2025, subject to change.\n",
    "\n",
    "## 1\\. Scenario Snapshot\n",
    "\n",
    "* **Users:** The target users are insurance servicing and ops teams who need to ingest data from handwritten forms.  \n",
    "* **Typical Asks:** Each form will have a different required structure, as well as different fields that need to be extracted.  \n",
    "* **Constraints:**  \n",
    "  * **Accuracy:** High accuracy is required to ensure that the data is correct and complete.  \n",
    "  * **Uncertainty:** The system must handle uncertainty in the data, such as missing data, ambiguous data, and different formats of the same field. In the event that the model cannot resolve the uncertainty, the system requires a mechanism to request human review.  \n",
    "  * **Performance & Cost:** While system latency is not critical, high accuracy is required while keeping costs under control. We will aim for a cost target of $20 or less per 1000 pages processed.\n",
    "\n",
    "## 2\\. Architecture\n",
    "\n",
    "The high level basic architecture of the solution is shown below.\n",
    "\n",
    "![](../../../images/3C_insurance_architecture.png)\n",
    "\n",
    "This task is complex and requires a wide variety of model capabilities, including vision, function calling, reasoning, and structured output. While `o3` is capable of doing all of these at once, we found during experimentation that `o4-mini` alone was not sufficient to achieve the necessary performance. Due to the higher relative costs of `o3`, we instead opted for a two-stage approach.\n",
    "\n",
    "1. Stage one is performed using the vision capabilities of GPT 4.1. This stage is optimized to extract text with maximum accuracy, leaving uncertainty for the reasoning stage and not making any assumptions not visible on the page. By doing OCR in the first stage, we do not require the reasoning model to work directly from an image, which can be challenging given all the other tasks the reasoning model must perform.  \n",
    "     \n",
    "2. Stage two takes advantage of the reasoning abilities of `o4-mini`. We use `o4-mini` to validate the accuracy of the OCR and to extract the data into a structured format. Importantly, we expect o4-mini to act as the secondary quality gate \\-- if the OCR is incomplete at this stage we can use o4-mini to refine and validate the original results.\n",
    "\n",
    "To demonstrate concretely how this works, let's look at a sample image of an insurance form.\n",
    "\n",
    "![](../../../images/3C_insurance_form.png)\n",
    "\n",
    "While the form itself is fairly straightforward, there is missing data and ambiguous information that will be difficult for a traditional OCR system to fill out correctly. First, notice that the zip code and county have been omitted. Second, the email address of the user is ambiguous \\-- it could be `jsmith1@gmail.com` or `jsmithl@gmail.com`. In the following sections, we will walk through how a well-designed solution can handle these ambiguities and return the correct form results.\n",
    "\n",
    "**Environment Setup & Library Code:**\n",
    "\n",
    "To make our example code more clear, we have broken out environment setup (such as `pip install` commands) and library functions into a separate code block. This will make it easier to focus on only the relevant logic in each step of our solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "923344db",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "# Install Python requirements\n",
    "%pip install -qU pydantic \"openai>=1.76.0\"\n",
    "\n",
    "# All imports\n",
    "import os\n",
    "import json\n",
    "\n",
    "from pydantic import BaseModel\n",
    "\n",
    "# Create the OpenAI client\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"sk-dummykey\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7ccd93f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_conversation_loop(\n",
    "    client,\n",
    "    messages,\n",
    "    tools,\n",
    "    tool_handlers,\n",
    "    response_format,\n",
    "    model,\n",
    "):\n",
    "    \"\"\"Run the OpenAI response completion loop, handling function calls via tool_handlers until parsing final response.\"\"\"\n",
    "    summaries = []\n",
    "    while True:\n",
    "        print(\n",
    "            f\"Requesting completion from model '{model}' (messages={len(messages)})\"\n",
    "        )\n",
    "        response = client.responses.parse(\n",
    "            model=model,\n",
    "            input=messages,\n",
    "            tools=tools,\n",
    "            text_format=response_format,\n",
    "            reasoning={\"summary\": \"auto\"},\n",
    "        )\n",
    "        summaries.append(response.output[0].summary)\n",
    "\n",
    "        if not response.output_parsed:\n",
    "            print(\"Assistant requested tool calls, resolving ...\")\n",
    "\n",
    "            reasoning_msg, tool_call = response.output\n",
    "            messages.append(reasoning_msg)\n",
    "            messages.append({\n",
    "                \"id\": tool_call.id,\n",
    "                \"call_id\": tool_call.call_id,\n",
    "                \"type\": tool_call.type,\n",
    "                \"name\": tool_call.name,\n",
    "                \"arguments\": tool_call.arguments,\n",
    "            })\n",
    "\n",
    "            if tool_call.name in tool_handlers:\n",
    "                try:\n",
    "                    args = json.loads(tool_call.arguments)\n",
    "                except Exception as exc:\n",
    "                    print(\n",
    "                        \"Failed to parse %s arguments: %s\", tool_call.name, exc\n",
    "                    )\n",
    "                    args = {}\n",
    "                result = tool_handlers[tool_call.name](**args)\n",
    "                messages.append(\n",
    "                    {\n",
    "                        \"type\": \"function_call_output\",\n",
    "                        \"call_id\": tool_call.call_id,\n",
    "                        \"output\": str(result),\n",
    "                    }\n",
    "                )\n",
    "                print(f\"Tool call {tool_call.name} complete, result: {str(result)}\")\n",
    "            else:\n",
    "                print(\"Unhandled function call: %s\", tool_call.name)\n",
    "\n",
    "        if response.output_parsed is not None:\n",
    "            print(\"Received parsed result from model\")\n",
    "            return response, summaries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76755e0d",
   "metadata": {},
   "source": [
    "**Flow Explanation: Stage 1**\n",
    "\n",
    "1. **Image:** The image of the form taken from the user's smartphone is passed to the model. OpenAI's models can accept a variety of image formats, but we typically use a PNG format to keep the text crisp and reduce artifacts. For this example, we pass the image to the model from a publicly available content URL. In a production environment, you likely would pass the image as a signed URL to an image hosted in your own cloud storage bucket.  \n",
    "     \n",
    "2. **Structured Output Schema:** We define a Pydantic model that sets the structure of the output data. The model includes all of the fields that we need to extract from the form, along with the appropriate types for each field. Our model is broken into several subcomponents, each of which is a Pydantic model itself and referenced by the parent model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "59263ec9",
   "metadata": {},
   "outputs": [],
   "source": [
    "class PersonContact(BaseModel):\n",
    "    name: str\n",
    "    home_phone: str\n",
    "    work_phone: str\n",
    "    cell_phone: str\n",
    "    email: str\n",
    "\n",
    "class Address(BaseModel):\n",
    "    street: str\n",
    "    city: str\n",
    "    state: str\n",
    "    zip: str\n",
    "    county: str\n",
    "\n",
    "class DwellingDetails(BaseModel):\n",
    "    coverage_a_limit: str\n",
    "    companion_policy_expiration_date: str\n",
    "    occupancy_of_dwelling: str\n",
    "    type_of_policy: str\n",
    "    unrepaired_structural_damage: bool\n",
    "    construction_type: str\n",
    "    roof_type: str\n",
    "    foundation_type: str\n",
    "    has_post_and_pier_or_post_and_beam_foundation: bool\n",
    "    cripple_walls: bool\n",
    "    number_of_stories: str\n",
    "    living_space_over_garage: bool\n",
    "    number_of_chimneys: str\n",
    "    square_footage: str\n",
    "    year_of_construction: str\n",
    "    anchored_to_foundation: bool\n",
    "    water_heater_secured: bool\n",
    "\n",
    "class InsuranceFormData(BaseModel):\n",
    "    applicant: PersonContact\n",
    "    co_applicant: PersonContact\n",
    "    risk_address: Address\n",
    "    mailing_address_if_different_than_risk_address: Address\n",
    "    participating_insurer: str\n",
    "    companion_policy_number: str\n",
    "    dwelling_details: DwellingDetails\n",
    "    effective_date: str\n",
    "    expiration_date: str"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70e746a3",
   "metadata": {},
   "source": [
    "3. **Run OCR:** Using the vision capabilities of GPT-4.1, we run the first stage of our pipeline to extract the text from the document in a structured format. This initial stage aims to achieve high accuracy while passing through uncertainty to the second stage. Our prompt explicitly instructs the model to avoid inferring inputs and instead to fill out the details as exact as possible. For the image input, we set image input detail to `auto` to infer a detail level that's appropriate to the image. We found in our experiments that `auto` worked well, but if you are seeing quality issues in your OCR processing consider using `high`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1537dad2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"applicant\": {\n",
      "    \"name\": \"Smith, James L\",\n",
      "    \"home_phone\": \"510 331 5555\",\n",
      "    \"work_phone\": \"\",\n",
      "    \"cell_phone\": \"510 212 5555\",\n",
      "    \"email\": \"jsmithl@gmail.com OR jsmith1@gmail.com\"\n",
      "  },\n",
      "  \"co_applicant\": {\n",
      "    \"name\": \"Roberts, Jesse T\",\n",
      "    \"home_phone\": \"510 331 5555\",\n",
      "    \"work_phone\": \"415 626 5555\",\n",
      "    \"cell_phone\": \"\",\n",
      "    \"email\": \"jrobertsjr@gmail.com\"\n",
      "  },\n",
      "  \"risk_address\": {\n",
      "    \"street\": \"855 Brannan St\",\n",
      "    \"city\": \"San Francisco\",\n",
      "    \"state\": \"CA\",\n",
      "    \"zip\": \"\",\n",
      "    \"county\": \"\"\n",
      "  },\n",
      "  \"mailing_address_if_different_than_risk_address\": {\n",
      "    \"street\": \"\",\n",
      "    \"city\": \"\",\n",
      "    \"state\": \"\",\n",
      "    \"zip\": \"\",\n",
      "    \"county\": \"\"\n",
      "  },\n",
      "  \"participating_insurer\": \"Acme Insurance Co\",\n",
      "  \"companion_policy_number\": \"81265919\",\n",
      "  \"dwelling_details\": {\n",
      "    \"coverage_a_limit\": \"$900,000\",\n",
      "    \"companion_policy_expiration_date\": \"5/31/27\",\n",
      "    \"occupancy_of_dwelling\": \"Owner\",\n",
      "    \"type_of_policy\": \"Homeowners\",\n",
      "    \"unrepaired_structural_damage\": false,\n",
      "    \"construction_type\": \"Frame\",\n",
      "    \"roof_type\": \"Composition\",\n",
      "    \"foundation_type\": \"Raised\",\n",
      "    \"has_post_and_pier_or_post_and_beam_foundation\": false,\n",
      "    \"cripple_walls\": false,\n",
      "    \"number_of_stories\": \"Greater than 1 story\",\n",
      "    \"living_space_over_garage\": true,\n",
      "    \"number_of_chimneys\": \"2\",\n",
      "    \"square_footage\": \"1200\",\n",
      "    \"year_of_construction\": \"2005\",\n",
      "    \"anchored_to_foundation\": true,\n",
      "    \"water_heater_secured\": true\n",
      "  },\n",
      "  \"effective_date\": \"5/31/25\",\n",
      "  \"expiration_date\": \"5/31/27\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "OCR_PROMPT = \"\"\"You are a helpful assistant who excels at processing insurance forms.\n",
    "\n",
    "You will be given an image of a hand-filled insurance form. Your job is to OCR the data into the given structured format.\n",
    "Fill out the fields as exactly as possible. If a written character could possibly be ambiguous (i.e. l or 1, o or 0), include all possiblities in the field separated by \"OR\", especially for email addresses.\n",
    "\"\"\"\n",
    "\n",
    "user_content = [\n",
    "    {\"type\": \"input_text\", \"text\": \"Here is a photo of the form filled out by the user:\"},\n",
    "    {\n",
    "        \"type\": \"input_image\",\n",
    "        \"image_url\": \"https://drive.usercontent.google.com/download?id=1-tZ526AW3mX1qthvgi8spaaxxeqFG5_6\",\n",
    "        \"detail\": \"auto\",\n",
    "    },\n",
    "]\n",
    "\n",
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": OCR_PROMPT},\n",
    "    {\"role\": \"user\", \"content\": user_content},\n",
    "]\n",
    "\n",
    "response = client.responses.parse(\n",
    "    model=\"gpt-4.1-2025-04-14\",\n",
    "    input=messages,\n",
    "    text_format=InsuranceFormData,\n",
    "    # Set temp to 0 for reproducibility\n",
    "    temperature=0,\n",
    ")\n",
    "\n",
    "s1_json_results = json.dumps(json.loads(response.output_parsed.model_dump_json()), indent=2)\n",
    "print(s1_json_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42296380",
   "metadata": {},
   "source": [
    "Notice that the output is missing several fields. In the next stage of processing we will take advantage of OpenAI's reasoning models to infer the missing fields where possible.\n",
    "\n",
    "**Flow Explanation: Stage 2**\n",
    "\n",
    "1. **Function Definitions:** We define a set of custom functions that the model can use to resolve uncertainty. In this case, we define a function that can validate email addresses by checking if the email exists. This can be used to resolve the ambiguous email address field where the model must choose between multiple possible values. By default, o4-mini supports built-in tools like web search, which in this case it will use to resolve zip codes and incomplete addresses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "72dc150e",
   "metadata": {},
   "outputs": [],
   "source": [
    "tools = [{\n",
    "    \"type\": \"function\",\n",
    "    \"name\": \"validate_email\",\n",
    "    \"description\": \"Check if an email address is valid and exists.\",\n",
    "    \"parameters\": {\n",
    "        \"type\": \"object\",\n",
    "        \"properties\": {\n",
    "            \"email\": {\n",
    "                \"type\": \"string\",\n",
    "                \"description\": \"The email address to validate.\"\n",
    "            }\n",
    "        },\n",
    "        \"required\": [\n",
    "            \"email\"\n",
    "        ],\n",
    "        \"additionalProperties\": False\n",
    "    }\n",
    "},\n",
    "{\n",
    "    \"type\": \"function\",\n",
    "    \"name\": \"search_web\",\n",
    "    \"description\": \"Perform a web search.\",\n",
    "    \"parameters\": {\n",
    "        \"type\": \"object\",\n",
    "        \"properties\": {\n",
    "            \"query\": {\n",
    "                \"type\": \"string\",\n",
    "                \"description\": \"The search query to run through the search engine.\"\n",
    "            }\n",
    "        },\n",
    "        \"required\": [\n",
    "            \"query\"\n",
    "        ],\n",
    "        \"additionalProperties\": False\n",
    "    }\n",
    "}]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9a9b808",
   "metadata": {},
   "source": [
    "2. **Prompt:** We provide a prompt to the model explaining that we have extracted text via OCR and requesting that the model perform reasoning and function calling to fill in the missing or ambiguous fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ae8fcf6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "PROMPT = \"\"\"You are a helpful assistant who excels at processing insurance forms.\n",
    "\n",
    "You will be given a javascript representation of an OCR'd document. Consider at which fields are ambiguous reason about how to fill them in. Fill any missing fields that are possible to infer from existing data, or search the web. If you cannot fill a field, reason about why.\n",
    "\n",
    "Use the tools provided if necessary to clarify the results. If the OCR system has provided two possibilities, do your best to definitely pick which option is correct.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1d2b77ee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requesting completion from model 'o4-mini-2025-04-16' (messages=2)\n",
      "Assistant requested tool calls, resolving ...\n",
      "Tool call validate_email complete, result: True\n",
      "Requesting completion from model 'o4-mini-2025-04-16' (messages=5)\n",
      "Assistant requested tool calls, resolving ...\n",
      "Tool call validate_email complete, result: False\n",
      "Requesting completion from model 'o4-mini-2025-04-16' (messages=8)\n",
      "Received parsed result from model\n",
      "{\n",
      "  \"applicant\": {\n",
      "    \"name\": \"Smith, James L\",\n",
      "    \"home_phone\": \"510 331 5555\",\n",
      "    \"work_phone\": \"\",\n",
      "    \"cell_phone\": \"510 212 5555\",\n",
      "    \"email\": \"jsmithl@gmail.com\"\n",
      "  },\n",
      "  \"co_applicant\": {\n",
      "    \"name\": \"Roberts, Jesse T\",\n",
      "    \"home_phone\": \"510 331 5555\",\n",
      "    \"work_phone\": \"415 626 5555\",\n",
      "    \"cell_phone\": \"\",\n",
      "    \"email\": \"jrobertsjr@gmail.com\"\n",
      "  },\n",
      "  \"risk_address\": {\n",
      "    \"street\": \"855 Brannan St\",\n",
      "    \"city\": \"San Francisco\",\n",
      "    \"state\": \"CA\",\n",
      "    \"zip\": \"94107\",\n",
      "    \"county\": \"San Francisco\"\n",
      "  },\n",
      "  \"mailing_address_if_different_than_risk_address\": {\n",
      "    \"street\": \"855 Brannan St\",\n",
      "    \"city\": \"San Francisco\",\n",
      "    \"state\": \"CA\",\n",
      "    \"zip\": \"94107\",\n",
      "    \"county\": \"San Francisco\"\n",
      "  },\n",
      "  \"participating_insurer\": \"Acme Insurance Co\",\n",
      "  \"companion_policy_number\": \"81265919\",\n",
      "  \"dwelling_details\": {\n",
      "    \"coverage_a_limit\": \"$900,000\",\n",
      "    \"companion_policy_expiration_date\": \"5/31/27\",\n",
      "    \"occupancy_of_dwelling\": \"Owner\",\n",
      "    \"type_of_policy\": \"Homeowners\",\n",
      "    \"unrepaired_structural_damage\": false,\n",
      "    \"construction_type\": \"Frame\",\n",
      "    \"roof_type\": \"Composition\",\n",
      "    \"foundation_type\": \"Raised\",\n",
      "    \"has_post_and_pier_or_post_and_beam_foundation\": false,\n",
      "    \"cripple_walls\": false,\n",
      "    \"number_of_stories\": \"Greater than 1 story\",\n",
      "    \"living_space_over_garage\": true,\n",
      "    \"number_of_chimneys\": \"2\",\n",
      "    \"square_footage\": \"1200\",\n",
      "    \"year_of_construction\": \"2005\",\n",
      "    \"anchored_to_foundation\": true,\n",
      "    \"water_heater_secured\": true\n",
      "  },\n",
      "  \"effective_date\": \"5/31/25\",\n",
      "  \"expiration_date\": \"5/31/27\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": PROMPT},\n",
    "    {\"role\": \"user\", \"content\": s1_json_results},\n",
    "]\n",
    "\n",
    "# For demonstration purposes, we'll hardcode the correct email answer.\n",
    "def email_mock(*args, **kwargs):\n",
    "    if kwargs[\"email\"] == \"jsmithl@gmail.com\":\n",
    "        return True\n",
    "    return False\n",
    "\n",
    "# Reasoning models like `o4-mini` will soon support built-in web search, but for now\n",
    "# we demonstrate this capability using a simple mock function.\n",
    "def web_mock(*args, **kwargs):\n",
    "    if \"855 Brannan\" in kwargs[\"query\"]:\n",
    "        return \"855 Brannan St, San Francisco, 94103, San Francisco County\"\n",
    "    \n",
    "    return \"\"\n",
    "    \n",
    "tool_handlers = {\"validate_email\": email_mock, \"search_web\": web_mock}\n",
    "\n",
    "response, summaries = run_conversation_loop(\n",
    "    client=client,\n",
    "    messages=messages,\n",
    "    tools=tools,\n",
    "    tool_handlers=tool_handlers,\n",
    "    response_format=InsuranceFormData,\n",
    "    model=\"o4-mini-2025-04-16\",\n",
    ")\n",
    "\n",
    "print(json.dumps(json.loads(response.output_parsed.model_dump_json()), indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb3f3115",
   "metadata": {},
   "source": [
    "You can see that the email address has been refined to a single value, the zip code and county have been filled in, and the mailing address has been filled in by using the risk address. The model has also returned the results in a structured format (with appropriate types such as boolean for yes/no questions), which can be easily parsed by a downstream system.\n",
    "\n",
    "To help us understand and debug the model, we can also print the summary chain-of-thought reasoning produced by the model. This can help expose common failure modes, points where the model is unclear, or incorrect upstream details.\n",
    "\n",
    "While developing this solution, the chain-of-thought summaries exposed some incorrectly named and typed schema values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ab1d4fbc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Determining insurance form details**\n",
      "\n",
      "I have a JSON representation of a partially filled insurance form, and there are a few missing or ambiguous fields that I need to address.\n",
      "\n",
      "For the email address, I see two options. I can validate which one is correct by checking both with the tool.\n",
      "\n",
      "The risk address fields for zip code and county are empty. Based on the address \"855 Brannan St, San Francisco, CA,\" I can determine the correct zip code is 94107, as that area corresponds to South Beach. Lastly, since the mailing address is empty, I assume it's the same as the risk address.\n",
      "\n",
      "**Filling insurance form details**\n",
      "\n",
      "I think it’s best to set the mailing address to be the same as the risk address or clarify that a blank one implies the same. Since it’s an explicit instruction to fill missing fields, I’ll fill in the mailing address with the risk address to avoid confusion.\n",
      "\n",
      "All co-applicant fields are present, and dwelling details are complete. The effective and expiration dates are also provided. I plan to validate both email options by checking each one separately. Let's begin with validating the first email.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for summary in summaries:\n",
    "    for response in summary:\n",
    "        print(response.text + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2bd52eb",
   "metadata": {},
   "source": [
    "## 3\\. Model and Capabilities Playbook\n",
    "\n",
    "Selecting the right tool for the job is key to getting the best results. In general, it's a good idea to start with the simplest solution that fits your needs and then upgrade if you need more capabilities.\n",
    "\n",
    "| Task | Start With | Upgrade When... | Escalate To | Rationale |\n",
    "| :---- | :---- | :---- | :---- | :---- |\n",
    "| OCR | `gpt-4.1` | Complex forms that are difficult to understand at a glance | `o3` | `gpt-4.1` is fast and cost-effective for most OCR. `o-3` has the ability to reason about form structure. |\n",
    "| Results Refinement | `o4-mini` | Complex logic for inferring details, many function calls required. | `o3` | Better for very long chains of reasoning, especially with both function calls and structured output. |\n",
    "\n",
    "## 4\\. Evaluation Metrics\n",
    "\n",
    "Track key metrics to ensure the system is performing accurately and as expected.\n",
    "\n",
    "### Critical Metrics\n",
    "\n",
    "* **OCR Accuracy:** Per-character and per-word accuracy.  \n",
    "* **Inferred Field Rate:** Portion unfilled entries correctly inferred from either existing data or function calling.  \n",
    "* **Human Intervention Rate:** How often a document contains an UNKNOWN and must be referred to a human.\n",
    "\n",
    "We recommend building a labeled hold-out set of forms and their expected responses. This dataset should be representative of the expected deployment environment, see the [OpenAI evals](https://platform.openai.com/docs/guides/evals) guide for more detailed information on building and evaluating your system.\n",
    "\n",
    "## 5\\. Deployment Notes\n",
    "\n",
    "Moving from prototype to a production-ready system requires attention to operational details (LLMOps).\n",
    "\n",
    "### Cost Breakdown\n",
    "\n",
    "We will assume that for document ingestion, [batch pricing](https://platform.openai.com/docs/guides/batch) is a viable option due to high latency tolerance (i.e. overnight runs are fine).\n",
    "\n",
    "#### **Stage 1: OCR (Optical Character Recognition)**\n",
    "\n",
    "**Model:** `gpt-4.1`\n",
    "\n",
    "| Type | Tokens | Rate (per 1M) | Cost |\n",
    "| :---- | :---- | :---- | :---- |\n",
    "| Input | 2,000 | $1.00 | $0.002 |\n",
    "| Output | 1,500 | $4.00 | $0.006 |\n",
    "| **Total for 1,000 pages (Stage 1\\)** |  |  | **$8.00** |\n",
    "\n",
    "#### **Stage 2: Reasoning**\n",
    "\n",
    "**Model:** `o4-mini`\n",
    "\n",
    "| Type | Tokens | Rate (per 1M) | Cost |\n",
    "| :---- | :---- | :---- | :---- |\n",
    "| Input | 2,000 | $0.55 | $0.0011 |\n",
    "| Output | 3,000 | $2.20 | $0.0066 |\n",
    "| **Total for 1,000 pages (Stage 2\\)** |  |  | **$7.70** |\n",
    "\n",
    "#### Grand Total (per 1,000 pages): **$15.70**\n",
    "\n",
    "Compare this cost to a one-stage `o3` deployment. Assuming equal token usage and batch usage, the additional cost of the more powerful reasoning model would come to $70/1000 pages.\n",
    "\n",
    "### Monitoring & Deployment\n",
    "\n",
    "Monitor your system by logging key metrics:\n",
    "\n",
    "* `llm_model_used`, `llm_input_tokens`, `llm_output_tokens`, `llm_latency_ms` per model  \n",
    "* `total_query_latency_ms`, `estimated_query_cost` per model  \n",
    "* `function_calls_per_document`, `num_email_validation_calls`  \n",
    "* `human_review_required`\n",
    "\n",
    "Pin the specific model version identifier (e.g., `o4-mini-2025-04-16`) used in deployment via configuration/environment variables to prevent unexpected behavior from silent model updates.\n",
    "\n",
    "## 6\\. Useful Cookbooks & Resources\n",
    "\n",
    "Refer to these related resources for deeper dives into specific components:\n",
    "\n",
    "* [Structured Output](https://platform.openai.com/docs/guides/structured-outputs)  \n",
    "* [Vision Models](https://platform.openai.com/docs/guides/images)  \n",
    "* [Function Calling](https://platform.openai.com/docs/guides/function-calling)\n",
    "\n",
    "================================================================================\n",
    "\n",
    "\n",
    "<h2 id=\"prototype-to-production\">Prototype to Production</h2>\n",
    "\n",
    "Transitioning a prototype to production requires careful planning and execution. This checklist highlights critical steps, drawing from our flagship use cases, to ensure your deployment is robust, efficient, and meets business goals.\n",
    "\n",
    "## 🗂️ TL;DR Matrix\n",
    "\n",
    "| Checklist Area | Key Focus / Actions | Why it Matters |\n",
    "| :---- | :---- | :---- |\n",
    "| **Define Success Criteria** | • Define measurable KPIs & SLOs (accuracy, cost, latency). • Ensure targets are measurable via logs. | Provides clear targets; proves value. |\n",
    "| **Document Model Rationale** | • Select initial models deliberately based on trade-offs. • Document the \"why\" behind model choices. | Justifies choices; aids future updates. |\n",
    "| **Robust Evaluation & Testing** | • Build automated tests (\"eval suite\") using a golden set. • Focus on factuality, hallucinations, tool errors.  • Test tool reliability & edge cases. | Ensures quality; prevents regressions before release. |\n",
    "| **Observability & Cost** | • Implement essential logging for monitoring & debugging. • Set cost guardrails (token limits, usage modes). | Enables tuning; keeps spending within budget. |\n",
    "| **Safety & Compliance** | • Use safety mechanisms (moderation APIs, prompts). • Enforce domain-specific compliance rules. • Mandate Human-in-the-Loop (HITL) for high-risk outputs. | Ensures responsible operation; meets requirements. |\n",
    "| **Model Updates & Versioning** | • Define version pinning strategy • Implement A/B testing for new versions • Create rollback procedures | Maintains stability while allowing improvements. |\n",
    "\n",
    "1. **Define Success Criteria Quantitatively:** Move beyond \"it works\" to measurable targets *before* major development.  \n",
    "     \n",
    "   * **Set Key Performance Indicators (KPIs) & SLOs:** Define specific targets for business value (e.g., RAG accuracy \\> 95%, OCR cost \\< $X/page) and performance (e.g., P95 latency \\< 1s, error rates).  \n",
    "   * **Ensure Measurability:** Confirm that all KPIs and SLOs can be directly measured from system logs (e.g., tracking `total_tokens`, `critique_status`).\n",
    "\n",
    "   \n",
    "\n",
    "2. **Document Initial Model Selection Rationale:** Justify your starting model choices for future reference.  \n",
    "     \n",
    "   * **Choose Models Deliberately:** Use the Model-Intro Matrix and use cases to select appropriate models for each task (e.g., `o4-mini` for speed/cost, `gpt-4.1` for accuracy, `o3` for depth).  \n",
    "   * **Record the \"Why\":** Briefly document the reasoning behind your choices (cost, latency, capability trade-offs) in code comments or design docs so future teams understand the context.\n",
    "\n",
    "   \n",
    "\n",
    "3. **Implement Robust Evaluation & Testing:** Verify quality and prevent regressions *before* shipping changes.  \n",
    "     \n",
    "   * **Build an Automated Eval Suite:** Create a repeatable test process using a \"golden set\" (50-100 diverse, expert-verified examples). Focus tests on `factuality`, `hallucination rate`, `tool-error rate`, and task-specific metrics.  \n",
    "   * **Test Reliably:** Rigorously test integrated tool reliability (success rate, error handling) and system behavior under load and with edge cases (malformed data, adversarial inputs).\n",
    "\n",
    "   \n",
    "\n",
    "4. **Establish Observability & Cost Controls:** Monitor performance and keep spending within budget.  \n",
    "     \n",
    "   * **Set Cost Guardrails:** Prevent unexpected cost increases by defining max token limits per stage and considering operational modes (\"Fast,\" \"Standard,\" \"Thorough\") to balance cost and performance.  \n",
    "   * **Implement Essential Logging:** Capture key operational data via structured logs for each processing stage to enable debugging and monitoring.\n",
    "\n",
    "   \n",
    "\n",
    "5. **Implement Safety & Compliance Guardrails:** Ensure responsible operation and meet requirements.  \n",
    "     \n",
    "   * **Use Safety Mechanisms:** Employ tools like OpenAI's moderation APIs, safety-focused system prompts, or sentinel models for checks, especially with user input or sensitive topics.  \n",
    "   * **Enforce Compliance:** Build in checks relevant to your specific industry and risks (e.g., legal constraints, lab safety).  \n",
    "   * **Require Human-in-the-Loop (HITL):** Mandate human review for low-confidence outputs, high-risk scenarios, or critical decisions, ensuring the workflow flags these items clearly.\n",
    "\n",
    "\n",
    "6. **Manage Model Updates and Versioning:** Prepare for model evolution over time.\n",
    "   \n",
    "   * **Version Pinning Strategy:** Decide whether to pin to specific model versions for stability or automatically adopt new versions for improvements.\n",
    "   * **A/B Testing Framework:** Establish a process to evaluate new model versions against your key metrics before full deployment.\n",
    "   * **Rollback Plan:** Create a clear procedure for reverting to previous model versions if issues arise with updates.\n",
    "   * **Monitor Version Performance:** Track metrics across model versions to identify performance trends and inform future selection decisions.\n",
    "\n",
    "================================================================================\n",
    "\n",
    "\n",
    "\n",
    "## Adaptation Decision Tree\n",
    "\n",
    "![Model Selection Decision Tree](../../../images/3D_model_selection_flowchart.png)\n",
    "\n",
    "## Communicating Model Selection to Non-Technical Stakeholders\n",
    "\n",
    "When explaining your model choices to business stakeholders, focus on these key points:\n",
    "\n",
    "1. **Align with Business Outcomes**: Explain how your model selection directly supports specific business goals (time savings, cost reduction, improved accuracy).\n",
    "\n",
    "2. **Translate Technical Metrics**: Convert technical considerations into business impact:\n",
    "   - \"This model reduces processing time from 5 seconds to 0.7 seconds, allowing us to handle customer inquiries 7x faster\"\n",
    "   - \"By using the mini variant, we can process 5x more documents within the same budget\"\n",
    "\n",
    "3. **Highlight Trade-offs**: Present clear scenarios for different models:\n",
    "   - \"Option A (GPT-4.1): Highest accuracy but higher cost - ideal for client-facing legal analysis\"\n",
    "   - \"Option B (GPT-4.1 mini): 90% of the accuracy at 30% of the cost - perfect for internal document processing\"\n",
    "\n",
    "4. **Use Concrete Examples**: Demonstrate the practical difference in outputs between models to illustrate the value proposition of each option.\n",
    "\n",
    "================================================================================\n",
    "\n",
    "\n",
    "\n",
    "## Appendices\n",
    "\n",
    "## Glossary of Key Terms\n",
    "\n",
    "| Term | Definition |\n",
    "|------|------------|\n",
    "| **Context Window** | The maximum number of tokens a model can process in a single request |\n",
    "| **Hallucination** | When a model generates content that appears plausible but is factually incorrect or unsupported |\n",
    "| **Latency** | The time delay between sending a request to a model and receiving a response |\n",
    "| **LLM** | Large Language Model; an AI system trained on vast amounts of text data |\n",
    "| **Prompt Engineering** | The practice of designing effective prompts to elicit desired outputs from AI models |\n",
    "| **RAG** | Retrieval-Augmented Generation; combining information retrieval with text generation |\n",
    "| **SOTA** | State-of-the-Art; representing the most advanced stage in a field at a given time |\n",
    "| **Token** | The basic unit of text that models process (roughly 0.75 words in English) |\n",
    "\n",
    "## 6.1 Price and Utility Table (Apr 2025)\n",
    "\n",
    "| Model | Context Window | Input Price (per 1M tokens) | Output Price (per 1M tokens) | Best For |\n",
    "|-------|----------------|-----------------------------|-----------------------------|----------|\n",
    "| GPT-4.1 | 1M | \\$2.00 | \\$8.00 | Long-doc analytics, code review |\n",
    "| GPT-4.1 mini | 1M | \\$0.40 | \\$1.60 | Production agents, balanced cost/performance |\n",
    "| GPT-4.1 nano | 1M | \\$0.10 | \\$0.40 | High-throughput, cost-sensitive applications |\n",
    "| GPT-4o | 128K | \\$5.00 | \\$15.00 | Real-time voice/vision chat |\n",
    "| GPT-4o mini | 128K | \\$0.15 | \\$0.60 | Vision tasks, rapid analytics |\n",
    "| o3 (low) | 200K | \\$10.00* | \\$40.00* | Bulk triage, catalog enrichment |\n",
    "| o3 (med) | 200K | \\$10.00* | \\$40.00* | Knowledge base Q&A |\n",
    "| o3 (high) | 200K | \\$10.00* | \\$40.00* | Multi-step reasoning, troubleshooting |\n",
    "| o4-mini (low) | 200K | \\$1.10* | \\$4.40* | Vision tasks, rapid analytics |\n",
    "| o4-mini (med) | 200K | \\$1.10* | \\$4.40* | Balanced vision + reasoning |\n",
    "| o4-mini (high) | 200K | \\$1.10* | \\$4.40* | Deep reasoning with cost control |\n",
    "\n",
    "\\* *Note: The low/med/high settings affect token usage rather than base pricing. Higher settings may use more tokens for deeper reasoning, increasing per-request cost and latency.*\n",
    "\n",
    "## 6.2 Prompt-pattern Quick Sheet (Token vs Latency Deltas)\n",
    "\n",
    "| Prompt Pattern | Description | Token Impact | Latency Impact | Best Model Fit |\n",
    "|----------------|-------------|--------------|----------------|----------------|\n",
    "| **Self-Critique** | Ask model to evaluate its own answer before finalizing | +20-30% tokens | +15-25% latency | GPT-4.1, o3 |\n",
    "| **Chain-of-Thought (CoT)** | Explicitly instruct to \"think step by step\" | +40-80% tokens | +30-50% latency | o3, o4-mini (high) |\n",
    "| **Structured Outputs** | Use JSON schema or pydantic models for consistent formatting | +5-10% tokens | +5-10% latency | All models |\n",
    "| **Zero-Token Memory** | Store context in external DB rather than in conversation | -70-90% tokens | -5-10% latency | GPT-4.1 family |\n",
    "| **Skeleton-Fill-In** | Provide template structure for model to complete | -10-20% tokens | -5-15% latency | o4-mini, GPT-4.1 nano |\n",
    "| **Self-Consistency** | Generate multiple answers and select most consistent | +200-300% tokens | +150-250% latency | o3 (high) |\n",
    "| **Role-Playing** | Assign specific personas to model for specialized knowledge | +5-15% tokens | Neutral | GPT-4o, o4-mini |\n",
    "| **Tournament Ranking** | Compare options pairwise rather than scoring individually | +50-100% tokens | +30-60% latency | o3, o4-mini (high) |\n",
    "| **Tool-Calling Reflex** | Prompt model to call tools when uncertainty is detected | +10-30% tokens | +20-40% latency | o3, GPT-4.1 |\n",
    "\n",
    "## 6.3 Links to External Cookbooks & Docs\n",
    "\n",
    "### OpenAI Official Resources\n",
    "- [OpenAI Cookbook Main Repository](https://cookbook.openai.com/)\n",
    "- [Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)\n",
    "- [Vision Models Guide](https://platform.openai.com/docs/guides/vision)\n",
    "- [Agents Documentation](https://platform.openai.com/docs/guides/agents)\n",
    "- [Structured Outputs Guide](https://platform.openai.com/docs/guides/structured-outputs)\n",
    "\n",
    "### RAG & Retrieval\n",
    "- [RAG on PDFs](https://cookbook.openai.com/examples/file_search_responses)\n",
    "\n",
    "### Specialized Use Cases\n",
    "- [Voice Assistant with Agents SDK](https://cookbook.openai.com/examples/agents_sdk/app_assistant_voice_agents)\n",
    "- [Multi-Tool Orchestration](https://cookbook.openai.com/examples/responses_api/responses_api_tool_orchestration)\n",
    "- [Data Extraction and Transformation](https://cookbook.openai.com/examples/data_extraction_transformation)\n",
    "\n",
    "### Prompting & Model Selection\n",
    "- [GPT-4.1 Prompting Guide](https://cookbook.openai.com/examples/gpt4-1_prompting_guide)\n",
    "- [Prompt Engineering Best Practices](https://platform.openai.com/docs/guides/prompt-engineering)\n",
    "\n",
    "### Evaluation & Deployment\n",
    "- [Getting Started with OpenAI Evals](https://cookbook.openai.com/examples/evaluation/getting_started_with_openai_evals)\n",
    "- [How to use the Usage API and Cost API to monitor your OpenAI usage](https://cookbook.openai.com/examples/completions_usage_api)\n",
    "\n",
    "================================================================================\n",
    "\n",
    "\n",
    "\n",
    "## Contributors\n",
    "\n",
    " This cookbook serves as a joint collaboration effort between OpenAI and [Tribe AI](https://www.tribe.ai/)\n",
    "- [Kashyap Coimbatore Murali](https://www.linkedin.com/in/kashyap-murali/)\n",
    "- [Nate Harada](https://www.linkedin.com/in/nate-harada/) \n",
    "- [Sai Prashanth Soundararaj](https://www.linkedin.com/in/saiprashanths/)\n",
    "- [Shikhar Kwatra](https://www.linkedin.com/in/shikharkwatra/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}