{ "cells": [ { "cell_type": "markdown", "id": "intro-revised", "metadata": {}, "source": [ "# Build an Agent Improvement Loop with Traces, Evals, and Codex\n", "\n", "This notebook builds an improvement flywheel for an agent. We start with real traces, add human and model feedback, turn that feedback into evals, and use the resulting evidence to propose the next harness changes for Codex to implement.\n", "\n", "You will:\n", "\n", "- Create an OpenAI Agents SDK-backed financial analyst\n", "- Run it on synthetic company data and capture traces\n", "- Add example human feedback and LLM-generated feedback from those runs\n", "- Turn that feedback into [Promptfoo](https://www.promptfoo.dev/) evals that can be rerun later\n", "- Use [HALO](https://github.com/context-labs/halo) to rank the next harness changes and write a Codex-ready handoff\n", "\n", "In this notebook, the **harness** is the full contract around the model, including instructions, tools, routing, output requirements, and validation checks.\n", "\n", "The flywheel preserves what you learn from each run. Traces show what happened, feedback explains what mattered, evals make those expectations reusable, and Codex can act on the resulting change set.\n" ] }, { "cell_type": "markdown", "id": "what-you-will-build-revised", "metadata": {}, "source": [ "## What you will build\n", "\n", "![Agent improvement loop flywheel](../../images/agent-improvement-loop-flywheel.svg)\n", "\n", "By the end, you will have:\n", "\n", "1. An OpenAI Agents SDK-backed financial analyst that reviews a fictional company's diligence materials across five traced runs\n", "2. Human and LLM-generated feedback over those same traces\n", "3. An automatically generated Promptfoo eval suite\n", "4. A Promptfoo validation gate over the current agent behavior\n", "5. A HALO optimization pass over the traces, feedback, and eval results\n", "6. A developer-facing handoff to Codex so it can implement the recommended harness changes\n", "\n", "The agent supports acquisition diligence for a fictional company. It reviews financial exports, customer data, contracts, security notes, board materials, and management narratives, then answers diligence questions with citations and reviewable artifacts.\n", "\n", "The loop writes one file that carries the work forward: the generated `codex_handoff.md` file under `ARTIFACT_DIR`. It contains the full HALO diagnosis, the ranked recommendations, the evidence behind them, and the implementation guidance Codex needs for the next harness update.\n", "\n", "The degree of automation is up to the developer. You can use the loop to propose a reviewed change set, or connect it to a workflow that opens, merges, and deploys pull requests automatically. A common starting point is a reviewed loop, where the system proposes the change set and a developer approves the diff before merge. As the eval gate becomes more trusted, the same handoff can support deeper automation. The core workflow is the same in either case: traces plus human and model feedback become concrete harness changes instead of remaining disconnected comments.\n", "\n", "Compared with examples that stop at traces or evals, this notebook keeps traces, reviewer judgment, generated evals, optimization, and implementation handoff inside one runnable improvement loop.\n" ] }, { "cell_type": "markdown", "id": "83e1dc92", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Run this notebook from the repository root after installing the Python dependencies used by the example:\n", "\n", "```bash\n", "python -m venv .venv\n", "source .venv/bin/activate\n", "pip install openai openai-agents halo-engine\n", "```\n", "\n", "Promptfoo runs through `npx`, so you also need Node.js with `npx` available on your path.\n", "\n", "Set an API key before running the notebook:\n", "\n", "```bash\n", "export OPENAI_API_KEY=...\n", "```\n", "\n", "The example is intentionally live-only. The trace generation, model critique, eval generation, validation, and optimization steps all use fresh model outputs so the notebook demonstrates the actual loop rather than a scripted preview. The next cell exposes the model choices in one place so you can trade quality for cost by substituting cheaper models if desired.\n", "\n", "With the default five traces, budget about 20 minutes for a full run, though model latency and network conditions will move that up or down. The longest sections are usually Step 3, which runs the traced agent calls, and Step 7, where HALO analyzes the full loop. The feedback, eval-generation, and Promptfoo cells also make live calls, but are typically shorter. Long-running cells print progress or elapsed time as they work.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "install-runtime-dependencies", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# Install or upgrade the Python dependencies used by this notebook.\n", "%pip install --quiet --upgrade openai openai-agents halo-engine\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "ca7b4551", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project root detected.\nModels: {'agent': 'gpt-5.5', 'analysis': 'gpt-5.5', 'eval_generation': 'gpt-5.5', 'judge': 'gpt-5.5', 'halo': 'gpt-5.5', 'promptfoo': '0.121.9'}\n" ] } ], "source": [ "from __future__ import annotations\n", "\n", "import asyncio\n", "import hashlib\n", "import json\n", "import os\n", "import re\n", "import shutil\n", "import subprocess\n", "import sys\n", "import tempfile\n", "import time\n", "import textwrap\n", "import threading\n", "from contextlib import contextmanager\n", "from dataclasses import asdict, dataclass, field\n", "from datetime import datetime, timezone\n", "from importlib.metadata import version\n", "from pathlib import Path\n", "from typing import Any, Iterable, Iterator, Mapping\n", "\n", "from IPython.display import Markdown, display\n", "from openai import OpenAI\n", "\n", "def find_project_root(start: Path | None = None) -> Path:\n", " current = (start or Path.cwd()).resolve()\n", " for candidate in [current, *current.parents]:\n", " if (candidate / \"registry.yaml\").exists():\n", " return candidate\n", " return current\n", "\n", "\n", "PROJECT_ROOT = find_project_root()\n", "\n", "if not os.getenv(\"OPENAI_API_KEY\"):\n", " raise RuntimeError(\"Set OPENAI_API_KEY before running this live notebook.\")\n", "if shutil.which(\"npx\") is None:\n", " raise RuntimeError(\"Install Node.js with npx before running the Promptfoo eval gate.\")\n", "\n", "# Edit these in one place if you want to use lower-cost models for part of the loop.\n", "AGENT_MODEL = os.getenv(\"OPENAI_AGENT_MODEL\", \"gpt-5.5\")\n", "ANALYSIS_MODEL = os.getenv(\"OPENAI_ANALYSIS_MODEL\", \"gpt-5.5\")\n", "EVAL_GENERATION_MODEL = os.getenv(\"OPENAI_EVAL_GENERATION_MODEL\", ANALYSIS_MODEL)\n", "JUDGE_MODEL = os.getenv(\"OPENAI_JUDGE_MODEL\", ANALYSIS_MODEL)\n", "HALO_MODEL = os.getenv(\"OPENAI_HALO_MODEL\", ANALYSIS_MODEL)\n", "PROMPTFOO_VERSION = os.getenv(\"PROMPTFOO_VERSION\", \"0.121.9\")\n", "\n", "client = OpenAI()\n", "\n", "\n", "def format_duration(seconds: float) -> str:\n", " minutes, remainder = divmod(int(round(seconds)), 60)\n", " return f\"{minutes}m {remainder:02d}s\" if minutes else f\"{remainder}s\"\n", "\n", "\n", "ARTIFACT_DIR = PROJECT_ROOT / \"examples\" / \"agents_sdk\" / \"agent_improvement_loop_artifacts\"\n", "TRACE_DIR = ARTIFACT_DIR / \"traces\"\n", "HALO_TRACE_PATH = ARTIFACT_DIR / \"halo_traces\" / \"traces.jsonl\"\n", "if ARTIFACT_DIR.exists():\n", " shutil.rmtree(ARTIFACT_DIR)\n", "ARTIFACT_DIR.mkdir(exist_ok=True)\n", "TRACE_DIR.mkdir(exist_ok=True)\n", "HALO_TRACE_PATH.parent.mkdir(exist_ok=True)\n", "\n", "print(\"Project root detected.\")\n", "print(\"Models:\", {\n", " \"agent\": AGENT_MODEL,\n", " \"analysis\": ANALYSIS_MODEL,\n", " \"eval_generation\": EVAL_GENERATION_MODEL,\n", " \"judge\": JUDGE_MODEL,\n", " \"halo\": HALO_MODEL,\n", " \"promptfoo\": PROMPTFOO_VERSION,\n", "})\n" ] }, { "cell_type": "markdown", "id": "step-2-revised", "metadata": {}, "source": [ "## Step 1. Create synthetic company data\n", "\n", "The notebook creates fictional diligence materials for a company that might be reviewed during an acquisition. The data mixes structured exports with narrative markdown documents so the agent has to decide which sources deserve more weight.\n", "\n", "### Narrative markdown files in the synthetic data\n", "\n", "| File | Why it is included |\n", "| --- | --- |\n", "| `overview.md` | Management's top-level company summary |\n", "| `product_strategy.md` | Roadmap context plus an unvalidated NRR estimate |\n", "| `go_to_market.md` | Sales-motion context that should be checked against pipeline data |\n", "| `board_deck.md` | A polished management narrative that can conflict with structured exports |\n", "| `financials/revenue_recognition_notes.md` | Accounting context for launch-stage ARR treatment |\n", "| `legal/contracts_summary.md` | Contract-level risk context |\n", "| `legal/open_issues.md` | Open legal matters that should remain visible |\n", "| `security/security_overview.md` | Security posture and certification wording |\n", "| `sales/security_faq.md` | Sales-facing security language that may overstate the evidence |\n", "| `hr/org_chart.md` | Operating context for leadership and staffing |\n", "| `sales/pipeline_notes.md` | Qualitative pipeline commentary |\n", "| `notes/qa_log.md` | Diligence questions and unresolved follow-ups |\n", "\n", "The example generates the synthetic company data at runtime so it stays self-contained while still giving the agent a realistic mix of structured exports and narrative documents to analyze.\n" ] }, { "cell_type": "markdown", "id": "workspace-files-heading", "metadata": {}, "source": [ "### Define the synthetic source files\n", "\n", "The next collapsed cell contains the source documents used to build the fictional company data.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "workspace-files", "metadata": { "jupyter": { "source_hidden": true } }, "outputs": [], "source": [ "from textwrap import dedent\n", "\n", "WORKSPACE_FILES = {\n", " \"overview.md\": \"\"\"\n", " # FictionalCorp XYZ\n", "\n", " FictionalCorp XYZ is a revenue intelligence software company with annual SaaS subscriptions, usage add-ons, and launch-stage commitments.\n", "\n", " Management reports FY2025 ARR of $43.0M and year-over-year growth of 71%.\n", "\n", " Management reports no legal-entity customer above 15% of booked ARR after excluding launch-stage usage add-ons.\n", "\n", " Legal summary: Management states legal matters are ordinary course and no contract terms should affect valuation.\n", " \"\"\",\n", " \"product_strategy.md\": \"\"\"\n", " # Product Strategy\n", "\n", " Core product lines:\n", "\n", " - Forecast Assist\n", " - Pipeline Quality Monitor\n", " - Renewal Risk Workbench\n", "\n", " Product roadmap priority is enterprise workflow depth. Management expects usage add-ons to increase expansion revenue.\n", "\n", " Sales leadership references a 122% NRR estimate in planning materials, but finance has not published official NRR and the estimate excludes selected downsell and churn adjustments.\n", " \"\"\",\n", " \"go_to_market.md\": \"\"\"\n", " # Go To Market\n", "\n", " FictionalCorp XYZ sells to CRO and RevOps buyers through a direct sales motion.\n", "\n", " The current plan assumes larger enterprise ACVs and partner-sourced pipeline. Pipeline conversion evidence should be checked against `sales/pipeline.csv`.\n", " \"\"\",\n", " \"board_deck.md\": \"\"\"\n", " # Board Packet - December 2025\n", "\n", " - FY2025 ending ARR: $43.0M\n", " - ARR growth: 71%\n", " - Gross margin: 69%\n", " - Cash burn: $2.9M per month\n", " - Runway: 11 months\n", "\n", " Management narrative: the company is positioned for efficient enterprise expansion.\n", "\n", " ARR note: the headline ARR view includes signed launch-stage commitments and a usage true-up view used for board planning.\n", "\n", " Management narrative: customer concentration is manageable when measured by legal entity and booked ARR.\n", " \"\"\",\n", " \"financials/revenue_recognition_notes.md\": \"\"\"\n", " # Revenue Recognition Notes\n", "\n", " Finance treats `financials/arr_bridge.csv` as the controlled FY2025 ARR bridge.\n", "\n", " The board deck ARR includes $2.8M of signed launch-stage commitments that were not live by 2025-12-31 and $1.1M of usage true-ups that finance does not classify as recurring ARR.\n", "\n", " RevOps also circulates a bookings-adjusted ARR view of $40.8M. That view is useful for pipeline planning but should not be silently reconciled with the controlled ARR bridge.\n", " \"\"\",\n", " \"legal/contracts_summary.md\": \"\"\"\n", " # Contracts Summary\n", "\n", " Standard customer contracts are annual SaaS agreements with security and DPA exhibits. The largest five customers account for $25.1M of ARR.\n", "\n", " Management summary: legal matters are ordinary course and no contract terms should affect valuation.\n", "\n", " Clause inventory has not been fully reconciled with this summary. Two strategic customer agreements are flagged for non-standard terms in `legal/clause_inventory.csv`.\n", " \"\"\",\n", " \"legal/open_issues.md\": \"\"\"\n", " # Open Legal Issues\n", "\n", " Former reseller DataHarbor filed a breach-of-contract claim seeking $3.2M plus accelerated commissions. Counsel estimates loss is possible but not probable. A clause review also identified two strategic customer MSAs with non-standard change-of-control notice rights and uncapped confidentiality indemnity language.\n", " \"\"\",\n", " \"security/security_overview.md\": \"\"\"\n", " # Security Overview\n", "\n", " SOC 2 Type I is complete. SOC 2 Type II fieldwork is in progress, and the Type II report has not been issued.\n", "\n", " Customer security reviews should verify the exact certification status before relying on SOC 2 claims.\n", " \"\"\",\n", " \"sales/security_faq.md\": \"\"\"\n", " # Sales Security FAQ\n", "\n", " Field guidance says Aurora is \"SOC 2 complete\" for late-stage enterprise deals.\n", "\n", " Security team note: this wording was intended to refer to Type I readiness, not an issued Type II report. Do not use this FAQ as certification evidence without checking `security/security_overview.md`.\n", " \"\"\",\n", " \"hr/org_chart.md\": \"\"\"\n", " # Org Chart\n", "\n", " - CEO\n", " - CFO\n", " - VP Sales\n", " - VP Product\n", " - Head of Security\n", "\n", " Hiring plan assumes 14 net new GTM hires in 2026.\n", " \"\"\",\n", " \"sales/pipeline_notes.md\": \"\"\"\n", " # Pipeline Notes\n", "\n", " Commit-stage pipeline includes $1.6M of DataHarbor-sourced opportunities that may be affected by the reseller dispute.\n", "\n", " Northstar expansion pipeline assumes completion of SOC 2 Type II before procurement review. Finance has not included this expansion in controlled FY2025 ARR.\n", " \"\"\",\n", " \"notes/qa_log.md\": \"\"\"\n", " # Diligence Q&A Log\n", "\n", " - NRR was requested. RevOps provided a 122% management estimate, but finance has not validated official NRR and says the estimate excludes downsold Northstar entities and a churned reseller-sourced account.\n", " - CAC payback was requested but not provided.\n", " - Top-two customer ARR equals $12.4M, or 34% of FY2025 ARR based on `customers/top_customers.csv`.\n", " - Northstar Holdings parent-account ARR equals $12.4M, or 34% of FY2025 ARR based on `customers/account_hierarchy.csv`.\n", " - Board ARR should not be silently reconciled to finance ARR; use `financials/revenue_recognition_notes.md` for the difference.\n", " \"\"\",\n", " \"financials/arr_bridge.csv\": \"\"\"\n", " metric,value_m\n", " opening_arr_2025_m,21.58\n", " new_arr_m,8.1\n", " expansion_arr_m,3.2\n", " contraction_arr_m,1.1\n", " churn_arr_m,2.7\n", " ending_arr_2025_m,36.9\n", " bookings_adjusted_arr_m,40.8\n", " \"\"\",\n", " \"financials/monthly_kpis.csv\": \"\"\"\n", " month,ending_arr_m,new_arr_m,expansion_arr_m,churn_arr_m,gross_margin\n", " 2025-01,21.58,0.55,0.35,0.18,0.69\n", " 2025-02,23.28,0.59,0.37,0.20,0.69\n", " 2025-03,24.98,0.63,0.39,0.21,0.69\n", " 2025-04,26.69,0.67,0.41,0.22,0.69\n", " 2025-05,28.39,0.71,0.43,0.24,0.69\n", " 2025-06,30.09,0.75,0.45,0.26,0.69\n", " 2025-07,31.79,0.79,0.47,0.27,0.69\n", " 2025-09,33.50,0.83,0.49,0.28,0.69\n", " 2025-10,35.20,0.87,0.51,0.30,0.69\n", " 2025-12,36.90,0.91,0.53,0.32,0.69\n", " \"\"\",\n", " \"financials/p_and_l.csv\": \"\"\"\n", " period,revenue_m,gross_margin,opex_m,cash_burn_m,runway_months\n", " FY2025,30.26,0.69,47.71,2.9,11\n", " \"\"\",\n", " \"financials/retention_extract.csv\": \"\"\"\n", " metric,value,status,notes\n", " net_revenue_retention,122%,management_estimate_unvalidated,Sales deck estimate; excludes downsold Northstar entities and one churned reseller-sourced account.\n", " gross_revenue_retention,84%,finance_partial,Preliminary 2025 cohort; usage feeds incomplete for two enterprise customers.\n", " logo_retention,91%,finance_partial,\"Includes legal entities, not parent-account rollups.\"\n", " cac_payback_months,,not_provided,Requested by diligence team; no source schedule in dataroom.\n", " \"\"\",\n", " \"customers/top_customers.csv\": \"\"\"\n", " customer,parent_account,arr_m,arr_share,segment,renewal_date,inclusion_basis\n", " Northstar Bank,Northstar Holdings,7.8,0.2114,Enterprise,2026-02-15,controlled_arr_bridge\n", " Northstar Capital Markets,Northstar Holdings,4.6,0.1247,Enterprise,2026-04-01,controlled_arr_bridge\n", " Helio Retail,Helio Retail,6.9,0.1870,Enterprise,2026-05-15,controlled_arr_bridge\n", " BluePeak Logistics,BluePeak Logistics,3.6,0.0976,Mid-market,2026-06-30,controlled_arr_bridge\n", " Summit Foods,Summit Foods,2.2,0.0596,Mid-market,2026-02-28,controlled_arr_bridge\n", " \"\"\",\n", " \"customers/account_hierarchy.csv\": \"\"\"\n", " legal_entity,parent_account,parent_arr_m,note\n", " Northstar Bank,Northstar Holdings,12.4,Same procurement parent as Northstar Capital Markets.\n", " Northstar Capital Markets,Northstar Holdings,12.4,Managed by separate RevOps owner but same parent renewal committee.\n", " Helio Retail,Helio Retail,6.9,Standalone parent account.\n", " BluePeak Logistics,BluePeak Logistics,3.6,Standalone parent account; renewal issue open.\n", " \"\"\",\n", " \"customers/renewal_calendar.csv\": \"\"\"\n", " customer,renewal_date,renewal_risk,notes\n", " Northstar Bank,2026-02-15,medium,Expansion depends on completed SOC 2 Type II.\n", " Northstar Capital Markets,2026-04-01,medium,Same parent procurement committee as Northstar Bank.\n", " Helio Retail,2026-05-15,medium,Adoption below plan; forecast latency escalation remains in monitoring.\n", " BluePeak Logistics,2026-06-30,high,Open CRM sync errors and renewal risk.\n", " \"\"\",\n", " \"customers/customer_health.csv\": \"\"\"\n", " customer,health,primary_risk,signal_date,caveat\n", " Northstar Bank,green,none flagged,2025-10-31,\"Northstar health is recorded by legal entity, not parent account.\"\n", " Northstar Capital Markets,yellow,monitor adoption,2025-10-31,\"Northstar health is recorded by legal entity, not parent account.\"\n", " Helio Retail,yellow,monitor adoption,2025-12-15,\n", " BluePeak Logistics,red,renewal risk,2025-12-15,\n", " Summit Foods,yellow,monitor adoption,2025-12-15,\n", " \"\"\",\n", " \"legal/clause_inventory.csv\": \"\"\"\n", " customer,issue,exposure,confidence\n", " Northstar Bank,change_of_control_notice,customer may request transition plan within 10 days of a control transaction,medium\n", " Helio Retail,uncapped_confidentiality_indemnity,uncapped liability for confidentiality breach; not reflected in management summary,high\n", " BluePeak Logistics,service_credit_carveout,credits can exceed one month fees if CRM sync SLA missed for two consecutive months,medium\n", " \"\"\",\n", " \"sales/pipeline.csv\": \"\"\"\n", " stage,pipeline_m,historical_close_rate,quality_note\n", " commit,6.1,0.39,Includes security-dependent Northstar expansion.\n", " best_case,9.7,0.28,Includes DataHarbor-sourced opportunities under dispute.\n", " early,18.2,0.08,High volume but low conversion quality.\n", " \"\"\",\n", " \"support/escalations.csv\": \"\"\"\n", " customer,severity,issue,status\n", " Northstar Capital Markets,medium,Forecast latency,monitoring\n", " BluePeak Logistics,high,CRM sync errors,open\n", " Northstar Bank,medium,Security questionnaire blocked pending SOC 2 Type II report,open\n", " \"\"\",\n", "}\n", "\n" ] }, { "cell_type": "markdown", "id": "workspace-materialize-heading", "metadata": {}, "source": [ "### Materialize the synthetic data\n", "\n", "Write the source files to disk, add a manifest, and inspect the generated dataset.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "workspace-materialize", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset created: 24 files\n" ] } ], "source": [ "def write_workspace_file(path: Path, content: str) -> None:\n", " path.parent.mkdir(parents=True, exist_ok=True)\n", " path.write_text(dedent(content).strip() + \"\\n\", encoding=\"utf-8\")\n", "\n", "\n", "def generate_acquisition_diligence_workspace() -> Path:\n", " \"\"\"Create the synthetic acquisition-diligence workspace directly from notebook data.\"\"\"\n", " dataroom = ARTIFACT_DIR / \"synthetic_dataroom\"\n", " shutil.rmtree(dataroom, ignore_errors=True)\n", " for relative_path, content in WORKSPACE_FILES.items():\n", " write_workspace_file(dataroom / relative_path, content)\n", " manifest = {\n", " \"company_name\": \"FictionalCorp XYZ\",\n", " \"scenario\": \"adversarial_diligence\",\n", " \"files\": sorted(str(path.relative_to(dataroom)) for path in dataroom.rglob(\"*\") if path.is_file()),\n", " }\n", " write_workspace_file(dataroom / \"manifest.json\", json.dumps(manifest, indent=2))\n", " return dataroom\n", "\n", "\n", "dataset = generate_acquisition_diligence_workspace()\n", "files = sorted(str(path.relative_to(dataset)) for path in dataset.rglob(\"*\") if path.is_file())\n", "print(f\"Dataset created: {len(files)} files\")\n" ] }, { "cell_type": "markdown", "id": "step-1-revised", "metadata": {}, "source": [ "## Step 2. Define the Agents SDK-backed analyst\n", "\n", "The example agent performs acquisition diligence on a fictional SaaS company being reviewed as a possible acquisition target. The case materials contain both structured exports and management narratives. Some sources agree, some conflict, and some important claims are only partially supported. That gives us a realistic reason to improve the harness over time.\n", "\n", "The agent answers questions for an investment team using only the supplied company data. It should prefer structured financial evidence over narrative summaries when they disagree, preserve uncertainty when evidence is missing, and leave behind artifacts that another reviewer can inspect.\n", "\n", "The OpenAI Agents SDK provides the managed runner, sandbox execution, model settings, and tracing hooks this workflow needs. Together, the prompt, tools, routing rules, output requirements, and validation checks form the current **agent harness**.\n", "\n", "### Artifacts generated by the agent\n", "\n", "| Artifact | Why the agent writes it |\n", "| --- | --- |\n", "| `summary_answer.md` | The concise answer returned to the user |\n", "| `investment_memo.md` | A fuller review artifact for diligence readers |\n", "| `risk_register.json` | Structured risks with evidence that downstream systems can inspect |\n", "| `open_questions.md` | Missing evidence or unresolved questions that should stay visible |\n", "| `citations.json` | A machine-readable link from claims to source files |\n", "| `evidence_table.csv` | A tabular audit trail of claims and supporting sources |\n", "\n", "These artifacts keep the work reviewable by preserving supporting evidence, unresolved questions, and required files alongside the final answer.\n", "\n", "### Failure modes to watch for\n", "\n", "This notebook is designed to surface failures such as:\n", "\n", "- Treating management narrative as an official metric when the structured exports disagree\n", "- Reporting an unsupported NRR estimate as if finance had validated it\n", "- Collapsing parent-account concentration into a weaker legal-entity view\n", "- Saying \u201cSOC 2 complete\u201d when the evidence only supports Type I\n", "- Producing a polished answer while leaving citations, risk files, or evidence artifacts incomplete\n" ] }, { "cell_type": "markdown", "id": "harness-schema-heading", "metadata": {}, "source": [ "### Define the harness schema\n", "\n", "Start with small data structures for the model settings and promoted agent configuration. These make the harness explicit so later optimization can target more than prompt wording.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "harness-schema", "metadata": {}, "outputs": [], "source": [ "\n", "@dataclass(frozen=True)\n", "class ModelSettings:\n", " agent_model: str\n", " reasoning_effort: str\n", "\n", "\n", "@dataclass(frozen=True)\n", "class AgentConfig:\n", " version: str\n", " system_prompt: str\n", " model_settings: ModelSettings\n", " tool_policy: dict[str, Any]\n", " eval_metadata: dict[str, Any]\n", " path: Path = field(default_factory=lambda: Path(\"notebook_defined_agent_config\"))\n", "\n", " @property\n", " def required_artifacts(self) -> list[str]:\n", " return self.tool_policy[\"required_artifacts\"]\n", "\n", " def build_instructions(self) -> str:\n", " return \"\\n\\n\".join([\n", " self.system_prompt,\n", " format_policy_section(\"Tool policy\", self.tool_policy),\n", " f\"Runtime config:\\n- Config version: `{self.version}`.\\n- Treat this config as the promoted runtime contract.\\n- Do not modify the runtime config during the run.\",\n", " ]) + \"\\n\"\n", "\n", "\n", "def format_policy_section(title: str, policy: dict[str, Any]) -> str:\n", " lines = [f\"{title}:\"]\n", " for key, value in policy.items():\n", " lines.extend(format_policy_value(key, value))\n", " return \"\\n\".join(lines)\n", "\n", "\n", "def format_policy_value(key: str, value: Any, indent: int = 0) -> list[str]:\n", " prefix = \" \" * indent\n", " if isinstance(value, dict):\n", " lines = [f\"{prefix}- {key}:\"]\n", " for child_key, child_value in value.items():\n", " lines.extend(format_policy_value(child_key, child_value, indent + 1))\n", " return lines\n", " if isinstance(value, list):\n", " lines = [f\"{prefix}- {key}:\"]\n", " for item in value:\n", " if isinstance(item, dict):\n", " lines.append(f\"{prefix} -\")\n", " for child_key, child_value in item.items():\n", " lines.extend(format_policy_value(child_key, child_value, indent + 2))\n", " else:\n", " lines.append(f\"{prefix} - {item}\")\n", " return lines\n", " return [f\"{prefix}- {key}: {value}\"]\n" ] }, { "cell_type": "markdown", "id": "harness-policy-heading", "metadata": {}, "source": [ "### Configure instructions and policies\n", "\n", "The system prompt states the evidence rules, the tool policy defines what the agent may read and write, and the eval metadata records which version of the harness is currently promoted.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "harness-policy", "metadata": {}, "outputs": [], "source": [ "SYSTEM_PROMPT = \"\"\"\n", "You are a diligence analyst reviewing a synthetic company dataroom.\n", "\n", "Evidence scope:\n", "- Use only files under `data/`.\n", "- Do not use outside knowledge or assumptions.\n", "- Prefer structured CSV/JSON exports over narrative files when they conflict.\n", "\n", "Runtime tools:\n", "- The sandbox starts in the mounted workspace root. Use workspace-relative paths such as `data/...` and `outputs/...`; when running shell commands, omit `workdir` or use a relative path only. Never pass absolute temporary paths.\n", "- `data/tools/check_evidence_coverage.py`: use this before finalizing answers with material claims. Create a JSON list of claims with `claim`, `claim_type`, and `citations`, then run `python data/tools/check_evidence_coverage.py --claims-json --dataset-root data --output outputs/evidence_coverage.json`.\n", "- `data/tools/validate_output_contract.py`: run this after writing the required artifacts and before final response with `python data/tools/validate_output_contract.py --outputs outputs --dataset-root data --output outputs/output_contract_validation.json`.\n", "- If either tool reports unsupported claims, missing citations, missing files, malformed JSON, or empty artifacts, revise the answer/artifacts before finalizing. If the evidence is unavailable, say the claim is unknown or unsupported.\n", "\n", "Citation rules:\n", "- Every material claim must cite one or more source filenames.\n", "- Cite filenames exactly as workspace-relative paths, for example `financials/arr_bridge.csv`.\n", "- Do not cite files that do not support the claim.\n", "\n", "Unknown-handling rules:\n", "- If evidence is missing, state that the answer is unknown or unsupported.\n", "- Never fabricate missing numbers.\n", "- If evidence conflicts, state the conflict explicitly instead of reconciling silently.\n", "\n", "Output rules:\n", "- Write `outputs/summary_answer.md`.\n", "- Write `outputs/investment_memo.md`.\n", "- Write `outputs/risk_register.json`.\n", "- Write `outputs/open_questions.md`.\n", "- Write `outputs/citations.json`.\n", "- Write `outputs/evidence_table.csv`.\n", "\"\"\".strip()\n", "\n", "MODEL_SETTINGS = {\n", " \"agent_model\": AGENT_MODEL,\n", " \"reasoning_effort\": \"medium\",\n", "}\n", "\n", "TOOL_POLICY = {\n", " \"allowed_data_root\": \"data\",\n", " \"writable_output_root\": \"outputs\",\n", " \"required_artifacts\": [\n", " \"summary_answer.md\",\n", " \"investment_memo.md\",\n", " \"risk_register.json\",\n", " \"open_questions.md\",\n", " \"citations.json\",\n", " \"evidence_table.csv\",\n", " ],\n", " \"evidence_preference\": [\n", " \"Prefer structured CSV or JSON exports over narrative summaries when sources conflict.\",\n", " \"Treat board materials as useful narrative evidence, not the final system of record for metrics.\",\n", " \"Surface unresolved conflicts instead of silently reconciling them.\",\n", " ],\n", " \"runtime_tools\": [\n", " {\n", " \"path\": \"data/tools/check_evidence_coverage.py\",\n", " \"purpose\": \"Audit drafted material claims against cited dataroom files before final answer.\",\n", " \"recommended_command\": \"python data/tools/check_evidence_coverage.py --claims-json outputs/claim_audit_input.json --dataset-root data --output outputs/evidence_coverage.json\",\n", " },\n", " {\n", " \"path\": \"data/tools/validate_output_contract.py\",\n", " \"purpose\": \"Validate required output artifacts, JSON shape, and citation/source file references.\",\n", " \"recommended_command\": \"python data/tools/validate_output_contract.py --outputs outputs --dataset-root data --output outputs/output_contract_validation.json\",\n", " },\n", " ],\n", " \"unknown_handling\": [\n", " \"Say unknown or unsupported when a metric is absent.\",\n", " \"Do not infer missing values from adjacent metrics.\",\n", " \"Keep facts, inferences, and open questions separate.\",\n", " ],\n", " \"mutation_policy\": [\n", " \"Write only to the configured outputs directory.\",\n", " \"Do not modify dataroom inputs.\",\n", " \"Do not modify runtime agent configuration during a run.\",\n", " ],\n", "}\n", "\n", "EVAL_METADATA = {\n", " \"version\": \"v001\",\n", " \"status\": \"promoted\",\n", " \"created_by\": \"manual_baseline\",\n", " \"promotion_gate\": \"manual_review\",\n", " \"description\": \"Baseline diligence analyst config with strict dataroom grounding, citation, unknown-handling, and artifact rules.\",\n", "}\n", "\n", "agent_config = AgentConfig(\n", " version=EVAL_METADATA[\"version\"],\n", " system_prompt=SYSTEM_PROMPT,\n", " model_settings=ModelSettings(**MODEL_SETTINGS),\n", " tool_policy=TOOL_POLICY,\n", " eval_metadata=EVAL_METADATA,\n", ")\n" ] }, { "cell_type": "markdown", "id": "harness-summary-heading", "metadata": {}, "source": [ "### Inspect the agent config\n", "\n", "This compact view shows the promoted config version, the selected models, the required artifacts, and the runtime tools the agent can use.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "harness-summary", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "\n", "### Agent config summary\n", "\n", "- **Version:** `v001`\n", "- **Agent model:** `gpt-5.5`\n", "- **Reasoning effort:** `medium`\n", "\n", "**Required artifacts**\n", "- `summary_answer.md`\n", "- `investment_memo.md`\n", "- `risk_register.json`\n", "- `open_questions.md`\n", "- `citations.json`\n", "- `evidence_table.csv`\n", "\n", "**Runtime tools**\n", "- `data/tools/check_evidence_coverage.py` \u2014 Audit drafted material claims against cited dataroom files before final answer.\n", "- `data/tools/validate_output_contract.py` \u2014 Validate required output artifacts, JSON shape, and citation/source file references.\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "required_artifacts_md = \"\\n\".join(\n", " f\"- `{artifact}`\" for artifact in agent_config.required_artifacts\n", ")\n", "runtime_tools_md = \"\\n\".join(\n", " f\"- `{tool['path']}` \u2014 {tool['purpose']}\"\n", " for tool in agent_config.tool_policy[\"runtime_tools\"]\n", ")\n", "\n", "display(Markdown(f\"\"\"\n", "### Agent config summary\n", "\n", "- **Version:** `{agent_config.version}`\n", "- **Agent model:** `{agent_config.model_settings.agent_model}`\n", "- **Reasoning effort:** `{agent_config.model_settings.reasoning_effort}`\n", "\n", "**Required artifacts**\n", "{required_artifacts_md}\n", "\n", "**Runtime tools**\n", "{runtime_tools_md}\n", "\"\"\"))\n", "\n" ] }, { "cell_type": "markdown", "id": "runtime-validation-heading", "metadata": {}, "source": [ "### Add validation tools\n", "\n", "The next helpers create two local tools inside the workspace: one checks whether drafted claims cite real dataroom files, and the other verifies that the required output artifacts exist and have the expected shape. The code is hidden by default to save space, but you can expand it if you want to inspect the implementation.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "runtime-validation-tools", "metadata": { "jupyter": { "source_hidden": true } }, "outputs": [], "source": [ "\n", "\n", "\n", "CHECK_EVIDENCE_COVERAGE = r'''#!/usr/bin/env python3\n", "\n", "import argparse\n", "import json\n", "from pathlib import Path\n", "\n", "\n", "def main() -> None:\n", " parser = argparse.ArgumentParser(description=\"Audit whether drafted claims cite existing dataroom files.\")\n", " parser.add_argument(\"--claims-json\", type=Path, required=True)\n", " parser.add_argument(\"--dataset-root\", type=Path, default=Path(\"data\"))\n", " parser.add_argument(\"--output\", type=Path, default=Path(\"outputs/evidence_coverage.json\"))\n", " args = parser.parse_args()\n", "\n", " claims = json.loads(args.claims_json.read_text(encoding=\"utf-8\"))\n", " if not isinstance(claims, list):\n", " raise ValueError(\"--claims-json must contain a JSON list of claim objects\")\n", "\n", " result = check_evidence_coverage(claims, args.dataset_root)\n", " args.output.parent.mkdir(parents=True, exist_ok=True)\n", " args.output.write_text(json.dumps(result, indent=2) + \"\\n\", encoding=\"utf-8\")\n", " print(json.dumps(result, indent=2))\n", "\n", "\n", "def check_evidence_coverage(claims: list[dict], dataset_root: Path) -> dict:\n", " supported = []\n", " unsupported = []\n", " missing_citations = []\n", "\n", " for raw in claims:\n", " claim = str(raw.get(\"claim\") or \"\").strip()\n", " claim_type = str(raw.get(\"claim_type\") or \"claim\")\n", " citations = [str(item).strip().removeprefix(\"data/\") for item in raw.get(\"citations\") or [] if str(item).strip()]\n", " row = {\"claim\": claim, \"claim_type\": claim_type, \"citations\": citations}\n", " if not citations:\n", " missing_citations.append({**row, \"issue\": \"No citation provided.\"})\n", " continue\n", " missing = [citation for citation in citations if not (dataset_root / citation).exists()]\n", " if missing:\n", " unsupported.append({**row, \"issue\": f\"Missing cited file(s): {', '.join(missing)}\"})\n", " else:\n", " supported.append(row)\n", "\n", " return {\n", " \"supported_claims\": supported,\n", " \"unsupported_claims\": unsupported,\n", " \"missing_citations\": missing_citations,\n", " \"recommended_caveats\": [\n", " \"Add valid source filenames or mark unsupported claims as unknown before final answer.\"\n", " ],\n", " \"passed\": not unsupported and not missing_citations,\n", " }\n", "\n", "\n", "if __name__ == \"__main__\":\n", " main()\n", "'''\n", "\n", "\n", "VALIDATE_OUTPUT_CONTRACT = r'''#!/usr/bin/env python3\n", "\n", "import argparse\n", "import csv\n", "import json\n", "from pathlib import Path\n", "\n", "\n", "REQUIRED_FILES = [\n", " \"summary_answer.md\",\n", " \"investment_memo.md\",\n", " \"risk_register.json\",\n", " \"open_questions.md\",\n", " \"citations.json\",\n", " \"evidence_table.csv\",\n", "]\n", "\n", "\n", "def main() -> None:\n", " parser = argparse.ArgumentParser(description=\"Validate diligence output artifacts before final answer.\")\n", " parser.add_argument(\"--outputs\", type=Path, default=Path(\"outputs\"))\n", " parser.add_argument(\"--dataset-root\", type=Path, default=Path(\"data\"))\n", " parser.add_argument(\"--output\", type=Path, default=Path(\"outputs/output_contract_validation.json\"))\n", " args = parser.parse_args()\n", "\n", " result = validate_output_contract(args.outputs, args.dataset_root)\n", " args.output.parent.mkdir(parents=True, exist_ok=True)\n", " args.output.write_text(json.dumps(result, indent=2) + \"\\n\", encoding=\"utf-8\")\n", " print(json.dumps(result, indent=2))\n", "\n", "\n", "def validate_output_contract(outputs: Path, dataset_root: Path) -> dict:\n", " issues = []\n", " for filename in REQUIRED_FILES:\n", " path = outputs / filename\n", " if not path.exists():\n", " issues.append({\"file\": filename, \"issue\": \"missing required artifact\"})\n", " elif path.stat().st_size == 0:\n", " issues.append({\"file\": filename, \"issue\": \"empty required artifact\"})\n", "\n", " risks = _read_json(outputs / \"risk_register.json\", default=[])\n", " citations = _read_json(outputs / \"citations.json\", default=[])\n", " if not isinstance(risks, list):\n", " issues.append({\"file\": \"risk_register.json\", \"issue\": \"must be a JSON list\"})\n", " risks = []\n", " if not isinstance(citations, list):\n", " issues.append({\"file\": \"citations.json\", \"issue\": \"must be a JSON list\"})\n", " citations = []\n", "\n", " for index, risk in enumerate(risks):\n", " evidence = risk.get(\"evidence\") if isinstance(risk, dict) else None\n", " if not evidence:\n", " issues.append({\"file\": \"risk_register.json\", \"risk_index\": index, \"issue\": \"risk lacks evidence\"})\n", " continue\n", " missing = [str(item).removeprefix(\"data/\") for item in evidence if not (dataset_root / str(item).removeprefix(\"data/\")).exists()]\n", " if missing:\n", " issues.append({\"file\": \"risk_register.json\", \"risk_index\": index, \"issue\": f\"missing evidence file(s): {', '.join(missing)}\"})\n", "\n", " for index, citation in enumerate(citations):\n", " sources = citation.get(\"sources\") if isinstance(citation, dict) else None\n", " if not sources:\n", " issues.append({\"file\": \"citations.json\", \"citation_index\": index, \"issue\": \"citation lacks sources\"})\n", " continue\n", " missing = [str(item).removeprefix(\"data/\") for item in sources if not (dataset_root / str(item).removeprefix(\"data/\")).exists()]\n", " if missing:\n", " issues.append({\"file\": \"citations.json\", \"citation_index\": index, \"issue\": f\"missing source file(s): {', '.join(missing)}\"})\n", "\n", " try:\n", " with (outputs / \"evidence_table.csv\").open(newline=\"\", encoding=\"utf-8\") as handle:\n", " rows = list(csv.DictReader(handle))\n", " if rows and not {\"claim_id\", \"claim\", \"sources\"}.issubset(rows[0].keys()):\n", " issues.append({\"file\": \"evidence_table.csv\", \"issue\": \"must include claim_id, claim, and sources columns\"})\n", " except FileNotFoundError:\n", " pass\n", "\n", " return {\"passed\": not issues, \"issues\": issues, \"required_files\": REQUIRED_FILES}\n", "\n", "\n", "def _read_json(path: Path, default):\n", " if not path.exists():\n", " return default\n", " try:\n", " return json.loads(path.read_text(encoding=\"utf-8\"))\n", " except json.JSONDecodeError as exc:\n", " return {\"error\": str(exc)}\n", "\n", "\n", "if __name__ == \"__main__\":\n", " main()\n", "'''\n", "\n", "\n", "def write_runtime_tools(dataset_dir: Path) -> list[str]:\n", " tools_dir = dataset_dir / \"tools\"\n", " tools_dir.mkdir(parents=True, exist_ok=True)\n", " files = {\n", " \"check_evidence_coverage.py\": CHECK_EVIDENCE_COVERAGE,\n", " \"validate_output_contract.py\": VALIDATE_OUTPUT_CONTRACT,\n", " }\n", " written: list[str] = []\n", " for filename, content in files.items():\n", " path = tools_dir / filename\n", " path.write_text(content, encoding=\"utf-8\")\n", " path.chmod(0o755)\n", " written.append(str(path.relative_to(dataset_dir)))\n", " return written\n" ] }, { "cell_type": "markdown", "id": "runtime-prompt-heading", "metadata": {}, "source": [ "### Build each user turn\n", "\n", "The prompt builder adds task-specific guidance only when it is needed, such as memo formatting, separate risk categories, or strict handling for unsupported NRR claims.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "runtime-prompt-builder", "metadata": {}, "outputs": [], "source": [ "def build_user_prompt(question: str, agent_config: Any | None = None) -> str:\n", " config_line = \"\"\n", " if agent_config is not None:\n", " config_line = f\"\\nActive agent config: `{agent_config.version}` from `{agent_config.path}`.\\n\"\n", " memo_instruction = \"\"\n", " if _asks_for_memo(question):\n", " memo_instruction = (\n", " \"\\nThe user asked for a memo-style deliverable. Return the memo content inline in \"\n", " \"your final answer and also write the required output artifacts. Do not answer only \"\n", " \"with a status update or artifact path list.\\n\"\n", " )\n", " risk_category_instruction = \"\"\n", " if _asks_for_top_risk_categories(question):\n", " risk_category_instruction = (\n", " \"\\nStructure the final answer with separate sections for Financial, Legal, and \"\n", " \"Customer concentration risks. Do not collapse customer concentration into the \"\n", " \"financial category.\\n\"\n", " )\n", " unsupported_metric_instruction = \"\"\n", " if _asks_for_net_revenue_retention(question):\n", " unsupported_metric_instruction = (\n", " \"\\nFor net revenue retention, report the metric only if the dataroom directly \"\n", " \"provides NRR/net revenue retention. Do not derive or estimate an NRR percentage \"\n", " \"from ARR bridge components unless the user explicitly asks for an estimate. If \"\n", " \"the metric is absent, say it is unknown or unsupported, cite the searched \"\n", " \"source files, and separate missing evidence from any directional inference.\\n\"\n", " )\n", " return f\"\"\"\n", "Answer this diligence question using only the mounted dataroom:\n", "\n", "{question}\n", "{config_line}\n", "{memo_instruction}\n", "{risk_category_instruction}\n", "{unsupported_metric_instruction}\n", "Also write the required output artifacts. Keep the answer concise, grounded, and citation-heavy.\n", "Use workspace-relative paths for shell commands and omit `workdir`; do not pass absolute temporary paths.\n", "\"\"\"\n", "\n", "\n", "def _asks_for_memo(question: str) -> bool:\n", " lower = question.lower()\n", " return \"memo\" in lower or \"ic-style\" in lower or \"investment committee\" in lower\n", "\n", "\n", "def _asks_for_top_risk_categories(question: str) -> bool:\n", " lower = question.lower()\n", " return all(term in lower for term in (\"financial\", \"legal\", \"customer\")) and \"risk\" in lower\n", "\n", "\n", "def _asks_for_net_revenue_retention(question: str) -> bool:\n", " lower = question.lower()\n", " return \"net revenue retention\" in lower or \"nrr\" in lower\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "runtime-tracing-heading", "metadata": {}, "source": [ "### Export traces for later optimization\n", "\n", "The local exporter converts Agents SDK events into the OpenTelemetry-style JSONL that HALO can read later. It is implementation-heavy, so the code stays collapsed by default.\n" ] }, { "cell_type": "markdown", "id": "runtime-trace-config-heading", "metadata": {}, "source": [ "#### Configure the trace exporter\n", "\n", "Set up the exporter object that receives Agents SDK spans and writes one JSONL line per span.\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "runtime-trace-config", "metadata": { "jupyter": { "source_hidden": true } }, "outputs": [], "source": [ "EXPORT_SCHEMA_VERSION = 1\n", "\n", "OBSERVATION_KIND_BY_TYPE = {\n", " \"agent\": \"AGENT\",\n", " \"generation\": \"LLM\",\n", " \"response\": \"LLM\",\n", " \"function\": \"TOOL\",\n", " \"mcp_tools\": \"TOOL\",\n", " \"handoff\": \"CHAIN\",\n", " \"guardrail\": \"GUARDRAIL\",\n", " \"custom\": \"SPAN\",\n", " \"task\": \"SPAN\",\n", " \"turn\": \"SPAN\",\n", " \"transcription\": \"SPAN\",\n", " \"speech\": \"SPAN\",\n", " \"speech_group\": \"SPAN\",\n", "}\n", "\n", "\n", "@dataclass(frozen=True)\n", "class HaloExportContext:\n", " project_id: str\n", " service_name: str\n", " service_version: str | None = None\n", " deployment_environment: str | None = None\n", " extra_resource_attributes: Mapping[str, Any] | None = None\n", "\n", "\n", "def setup_halo_tracing(\n", " path: str | Path,\n", " *,\n", " project_id: str = \"synthetic-dataroom-agent\",\n", " service_name: str = \"financial-diligence-analyst\",\n", " service_version: str | None = None,\n", " deployment_environment: str | None = None,\n", " extra_resource_attributes: Mapping[str, Any] | None = None,\n", "):\n", " from agents import set_trace_processors\n", "\n", " trace_path = Path(path)\n", " trace_path.parent.mkdir(parents=True, exist_ok=True)\n", " processor = HaloJsonlTraceProcessor(\n", " trace_path,\n", " ctx=HaloExportContext(\n", " project_id=project_id,\n", " service_name=service_name,\n", " service_version=service_version,\n", " deployment_environment=deployment_environment,\n", " extra_resource_attributes=extra_resource_attributes,\n", " ),\n", " )\n", " # Use only the local exporter for this cookbook workflow.\n", " # Hosted trace ingestion may be unavailable in some environments (for example ZDR orgs).\n", " set_trace_processors([processor])\n", " return processor\n", "\n", "\n", "class HaloJsonlTraceProcessor:\n", " def __init__(self, path: Path, *, ctx: HaloExportContext):\n", " self._path = path\n", " self._ctx = ctx\n", " self._lock = threading.Lock()\n", " self._handle = path.open(\"a\", encoding=\"utf-8\")\n", " self._trace_meta: dict[str, tuple[str | None, str | None, dict[str, Any]]] = {}\n", "\n", " def on_trace_start(self, trace) -> None: # noqa: ANN001\n", " data = trace.export() or {}\n", " trace_id = _strip_prefix(data.get(\"id\"), \"trace_\") or \"\"\n", " metadata = data.get(\"metadata\") if isinstance(data.get(\"metadata\"), dict) else {}\n", " self._trace_meta[trace_id] = (\n", " data.get(\"workflow_name\"),\n", " data.get(\"group_id\"),\n", " metadata,\n", " )\n", "\n", " def on_trace_end(self, trace) -> None: # noqa: ANN001\n", " data = trace.export() or {}\n", " trace_id = _strip_prefix(data.get(\"id\"), \"trace_\") or \"\"\n", " self._trace_meta.pop(trace_id, None)\n", "\n", " def on_span_start(self, span) -> None: # noqa: ANN001\n", " return None\n", "\n", " def on_span_end(self, span) -> None: # noqa: ANN001\n", " exported = span.export() or {}\n", " trace_id = _strip_prefix(exported.get(\"trace_id\"), \"trace_\") or \"\"\n", " workflow_name, group_id, trace_metadata = self._trace_meta.get(trace_id, (None, None, {}))\n", " line = span_to_halo_jsonl_line(\n", " span,\n", " ctx=self._ctx,\n", " workflow_name=workflow_name,\n", " group_id=group_id,\n", " trace_metadata=trace_metadata,\n", " )\n", " encoded = json.dumps(line, separators=(\",\", \":\"), ensure_ascii=False, default=str)\n", " with self._lock:\n", " self._handle.write(encoded)\n", " self._handle.write(\"\\n\")\n", "\n", " def shutdown(self) -> None:\n", " with self._lock:\n", " try:\n", " self._handle.flush()\n", " self._handle.close()\n", " except Exception:\n", " pass\n", "\n", " def force_flush(self) -> None:\n", " with self._lock:\n", " self._handle.flush()\n", "\n" ] }, { "cell_type": "markdown", "id": "runtime-trace-mapping-heading", "metadata": {}, "source": [ "#### Map SDK spans into HALO-readable fields\n", "\n", "These helpers translate each SDK span type into the attributes HALO will inspect later.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "runtime-trace-mapping", "metadata": { "jupyter": { "source_hidden": true } }, "outputs": [], "source": [ "def span_to_halo_jsonl_line(\n", " span,\n", " *,\n", " ctx: HaloExportContext,\n", " workflow_name: str | None = None,\n", " group_id: str | None = None,\n", " trace_metadata: Mapping[str, Any] | None = None,\n", ") -> dict[str, Any]:\n", " raw = span.export() or {}\n", " span_data = raw.get(\"span_data\") or {}\n", " span_type = str(span_data.get(\"type\") or \"custom\")\n", " error = raw.get(\"error\")\n", " resource_attributes: dict[str, Any] = {\"service.name\": ctx.service_name}\n", " if ctx.service_version:\n", " resource_attributes[\"service.version\"] = ctx.service_version\n", " if ctx.deployment_environment:\n", " resource_attributes[\"deployment.environment\"] = ctx.deployment_environment\n", " if ctx.extra_resource_attributes:\n", " resource_attributes.update(ctx.extra_resource_attributes)\n", "\n", " attributes, projection = _attributes_for_span_type(span_type, span_data)\n", " if workflow_name:\n", " attributes[\"agent.workflow.name\"] = workflow_name\n", " if group_id:\n", " attributes[\"agent.workflow.group_id\"] = group_id\n", " for key, value in (trace_metadata or {}).items():\n", " if _json_safe(value):\n", " attributes[f\"agent.trace_metadata.{key}\"] = value\n", " else:\n", " attributes[f\"agent.trace_metadata.{key}\"] = _json(value)\n", "\n", " attributes.update(\n", " {\n", " \"inference.export.schema_version\": EXPORT_SCHEMA_VERSION,\n", " \"inference.project_id\": ctx.project_id,\n", " \"inference.observation_kind\": OBSERVATION_KIND_BY_TYPE.get(span_type, \"SPAN\"),\n", " \"inference.llm.provider\": projection.get(\"llm_provider\"),\n", " \"inference.llm.model_name\": projection.get(\"llm_model_name\"),\n", " \"inference.llm.input_tokens\": projection.get(\"input_tokens\"),\n", " \"inference.llm.output_tokens\": projection.get(\"output_tokens\"),\n", " \"inference.llm.cost.total\": projection.get(\"cost_total\"),\n", " \"inference.user_id\": projection.get(\"user_id\"),\n", " \"inference.session_id\": group_id,\n", " \"inference.agent_name\": projection.get(\"agent_name\") or \"\",\n", " }\n", " )\n", "\n", " return {\n", " \"trace_id\": _strip_prefix(raw.get(\"trace_id\"), \"trace_\") or \"\",\n", " \"span_id\": _strip_prefix(raw.get(\"id\"), \"span_\") or \"\",\n", " \"parent_span_id\": _strip_prefix(raw.get(\"parent_id\"), \"span_\") or \"\",\n", " \"trace_state\": \"\",\n", " \"name\": _span_name(span_type, span_data),\n", " \"kind\": _span_kind(span_type),\n", " \"start_time\": _to_otlp_timestamp(raw.get(\"started_at\")),\n", " \"end_time\": _to_otlp_timestamp(raw.get(\"ended_at\")),\n", " \"status\": {\n", " \"code\": \"STATUS_CODE_ERROR\" if error else \"STATUS_CODE_OK\",\n", " \"message\": str((error or {}).get(\"message\") or \"\"),\n", " },\n", " \"resource\": {\"attributes\": resource_attributes},\n", " \"scope\": {\"name\": \"openai-agents-sdk\", \"version\": _sdk_version()},\n", " \"attributes\": {key: value for key, value in attributes.items() if value is not None},\n", " }\n", "\n", "\n", "def _attributes_for_span_type(\n", " span_type: str,\n", " data: Mapping[str, Any],\n", ") -> tuple[dict[str, Any], dict[str, Any]]:\n", " if span_type == \"agent\":\n", " return _agent_attrs(data)\n", " if span_type == \"generation\":\n", " return _generation_attrs(data)\n", " if span_type == \"response\":\n", " return _response_attrs(data)\n", " if span_type == \"function\":\n", " return _function_attrs(data)\n", " if span_type == \"mcp_tools\":\n", " return _mcp_tools_attrs(data)\n", " if span_type == \"handoff\":\n", " return _handoff_attrs(data)\n", " if span_type == \"guardrail\":\n", " return _guardrail_attrs(data)\n", " return _custom_attrs(span_type, data)\n", "\n", "\n", "def _agent_attrs(data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " name = data.get(\"name\") or \"\"\n", " return _drop_none(\n", " {\n", " \"openinference.span.kind\": \"AGENT\",\n", " \"agent.name\": name,\n", " \"agent.handoffs\": _json(data.get(\"handoffs\")),\n", " \"agent.tools\": _json(data.get(\"tools\")),\n", " \"agent.output_type\": data.get(\"output_type\"),\n", " }\n", " ), {\"agent_name\": name}\n", "\n", "\n", "def _generation_attrs(data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " usage = data.get(\"usage\") or {}\n", " input_messages = data.get(\"input\") or []\n", " output_messages = data.get(\"output\") or []\n", " attrs: dict[str, Any] = {\n", " \"openinference.span.kind\": \"LLM\",\n", " \"llm.provider\": \"openai\",\n", " \"llm.model_name\": data.get(\"model\"),\n", " \"llm.invocation_parameters\": _json(data.get(\"model_config\")),\n", " \"llm.input_messages\": _json(list(input_messages)),\n", " \"llm.output_messages\": _json(list(output_messages)),\n", " \"llm.token_count.prompt\": _int(usage.get(\"input_tokens\") or usage.get(\"prompt_tokens\")),\n", " \"llm.token_count.completion\": _int(\n", " usage.get(\"output_tokens\") or usage.get(\"completion_tokens\")\n", " ),\n", " \"llm.token_count.total\": _int(usage.get(\"total_tokens\")),\n", " }\n", " attrs.update(_expand_messages(\"llm.input_messages\", input_messages))\n", " attrs.update(_expand_messages(\"llm.output_messages\", output_messages))\n", " return _drop_none(attrs), {\n", " \"llm_provider\": \"openai\",\n", " \"llm_model_name\": data.get(\"model\"),\n", " \"input_tokens\": _int(usage.get(\"input_tokens\") or usage.get(\"prompt_tokens\")),\n", " \"output_tokens\": _int(usage.get(\"output_tokens\") or usage.get(\"completion_tokens\")),\n", " }\n", "\n", "\n", "def _response_attrs(data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " usage = data.get(\"usage\") or {}\n", " return _drop_none(\n", " {\n", " \"openinference.span.kind\": \"LLM\",\n", " \"llm.provider\": \"openai\",\n", " \"llm.response.id\": data.get(\"response_id\"),\n", " \"llm.token_count.prompt\": _int(usage.get(\"input_tokens\") or usage.get(\"prompt_tokens\")),\n", " \"llm.token_count.completion\": _int(\n", " usage.get(\"output_tokens\") or usage.get(\"completion_tokens\")\n", " ),\n", " \"llm.token_count.total\": _int(usage.get(\"total_tokens\")),\n", " }\n", " ), {\n", " \"llm_provider\": \"openai\",\n", " \"input_tokens\": _int(usage.get(\"input_tokens\") or usage.get(\"prompt_tokens\")),\n", " \"output_tokens\": _int(usage.get(\"output_tokens\") or usage.get(\"completion_tokens\")),\n", " }\n", "\n", "\n", "def _function_attrs(data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " return _drop_none(\n", " {\n", " \"openinference.span.kind\": \"TOOL\",\n", " \"tool.name\": data.get(\"name\"),\n", " \"input.value\": data.get(\"input\"),\n", " \"output.value\": data.get(\"output\"),\n", " \"mcp.data\": _json(data.get(\"mcp_data\")),\n", " }\n", " ), {}\n", "\n", "\n", "def _mcp_tools_attrs(data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " return _drop_none(\n", " {\n", " \"openinference.span.kind\": \"TOOL\",\n", " \"mcp.server\": data.get(\"server\"),\n", " \"mcp.tools.listed\": _json(data.get(\"result\")),\n", " }\n", " ), {}\n", "\n", "\n", "def _handoff_attrs(data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " return _drop_none(\n", " {\n", " \"openinference.span.kind\": \"CHAIN\",\n", " \"agent.handoff.from\": data.get(\"from_agent\"),\n", " \"agent.handoff.to\": data.get(\"to_agent\"),\n", " }\n", " ), {\"agent_name\": data.get(\"to_agent\")}\n", "\n", "\n", "def _guardrail_attrs(data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " return _drop_none(\n", " {\n", " \"openinference.span.kind\": \"GUARDRAIL\",\n", " \"guardrail.name\": data.get(\"name\"),\n", " \"guardrail.triggered\": bool(data.get(\"triggered\")),\n", " }\n", " ), {}\n", "\n", "\n", "def _custom_attrs(span_type: str, data: Mapping[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:\n", " attrs: dict[str, Any] = {\n", " \"openinference.span.kind\": \"CHAIN\",\n", " \"sdk.span.type\": span_type,\n", " }\n", " if data.get(\"name\"):\n", " attrs[\"sdk.span.name\"] = data.get(\"name\")\n", " payload = data.get(\"data\") or {}\n", " if isinstance(payload, Mapping):\n", " for key, value in payload.items():\n", " attrs[f\"sdk.data.{key}\"] = value if _json_safe(value) else _json(value)\n", " if \"usage\" in data:\n", " attrs[\"llm.token_count.total\"] = _int((data.get(\"usage\") or {}).get(\"total_tokens\"))\n", " return _drop_none(attrs), {}\n", "\n" ] }, { "cell_type": "markdown", "id": "runtime-trace-normalize-heading", "metadata": {}, "source": [ "#### Normalize helper values\n", "\n", "The final helpers keep IDs, timestamps, and serialized values consistent across exported spans.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "runtime-trace-normalize", "metadata": { "jupyter": { "source_hidden": true } }, "outputs": [], "source": [ "def _strip_prefix(value: Any, prefix: str) -> str | None:\n", " if not value:\n", " return None\n", " text = str(value)\n", " return text[len(prefix) :] if text.startswith(prefix) else text\n", "\n", "\n", "def _to_otlp_timestamp(value: str | None) -> str:\n", " if not value:\n", " return \"\"\n", " parsed = datetime.fromisoformat(value)\n", " if parsed.tzinfo is None:\n", " parsed = parsed.replace(tzinfo=timezone.utc)\n", " parsed = parsed.astimezone(timezone.utc)\n", " return parsed.strftime(\"%Y-%m-%dT%H:%M:%S.\") + f\"{parsed.microsecond:06d}000Z\"\n", "\n", "\n", "def _span_kind(span_type: str) -> str:\n", " return \"SPAN_KIND_CLIENT\" if span_type in {\"generation\", \"response\"} else \"SPAN_KIND_INTERNAL\"\n", "\n", "\n", "def _span_name(span_type: str, data: Mapping[str, Any]) -> str:\n", " if data.get(\"name\"):\n", " return f\"{span_type}.{data['name']}\"\n", " if data.get(\"model\"):\n", " return f\"{span_type}.{data['model']}\"\n", " return span_type\n", "\n", "\n", "def _expand_messages(prefix: str, messages: Iterable[Mapping[str, Any]]) -> dict[str, Any]:\n", " attrs: dict[str, Any] = {}\n", " for index, message in enumerate(messages or []):\n", " if not isinstance(message, Mapping):\n", " continue\n", " role = message.get(\"role\")\n", " content = message.get(\"content\")\n", " if role is not None:\n", " attrs[f\"{prefix}.{index}.message.role\"] = role\n", " if isinstance(content, str):\n", " attrs[f\"{prefix}.{index}.message.content\"] = content\n", " elif content is not None:\n", " attrs[f\"{prefix}.{index}.message.content\"] = _json(content)\n", " for tool_index, tool_call in enumerate(message.get(\"tool_calls\") or []):\n", " function = (tool_call or {}).get(\"function\") or {}\n", " attrs[f\"{prefix}.{index}.message.tool_calls.{tool_index}.tool_call.id\"] = (\n", " tool_call or {}\n", " ).get(\"id\")\n", " attrs[\n", " f\"{prefix}.{index}.message.tool_calls.{tool_index}.tool_call.function.name\"\n", " ] = function.get(\"name\")\n", " attrs[\n", " f\"{prefix}.{index}.message.tool_calls.{tool_index}.tool_call.function.arguments\"\n", " ] = function.get(\"arguments\")\n", " if message.get(\"tool_call_id\"):\n", " attrs[f\"{prefix}.{index}.message.tool_call_id\"] = message[\"tool_call_id\"]\n", " if message.get(\"name\"):\n", " attrs[f\"{prefix}.{index}.message.name\"] = message[\"name\"]\n", " return {key: value for key, value in attrs.items() if value is not None}\n", "\n", "\n", "def _json(value: Any) -> str | None:\n", " if value is None:\n", " return None\n", " return json.dumps(value, default=str, separators=(\",\", \":\"))\n", "\n", "\n", "def _json_safe(value: Any) -> bool:\n", " return isinstance(value, (str, int, float, bool)) or value is None\n", "\n", "\n", "def _int(value: Any) -> int | None:\n", " if value is None:\n", " return None\n", " try:\n", " return int(value)\n", " except (TypeError, ValueError):\n", " return None\n", "\n", "\n", "def _drop_none(values: Mapping[str, Any]) -> dict[str, Any]:\n", " return {key: value for key, value in values.items() if value is not None}\n", "\n", "\n", "def _sdk_version() -> str:\n", " try:\n", " return version(\"openai-agents\")\n", " except Exception:\n", " return \"unknown\"\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "runtime-runner-heading", "metadata": {}, "source": [ "### Run the SDK agent\n", "\n", "`run_sdk_agent()` calls the Agents SDK runner directly while handling the repeated setup around each traced run: mounting the data, attaching tracing, executing the agent, and collecting the output artifacts.\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "runtime-agent-runner", "metadata": {}, "outputs": [], "source": [ "async def run_sdk_agent(\n", " dataset_dir: Path,\n", " output_dir: Path,\n", " question: str,\n", " model: str,\n", " agent_config: AgentConfig,\n", " trace_id: str | None = None,\n", " trace_metadata: dict[str, Any] | None = None,\n", " halo_trace_path: str | Path | None = None,\n", " halo_project_id: str = \"financial_diligence_analyst_optimization_context\",\n", ") -> str:\n", " from agents import ModelSettings as SDKModelSettings\n", " from agents import Runner, custom_span, flush_traces, trace\n", " from agents.run import RunConfig\n", " from agents.sandbox import Manifest, SandboxAgent, SandboxRunConfig\n", " from agents.sandbox.entries import Dir, LocalDir\n", " from agents.sandbox.sandboxes.unix_local import UnixLocalSandboxClient\n", " from openai.types.shared import Reasoning\n", "\n", " output_dir.mkdir(parents=True, exist_ok=True)\n", " with staged_dataset_mount(dataset_dir) as staged_dataset_dir:\n", " write_runtime_manifest(staged_dataset_dir)\n", " reasoning = Reasoning(effort=agent_config.model_settings.reasoning_effort)\n", " agent = SandboxAgent(\n", " name=\"Synthetic dataroom diligence analyst\",\n", " model=model,\n", " model_settings=SDKModelSettings(reasoning=reasoning),\n", " instructions=agent_config.build_instructions(),\n", " default_manifest=Manifest(\n", " entries={\n", " \"data\": LocalDir(src=staged_dataset_dir),\n", " \"outputs\": Dir(),\n", " }\n", " ),\n", " )\n", " client = UnixLocalSandboxClient()\n", " session = None\n", " halo_processor = None\n", " if halo_trace_path is not None:\n", " halo_processor = setup_halo_tracing(\n", " halo_trace_path,\n", " project_id=halo_project_id,\n", " service_version=agent_config.version,\n", " deployment_environment=\"notebook\" if trace_metadata else None,\n", " extra_resource_attributes={\n", " \"agent.config.version\": agent_config.version,\n", " \"agent.config.path\": str(agent_config.path),\n", " },\n", " )\n", " trace_context = (\n", " trace(\n", " workflow_name=\"Synthetic dataroom diligence\",\n", " trace_id=trace_id,\n", " metadata=trace_metadata,\n", " )\n", " if trace_id\n", " else None\n", " )\n", " if trace_context is not None:\n", " trace_context.__enter__()\n", " try:\n", " with custom_span(\n", " \"sandbox_workspace\",\n", " {\n", " \"tool.name\": \"sandbox_workspace\",\n", " \"tool.input\": {\n", " \"mounted\": \"data\",\n", " \"writable\": \"outputs\",\n", " \"dataset_dir\": str(dataset_dir),\n", " \"staged_dataset_dir\": str(staged_dataset_dir),\n", " \"agent_config\": str(agent_config.path),\n", " \"agent_config_version\": agent_config.version,\n", " },\n", " },\n", " disabled=trace_context is None,\n", " ):\n", " with custom_span(\n", " \"agent_config\",\n", " {\n", " \"tool.name\": \"agent_config\",\n", " \"tool.input\": {\n", " \"version\": agent_config.version,\n", " \"required_artifacts\": agent_config.required_artifacts,\n", " },\n", " },\n", " disabled=trace_context is None,\n", " ):\n", " pass\n", " session = await client.create(manifest=agent.default_manifest)\n", " async with session:\n", " result = await Runner.run(\n", " agent,\n", " build_user_prompt(question, agent_config),\n", " run_config=RunConfig(\n", " sandbox=SandboxRunConfig(session=session),\n", " workflow_name=\"Synthetic dataroom diligence\",\n", " trace_id=trace_id,\n", " trace_metadata=trace_metadata,\n", " tracing_disabled=trace_id is None,\n", " ),\n", " max_turns=30,\n", " )\n", " for filename in agent_config.required_artifacts:\n", " try:\n", " with custom_span(\n", " \"artifact_write\",\n", " {\n", " \"tool.name\": \"artifact_write\",\n", " \"tool.input\": {\"filename\": filename},\n", " },\n", " disabled=trace_context is None,\n", " ):\n", " with await session.read(Path(\"outputs\") / filename) as handle:\n", " (output_dir / filename).write_bytes(handle.read())\n", " except Exception:\n", " continue\n", " return str(result.final_output)\n", " finally:\n", " delete = getattr(client, \"delete\", None)\n", " if delete is not None and session is not None:\n", " try:\n", " await delete(session)\n", " except Exception:\n", " pass\n", " if trace_context is not None:\n", " trace_context.__exit__(None, None, None)\n", " if halo_processor is not None:\n", " try:\n", " flush_traces()\n", " except Exception:\n", " pass\n", " try:\n", " halo_processor.shutdown()\n", " except Exception:\n", " pass\n", "\n", "\n", "@contextmanager\n", "def staged_dataset_mount(dataset_dir: Path) -> Iterator[Path]:\n", " \"\"\"Prepare a writable SDK mount copy without mutating the source dataroom.\"\"\"\n", " with tempfile.TemporaryDirectory(prefix=\"synthetic-dataroom-mount-\") as tmp:\n", " staged_dir = Path(tmp) / dataset_dir.name\n", " shutil.copytree(dataset_dir, staged_dir)\n", " write_runtime_tools(staged_dir)\n", " yield staged_dir.resolve()\n", "\n", "\n", "def write_runtime_manifest(dataset_dir: Path) -> None:\n", " manifest = {\n", " \"runtime_scope\": \"sdk_agent_visible_dataroom\",\n", " \"files\": sorted(\n", " str(path.relative_to(dataset_dir))\n", " for path in dataset_dir.rglob(\"*\")\n", " if path.is_file() and path.name != \"manifest.json\"\n", " ),\n", " }\n", " (dataset_dir / \"manifest.json\").write_text(\n", " json.dumps(manifest, indent=2) + \"\\n\",\n", " encoding=\"utf-8\",\n", " )\n", "\n" ] }, { "cell_type": "markdown", "id": "step-3-revised", "metadata": {}, "source": [ "## Step 3. Generate traced runs\n", "\n", "The questions are intentionally varied so the eval suite covers several ways the agent can go wrong. The notebook runs five traces by default to keep the live path practical while still covering several distinct behaviors. A larger question bank remains available if you want broader coverage later.\n", "\n", "Each run uses the async Agents SDK path and writes a real trace plus the required artifacts.\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "8ab52b7a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running trace-01/05: What do runway and burn tell us about near-term financing risk?\n", "Running trace-02/05: How strong is revenue quality, and which ARR figure should we rely on?\n", "Running trace-03/05: What is the real customer concentration risk after parent-account rollups?\n", "Running trace-04/05: How ready is the company for enterprise security review?\n", "Running trace-05/05: What unsupported metrics should we refuse to infer from the dataroom?\n", "Trace generation completed in 7m 59s\n", "trace-01: What do runway and burn tell us about near-term financing risk?\n", "Near-term financing risk is elevated. Finance reports `$2.9M` monthly cash burn and `11 months` runway, and the board packet corroborates both figures....\n", "\n", "trace-02: How strong is revenue quality, and which ARR figure should we rely on?\n", "**Answer** - Revenue quality is **moderate, not clean**: real scale and 69% gross margin, but ARR definition drift, unvalidated retention, concentration, and renewal risk weaken...\n", "\n", "trace-03: What is the real customer concentration risk after parent-account rollups?\n", "**Answer** - Real concentration risk is **high**: Northstar Bank + Northstar Capital Markets roll up to **Northstar Holdings at $12.4M**, or **33.6% of controlled FY2025 ARR**....\n", "\n", "trace-04: How ready is the company for enterprise security review?\n", "**Answer** - The company is **partially ready, but not ready for frictionless enterprise security review**: SOC 2 Type I is complete, but SOC 2 Type II fieldwork is still in...\n", "\n", "trace-05: What unsupported metrics should we refuse to infer from the dataroom?\n", "**Answer** Refuse to infer these unsupported or conflicted metrics from the dataroom: - `CAC payback`: explicitly `not_provided`; requested but not supplied....\n", "\n" ] } ], "source": [ "QUESTION_BANK = [\n", " \"What do runway and burn tell us about near-term financing risk?\",\n", " \"How strong is revenue quality, and which ARR figure should we rely on?\",\n", " \"What is the real customer concentration risk after parent-account rollups?\",\n", " \"What legal exposure should an acquirer investigate first?\",\n", " \"How ready is the company for enterprise security review?\",\n", " \"Which contradictions appear across the board deck, finance exports, and management narratives?\",\n", " \"What unsupported metrics should we refuse to infer from the dataroom?\",\n", " \"What follow-up questions should management answer before an investment committee review?\",\n", " \"What are the top three diligence risks, ranked by severity?\",\n", " \"Which claims in the materials look directionally useful but still need stronger evidence?\",\n", "]\n", "\n", "# Using 5 questions as the default, with more available if you want broader coverage later.\n", "\n", "DEFAULT_TRACE_INDICES = [0, 1, 2, 4, 6]\n", "TRACE_LIMIT = len(DEFAULT_TRACE_INDICES)\n", "QUESTIONS = [QUESTION_BANK[index] for index in DEFAULT_TRACE_INDICES]\n", "\n", "\n", "@dataclass\n", "class TraceRecord:\n", " trace_id: str\n", " sdk_trace_id: str\n", " trace_label: str\n", " question: str\n", " answer: str\n", " output_dir: str\n", " mode: str\n", "\n", "\n", "def sdk_trace_id(label: str) -> str:\n", " # Agents SDK trace uploads expect ids shaped like `trace_`.\n", " return f\"trace_{hashlib.sha256(label.encode('utf-8')).hexdigest()[:32]}\"\n", "\n", "\n", "def exported_trace_id(label: str) -> str:\n", " # The local HALO exporter strips the SDK `trace_` prefix before writing JSONL.\n", " return sdk_trace_id(label).removeprefix(\"trace_\")\n", "\n", "\n", "async def generate_traces(dataset: Path, questions: list[str]) -> list[TraceRecord]:\n", " traces: list[TraceRecord] = []\n", " for index, question in enumerate(questions, start=1):\n", " label = f\"trace-{index:02d}\"\n", " print(f\"Running {label}/{len(questions):02d}: {question}\")\n", " output_dir = TRACE_DIR / f\"trace_{index:02d}\"\n", " output_dir.mkdir(parents=True, exist_ok=True)\n", " real_sdk_trace_id = sdk_trace_id(label)\n", " real_exported_trace_id = exported_trace_id(label)\n", " answer = await run_sdk_agent(\n", " dataset_dir=dataset,\n", " output_dir=output_dir,\n", " question=question,\n", " model=AGENT_MODEL,\n", " agent_config=agent_config,\n", " trace_id=real_sdk_trace_id,\n", " trace_metadata={\"notebook_trace_id\": label},\n", " halo_trace_path=HALO_TRACE_PATH,\n", " )\n", " traces.append(\n", " TraceRecord(\n", " trace_id=real_exported_trace_id,\n", " sdk_trace_id=real_sdk_trace_id,\n", " trace_label=label,\n", " question=question,\n", " answer=answer,\n", " output_dir=str(output_dir.relative_to(PROJECT_ROOT)),\n", " mode=\"sdk\",\n", " )\n", " )\n", " return traces\n", "\n", "\n", "trace_generation_started = time.perf_counter()\n", "traces = await generate_traces(dataset, QUESTIONS)\n", "print(f\"Trace generation completed in {format_duration(time.perf_counter() - trace_generation_started)}\")\n", "assert len(traces) == TRACE_LIMIT\n", "\n", "for trace in traces:\n", " print(f\"{trace.trace_label}: {trace.question}\")\n", " print(textwrap.shorten(trace.answer.replace(\"\\n\", \" \"), width=180, placeholder=\"...\"))\n", " print()" ] }, { "cell_type": "markdown", "id": "127f6f6d", "metadata": {}, "source": [ "### Inspect the agent artifacts\n", "\n", "Each traced run writes the full artifact set required by the harness. The first run below shows the files the agent produced so you can inspect the answer, evidence, and open questions together.\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "3c78c55f", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "### `summary_answer.md`\n", "```markdown\n", "# Summary Answer\n", "\n", "Runway and burn indicate elevated near-term financing risk. Finance reports FY2025 cash burn of $2.9M per month and 11 months of runway, and the December board packet repeats the same burn and runway figures. (`financials/p_and_l.csv`, `board_deck.md`)\n", "\n", "An 11-month runway is a sub-12-month financing window: unless burn is reduced, revenue conversion accelerates, or additional capital is secured, the company likely needs a financing plan in the near term. (`financials/p_and_l.csv`)\n", "\n", "The financing story is somewhat weakened by ARR quality and source conflicts. The controlled FY2025 ARR bridge shows $36.9M ending ARR, while the board deck reports $43.0M because it includes $2.8M of launch-stage commitments and $1.1M of usage true-ups that finance does not classify as recurring ARR. (`financials/arr_bridge.csv`, `financials/revenue_recognition_notes.md`, `board_deck.md`)\n", "\n", "The dataroom does not provide a cash balance, debt schedule, undrawn facility, covenant package, or financing plan, so the exact liquidity cushion and financing path are unknown from the provided evidence. (`financials/p_and_l.csv`, `manifest.json`)\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### `investment_memo.md`\n", "```markdown\n", "# Investment Memo: Runway and Burn\n", "\n", "## Bottom Line\n", "- Near-term financing risk is elevated because finance reports $2.9M of monthly cash burn and only 11 months of runway. (`financials/p_and_l.csv`)\n", "- The board packet corroborates the same $2.9M monthly burn and 11-month runway. (`board_deck.md`)\n", "- The exact liquidity cushion is unknown because the dataroom provides runway and burn but not cash balance, debt availability, covenant terms, or a financing plan. (`financials/p_and_l.csv`, `manifest.json`)\n", "\n", "## Evidence\n", "- FY2025 P&L reports $30.26M revenue, 69% gross margin, $47.71M opex, $2.9M cash burn per month, and 11 months of runway. (`financials/p_and_l.csv`)\n", "- Finance-controlled ARR is $36.9M at FY2025 year-end. (`financials/arr_bridge.csv`)\n", "- The board deck reports $43.0M FY2025 ending ARR, 71% ARR growth, 69% gross margin, $2.9M monthly burn, and 11 months of runway. (`board_deck.md`)\n", "- Finance states the board ARR includes $2.8M signed launch-stage commitments not live by 2025-12-31 and $1.1M usage true-ups that finance does not classify as recurring ARR. (`financials/revenue_recognition_notes.md`)\n", "\n", "## Interpretation\n", "- A company burning $2.9M per month with 11 months of runway has less than one year to reduce burn, convert growth into cash-efficient revenue, or raise capital. (`financials/p_and_l.csv`)\n", "- The growth narrative should be underwritten against finance-controlled ARR rather than board headline ARR because finance identifies specific non-recurring or not-yet-live components in the board figure. (`financials/arr_bridge.csv`, `financials/revenue_recognition_notes.md`, `board_deck.md`)\n", "- Current evidence supports a financing-risk concern, but it does not support quantifying exact cash balance, facility availability, covenant headroom, or planned raise timing. (`financials/p_and_l.csv`, `manifest.json`)\n", "\n", "## Diligence View\n", "- Financing risk: High / elevated.\n", "- Key dependency: management must show a credible plan to extend runway beyond the reported 11 months.\n", "- Critical missing evidence: cash balance, monthly cash forecast, debt/facility details, covenant headroom, and financing plan.\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### `risk_register.json`\n", "```json\n", "[\n", " {\n", " \"id\": \"R-001\",\n", " \"risk\": \"Sub-12-month runway\",\n", " \"severity\": \"High\",\n", " \"rationale\": \"Finance reports 11 months of runway and $2.9M of monthly cash burn, which indicates a near-term need to reduce burn, improve cash generation, or secure financing.\",\n", " \"evidence\": [\n", " \"financials/p_and_l.csv\",\n", " \"board_deck.md\"\n", " ],\n", " \"open_questions\": [\n", " \"What is current unrestricted cash?\",\n", " \"What financing actions are planned before runway drops below 6 months?\"\n", " ]\n", " },\n", " {\n", " \"id\": \"R-002\",\n", " \"risk\": \"ARR quality may weaken financing narrative\",\n", " \"severity\": \"Medium\",\n", " \"rationale\": \"Finance-controlled FY2025 ending ARR is $36.9M, while the board deck reports $43.0M ARR because it includes launch-stage commitments and usage true-ups that finance does not classify as recurring ARR.\",\n", " \"evidence\": [\n", " \"financials/arr_bridge.csv\",\n", " \"financials/revenue_recognition_notes.md\",\n", " \"board_deck.md\"\n", " ],\n", " \"open_questions\": [\n", " \"Which ARR figure is used in lender or investor materials?\",\n", " \"How much of the launch-stage commitments have since gone live?\"\n", " ]\n", " },\n", " {\n", " \"id\": \"R-003\",\n", " \"risk\": \"Liquidity structure is not evidenced\",\n", " \"severity\": \"Medium\",\n", " \"rationale\": \"The dataroom provides burn and runway but does not provide cash balance, debt availability, covenant headroom, or a financing plan, limiting confidence in the company\\u2019s liquidity path.\",\n", " \"evidence\": [\n", " \"financials/p_and_l.csv\",\n", " \"manifest.json\"\n", " ],\n", " \"open_questions\": [\n", " \"Is there an undrawn revolver or venture debt facility?\",\n", " \"Are there covenants or minimum cash requirements?\"\n", " ]\n", " }\n", "]\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### `open_questions.md`\n", "```markdown\n", "# Open Questions\n", "\n", "- What is current unrestricted cash, and how does it reconcile to the reported 11 months of runway? (`financials/p_and_l.csv`)\n", "- Is there an existing debt facility, undrawn revolver, covenant package, or minimum cash requirement? (`manifest.json`)\n", "- What is management's financing plan, including target timing, amount, and contingency if markets are unavailable? (`manifest.json`)\n", "- What burn reduction actions are available, and how many months of runway would each action add? (`financials/p_and_l.csv`)\n", "- Which ARR figure is used in financing discussions: finance-controlled $36.9M ARR or board headline $43.0M ARR? (`financials/arr_bridge.csv`, `financials/revenue_recognition_notes.md`, `board_deck.md`)\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### `citations.json`\n", "```json\n", "[\n", " {\n", " \"claim_id\": \"C-001\",\n", " \"claim\": \"Finance reports FY2025 cash burn of $2.9M per month and 11 months of runway.\",\n", " \"sources\": [\n", " \"financials/p_and_l.csv\"\n", " ]\n", " },\n", " {\n", " \"claim_id\": \"C-002\",\n", " \"claim\": \"The December board packet repeats $2.9M monthly cash burn and 11 months of runway.\",\n", " \"sources\": [\n", " \"board_deck.md\"\n", " ]\n", " },\n", " {\n", " \"claim_id\": \"C-003\",\n", " \"claim\": \"Finance-controlled FY2025 ending ARR is $36.9M.\",\n", " \"sources\": [\n", " \"financials/arr_bridge.csv\"\n", " ]\n", " },\n", " {\n", " \"claim_id\": \"C-004\",\n", " \"claim\": \"The board deck reports $43.0M FY2025 ending ARR and 71% ARR growth.\",\n", " \"sources\": [\n", " \"board_deck.md\"\n", " ]\n", " },\n", " {\n", " \"claim_id\": \"C-005\",\n", " \"claim\": \"Finance states board ARR includes $2.8M of launch-stage commitments not live by 2025-12-31 and $1.1M of usage true-ups that finance does not classify as recurring ARR.\",\n", " \"sources\": [\n", " \"financials/revenue_recognition_notes.md\"\n", " ]\n", " },\n", " {\n", " \"claim_id\": \"C-006\",\n", " \"claim\": \"The dataroom does not provide a separate cash balance, debt schedule, facility availability, covenant package, or financing plan.\",\n", " \"sources\": [\n", " \"financials/p_and_l.csv\",\n", " \"manifest.json\"\n", " ]\n", " }\n", "]\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### `evidence_table.csv`\n", "```csv\n", "claim_id,claim,sources\n", "C-001,\"Finance reports FY2025 cash burn of $2.9M per month and 11 months of runway.\",\"financials/p_and_l.csv\"\n", "C-002,\"The December board packet repeats $2.9M monthly cash burn and 11 months of runway.\",\"board_deck.md\"\n", "C-003,\"Finance-controlled FY2025 ending ARR is $36.9M.\",\"financials/arr_bridge.csv\"\n", "C-004,\"The board deck reports $43.0M FY2025 ending ARR and 71% ARR growth.\",\"board_deck.md\"\n", "C-005,\"Finance states board ARR includes $2.8M of launch-stage commitments not live by 2025-12-31 and $1.1M of usage true-ups that finance does not classify as recurring ARR.\",\"financials/revenue_recognition_notes.md\"\n", "C-006,\"The dataroom does not provide a separate cash balance, debt schedule, facility availability, covenant package, or financing plan.\",\"financials/p_and_l.csv; manifest.json\"\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def show_trace_artifacts(trace: TraceRecord) -> None:\n", " output_dir = PROJECT_ROOT / trace.output_dir\n", " for artifact in agent_config.required_artifacts:\n", " path = output_dir / artifact\n", " language = {\n", " \".md\": \"markdown\",\n", " \".json\": \"json\",\n", " \".csv\": \"csv\",\n", " }.get(path.suffix, \"text\")\n", " display(Markdown(f\"### `{artifact}`\\n```{language}\\n{path.read_text(encoding='utf-8').rstrip()}\\n```\"))\n", "\n", "\n", "show_trace_artifacts(traces[0])" ] }, { "cell_type": "markdown", "id": "step-4-revised", "metadata": {}, "source": [ "## Step 4. Generate example human feedback and model insights\n", "\n", "This section simulates a human expert reviewing the traces after the agent runs. In a real diligence workflow, that might be the finance lead or another case expert who knows which details matter for the decision. In this example, the reviewer calls out that a parent-account rollup matters more than legal-entity concentration, that an unvalidated management NRR estimate should not become an official metric, and that \u201cSOC 2 complete\u201d is too vague when the evidence only supports Type I.\n", "\n", "The model-generated insights stay separate. In a fully automated path, an LLM reviews the same traces and proposes recurring issues or missing behaviors. That extra pass improves coverage, while subject-matter expert review adds domain judgment grounded in the work itself.\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "a3720eaa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feedback generation completed in 13s\n", "Human feedback items: 5\n", "LLM insight items: 5\n", "\n", "Example human feedback:\n", "{\n", " \"feedback_id\": \"human-trace-01\",\n", " \"trace_id\": \"43d9b03619a9d2ed4d2f3e3fd17c8bf4\",\n", " \"trace_label\": \"trace-01\",\n", " \"question\": \"What do runway and burn tell us about near-term financing risk?\",\n", " \"source_type\": \"human_feedback\",\n", " \"theme\": \"financial_risk\",\n", " \"summary\": \"State both the 11-month runway and rising burn as financing risk, not just a generic red flag.\",\n", " \"required_observations\": [\n", " \"Name the 11-month runway\",\n", " \"Tie burn to near-term financing pressure\"\n", " ],\n", " \"prohibited_claims\": [\n", " \"Do not imply the company has more than 12 months of runway\"\n", " ]\n", "}\n", "\n", "Example LLM insight:\n", "{\n", " \"insight_id\": \"llm_insight_01\",\n", " \"trace_id\": \"43d9b03619a9d2ed4d2f3e3fd17c8bf4\",\n", " \"question\": \"What do runway and burn tell us about near-term financing risk?\",\n", " \"source_type\": \"llm_insight\",\n", " \"observations\": [\n", " \"Flags elevated financing risk when runway is under 12 months and monthly burn is cited from finance and board sources.\",\n", " \"Prefers finance-controlled ARR over board headline ARR when ARR definitions conflict.\",\n", " \"Explicitly identifies missing liquidity data such as cash balance, debt availability, covenants, and financing plan.\",\n", " \"Includes source citations for key numeric claims and notes validation/artifact completion.\"\n", " ],\n", " \"trace_label\": \"trace-01\"\n", "}\n" ] } ], "source": [ "def feedback_item(\n", " trace: TraceRecord,\n", " summary: str,\n", " required: list[str],\n", " prohibited: list[str],\n", " theme: str,\n", ") -> dict[str, Any]:\n", " return {\n", " \"feedback_id\": f\"human-{trace.trace_label}\",\n", " \"trace_id\": trace.trace_id,\n", " \"trace_label\": trace.trace_label,\n", " \"question\": trace.question,\n", " \"source_type\": \"human_feedback\",\n", " \"theme\": theme,\n", " \"summary\": summary,\n", " \"required_observations\": required,\n", " \"prohibited_claims\": prohibited,\n", " }\n", "\n", "\n", "def generate_mock_human_feedback(traces: list[TraceRecord]) -> list[dict[str, Any]]:\n", " specs_by_question = {\n", " \"What do runway and burn tell us about near-term financing risk?\": (\n", " \"State both the 11-month runway and rising burn as financing risk, not just a generic red flag.\",\n", " [\"Name the 11-month runway\", \"Tie burn to near-term financing pressure\"],\n", " [\"Do not imply the company has more than 12 months of runway\"],\n", " \"financial_risk\",\n", " ),\n", " \"How strong is revenue quality, and which ARR figure should we rely on?\": (\n", " \"Use the controlled ARR bridge as the reliable figure and preserve the board-versus-finance contradiction.\",\n", " [\"Prefer finance ARR over board ARR\", \"Preserve the ARR contradiction\"],\n", " [\"Do not silently reconcile the ARR gap\"],\n", " \"revenue_quality\",\n", " ),\n", " \"What is the real customer concentration risk after parent-account rollups?\": (\n", " \"Roll concentration up to Northstar Holdings. Legal-entity framing understates the real dependency.\",\n", " [\"Mention parent-account concentration\", \"Use account_hierarchy.csv\"],\n", " [\"Do not stop at legal-entity concentration\"],\n", " \"customer_concentration\",\n", " ),\n", " \"How ready is the company for enterprise security review?\": (\n", " \"Be exact about certification status: Type I is complete; Type II is still in progress.\",\n", " [\"Distinguish Type I from Type II\", \"Treat sales FAQ as weaker evidence\"],\n", " [\"Do not say SOC 2 is simply complete\"],\n", " \"security_readiness\",\n", " ),\n", " \"What unsupported metrics should we refuse to infer from the dataroom?\": (\n", " \"Refuse official NRR and CAC payback when the dataroom does not support them.\",\n", " [\"Mark official NRR unsupported\", \"Mark CAC payback unsupported\"],\n", " [\"Do not promote the management NRR estimate into an official metric\"],\n", " \"unsupported_metrics\",\n", " ),\n", " }\n", " return [feedback_item(trace, *specs_by_question[trace.question]) for trace in traces]\n", "\n", "\n", "def extract_json(text: str) -> Any:\n", " text = text.strip()\n", " fenced = re.search(r\"```(?:json)?\\s*(.*?)```\", text, flags=re.DOTALL)\n", " candidate = fenced.group(1).strip() if fenced else text\n", " return json.loads(candidate)\n", "\n", "\n", "def generate_llm_feedback(traces: list[TraceRecord]) -> list[dict[str, Any]]:\n", " payload = [asdict(trace) for trace in traces]\n", " response = client.responses.create(\n", " model=ANALYSIS_MODEL,\n", " input=f\"\"\"\n", "You are reviewing traces from a financial diligence analyst agent.\n", "Return JSON only: a list of objects with keys `insight_id`, `trace_id`, `question`, `source_type`, and `observations`.\n", "Use `source_type` = `llm_insight`.\n", "For `trace_id`, copy the provided `trace_id` field exactly; do not use `sdk_trace_id` or `trace_label`.\n", "For each trace, identify concise recurring-behavior observations that could help generate evals later.\n", "Do not restate the whole answer. Do not invent unavailable evidence.\n", "\n", "Traces:\n", "{json.dumps(payload, indent=2)}\n", "\"\"\".strip(),\n", " )\n", " parsed = extract_json(response.output_text)\n", " if not isinstance(parsed, list):\n", " raise ValueError(\"Expected a JSON list of LLM insights.\")\n", " trace_labels = {trace.trace_id: trace.trace_label for trace in traces}\n", " for item in parsed:\n", " try:\n", " item[\"trace_label\"] = trace_labels[item[\"trace_id\"]]\n", " except KeyError as exc:\n", " raise ValueError(f\"Unknown trace_id in LLM feedback: {item['trace_id']}\") from exc\n", " return parsed\n", "\n", "\n", "feedback_started = time.perf_counter()\n", "human_feedback = generate_mock_human_feedback(traces)\n", "llm_feedback = generate_llm_feedback(traces)\n", "print(f\"Feedback generation completed in {format_duration(time.perf_counter() - feedback_started)}\")\n", "assert len(human_feedback) == TRACE_LIMIT\n", "assert len(llm_feedback) == TRACE_LIMIT\n", "\n", "print(\"Human feedback items:\", len(human_feedback))\n", "print(\"LLM insight items:\", len(llm_feedback))\n", "print(\"\\nExample human feedback:\")\n", "print(json.dumps(human_feedback[0], indent=2))\n", "print(\"\\nExample LLM insight:\")\n", "print(json.dumps(llm_feedback[0], indent=2))" ] }, { "cell_type": "markdown", "id": "step-5-revised", "metadata": {}, "source": [ "## Step 5. Generate Promptfoo evals from traces and feedback\n", "\n", "The eval suite is generated dynamically by an LLM from the evidence collected so far: traced behavior, human feedback, and model-generated observations. This turns comments into tests that the next harness revision can run again later.\n", "\n", "Promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. In this notebook, the generated behaviors become Promptfoo test cases: each one can combine literal assertions with an LLM rubric judge, so the same gate can check both exact requirements and semantic reviewer intent.\n", "\n", "Evals are a good place to invest manual effort from subject-matter experts and developers. A fully automated pass can propose useful evals quickly, but people should still check whether the evals are accurate, representative, and measuring the behavior that actually matters before they become part of the long-term test suite.\n" ] }, { "cell_type": "code", "execution_count": 17, "id": "5daefffe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Eval generation completed in 52s\n" ] }, { "data": { "text/markdown": [ "| title | scoring_method | expected_behavior |\n", "| --- | --- | --- |\n", "| Runway and burn must be translated into near-term financing risk | hybrid | The answer should explicitly state that financing risk is elevated because runway is 11 months and monthly cash burn is material/rising, tying burn to pressure to reduce spend, improve cash conversion, or raise capital before the sub-12-month runway closes. It should not imply the company has more than 12 months of runway. |\n", "| Revenue quality assessment must prefer finance-controlled ARR and preserve ARR contradictions | hybrid | The answer should characterize revenue quality as mixed or moderate rather than clean, rely on finance-controlled FY2025 ending ARR of about $36.9M for underwriting, and explicitly reject or qualify the $43.0M board/headline ARR and $40.8M bookings-adjusted ARR as not equivalent to recurring ARR. It should preserve the contradiction instead of silently reconciling the gap. |\n", "| Customer concentration must be assessed after parent-account rollups | hybrid | The answer should roll legal entities up to parent accounts before assessing concentration, specifically recognizing Northstar Holdings as the true parent exposure. It should use finance-controlled ARR as the denominator, cite or reference account hierarchy evidence, and avoid stopping at legal-entity concentration. |\n", "| Enterprise security readiness must distinguish SOC 2 Type I from Type II | hybrid | The answer should state that the company is only partially ready for enterprise security review because SOC 2 Type I is complete but SOC 2 Type II is still in progress and no Type II report has been issued. It should treat sales FAQ language like 'SOC 2 complete' as weaker or potentially misleading evidence and connect missing Type II evidence to enterprise procurement or customer friction. |\n", "| Unsupported metrics must be refused rather than inferred | hybrid | The answer should refuse to infer metrics that are missing, conflicted, partial, or management-estimated. In particular, it must mark CAC payback as unsupported/not provided and official NRR as unsupported because the 122% NRR is only an unvalidated management estimate. It should not promote partial or unofficial metrics into definitive diligence metrics. |" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Runway and burn must be translated into near-term financing risk\n", " pass: Near-term financing risk is elevated: the company has only 11 months of runway and meaningful monthly burn, creating pressure to reduce burn, improve cash conversion, or raise capital within a sub-12-month window.\n", " fail: Financing risk appears manageable because the company has enough runway for the next year and should be able to continue operating without near-term funding pressure.\n", "\n", "Revenue quality assessment must prefer finance-controlled ARR and preserve ARR contradictions\n", " pass: Revenue quality is moderate, not clean. For underwriting, use the finance-controlled ARR bridge at $36.9M, while treating the $43.0M board ARR and $40.8M bookings-adjusted view as non-comparable or planning figures because they include items not classified as recurring ARR.\n", " fail: Revenue quality is strong and the company has $43.0M of ARR; the board number can be used because it reconciles to the finance ARR bridge after normal adjustments.\n", "\n", "Customer concentration must be assessed after parent-account rollups\n", " pass: Concentration risk is high after parent rollups: Northstar Bank and Northstar Capital Markets roll up to Northstar Holdings at about $12.4M, roughly one-third of finance-controlled ARR. Looking only at legal entities understates the dependency.\n", " fail: Customer concentration is acceptable because no single legal entity exceeds the threshold after reviewing the top-customer list.\n", "\n", "Enterprise security readiness must distinguish SOC 2 Type I from Type II\n", " pass: The company is partially ready, not frictionless: SOC 2 Type I is complete, but Type II fieldwork is still in progress and no Type II report is available. Sales materials saying 'SOC 2 complete' should be treated cautiously because enterprise buyers are waiting on Type II evidence.\n", " fail: The company is ready for enterprise security review because SOC 2 is complete and the sales FAQ confirms there should be no security blocker.\n", "\n", "Unsupported metrics must be refused rather than inferred\n", " pass: Refuse to infer CAC payback because it is not provided. Also refuse to treat 122% NRR as official; it is an unvalidated management estimate and should not be used as a definitive retention metric.\n", " fail: The company has 122% official NRR and CAC payback appears attractive based on its revenue growth, so both can be used in underwriting.\n" ] } ], "source": [ "def generate_feedback_derived_evals(\n", " traces: list[TraceRecord],\n", " human_feedback: list[dict[str, Any]],\n", " llm_feedback: list[dict[str, Any]],\n", ") -> list[dict[str, Any]]:\n", " min_eval_count = min(5, max(2, len(traces)))\n", " max_eval_count = min(7, max(min_eval_count, len(traces) + 2))\n", " response = client.responses.create(\n", " model=EVAL_GENERATION_MODEL,\n", " input=f\"\"\"\n", "You are designing an eval suite for an OpenAI Agents SDK-backed financial diligence analyst.\n", "Use the traces, human feedback, and LLM insights below to generate {min_eval_count} to {max_eval_count} durable eval definitions.\n", "Return JSON only: a list of objects with keys `eval_id`, `title`, `scoring_method`, `expected_behavior`, `source_trace_id`, `rubric`, `deterministic_assertions`, `suggested_pass_example`, and `suggested_fail_example`.\n", "`scoring_method` must be one of `deterministic`, `llm_judge`, or `hybrid`.\n", "`source_trace_id` must exactly match the provided `trace_id` field for the trace whose answer should be scored. Do not use `sdk_trace_id` or `trace_label` for this field; those are only for SDK transport and human-readable references.\n", "`rubric` must be a concise pass/fail grading rubric suitable for Promptfoo `llm-rubric`.\n", "`deterministic_assertions` must be a list of Promptfoo-style assertion objects and may use only `contains`, `icontains`, or `not-contains` when a literal check is clearly useful; otherwise return an empty list.\n", "Prefer reusable behaviors over one-off trace restatements.\n", "\n", "Traces:\n", "{json.dumps([asdict(trace) for trace in traces], indent=2)}\n", "\n", "Human feedback:\n", "{json.dumps(human_feedback, indent=2)}\n", "\n", "LLM insights:\n", "{json.dumps(llm_feedback, indent=2)}\n", "\"\"\".strip(),\n", " )\n", " parsed = extract_json(response.output_text)\n", " if not isinstance(parsed, list):\n", " raise ValueError(\"Expected a JSON list of eval definitions.\")\n", " trace_labels = {trace.trace_id: trace.trace_label for trace in traces}\n", " for item in parsed:\n", " try:\n", " item[\"source_trace_label\"] = trace_labels[item[\"source_trace_id\"]]\n", " except KeyError as exc:\n", " raise ValueError(f\"Unknown source_trace_id in generated eval: {item['source_trace_id']}\") from exc\n", " return parsed\n", "\n", "\n", "eval_generation_started = time.perf_counter()\n", "eval_suite = generate_feedback_derived_evals(traces, human_feedback, llm_feedback)\n", "print(f\"Eval generation completed in {format_duration(time.perf_counter() - eval_generation_started)}\")\n", "assert all({\"title\", \"scoring_method\", \"suggested_pass_example\", \"suggested_fail_example\", \"expected_behavior\", \"source_trace_id\", \"rubric\", \"deterministic_assertions\"} <= set(item) for item in eval_suite)\n", "\n", "\n", "def markdown_table(rows: list[dict[str, Any]], columns: list[str]) -> str:\n", " header = \"| \" + \" | \".join(columns) + \" |\"\n", " divider = \"| \" + \" | \".join([\"---\"] * len(columns)) + \" |\"\n", " body = [\"| \" + \" | \".join(str(row[column]) for column in columns) + \" |\" for row in rows]\n", " return \"\\n\".join([header, divider, *body])\n", "\n", "\n", "display(Markdown(markdown_table(eval_suite, [\"title\", \"scoring_method\", \"expected_behavior\"])))\n", "\n", "for item in eval_suite:\n", " print(f\"\\n{item['title']}\")\n", " print(\" pass:\", item[\"suggested_pass_example\"])\n", " print(\" fail:\", item[\"suggested_fail_example\"])" ] }, { "cell_type": "markdown", "id": "step-6-revised", "metadata": {}, "source": [ "## Step 6. Validate the current harness with Promptfoo\n", "\n", "Promptfoo runs the generated tests against the current trace outputs. That gives the loop a snapshot of where the harness already behaves well and which expectations still fail. Promptfoo fits this role because it can combine deterministic checks for literal requirements with `llm-rubric` judges for semantic quality.\n", "\n", "In this notebook, the Promptfoo gate scores existing trace outputs. To validate a future harness revision, replace the trace-output provider with a provider that runs the candidate agent. Those Promptfoo results become part of the optimization input passed into HALO below. Even when eval generation is automated, humans can still tighten weak evals before letting them steer repeated optimization.\n" ] }, { "cell_type": "markdown", "id": "promptfoo-build-heading", "metadata": {}, "source": [ "### Build the Promptfoo test harness\n", "\n", "The provider serves existing trace outputs back to Promptfoo, and the test builder turns generated eval definitions into runnable Promptfoo cases.\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "promptfoo-build", "metadata": {}, "outputs": [], "source": [ "PROMPTFOO_PROVIDER = r'''from __future__ import annotations\n", "\n", "import json\n", "from pathlib import Path\n", "\n", "\n", "def call_api(prompt: str, options: dict, context: dict) -> dict:\n", " config = options.get(\"config\", {})\n", " trace_outputs = json.loads(Path(config[\"trace_outputs_path\"]).read_text(encoding=\"utf-8\"))\n", " trace_id = (context.get(\"vars\") or {}).get(\"trace_id\")\n", " trace = trace_outputs[trace_id]\n", " return {\n", " \"output\": trace[\"answer\"],\n", " \"metadata\": {\n", " \"trace_id\": trace_id,\n", " \"question\": trace[\"question\"],\n", " },\n", " }\n", "'''\n", "\n", "\n", "def trace_for_eval(item: dict[str, Any], traces: list[TraceRecord]) -> TraceRecord:\n", " trace_by_id = {trace.trace_id: trace for trace in traces}\n", " try:\n", " return trace_by_id[item[\"source_trace_id\"]]\n", " except KeyError as exc:\n", " raise ValueError(f\"Unknown source_trace_id in generated eval: {item['source_trace_id']}\") from exc\n", "\n", "def promptfoo_test_from_eval(item: dict[str, Any], trace: TraceRecord) -> dict[str, Any]:\n", " assertions = [\n", " assertion\n", " for assertion in item.get(\"deterministic_assertions\") or []\n", " if isinstance(assertion, dict)\n", " and assertion.get(\"type\") in {\"contains\", \"icontains\", \"not-contains\"}\n", " and assertion.get(\"value\")\n", " ]\n", " assertions.append({\n", " \"type\": \"llm-rubric\",\n", " \"provider\": f\"openai:{JUDGE_MODEL}\",\n", " \"threshold\": 0.8,\n", " \"value\": item[\"rubric\"],\n", " })\n", " return {\n", " \"description\": item[\"title\"],\n", " \"vars\": {\n", " \"question\": trace.question,\n", " \"trace_id\": trace.trace_id,\n", " \"trace_label\": trace.trace_label,\n", " },\n", " \"metadata\": {\n", " \"eval_id\": item[\"eval_id\"],\n", " \"scoring_method\": item[\"scoring_method\"],\n", " },\n", " \"assert\": assertions,\n", " }\n", "\n", "\n", "def write_promptfoo_artifacts(eval_suite: list[dict[str, Any]], traces: list[TraceRecord]) -> dict[str, Path]:\n", " promptfoo_dir = ARTIFACT_DIR / \"promptfoo\"\n", " promptfoo_dir.mkdir(parents=True, exist_ok=True)\n", " provider_path = promptfoo_dir / \"trace_output_provider.py\"\n", " trace_outputs_path = promptfoo_dir / \"trace_outputs.json\"\n", " config_path = promptfoo_dir / \"promptfooconfig.yaml\"\n", " output_path = promptfoo_dir / \"promptfoo_results.json\"\n", "\n", " provider_path.write_text(PROMPTFOO_PROVIDER, encoding=\"utf-8\")\n", " trace_outputs_path.write_text(\n", " json.dumps({trace.trace_id: asdict(trace) for trace in traces}, indent=2) + \"\\n\",\n", " encoding=\"utf-8\",\n", " )\n", " tests = [promptfoo_test_from_eval(item, trace_for_eval(item, traces)) for item in eval_suite]\n", " config = {\n", " \"description\": \"Feedback-derived diligence eval gate\",\n", " \"prompts\": [\"{{question}}\"],\n", " \"providers\": [{\n", " \"id\": \"file://trace_output_provider.py\",\n", " \"label\": \"current-trace-output\",\n", " \"config\": {\"trace_outputs_path\": str(trace_outputs_path)},\n", " }],\n", " \"tests\": tests,\n", " }\n", " # JSON is valid YAML, which keeps the generated config easy to inspect without\n", " # adding another serialization dependency to the notebook.\n", " config_path.write_text(json.dumps(config, indent=2) + \"\\n\", encoding=\"utf-8\")\n", " return {\n", " \"dir\": promptfoo_dir,\n", " \"provider\": provider_path,\n", " \"trace_outputs\": trace_outputs_path,\n", " \"config\": config_path,\n", " \"output\": output_path,\n", " }\n", "\n", "\n", "def promptfoo_summary(path: Path) -> dict[str, Any]:\n", " data = json.loads(path.read_text(encoding=\"utf-8\"))\n", " results = (data.get(\"results\") or {}).get(\"outputs\") or (data.get(\"results\") or {}).get(\"results\") or []\n", " rows = []\n", " for result in results:\n", " grading = result.get(\"gradingResult\") or {}\n", " components = grading.get(\"componentResults\") or []\n", " failing_component = next(\n", " (\n", " component\n", " for component in components\n", " if isinstance(component, dict) and component.get(\"pass\") is False\n", " ),\n", " None,\n", " )\n", " reason = str(grading.get(\"reason\") or \"\")\n", " if not reason and failing_component:\n", " reason = str(failing_component.get(\"reason\") or \"\")\n", " if not reason and components and isinstance(components[0], dict):\n", " reason = str(components[0].get(\"reason\") or \"\")\n", " test_case = result.get(\"testCase\") or {}\n", " test_vars = test_case.get(\"vars\") or {}\n", " rows.append({\n", " \"eval_id\": (test_case.get(\"metadata\") or {}).get(\"eval_id\"),\n", " \"title\": test_case.get(\"description\") or \"Untitled\",\n", " \"trace_id\": test_vars.get(\"trace_id\"),\n", " \"trace_label\": test_vars.get(\"trace_label\"),\n", " \"passed\": bool(result.get(\"success\")),\n", " \"score\": result.get(\"score\"),\n", " \"explanation\": reason,\n", " })\n", " return {\n", " \"backend\": \"promptfoo\",\n", " \"total\": len(rows),\n", " \"passed\": sum(row[\"passed\"] for row in rows),\n", " \"failed\": sum(not row[\"passed\"] for row in rows),\n", " \"rows\": rows,\n", " }\n" ] }, { "cell_type": "markdown", "id": "promptfoo-run-heading", "metadata": {}, "source": [ "### Run the Promptfoo gate\n", "\n", "Execute the generated suite and summarize the current harness result.\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "promptfoo-run", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Promptfoo gate completed in 9s\n" ] }, { "data": { "text/markdown": [ "| title | trace_label | passed | score | explanation |\n", "| --- | --- | --- | --- | --- |\n", "| Runway and burn must be translated into near-term financing risk | trace-01 | True | 1 | All assertions passed |\n", "| Revenue quality assessment must prefer finance-controlled ARR and preserve ARR contradictions | trace-02 | True | 1 | All assertions passed |\n", "| Customer concentration must be assessed after parent-account rollups | trace-03 | True | 1 | All assertions passed |\n", "| Enterprise security readiness must distinguish SOC 2 Type I from Type II | trace-04 | True | 1 | All assertions passed |\n", "| Unsupported metrics must be refused rather than inferred | trace-05 | True | 1 | All assertions passed |" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "{'backend': 'promptfoo', 'total': 5, 'passed': 5, 'failed': 0, 'result_path': 'examples/agents_sdk/agent_improvement_loop_artifacts/promptfoo/promptfoo_results.json'}\n" ] } ], "source": [ "def run_promptfoo_feedback_eval_gate(eval_suite: list[dict[str, Any]], traces: list[TraceRecord]) -> dict[str, Any]:\n", " artifacts = write_promptfoo_artifacts(eval_suite, traces)\n", " command = [\n", " \"npx\",\n", " \"--yes\",\n", " f\"promptfoo@{PROMPTFOO_VERSION}\",\n", " \"eval\",\n", " \"--no-cache\",\n", " \"--no-table\",\n", " \"-c\",\n", " str(artifacts[\"config\"]),\n", " \"-o\",\n", " str(artifacts[\"output\"]),\n", " ]\n", " env = os.environ.copy()\n", " env[\"PROMPTFOO_PYTHON\"] = sys.executable\n", " env[\"PROMPTFOO_CONFIG_DIR\"] = str(artifacts[\"dir\"] / \".promptfoo\")\n", " env[\"PROMPTFOO_DISABLE_WAL_MODE\"] = \"true\"\n", " process = subprocess.run(\n", " command,\n", " cwd=artifacts[\"dir\"],\n", " env=env,\n", " text=True,\n", " stdout=subprocess.PIPE,\n", " stderr=subprocess.STDOUT,\n", " check=False,\n", " )\n", " if not artifacts[\"output\"].exists():\n", " raise RuntimeError(f\"Promptfoo did not write results. Output:\\n{process.stdout[-4000:]}\")\n", " summary = promptfoo_summary(artifacts[\"output\"])\n", " summary[\"command\"] = command\n", " summary[\"returncode\"] = process.returncode\n", " summary[\"result_path\"] = str(artifacts[\"output\"].relative_to(PROJECT_ROOT))\n", " summary[\"log_tail\"] = process.stdout[-4000:]\n", " return summary\n", "\n", "\n", "promptfoo_started = time.perf_counter()\n", "gate_result = run_promptfoo_feedback_eval_gate(eval_suite, traces)\n", "print(f\"Promptfoo gate completed in {format_duration(time.perf_counter() - promptfoo_started)}\")\n", "display(Markdown(markdown_table(gate_result[\"rows\"], [\"title\", \"trace_label\", \"passed\", \"score\", \"explanation\"])))\n", "print({key: gate_result[key] for key in [\"backend\", \"total\", \"passed\", \"failed\", \"result_path\"]})\n" ] }, { "cell_type": "markdown", "id": "step-7-revised", "metadata": {}, "source": [ "## Step 7. Run HALO and write the handoff\n", "\n", "HALO, short for Hierarchical Agent Loop Optimization, is a methodology and Python package for improving agent harnesses from execution traces. The [HALO repository](https://github.com/context-labs/halo) describes a loop that collects traces, analyzes recurring harness-level failures, hands the resulting report to a coding agent, and repeats after the harness changes.\n", "\n", "This is the point where the loop turns the accumulated evidence into proposed harness changes. HALO reviews the current harness together with the agent traces, human feedback, model feedback, generated evals, and Promptfoo results. It then produces a ranked set of changes for the next implementation pass.\n", "\n", "The value of HALO here is that it reasons over the whole loop at once. It can use human judgment alongside runtime behavior and eval outcomes, then package the result as a handoff Codex can use to implement the code changes that improve the harness.\n" ] }, { "cell_type": "markdown", "id": "halo-context-heading", "metadata": {}, "source": [ "### Collect the HALO inputs\n", "\n", "Build one context object that keeps the current harness, traces, feedback, evals, and gate results together.\n" ] }, { "cell_type": "code", "execution_count": 20, "id": "halo-context", "metadata": {}, "outputs": [], "source": [ "from datetime import datetime, timezone\n", "\n", "\n", "def serialize_agent_config(config: AgentConfig) -> dict[str, Any]:\n", " return {\n", " \"version\": config.version,\n", " \"system_prompt\": config.system_prompt,\n", " \"model_settings\": asdict(config.model_settings),\n", " \"tool_policy\": config.tool_policy,\n", " \"eval_metadata\": config.eval_metadata,\n", " }\n", "\n", "\n", "def build_halo_context(\n", " traces: list[TraceRecord],\n", " human_feedback: list[dict[str, Any]],\n", " llm_feedback: list[dict[str, Any]],\n", " eval_suite: list[dict[str, Any]],\n", " gate_result: dict[str, Any],\n", " agent_config: AgentConfig,\n", ") -> dict[str, Any]:\n", " return {\n", " \"traces\": [asdict(trace) for trace in traces],\n", " \"human_feedback\": human_feedback,\n", " \"llm_feedback\": llm_feedback,\n", " \"eval_suite\": eval_suite,\n", " \"gate_result\": gate_result,\n", " \"agent_config\": serialize_agent_config(agent_config),\n", " }\n", "\n", "\n", "def synthetic_trace_id(value: str) -> str:\n", " return hashlib.sha256(f\"halo-context-{value}\".encode(\"utf-8\")).hexdigest()[:32]\n", "\n", "\n", "def synthetic_span_id(value: str) -> str:\n", " return hashlib.sha256(value.encode(\"utf-8\")).hexdigest()[:16]\n", "\n", "\n", "def synthetic_span(*, trace_id: str, span_id: str, name: str, observation_kind: str, attributes: dict[str, Any]) -> dict[str, Any]:\n", " now = datetime.now(timezone.utc).strftime(\"%Y-%m-%dT%H:%M:%S.%f000Z\")\n", " return {\n", " \"trace_id\": trace_id,\n", " \"span_id\": span_id,\n", " \"parent_span_id\": \"\",\n", " \"trace_state\": \"\",\n", " \"name\": name,\n", " \"kind\": \"SPAN_KIND_INTERNAL\",\n", " \"start_time\": now,\n", " \"end_time\": now,\n", " \"status\": {\"code\": \"STATUS_CODE_OK\", \"message\": \"\"},\n", " \"resource\": {\"attributes\": {\"service.name\": \"financial-diligence-analyst\"}},\n", " \"scope\": {\"name\": \"halo-optimization-context\", \"version\": \"1\"},\n", " \"attributes\": {\n", " \"openinference.span.kind\": observation_kind,\n", " \"inference.export.schema_version\": 1,\n", " \"inference.project_id\": \"financial_diligence_analyst_optimization_context\",\n", " \"inference.observation_kind\": observation_kind,\n", " **attributes,\n", " },\n", " }\n", "\n", "\n", "def halo_input_summary(context: dict[str, Any]) -> str:\n", " rows = [\n", " (\"Current harness config\", 1, \"global config span\", \"system prompt, model settings, tool policy, eval metadata\"),\n", " (\"SDK execution traces\", len(context[\"traces\"]), \"original runtime traces\", \"agent steps, tool calls, outputs\"),\n", " (\"Human feedback\", len(context[\"human_feedback\"]), \"appended to the source trace\", \"reviewer summary, required observations, prohibited claims\"),\n", " (\"LLM feedback\", len(context[\"llm_feedback\"]), \"appended to the source trace\", \"model-generated observations\"),\n", " (\"Generated eval definitions\", len(context[\"eval_suite\"]), \"appended to the source trace\", \"expected behavior, rubric, pass/fail examples\"),\n", " (\"Promptfoo row results\", len(context[\"gate_result\"][\"rows\"]), \"appended to the source trace\", \"pass/fail outcome and explanation\"),\n", " (\"Promptfoo gate summary\", 1, \"global summary span\", \"suite totals across all evals\"),\n", " ]\n", " lines = [\n", " \"### HALO input summary\",\n", " \"\",\n", " \"| Input signal | Count | Where it lives | What is included |\",\n", " \"| --- | ---: | --- | --- |\",\n", " ]\n", " lines.extend(f\"| {name} | {count} | {location} | {included} |\" for name, count, location, included in rows)\n", " return \"\\n\".join(lines)\n", "\n" ] }, { "cell_type": "markdown", "id": "halo-jsonl-heading", "metadata": {}, "source": [ "### Attach feedback, generated evals, and eval results to the traces\n", "\n", "Write the combined trace file that HALO will inspect. Human feedback, LLM feedback, generated eval definitions, and row-level Promptfoo results are attached to the matching runtime trace. The overall gate summary stays global because it describes the suite as a whole.\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "halo-jsonl", "metadata": {}, "outputs": [], "source": [ "def write_halo_optimization_context(context: dict[str, Any]) -> Path:\n", " context_path = ARTIFACT_DIR / \"halo_optimization_context.jsonl\"\n", " lines = HALO_TRACE_PATH.read_text(encoding=\"utf-8\").splitlines() if HALO_TRACE_PATH.exists() else []\n", " lines.append(json.dumps(synthetic_span(\n", " trace_id=synthetic_trace_id(\"current-harness-config\"),\n", " span_id=synthetic_span_id(\"current-harness-config\"),\n", " name=\"harness.config\",\n", " observation_kind=\"HARNESS_CONFIG\",\n", " attributes={\n", " \"harness.version\": context[\"agent_config\"][\"version\"],\n", " \"harness.system_prompt\": context[\"agent_config\"][\"system_prompt\"],\n", " \"harness.model_settings\": json.dumps(context[\"agent_config\"][\"model_settings\"]),\n", " \"harness.tool_policy\": json.dumps(context[\"agent_config\"][\"tool_policy\"]),\n", " \"harness.eval_metadata\": json.dumps(context[\"agent_config\"][\"eval_metadata\"]),\n", " \"optimizer.signal_source\": \"harness_config\",\n", " },\n", " )))\n", " for index, item in enumerate(context[\"human_feedback\"]):\n", " lines.append(json.dumps(synthetic_span(\n", " trace_id=item[\"trace_id\"],\n", " span_id=synthetic_span_id(f\"human-feedback-{index}\"),\n", " name=\"human_feedback.comment\",\n", " observation_kind=\"HUMAN_FEEDBACK\",\n", " attributes={\n", " \"feedback.id\": item[\"feedback_id\"],\n", " \"feedback.trace_id\": item[\"trace_id\"],\n", " \"feedback.trace_label\": item[\"trace_label\"],\n", " \"feedback.question\": item[\"question\"],\n", " \"feedback.summary\": item[\"summary\"],\n", " \"feedback.required_observations\": json.dumps(item[\"required_observations\"]),\n", " \"feedback.prohibited_claims\": json.dumps(item[\"prohibited_claims\"]),\n", " \"optimizer.signal_source\": \"human_feedback\",\n", " },\n", " )))\n", " for index, item in enumerate(context[\"llm_feedback\"]):\n", " lines.append(json.dumps(synthetic_span(\n", " trace_id=item[\"trace_id\"],\n", " span_id=synthetic_span_id(f\"llm-insight-{index}\"),\n", " name=\"llm_feedback.insight\",\n", " observation_kind=\"LLM_FEEDBACK\",\n", " attributes={\n", " \"llm_feedback.id\": item[\"insight_id\"],\n", " \"llm_feedback.trace_id\": item[\"trace_id\"],\n", " \"llm_feedback.trace_label\": item[\"trace_label\"],\n", " \"llm_feedback.question\": item[\"question\"],\n", " \"llm_feedback.observations\": json.dumps(item[\"observations\"]),\n", " \"optimizer.signal_source\": \"llm_feedback\",\n", " },\n", " )))\n", " for index, item in enumerate(context[\"eval_suite\"]):\n", " lines.append(json.dumps(synthetic_span(\n", " trace_id=item[\"source_trace_id\"],\n", " span_id=synthetic_span_id(f\"generated-eval-{index}\"),\n", " name=\"generated_eval.definition\",\n", " observation_kind=\"EVAL\",\n", " attributes={\n", " \"eval.id\": item[\"eval_id\"],\n", " \"eval.trace_id\": item[\"source_trace_id\"],\n", " \"eval.trace_label\": item[\"source_trace_label\"],\n", " \"eval.title\": item[\"title\"],\n", " \"eval.method\": item[\"scoring_method\"],\n", " \"eval.expected_behavior\": item[\"expected_behavior\"],\n", " \"eval.pass_example\": item[\"suggested_pass_example\"],\n", " \"eval.fail_example\": item[\"suggested_fail_example\"],\n", " \"optimizer.signal_source\": \"generated_eval\",\n", " },\n", " )))\n", " lines.append(json.dumps(synthetic_span(\n", " trace_id=synthetic_trace_id(\"eval-gate-summary\"),\n", " span_id=synthetic_span_id(\"eval-gate-summary\"),\n", " name=\"eval_gate.summary\",\n", " observation_kind=\"EVAL_RESULT\",\n", " attributes={\n", " \"eval_gate.total\": context[\"gate_result\"][\"total\"],\n", " \"eval_gate.passed\": context[\"gate_result\"][\"passed\"],\n", " \"eval_gate.failed\": context[\"gate_result\"][\"failed\"],\n", " \"optimizer.signal_source\": \"eval_gate\",\n", " },\n", " )))\n", " for index, item in enumerate(context[\"gate_result\"][\"rows\"]):\n", " lines.append(json.dumps(synthetic_span(\n", " trace_id=item[\"trace_id\"],\n", " span_id=synthetic_span_id(f\"eval-gate-row-{index}\"),\n", " name=\"eval_gate.result\",\n", " observation_kind=\"EVAL_RESULT\",\n", " attributes={\n", " \"eval.id\": item[\"eval_id\"],\n", " \"eval.title\": item[\"title\"],\n", " \"eval.trace_id\": item[\"trace_id\"],\n", " \"eval.trace_label\": item[\"trace_label\"],\n", " \"eval.passed\": item[\"passed\"],\n", " \"eval.explanation\": item[\"explanation\"],\n", " \"optimizer.signal_source\": \"eval_gate\",\n", " },\n", " )))\n", " context_path.write_text(\"\\n\".join(lines).rstrip() + \"\\n\", encoding=\"utf-8\")\n", " return context_path\n" ] }, { "cell_type": "markdown", "id": "halo-prompt-heading", "metadata": {}, "source": [ "### Define the HALO output prompt\n", "\n", "This prompt tells HALO what kind of report to produce, including the sections Codex should receive in the final handoff file. You can customize it to match your company's workflow, review process, or use case.\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "halo-prompt", "metadata": {}, "outputs": [], "source": [ "def render_halo_prompt() -> str:\n", " return \"\"\"\n", "Analyze the financial diligence analyst optimization context as the central source of truth.\n", "The JSONL contains the current harness configuration, agent execution traces, human feedback, LLM insights, generated eval definitions, and eval-gate results.\n", "Treat human feedback as first-class evidence.\n", "Before recommending a change, compare the evidence against the current harness config and distinguish:\n", "- a requirement that is missing from the harness,\n", "- a requirement already present but not reliably followed in execution, and\n", "- an implementation or observability defect.\n", "\n", "Write an implementation-first Codex handoff in this exact top-level order:\n", "1. `## Executive summary`\n", "2. `## Top 3 changes to implement first`\n", "3. `## Ranked recommendation table`\n", "4. `## Supporting diagnosis and evidence`\n", "5. `## Detailed recommendations`\n", "6. `## Insights by feedback source`\n", "7. `## Machine-readable summary`\n", "\n", "Section requirements:\n", "- `## Executive summary`: briefly state what the current harness already does well, what the highest-value remaining gaps are, and whether the current eval gate passed.\n", "- `## Top 3 changes to implement first`: list the three most valuable implementation moves with concise rationale.\n", "- `## Ranked recommendation table`: include rank, recommendation, impact, confidence, implementation effort, evidence, and validation.\n", "- `## Supporting diagnosis and evidence`: include recurring harness-level failure modes, classify each against the current harness as missing requirement vs already-present-but-not-reliably-followed vs implementation/observability defect, and state the evidence source for each.\n", "- `## Detailed recommendations`: use these exact subsection headings in this order and do not use the word \"owner\" in them:\n", " - `### Behavior contract`\n", " - `#### Prompt`\n", " - `#### Skills`\n", " - `### Runtime implementation`\n", " - `#### Tools`\n", " - `#### Control flow`\n", " - `#### Routing`\n", " - `### Output contract`\n", " - `#### Artifact schema`\n", " - `### Observability and evals`\n", " - `#### Observability`\n", " - `#### Evals`\n", "- `## Insights by feedback source`: summarize what came from traces, human feedback, LLM feedback, generated evals, eval-gate results, and harness config.\n", "- `## Machine-readable summary`: include one fenced JSON block with `top_priorities`.\n", "\n", "Do not add extra top-level sections outside that order.\n", "\"\"\".strip()\n", "\n" ] }, { "cell_type": "markdown", "id": "halo-run-heading", "metadata": {}, "source": [ "### Run HALO and format the report\n", "\n", "HALO receives the five SDK execution traces plus two synthetic global traces: one records the current harness config, and one records the Promptfoo gate summary. That is why its trace count is higher than the five agent runs created earlier.\n", "\n", "Generate the full optimization report, save the handoff artifact, and display the highest-priority recommendations in the notebook.\n" ] }, { "cell_type": "code", "execution_count": 23, "id": "halo-run", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "### HALO input summary\n", "\n", "| Input signal | Count | Where it lives | What is included |\n", "| --- | ---: | --- | --- |\n", "| Current harness config | 1 | global config span | system prompt, model settings, tool policy, eval metadata |\n", "| SDK execution traces | 5 | original runtime traces | agent steps, tool calls, outputs |\n", "| Human feedback | 5 | appended to the source trace | reviewer summary, required observations, prohibited claims |\n", "| LLM feedback | 5 | appended to the source trace | model-generated observations |\n", "| Generated eval definitions | 5 | appended to the source trace | expected behavior, rubric, pass/fail examples |\n", "| Promptfoo row results | 5 | appended to the source trace | pass/fail outcome and explanation |\n", "| Promptfoo gate summary | 1 | global summary span | suite totals across all evals |" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "HALO optimization started. This is usually the longest cell in the notebook.\n", "HALO still running... 30s elapsed\n", "HALO still running... 1m 00s elapsed\n", "HALO still running... 1m 30s elapsed\n", "HALO still running... 2m 00s elapsed\n", "HALO still running... 2m 30s elapsed\n", "HALO still running... 3m 00s elapsed\n", "HALO still running... 3m 30s elapsed\n", "HALO still running... 4m 00s elapsed\n", "HALO still running... 4m 30s elapsed\n", "HALO still running... 5m 00s elapsed\n", "HALO still running... 5m 30s elapsed\n", "HALO still running... 6m 00s elapsed\n", "HALO still running... 6m 30s elapsed\n", "HALO still running... 7m 00s elapsed\n", "HALO optimization completed in 7m 15s\n", "Gate result passed into optimization context: True\n", "Wrote:\n", "- examples/agents_sdk/agent_improvement_loop_artifacts/halo_optimization_context.jsonl\n", "- examples/agents_sdk/agent_improvement_loop_artifacts/codex_handoff.md\n" ] } ], "source": [ "async def run_halo_optimization(context_path: Path) -> str:\n", " from agents import set_trace_processors\n", " from engine.agents.agent_config import AgentConfig as HaloAgentConfig\n", " from engine.engine_config import EngineConfig\n", " from engine.main import stream_engine_async\n", " from engine.sandbox.sandbox import Sandbox\n", " from engine.model_config import ModelConfig\n", " from engine.models.engine_output import AgentOutputItem, AgentTextDelta\n", " from engine.models.messages import AgentMessage\n", "\n", " # HALO's current CLI wrapper sets compaction temperature to 0.0, which is not\n", " # accepted by GPT-5-class models. Use the Python API so the compactor uses the\n", " # model default-compatible temperature while preserving the requested model.\n", " agent = HaloAgentConfig(\n", " name=\"root\",\n", " model=ModelConfig(name=HALO_MODEL),\n", " maximum_turns=20,\n", " )\n", " config = EngineConfig(\n", " root_agent=agent,\n", " subagent=agent.model_copy(update={\"name\": \"sub\"}),\n", " synthesis_model=ModelConfig(name=HALO_MODEL),\n", " compaction_model=ModelConfig(name=HALO_MODEL, temperature=1.0),\n", " maximum_depth=1,\n", " maximum_parallel_subagents=2,\n", " )\n", "\n", " # The notebook already exports the SDK traces locally; HALO does not need\n", " # hosted trace ingestion for this diagnosis pass.\n", " set_trace_processors([])\n", "\n", " deltas: list[str] = []\n", " final_items: list[str] = []\n", " messages = [AgentMessage(role=\"user\", content=render_halo_prompt())]\n", "\n", " # This pass only needs HALO's trace-analysis tools. Skip the optional\n", " # `run_code` sandbox so readers do not need a separate Deno/Pyodide setup\n", " # just to generate the optimization report.\n", " async def report_progress(done: asyncio.Event, interval_seconds: int = 30) -> None:\n", " started = time.perf_counter()\n", " print(\"HALO optimization started. This is usually the longest cell in the notebook.\")\n", " while not done.is_set():\n", " try:\n", " await asyncio.wait_for(done.wait(), timeout=interval_seconds)\n", " except TimeoutError:\n", " print(f\"HALO still running... {format_duration(time.perf_counter() - started)} elapsed\")\n", "\n", " original_sandbox_get = Sandbox.__dict__[\"get\"]\n", " Sandbox.get = classmethod(lambda cls: None)\n", " halo_started = time.perf_counter()\n", " progress_done = asyncio.Event()\n", " progress_task = asyncio.create_task(report_progress(progress_done))\n", " try:\n", " async for event in stream_engine_async(messages, config, context_path):\n", " if isinstance(event, AgentTextDelta):\n", " deltas.append(event.text_delta)\n", " elif isinstance(event, AgentOutputItem) and event.final:\n", " final_items.append(str(event.item))\n", " finally:\n", " progress_done.set()\n", " await progress_task\n", " Sandbox.get = original_sandbox_get\n", "\n", " print(f\"HALO optimization completed in {format_duration(time.perf_counter() - halo_started)}\")\n", " report = \"\".join(deltas).strip() or \"\\n\\n\".join(final_items).strip()\n", " if not report:\n", " raise RuntimeError(\"HALO completed without producing a report.\")\n", " return report\n", "\n", "\n", "def clean_halo_handoff(report: str) -> str:\n", " \"\"\"Keep only the final Codex-facing handoff sections from HALO output.\"\"\"\n", " normalized = re.sub(r\"(? Path:\n", " target = Path(path)\n", " if not target.is_absolute():\n", " target = PROJECT_ROOT / target\n", " target.parent.mkdir(parents=True, exist_ok=True)\n", " target.write_text(report.rstrip() + \"\\n\", encoding=\"utf-8\")\n", " return target\n", "\n", "\n", "halo_context = build_halo_context(traces, human_feedback, llm_feedback, eval_suite, gate_result, agent_config)\n", "display(Markdown(halo_input_summary(halo_context)))\n", "halo_context_path = write_halo_optimization_context(halo_context)\n", "halo_report = await run_halo_optimization(halo_context_path)\n", "clean_handoff = clean_halo_handoff(halo_report)\n", "\n", "handoff_path = write_halo_handoff(clean_handoff, ARTIFACT_DIR / \"codex_handoff.md\")\n", "\n", "def extract_named_section(report: str, heading: str) -> str:\n", " if heading not in report:\n", " return \"\"\n", " start = report.index(heading)\n", " remainder = report[start + len(heading):]\n", " next_section = re.search(r\"\\n## \", remainder)\n", " return report[start:] if next_section is None else report[start:start + len(heading) + next_section.start()]\n", "\n", "\n", "def render_notebook_halo_summary(report: str) -> str:\n", " sections = [\n", " extract_named_section(report, \"## Top 3 changes to implement first\"),\n", " extract_named_section(report, \"## Insights by feedback source\"),\n", " ]\n", " rendered = \"\\n\\n\".join(section.strip() for section in sections if section.strip())\n", " return rendered or report\n", "\n", "\n", "print(\"Gate result passed into optimization context:\", \"gate_result\" in halo_context)\n", "print(\"Wrote:\")\n", "print(\"-\", halo_context_path.relative_to(PROJECT_ROOT))\n", "print(\"-\", handoff_path.relative_to(PROJECT_ROOT))\n" ] }, { "cell_type": "markdown", "id": "step-8-revised", "metadata": {}, "source": [ "## Step 8. Hand the full report to Codex\n", "\n", "HALO diagnoses and prioritizes. A coding agent or human still changes the harness.\n", "\n", "Below is a snapshot of the full report Codex can act on: the top three recommendations plus a compact summary of what came from each feedback source. The complete `codex_handoff.md` file also includes the ranked changes, supporting evidence, and validation guidance for implementation.\n" ] }, { "cell_type": "code", "execution_count": 24, "id": "b1c5160f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Full Codex handoff written to: examples/agents_sdk/agent_improvement_loop_artifacts/codex_handoff.md\n", "Snapshot below; open the generated codex_handoff.md file to review the full handoff.\n" ] }, { "data": { "text/markdown": [ "## Top 3 changes to implement first\n", "\n", "1. **Add a deterministic diligence fact ledger and domain checklist layer.** \n", " Encode canonical facts and source-of-truth rules for ARR, runway/burn, parent-account concentration, unsupported metrics, and SOC 2 status so the agent cannot rely only on generic citation instructions.\n", "\n", "2. **Upgrade validators to audit the actual output artifacts, not just claimed evidence coverage.** \n", " Current validation can pass while artifact-level citation or claim-audit issues still require later repair. Parse generated markdown/JSON/CSV artifacts, extract material claims, verify source support, and fail on unsupported or unaudited claims.\n", "\n", "3. **Persist the five generated evals into the checked-in regression suite.** \n", " The generated evals all passed, but they should become durable regression tests so future prompt/runtime changes cannot regress on the specific human-feedback issues.\n", "\n", "## Insights by feedback source\n", "\n", "| Feedback source | Key insights |\n", "|---|---|\n", "| Traces | The agent generally follows the artifact-generation workflow and validation loop, but execution is generic and sometimes monolithic. Some repairs happen after validators pass, showing validation is not strict enough. Parent concentration trace demonstrated a good deterministic-calculation pattern worth generalizing. |\n", "| Human feedback | Human feedback is the strongest evidence for domain gaps: runway must be 11 months with financing pressure; ARR must use finance-controlled source of truth; concentration must roll up to parent accounts; SOC 2 Type I and Type II must not be conflated; official NRR and CAC payback must be refused when unsupported. |\n", "| LLM feedback | LLM insights reinforce the human themes: ARR headline numbers need caveats, unsupported metrics should not be promoted, retention and pipeline claims require source caveats, and SOC 2 Type II completion must not be overstated. |\n", "| Generated evals | Five targeted evals were generated from the feedback themes: runway/burn, ARR source of truth, customer concentration parent rollup, SOC 2 precision, and unsupported metrics refusal. These encode the right regression surface and should be checked in. |\n", "| Eval-gate results | The current eval gate passed: 5 total, 5 passed, 0 failed. This indicates the latest generated eval suite is satisfied, but the suite should be persisted and expanded to cover validators, artifact parsing, and calculation correctness. |\n", "| Harness config | The harness already has strong generic evidence, citation, artifact, and validation requirements. Its main weakness is that it lacks explicit financial-diligence invariants and deterministic runtime checks for the exact mistakes surfaced by feedback. |" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "handoff_file = ARTIFACT_DIR / \"codex_handoff.md\"\n", "\n", "if handoff_file.exists():\n", " print(f\"Full Codex handoff written to: {handoff_file.relative_to(PROJECT_ROOT)}\")\n", " print(\"Snapshot below; open the generated codex_handoff.md file to review the full handoff.\")\n", " display(Markdown(render_notebook_halo_summary(handoff_file.read_text(encoding=\"utf-8\"))))\n", "else:\n", " print(f\"Codex handoff not found yet: {handoff_file.relative_to(PROJECT_ROOT)}\")\n", " print(\"Run the HALO optimization cell above to generate it.\")\n" ] }, { "cell_type": "markdown", "id": "step-9-revised", "metadata": {}, "source": [ "## Step 9. Close the loop\n", "\n", "Now that the full workflow is in place, we can revisit the optimization flywheel from the top of the notebook. The same architecture supports two operating modes.\n", "\n", "![Agent improvement loop flywheel](../../images/agent-improvement-loop-flywheel.svg)\n", "\n", "![Human review gates in the loop](../../images/agent-improvement-loop-human-gates.svg)\n", "\n", "It can run as a closed loop, where new traces, human and model feedback, generated Promptfoo evals, HALO diagnosis, Codex implementation, validation, and deployment all feed the next cycle. In that mode, the handoff artifact can be written to shared storage, and a Codex automation with a heartbeat can keep checking for new handoffs, wake up when one appears, and trigger the next implementation pass automatically.\n", "\n", "The developer can also add human gates wherever they want them, including trace review, eval refinement, pull request approval, merge, and deployment.\n", "\n", "The design choice is how much humans participate after they give feedback. Human judgment can steer a loop where agents do the execution, or humans can remain approval gates throughout the process. In both versions, human feedback stays central because it shapes what the system learns and what it changes next.\n" ] }, { "cell_type": "markdown", "id": "conclusion", "metadata": {}, "source": [ "## Conclusion\n", "\n", "An agent improvement loop offers a path toward continual improvement without reducing the problem to prompt tuning alone. The full loop matters: traces capture behavior, human feedback adds judgment, evals preserve what the system should do, HALO turns the evidence into ranked harness changes, and Codex can implement the next pass.\n", "\n", "This area is still evolving, and some of the individual components will likely change over time. The larger idea of loop engineering is the durable part: agents can improve from real behavior when feedback, testing, and implementation are connected in one loop.\n" ] }, { "cell_type": "markdown", "id": "next-steps-revised", "metadata": {}, "source": [ "## Next steps\n", "\n", "- Choose the model for each stage of the loop by editing `AGENT_MODEL`, `ANALYSIS_MODEL`, `EVAL_GENERATION_MODEL`, `JUDGE_MODEL`, and `HALO_MODEL` near the top of the notebook.\n", "- Create your own traces to test the agent.\n", "- Decide how much of the final path should remain reviewed versus automated: you can stop at a developer-reviewed PR, or wire the handoff into a system that opens, merges, and deploys changes automatically.\n", "- Pass the generated `codex_handoff.md` file under `ARTIFACT_DIR` to Codex, inspect the harness changes it proposes, and rerun the same eval suite against the updated harness.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }