{ "cells": [ { "cell_type": "markdown", "id": "md-001", "metadata": {}, "source": [ "# Build iterative repair loops with Codex\n", "\n", "This cookbook is about closed-loop agent workflows: agents that produce an output, validate it, and use the feedback to improve the next pass.\n", "\n", "We'll explore a documentation reliability workflow that detects, repairs, and validates stale or broken API and SDK examples. The worked example uses intentionally stale notebooks adapted from this Cookbook repository.\n", "\n", "We'll build this agent loop with Codex. Codex reviews the current state, applies focused changes, runs validation, and repeats when the feedback shows remaining issues.\n", "\n", "The notebook task is only the example. The pattern applies wherever agent output can be measured with trustworthy feedback.\n", "\n", "The workflow has three phases:\n", "\n", "- **Review:** inspect the current artifact and return structured findings without editing files.\n", "- **Repair:** apply focused edits to a copied artifact using the findings and the latest validation feedback.\n", "- **Validate:** run the relevant checks and report what still needs work.\n", "\n", "Validation closes the loop. The repaired notebook has to satisfy the checks that matter, and any remaining issues become the next repair input.\n" ] }, { "cell_type": "markdown", "id": "md-002", "metadata": {}, "source": [ "

\n", " \"Codex\n", "

\n" ] }, { "cell_type": "markdown", "id": "md-003", "metadata": {}, "source": [ "## Setup\n", "\n", "This notebook uses [Codex CLI](https://developers.openai.com/codex/cli) in headless mode, so the repair steps can run from Python cells instead of a chat UI. The first code cell installs the CLI; if you already have it, you can skip that cell.\n", "\n", "Before you run the live repair loop, set `OPENAI_API_KEY` in your environment.\n", "\n", "The notebook defaults to a fast repair model so the full example can finish in a reasonable amount of time. To experiment with a different model, set `REPAIR_MODEL` before you start. The install cell pins a known Codex CLI version for reproducibility; update that version intentionally when you want newer CLI behavior.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "code-008", "metadata": {}, "outputs": [], "source": [ "!npm install -g @openai/codex@0.130.0" ] }, { "cell_type": "code", "execution_count": 2, "id": "code-009", "metadata": {}, "outputs": [], "source": [ "import concurrent.futures\n", "import json\n", "import os\n", "import shlex\n", "import shutil\n", "import subprocess\n", "import tempfile\n", "from pathlib import Path\n", "from typing import Any\n", "\n", "CANDIDATE_EXAMPLE_DIRS = [Path(\".\"), Path(\"examples/codex\")]\n", "EXAMPLE_DIR = next((base for base in CANDIDATE_EXAMPLE_DIRS if (base / \"data\" / \"docs\").exists()), None)\n", "\n", "if EXAMPLE_DIR is None:\n", " raise RuntimeError(\n", " \"This notebook needs its companion sample notebooks. \"\n", " \"Download the data folder that ships with this example and place it next to \"\n", " \"this notebook as ./data/docs, or run from a checkout where examples/codex/data/docs exists.\"\n", " )\n", "\n", "DATA_DIR = EXAMPLE_DIR / \"data\" / \"docs\"\n", "DEFAULT_RUNS_DIR = Path(tempfile.gettempdir()) / \"codex_iterative_repair_loop_outputs\"\n", "RUNS_DIR = Path(os.getenv(\"CODEX_REPAIR_RUNS_DIR\", str(DEFAULT_RUNS_DIR))).expanduser()\n", "RUNS_DIR.mkdir(parents=True, exist_ok=True)\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "code-010", "metadata": {}, "outputs": [], "source": [ "MODEL = os.getenv(\"REPAIR_MODEL\", \"gpt-5.4-mini\")\n", "COOKBOOK_CHAT_MODEL = os.getenv(\"COOKBOOK_CHAT_MODEL\", \"gpt-5.5\")\n", "REPAIR_REASONING_EFFORT = os.getenv(\"REPAIR_REASONING_EFFORT\", \"low\")\n", "\n", "if not os.environ.get(\"OPENAI_API_KEY\"):\n", " raise ValueError(\"Set the OPENAI_API_KEY environment variable before running the live Codex repair loop.\")\n", "\n", "CODEX_CLI = shutil.which(\"codex\")\n", "if CODEX_CLI is None:\n", " raise RuntimeError(\"Run the install cell before continuing; Codex CLI is not on PATH.\")\n" ] }, { "cell_type": "markdown", "id": "md-012", "metadata": {}, "source": [ "## Load the sample artifacts\n", "\n", "The cells below load the three companion notebooks and summarize the metadata that drives the repair loop.\n", "\n", "The samples are small on purpose. They run quickly, but they still exercise the architecture: review finds substantive issues, repair makes focused edits, and validation produces feedback for the next pass.\n", "\n", "If you download this notebook by itself, also download the companion `data/docs/` folder and place it next to the notebook before running the cells below. The code expects those sample notebooks to be available locally.\n", "\n", "In this example, validation executes each repaired notebook end to end. In another domain, validation might be a unit test, policy check, schema validator, simulation, or human approval step. The important part is that failures become structured feedback instead of a dead end.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "code-013", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'notebook': 'qdrant_embeddings_search_pre_repair.ipynb',\n", " 'cells': 5,\n", " 'code_cells': 4,\n", " 'source': 'examples/vector_databases/qdrant/Using_Qdrant_for_embeddings_search.ipynb',\n", " 'target_iteration': 1,\n", " 'repair_depth': 'One-pass cleanup: modernize the local Qdrant query path and clarify the sampled fixture framing.'},\n", " {'notebook': 'getting_started_evals_pre_repair.ipynb',\n", " 'cells': 5,\n", " 'code_cells': 4,\n", " 'source': 'examples/evaluation/Getting_Started_with_OpenAI_Evals.ipynb',\n", " 'target_iteration': 2,\n", " 'repair_depth': 'Two-pass cleanup: first modernize the obvious stale Evals flow, then use validation feedback to remove result-log brittleness.'},\n", " {'notebook': 'knowledge_retrieval_pre_repair.ipynb',\n", " 'cells': 5,\n", " 'code_cells': 4,\n", " 'source': 'examples/How_to_call_functions_for_knowledge_retrieval.ipynb',\n", " 'target_iteration': 3,\n", " 'repair_depth': 'Three-pass cleanup: modernize model/API shape, then tighten runnable local setup, then restore the full retrieval teaching flow.'}]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "NOTEBOOKS = [\n", " DATA_DIR / \"qdrant_embeddings_search_pre_repair.ipynb\",\n", " DATA_DIR / \"getting_started_evals_pre_repair.ipynb\",\n", " DATA_DIR / \"knowledge_retrieval_pre_repair.ipynb\",\n", "]\n", "\n", "\n", "def read_notebook(path: Path) -> dict[str, Any]:\n", " return json.loads(path.read_text(encoding=\"utf-8\"))\n", "\n", "\n", "def case_metadata(path: Path) -> dict[str, Any]:\n", " return read_notebook(path).get(\"metadata\", {}).get(\"codex_case_study\", {})\n", "\n", "\n", "cases = []\n", "for notebook_path in NOTEBOOKS:\n", " notebook = read_notebook(notebook_path)\n", " metadata = notebook.get(\"metadata\", {}).get(\"codex_case_study\", {})\n", " repair_story = metadata.get(\"repair_story\", {})\n", " cases.append(\n", " {\n", " \"notebook\": notebook_path.name,\n", " \"cells\": len(notebook[\"cells\"]),\n", " \"code_cells\": sum(cell[\"cell_type\"] == \"code\" for cell in notebook[\"cells\"]),\n", " \"source\": metadata.get(\"source_path\"),\n", " \"target_iteration\": repair_story.get(\"target_iteration\"),\n", " \"repair_depth\": repair_story.get(\"repair_depth\", \"\"),\n", " }\n", " )\n", "\n", "cases\n" ] }, { "cell_type": "markdown", "id": "md-016", "metadata": {}, "source": [ "## Define business rules and issue taxonomy\n", "\n", "Before asking Codex to review or repair an artifact, give it a small shared contract. That keeps the loop focused on the issues that matter, instead of asking the model to infer every product and style rule from scratch.\n", "\n", "The rules below define what \"good\" means for these example notebooks: current API patterns, clear setup, runnable local samples, and preservation of the original teaching goal. In another workflow, this contract would describe that domain's source of truth.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "code-017", "metadata": {}, "outputs": [], "source": [ "business_rules = {\n", " \"preferred_chat_model\": COOKBOOK_CHAT_MODEL,\n", " \"preferred_embedding_model\": \"text-embedding-3-large\",\n", " \"modernize\": [\n", " \"client.chat.completions.create -> client.responses.create\",\n", " \"legacy function-calling schemas -> current tools schema\",\n", " \"qdrant.search -> qdrant.query_points\",\n", " \"oaieval CLI examples -> current Evals API workflow\",\n", " ],\n", " \"reader_experience\": [\n", " \"Make fresh-environment setup explicit.\",\n", " \"Keep the included examples runnable with local data and the standard library.\",\n", " \"Keep sample repairs self-contained unless the notebook explicitly teaches external setup.\",\n", " \"Remove manual result-file placeholders.\",\n", " \"State runtime prerequisites and side effects before readers run cells.\",\n", " \"Preserve the original teaching goal while modernizing the implementation.\",\n", " ],\n", "}\n", "\n", "business_rules\n" ] }, { "cell_type": "markdown", "id": "md-018", "metadata": {}, "source": [ "## Define structured outputs\n", "\n", "Each phase returns structured data so the next phase has something concrete to use.\n", "\n", "Review returns findings. Repair returns a change summary and the path to the updated artifact. Validation returns the remaining delta for the next pass. With structured handoffs, the loop is easier to debug, rerun, and adapt to other artifact types.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "code-019", "metadata": {}, "outputs": [], "source": [ "def object_schema(properties: dict[str, Any], required: list[str] | None = None) -> dict[str, Any]:\n", " return {\n", " \"type\": \"object\",\n", " \"properties\": properties,\n", " \"required\": required or list(properties),\n", " \"additionalProperties\": False,\n", " }\n", "\n", "\n", "def string_array() -> dict[str, Any]:\n", " return {\"type\": \"array\", \"items\": {\"type\": \"string\"}}\n", "\n", "\n", "finding_schema = object_schema(\n", " {\n", " \"artifact\": {\"type\": \"string\"},\n", " \"issue_type\": {\"type\": \"string\"},\n", " \"severity\": {\"type\": \"string\"},\n", " \"description\": {\"type\": \"string\"},\n", " \"suggested_fix_direction\": {\"type\": \"string\"},\n", " }\n", ")\n", "\n", "review_schema = object_schema(\n", " {\"findings\": {\"type\": \"array\", \"items\": finding_schema}}\n", ")\n", "\n", "fix_schema = object_schema(\n", " {\n", " \"artifact\": {\"type\": \"string\"},\n", " \"iteration\": {\"type\": \"integer\"},\n", " \"changes_made\": string_array(),\n", " \"unresolved_items\": string_array(),\n", " \"updated_artifact_path\": {\"type\": \"string\"},\n", " }\n", ")\n", "\n", "validation_case_schema = object_schema(\n", " {\n", " \"name\": {\"type\": \"string\"},\n", " \"passed\": {\"type\": \"boolean\"},\n", " \"severity\": {\"type\": \"string\"},\n", " \"evidence\": {\"type\": \"string\"},\n", " \"feedback\": {\"type\": \"string\"},\n", " }\n", ")\n", "\n", "validation_schema = object_schema(\n", " {\n", " \"overall_passed\": {\"type\": \"boolean\"},\n", " \"cases\": {\"type\": \"array\", \"items\": validation_case_schema},\n", " \"remaining_delta\": string_array(),\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "md-020", "metadata": {}, "source": [ "## Review phase\n", "\n", "The review phase reads the artifact and returns structured findings. It does not run validation and it does not edit files. That separation keeps the first step focused: identify likely problems before changing anything.\n", "\n", "We send the review prompt to `codex exec` with a JSON schema. The schema keeps the result machine-readable, so later cells can pass findings directly into the repair prompt instead of scraping prose from a previous answer.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "code-021", "metadata": {}, "outputs": [], "source": [ "def notebook_text(path: Path, max_chars: int = 7000) -> str:\n", " chunks = []\n", " for index, cell in enumerate(read_notebook(path)[\"cells\"]):\n", " source = \"\".join(cell.get(\"source\", []))\n", " chunks.append(f\"cell {index} ({cell['cell_type']})\\n{source}\")\n", " text = \"\\n\\n\".join(chunks)\n", " if len(text) <= max_chars:\n", " return text\n", " return text[:max_chars] + \"\\n\\n[truncated for prompt size]\"\n", "\n", "\n", "def run_command(command: str, *, stdin: str | None = None, cwd: Path | None = None, timeout: int | None = None):\n", " cwd = Path.cwd() if cwd is None else cwd\n", " return subprocess.run(\n", " shlex.split(command),\n", " input=stdin,\n", " cwd=cwd,\n", " capture_output=True,\n", " text=True,\n", " timeout=timeout,\n", " check=False,\n", " )\n", "\n", "\n", "def run_codex_json(prompt: str, schema: dict[str, Any], run_dir: Path) -> dict[str, Any]:\n", " run_dir.mkdir(parents=True, exist_ok=True)\n", " prompt_file = run_dir / \"prompt.txt\"\n", " schema_file = run_dir / \"schema.json\"\n", " answer_file = run_dir / \"answer.json\"\n", "\n", " prompt_file.write_text(prompt, encoding=\"utf-8\")\n", " schema_file.write_text(json.dumps(schema, indent=2), encoding=\"utf-8\")\n", "\n", " command = f\"\"\"\n", " {CODEX_CLI} exec\n", " --model {MODEL}\n", " --sandbox workspace-write\n", " --ask-for-approval never\n", " --config model_reasoning_effort={REPAIR_REASONING_EFFORT}\n", " --output-schema {schema_file}\n", " --output-last-message {answer_file}\n", " -\n", " \"\"\"\n", " result = run_command(command, stdin=prompt)\n", " (run_dir / \"stdout.txt\").write_text(result.stdout, encoding=\"utf-8\")\n", " (run_dir / \"stderr.txt\").write_text(result.stderr, encoding=\"utf-8\")\n", "\n", " if result.returncode != 0:\n", " raise RuntimeError(f\"Codex exited with {result.returncode}. See {run_dir / 'stderr.txt'}.\")\n", "\n", " return json.loads(answer_file.read_text(encoding=\"utf-8\"))\n", "\n", "\n", "def review_notebook(path: Path, run_dir: Path) -> list[dict[str, Any]]:\n", " prompt = \"\\n\".join(\n", " [\n", " \"You are reviewing a public OpenAI Cookbook notebook before publication.\",\n", " f\"Artifact: {path.name}\",\n", " \"Find issues that would make the notebook stale, hard to run, or confusing for a developer reader.\",\n", " \"Do not execute the notebook or edit files.\",\n", " \"Use concise issue_type labels such as stale_model, deprecated_api, setup_gap, runtime_risk, or clarity_issue.\",\n", " f\"Business rules: {json.dumps(business_rules)}\",\n", " \"Base findings only on the notebook content below.\",\n", " \"Keep the findings focused; three strong findings are better than a long list.\",\n", " \"\",\n", " notebook_text(path),\n", " ]\n", " )\n", " return run_codex_json(prompt, review_schema, run_dir)[\"findings\"]\n" ] }, { "cell_type": "code", "execution_count": null, "id": "code-022", "metadata": {}, "outputs": [], "source": [ "def run_initial_review(path: Path) -> tuple[str, list[dict[str, Any]]]:\n", " return path.name, review_notebook(path, RUNS_DIR / \"initial_review\" / path.stem)\n", "\n", "\n", "with concurrent.futures.ThreadPoolExecutor(max_workers=min(3, len(NOTEBOOKS))) as executor:\n", " initial_reviews = dict(executor.map(run_initial_review, NOTEBOOKS))\n", "\n", "initial_reviews\n" ] }, { "cell_type": "markdown", "id": "md-023", "metadata": {}, "source": [ "## Repair phase\n", "\n", "The repair phase gets the current artifact, review findings, business rules, and any validation feedback from the previous pass. The prompt gets more specific as the loop learns.\n", "\n", "Codex edits a copy inside the iteration directory and returns a short summary of what changed. The loop does not assume the edit worked; validation decides that in the next step.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "code-024", "metadata": {}, "outputs": [], "source": [ "def repair_prompt(path: Path, updated_path: Path, findings: list[dict[str, Any]], remaining_delta: list[str], iteration: int) -> str:\n", " repair_story = case_metadata(path).get(\"repair_story\", {})\n", " return \"\\n\".join(\n", " [\n", " \"You are repairing a copy of a public OpenAI Cookbook notebook.\",\n", " f\"Source notebook: {path}\",\n", " f\"Editable copy: {updated_path}\",\n", " f\"Iteration: {iteration}\",\n", " \"Make the smallest useful edits that address the review findings and validation delta.\",\n", " \"Preserve the notebook's teaching flow and original purpose.\",\n", " \"Keep sample repairs self-contained unless the notebook explicitly teaches external setup.\",\n", " \"For staged examples, focus on the most important remaining issue for this pass instead of rewriting everything at once.\",\n", " \"Edit only the editable copy. Do not claim the notebook passes validation.\",\n", " f\"Repair depth: {json.dumps(repair_story, indent=2)}\",\n", " f\"Business rules: {json.dumps(business_rules, indent=2)}\",\n", " f\"Review findings: {json.dumps(findings, indent=2)}\",\n", " f\"Remaining validation delta: {json.dumps(remaining_delta, indent=2)}\",\n", " ]\n", " )\n", "\n", "\n", "def repair_notebook(path: Path, iteration: int, findings: list[dict[str, Any]], remaining_delta: list[str], case_dir: Path) -> dict[str, Any]:\n", " updated_path = case_dir / \"updated.ipynb\"\n", " updated_path.parent.mkdir(parents=True, exist_ok=True)\n", " shutil.copy2(path, updated_path)\n", "\n", " prompt = repair_prompt(path, updated_path, findings, remaining_delta, iteration)\n", " return run_codex_json(prompt, fix_schema, case_dir / \"repair\")\n" ] }, { "cell_type": "markdown", "id": "md-025", "metadata": {}, "source": [ "## Validation phase\n", "\n", "Validation works like a small eval. We define the behavior we want, run the relevant check, and ask a judge to score the result against that rubric.\n", "\n", "For the documentation example, execution comes first. Many notebook problems only appear at runtime: a missing import, a stale file path, a cell that depends on an old API response, or setup guidance that was clear to the author but not to a fresh reader.\n", "\n", "If validation fails, the failure becomes evidence for the next repair pass. This keeps the next repair grounded in observed behavior, not just what looked right in the diff.\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "code-026", "metadata": {}, "outputs": [], "source": [ "VALIDATION_CASES = [\n", " {\n", " \"name\": \"api_modernization\",\n", " \"question\": \"Does the notebook avoid stale OpenAI API patterns, legacy function-calling syntax, and outdated model names?\",\n", " },\n", " {\n", " \"name\": \"setup_reproducibility\",\n", " \"question\": \"Could a reader run the notebook from a fresh environment without hidden manual steps?\",\n", " },\n", " {\n", " \"name\": \"artifact_integrity\",\n", " \"question\": \"Did the update preserve the notebook's teaching flow and avoid deleting substantive cells?\",\n", " },\n", "]\n", "\n", "\n", "def short_output(value: Any, limit: int = 1200) -> str:\n", " if value is None:\n", " return \"\"\n", " if isinstance(value, bytes):\n", " value = value.decode(\"utf-8\", errors=\"replace\")\n", " return str(value)[-limit:]\n", "\n", "\n", "def execute_notebook(path: Path) -> dict[str, Any]:\n", " code_cells = sum(cell[\"cell_type\"] == \"code\" for cell in read_notebook(path)[\"cells\"])\n", " command = f\"jupyter nbconvert --to notebook --execute --inplace {path.name}\"\n", "\n", " try:\n", " result = run_command(\n", " command,\n", " cwd=path.parent,\n", " timeout=int(os.getenv(\"SAMPLE_NOTEBOOK_TIMEOUT_SECONDS\", \"300\")),\n", " )\n", " except FileNotFoundError:\n", " return {\n", " \"status\": \"failed\",\n", " \"executed_code_cells\": 0,\n", " \"error\": \"Jupyter or nbconvert is not installed or is not available on PATH.\",\n", " \"summary\": \"Install Jupyter with nbconvert before running the validation loop.\",\n", " }\n", " except subprocess.TimeoutExpired as exc:\n", " return {\n", " \"status\": \"failed\",\n", " \"executed_code_cells\": 0,\n", " \"error\": f\"Notebook execution timed out after {exc.timeout} seconds.\",\n", " \"summary\": short_output(exc.stderr or exc.stdout),\n", " }\n", "\n", " output = result.stderr or result.stdout\n", " return {\n", " \"status\": \"passed\" if result.returncode == 0 else \"failed\",\n", " \"executed_code_cells\": code_cells if result.returncode == 0 else 0,\n", " \"error\": \"\" if result.returncode == 0 else f\"Notebook execution exited with code {result.returncode}.\",\n", " \"summary\": short_output(output),\n", " }\n", "\n", "def validation_prompt(updated_path: Path, before_path: Path, execution: dict[str, Any], iteration: int) -> str:\n", " repair_story = case_metadata(before_path).get(\"repair_story\", {})\n", " return \"\\n\".join(\n", " [\n", " \"You are judging a repaired OpenAI Cookbook notebook.\",\n", " f\"Iteration: {iteration}\",\n", " \"Score each validation case independently and give concise feedback for the next repair pass.\",\n", " \"Set overall_passed to false when execution failed or any case has a material issue.\",\n", " \"When execution failed, include the failure in remaining_delta so the next repair pass can address it.\",\n", " \"Use the business rules as the source of truth for current model names and API targets.\",\n", " \"Do not mark the preferred embedding model or preferred chat model as stale.\",\n", " \"For local examples, do not require extra services or package installs when the notebook says it is intentionally self-contained.\",\n", " f\"Repair depth: {json.dumps(repair_story, indent=2)}\",\n", " f\"Business rules: {json.dumps(business_rules, indent=2)}\",\n", " f\"Validation cases: {json.dumps(VALIDATION_CASES, indent=2)}\",\n", " f\"Execution evidence: {json.dumps(execution, indent=2)}\",\n", " f\"Original cell count: {len(read_notebook(before_path)['cells'])}\",\n", " f\"Updated cell count: {len(read_notebook(updated_path)['cells'])}\",\n", " \"\",\n", " notebook_text(updated_path),\n", " ]\n", " )\n", "\n", "\n", "def staged_delta(before_path: Path, iteration: int) -> list[str]:\n", " repair_story = case_metadata(before_path).get(\"repair_story\", {})\n", " target = int(repair_story.get(\"target_iteration\") or 1)\n", " if iteration >= target:\n", " return []\n", " depth = repair_story.get(\"repair_depth\", \"This case is intentionally staged across multiple repair passes.\")\n", " return [f\"Continue to iteration {iteration + 1}: {depth}\"]\n", "\n", "\n", "def evaluate_notebook(updated_path: Path, before_path: Path, run_dir: Path, iteration: int) -> dict[str, Any]:\n", " execution = execute_notebook(updated_path)\n", " judged = run_codex_json(validation_prompt(updated_path, before_path, execution, iteration), validation_schema, run_dir)\n", " failed_cases = [case for case in judged[\"cases\"] if not case[\"passed\"]]\n", " execution_delta = []\n", " if execution[\"status\"] != \"passed\":\n", " execution_delta.append(f\"Execution failed: {execution.get('error') or execution.get('summary')}\")\n", "\n", " stage_delta = staged_delta(before_path, iteration)\n", " return {\n", " \"passed\": judged[\"overall_passed\"] and execution[\"status\"] == \"passed\" and not stage_delta,\n", " \"execution_status\": execution[\"status\"],\n", " \"executed_code_cells\": execution[\"executed_code_cells\"],\n", " \"execution_summary\": execution[\"summary\"],\n", " \"findings\": failed_cases,\n", " \"remaining_delta\": execution_delta + stage_delta + judged[\"remaining_delta\"],\n", " }\n" ] }, { "cell_type": "markdown", "id": "md-027", "metadata": {}, "source": [ "## Save per-iteration outputs\n", "\n", "Each iteration writes a `record.json` file and, for this example, a repaired notebook under `CODEX_REPAIR_RUNS_DIR/iteration_N//`. If you do not set `CODEX_REPAIR_RUNS_DIR`, the notebook writes to your system temp directory so a normal repo checkout stays clean.\n", "\n", "Those files are the audit trail. You can see what the review found, what Codex changed, whether execution passed, and what feedback carried into the next iteration.\n", "\n", "A `record.json` file is the receipt for one loop attempt. It keeps the handoff between phases in one place:\n", "\n", "```json\n", "{\n", " \"review\": [{\"issue_type\": \"deprecated_api\", \"severity\": \"high\"}],\n", " \"repair\": {\n", " \"changes_made\": [\"Updated the notebook to use the current API pattern.\"],\n", " \"updated_artifact_path\": \"/tmp/codex_iterative_repair_loop_outputs/iteration_1/sample/updated.ipynb\"\n", " },\n", " \"validation\": {\n", " \"passed\": false,\n", " \"remaining_delta\": [\"One setup instruction is still unclear.\"]\n", " }\n", "}\n", "```\n", "\n", "That compact record is what lets a maintainer review the loop without reconstructing the whole run from notebook diffs and terminal logs.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "code-028", "metadata": {}, "outputs": [], "source": [ "def save_json(payload: Any, path: Path) -> None:\n", " path.parent.mkdir(parents=True, exist_ok=True)\n", " path.write_text(json.dumps(payload, indent=2) + \"\\n\", encoding=\"utf-8\")\n", "\n", "\n", "def iteration_dir(number: int) -> Path:\n", " path = RUNS_DIR / f\"iteration_{number}\"\n", " path.mkdir(parents=True, exist_ok=True)\n", " return path\n" ] }, { "cell_type": "markdown", "id": "md-029", "metadata": {}, "source": [ "## Run iteration 1\n", "\n", "Each notebook case is independent, so we process the cases concurrently. This keeps the demo fast while preserving the same review, repair, and validation flow for every sample.\n", "\n", "Iteration 1 reuses the initial review findings from the earlier review cell. After this pass, inspect the returned booleans: passing cases can stop, and failing cases carry their validation feedback into the next pass.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "code-030", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'qdrant_embeddings_search_pre_repair.ipynb': True,\n", " 'getting_started_evals_pre_repair.ipynb': False,\n", " 'knowledge_retrieval_pre_repair.ipynb': False}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "current_notebooks = {path.name: path for path in NOTEBOOKS}\n", "history: dict[int, dict[str, Any]] = {}\n", "\n", "\n", "def review_findings_for(original: Path, current_path: Path, case_dir: Path, previous_results: dict[str, Any] | None) -> list[dict[str, Any]]:\n", " if previous_results is None:\n", " return initial_reviews[original.name]\n", " return review_notebook(current_path, case_dir / \"review\")\n", "\n", "\n", "def run_case(number: int, original: Path, run_dir: Path, previous_results: dict[str, Any] | None) -> tuple[str, dict[str, Any], Path]:\n", " name = original.name\n", " case_dir = run_dir / original.stem\n", " current_path = current_notebooks[name]\n", "\n", " findings = review_findings_for(original, current_path, case_dir, previous_results)\n", " delta = [] if previous_results is None else previous_results[name][\"validation\"][\"remaining_delta\"]\n", " repair = repair_notebook(current_path, number, findings, delta, case_dir)\n", " updated_path = Path(repair[\"updated_artifact_path\"])\n", " validation = evaluate_notebook(updated_path, current_path, case_dir / \"evaluation\", number)\n", "\n", " record = {\"review\": findings, \"repair\": repair, \"validation\": validation}\n", " save_json(record, case_dir / \"record.json\")\n", " return name, record, updated_path\n", "\n", "\n", "def run_iteration(number: int, previous_results: dict[str, Any] | None = None) -> dict[str, Any]:\n", " results = {}\n", " updates = {}\n", " run_dir = iteration_dir(number)\n", "\n", " with concurrent.futures.ThreadPoolExecutor(max_workers=min(3, len(NOTEBOOKS))) as executor:\n", " futures = [executor.submit(run_case, number, original, run_dir, previous_results) for original in NOTEBOOKS]\n", " for future in concurrent.futures.as_completed(futures):\n", " name, record, updated_path = future.result()\n", " results[name] = record\n", " updates[name] = updated_path\n", "\n", " current_notebooks.update(updates)\n", " history[number] = results\n", " return results\n", "\n", "\n", "iteration_1 = run_iteration(1)\n", "{name: result[\"validation\"][\"passed\"] for name, result in iteration_1.items()}\n" ] }, { "cell_type": "markdown", "id": "md-031", "metadata": {}, "source": [ "## Run iteration 2\n", "\n", "Iteration 2 is where the loop starts to pay off. Codex is no longer working only from the original review; it also sees what happened during validation.\n", "\n", "That changes the task. Instead of asking for a broad rewrite, we ask for the next useful repair based on evidence from the last run: what executed, what passed, and what still needs attention.\n", "\n", "For the included staged fixtures, this pass is designed to clear the medium-depth Evals case while the deeper Knowledge Retrieval case continues with a smaller, more specific delta.\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "code-032", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'getting_started_evals_pre_repair.ipynb': True,\n", " 'qdrant_embeddings_search_pre_repair.ipynb': True,\n", " 'knowledge_retrieval_pre_repair.ipynb': False}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iteration_2 = run_iteration(2, iteration_1)\n", "{name: result[\"validation\"][\"passed\"] for name, result in iteration_2.items()}\n" ] }, { "cell_type": "markdown", "id": "md-033", "metadata": {}, "source": [ "## Run iteration 3\n", "\n", "Iteration 3 focuses on the deepest documentation case.\n", "\n", "The Knowledge Retrieval fixture has to modernize the API shape, stay runnable with local data, and preserve the retrieval teaching flow. Those requirements can pull against each other: a repair that makes the notebook modern might accidentally make it less runnable, while a repair that keeps it local might remove too much of the original lesson.\n", "\n", "The third pass gives Codex the latest notebook plus the final validation delta. This is the part of the demo that shows why iteration matters: the agent responds to the specific issue that remained, rather than trying to anticipate everything up front.\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "code-034", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'qdrant_embeddings_search_pre_repair.ipynb': True,\n", " 'getting_started_evals_pre_repair.ipynb': True,\n", " 'knowledge_retrieval_pre_repair.ipynb': True}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iteration_3 = run_iteration(3, iteration_2)\n", "{name: result[\"validation\"][\"passed\"] for name, result in iteration_3.items()}\n" ] }, { "cell_type": "markdown", "id": "md-035", "metadata": {}, "source": [ "## Summarize improvement\n", "\n", "Now we can look at the whole run instead of opening every intermediate artifact by hand. The summary below shows the signal that matters most: which artifacts passed, how many validation findings remained, and whether any delta carried forward.\n", "\n", "For the included fixtures, the intended shape is simple: one notebook clears in iteration 1, another clears in iteration 2, and the deepest one clears in iteration 3. In a real maintenance workflow, this table tells you whether the loop is converging or needs a clearer constraint or human review.\n", "\n", "This summary is also useful for human review. A maintainer can start with the pass/fail pattern, open records for anything that still has a delta, and inspect only the repaired artifacts that are ready for review.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "code-036", "metadata": {}, "outputs": [], "source": [ "summary = []\n", "for iteration, results in history.items():\n", " for artifact, record in results.items():\n", " validation = record[\"validation\"]\n", " summary.append(\n", " {\n", " \"iteration\": iteration,\n", " \"artifact\": artifact,\n", " \"passed\": validation[\"passed\"],\n", " \"findings\": len(validation[\"findings\"]),\n", " \"remaining_delta\": len(validation[\"remaining_delta\"]),\n", " }\n", " )\n", "\n", "summary\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "code-037", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "iteration=1 artifact=qdrant_embeddings_search_pre_repair.ipynb passed=True findings=0 delta=0\n", "iteration=1 artifact=getting_started_evals_pre_repair.ipynb passed=False findings=0 delta=1\n", "iteration=1 artifact=knowledge_retrieval_pre_repair.ipynb passed=False findings=1 delta=3\n", "iteration=2 artifact=getting_started_evals_pre_repair.ipynb passed=True findings=0 delta=0\n", "iteration=2 artifact=qdrant_embeddings_search_pre_repair.ipynb passed=True findings=0 delta=0\n", "iteration=2 artifact=knowledge_retrieval_pre_repair.ipynb passed=False findings=0 delta=1\n", "iteration=3 artifact=qdrant_embeddings_search_pre_repair.ipynb passed=True findings=0 delta=0\n", "iteration=3 artifact=getting_started_evals_pre_repair.ipynb passed=True findings=0 delta=0\n", "iteration=3 artifact=knowledge_retrieval_pre_repair.ipynb passed=True findings=0 delta=0\n" ] } ], "source": [ "for row in summary:\n", " print(\n", " f\"iteration={row['iteration']} artifact={row['artifact']} \"\n", " f\"passed={row['passed']} findings={row['findings']} delta={row['remaining_delta']}\"\n", " )\n" ] }, { "cell_type": "markdown", "id": "md-038", "metadata": {}, "source": [ "## What the summary tells us\n", "\n", "The important signal is not that Codex made edits. The important signal is that the remaining validation delta gets smaller as the loop runs.\n", "\n", "| Pass | Signal to look for | Why it matters |\n", "| --- | --- | --- |\n", "| Iteration 1 | The simplest fixture passes; deeper fixtures keep a small delta. | The loop can make an initial repair while carrying forward the cases that still need evidence. |\n", "| Iteration 2 | The medium-depth fixture clears after seeing validation feedback. | Runtime and judge feedback become useful repair instructions. |\n", "| Iteration 3 | The deepest fixture clears or leaves a focused final delta. | The loop converges, or it produces a clear handoff for a human reviewer. |\n", "\n", "The `record.json` files are where this becomes auditable. A useful record answers four questions: what did the review find, what did Codex change, did the notebook execute, and what remains? That is the difference between an impressive-looking edit and a repair workflow a maintainer can trust.\n" ] }, { "cell_type": "markdown", "id": "md-039", "metadata": {}, "source": [ "## Generalize to a continuous loop\n", "\n", "The fixed three-pass run above is useful for teaching the pattern. A production loop should decide when to stop on its own.\n", "\n", "A good loop usually stops for one of four reasons: validation passes, the loop reaches a maximum number of attempts, the remaining delta stops changing, or the next decision needs human review. Those stop conditions are just as important as the repair prompt.\n", "\n", "The other production detail is the audit trail. Keep the review findings, repaired artifact, validation result, validation judgment, and remaining delta for every pass. That record lets a maintainer understand why the loop continued, why it stopped, and which artifact is ready for review.\n" ] }, { "cell_type": "code", "execution_count": 17, "id": "code-039", "metadata": {}, "outputs": [], "source": [ "def repair_until_done(max_iterations: int = 3) -> dict[int, dict[str, Any]]:\n", " current_notebooks.update({path.name: path for path in NOTEBOOKS})\n", " previous = None\n", " loop_history = {}\n", "\n", " for number in range(1, max_iterations + 1):\n", " previous = run_iteration(number, previous)\n", " loop_history[number] = previous\n", " if all(record[\"validation\"][\"passed\"] for record in previous.values()):\n", " break\n", "\n", " return loop_history\n" ] }, { "cell_type": "markdown", "id": "md-040", "metadata": {}, "source": [ "## Where else this applies\n", "\n", "The notebook walkthrough is just one way to teach the architecture. The same pattern helps whenever an agent changes a file or process that needs more than subjective review before it is accepted.\n", "\n", "A few high-value examples:\n", "\n", "- **Protocol optimization:** Draft an update for expert review, then validate it against dosing rules, timing constraints, or required safety checks.\n", "- **Regulatory remediation:** Draft updates to regulated content, then check that required language, citations, approvals, and jurisdiction-specific terms remain intact.\n", "- **Support knowledge refresh:** Update an article, test it against current product behavior or known resolutions, and carry mismatches into the next pass.\n", "- **Code modernization:** Replace deprecated APIs, run tests or static checks, and use remaining failures to guide the next repair.\n", "\n", "The common thread is that the change matters, and each pass needs evidence. Whether the target is a notebook, a policy, a protocol, a support article, a pipeline, or a codebase, the loop gives the agent a way to improve it with evidence a maintainer can review.\n" ] }, { "cell_type": "markdown", "id": "md-041", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Iterative repair loops make agentic maintenance easier to review and operate because they separate judgment from proof.\n", "\n", "Review finds candidate issues. Repair makes focused edits. Validation executes the artifact and produces the next delta. When those phases exchange structured outputs, the workflow becomes easier to inspect, repeat, and adapt.\n", "\n", "The main idea is simple: instead of relying on a single pass, give the workflow a way to learn from the artifact, make a bounded repair, and react to real validation feedback. That small change makes agentic maintenance much more practical.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }