{ "cells": [ { "cell_type": "markdown", "id": "6a1c28f4", "metadata": {}, "source": [ "# Building Reliable Agents with Memory and Compaction" ] }, { "cell_type": "markdown", "id": "359e49f9", "metadata": {}, "source": [ "This Cookbook shows how to build an evidence review agent for a synthetic compliance investigation using the OpenAI Agents SDK.\n", "\n", "You will start with a simple sandbox agent, then add two reliability primitives:\n", "\n", "- **Compaction** lets you support long-running conversations despite finite context windows by carrying forward the state needed for later turns while reducing context size.\n", "- **Memory** lets future sandbox-agent runs reuse workflow lessons from prior runs without replaying every previous turn.\n", "\n", "The reliability pattern is straightforward: compaction helps the current run continue, memory helps later runs start with useful workflow guidance, and the generated memo remains the human-reviewed source of truth for the investigation.\n", "\n", "References:\n", "\n", "- [OpenAI Agents SDK](https://openai.github.io/openai-agents-python/)\n", "- [Agents SDK sessions](https://openai.github.io/openai-agents-python/sessions/)\n", "- [Agents SDK sandboxing](https://openai.github.io/openai-agents-python/ref/sandbox/)" ] }, { "cell_type": "markdown", "id": "52ed83a2", "metadata": {}, "source": [ "## Use Case: Evidence Review Agent for a Compliance Investigation\n", "\n", "A compliance team is investigating whether a vendor exception followed internal policy. The evidence arrives as a small set of files: policy language, exception notes, audit observations, approval records, and remediation plans.\n", "\n", "The agent's job is not to become the investigation record. Its job is to help a reviewer move through the evidence, keep track of what changed, and write a concise memo that separates supported findings from open questions.\n", "\n", "This makes the example useful for memory and compaction because the investigation has three traits that show up in real work:\n", "\n", "- **The record changes over time.** Later documents may narrow or supersede an earlier assumption.\n", "- **The conversation can become long-running.** A reviewer may ask follow-up questions, request revisions, and return to the same work later.\n", "- **The final artifact needs provenance.** The memo should cite evidence and preserve uncertainty instead of flattening the review into a confident but unsupported conclusion." ] }, { "cell_type": "markdown", "id": "a94ed949", "metadata": {}, "source": [ "## Where This Pattern Applies\n", "\n", "Although this notebook uses a compliance review, the same pattern applies anywhere knowledge workers review evolving context and produce a human-auditable artifact.\n", "\n", "Good fits include:\n", "\n", "- Customer support teams applying new policy updates to open escalations.\n", "- Security teams reviewing incident evidence and writing incident summaries.\n", "- Finance teams reconciling exceptions across policies, approvals, and audit notes.\n", "- Product teams updating competitive positioning after new launches or model releases.\n", "- Legal or procurement teams reviewing contracts, emails, and approval histories.\n", "- M&A teams absorbing new business rules, operating procedures, and diligence notes.\n", "\n", "In each case, compaction keeps the active review viable as context grows, memory carries reusable workflow lessons forward, and the final artifact remains the reviewed output." ] }, { "cell_type": "markdown", "id": "536c2bbf", "metadata": {}, "source": [ "## What You'll Build\n", "\n", "The use case is a compliance evidence review. A team receives policy documents, exception notes, audit findings, approvals, and remediation plans over time. The agent helps review the evidence, preserve uncertainty, and produce a concise memo with citations.\n", "\n", "You will build:\n", "\n", "1. A synthetic evidence workspace with a folder structure, manifest, and output directory.\n", "2. A simple `SandboxAgent` that can inspect files and write a memo.\n", "3. A compaction checkpoint for long-running work.\n", "4. SDK memory generation for reusable workflow lessons.\n", "5. A combined run that uses sandbox tools, compaction, memory, and generated artifacts together.\n", "\n", "You can inspect the notebook without making model calls because `RUN_AGENT` defaults to `False`. Set `RUN_AGENT = True` only when you want to execute the live sandbox workflow.\n", "\n", "### Table of Contents\n", "\n", "- [Use Case](#use-case-evidence-review-agent-for-a-compliance-investigation)\n", "- [Where This Pattern Applies](#where-this-pattern-applies)\n", "- [Prerequisites](#prerequisites)\n", "- [Setup](#setup)\n", "- [Using Agents SDK in This Notebook](#using-agents-sdk-in-this-notebook)\n", "- [Memory vs. Compaction](#memory-vs-compaction)\n", "- [Folder Structure and Manifest](#folder-structure-and-manifest)\n", "- [Prepare a Small Evidence Workspace](#prepare-a-small-evidence-workspace)\n", "- [Step 1: Start With a Simple Agent Configuration](#step-1-start-with-a-simple-agent-configuration)\n", "- [Step 2: Add Compaction](#step-2-add-compaction)\n", "- [Step 3: Attach Memory](#step-3-attach-memory)\n", "- [Step 4: Run With Both Compaction and Memory](#step-4-run-with-both-compaction-and-memory)\n", "- [Inspect Generated Artifacts](#inspect-generated-artifacts)" ] }, { "cell_type": "markdown", "id": "1d22dbee", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "To run the live agent workflow, you need:\n", "\n", "- Python 3.10 or later.\n", "- The `openai-agents` package.\n", "- An OpenAI API key available as `OPENAI_API_KEY`.\n", "- A local Unix-like environment for `UnixLocalSandboxClient`. The notebook uses synthetic files created from the sandbox `Manifest`, so no external dataset is required.\n", "\n", "The notebook is safe to inspect without credentials because `RUN_AGENT` defaults to `False`. Set `RUN_AGENT = True` only when you want to execute the model-backed sandbox run.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "install-agents-sdk", "metadata": {}, "outputs": [], "source": [ "# Install or upgrade the OpenAI Agents SDK.\n", "%pip install --upgrade openai-agents" ] }, { "cell_type": "markdown", "id": "ea5a9645", "metadata": {}, "source": [ "## Setup\n", "\n", "The notebook writes all synthetic files under `examples/agents_sdk/.tmp/evidence_review_memory_compaction/`. It does not require external data.\n", "\n", "By default, tracing is disabled because some organizations use Zero Data Retention (ZDR), where trace ingestion may be blocked. For synthetic data or non-ZDR environments, you can set `ENABLE_TRACING = True` to inspect traces while developing." ] }, { "cell_type": "code", "execution_count": null, "id": "e4909ee3", "metadata": {}, "outputs": [], "source": [ "from __future__ import annotations\n", "\n", "import json\n", "import os\n", "import textwrap\n", "from pathlib import Path\n", "\n", "RUN_AGENT = False\n", "MODEL = \"gpt-5.5\"\n", "COMPACTION_MODEL = \"gpt-5.4-mini\"\n", "WORKFLOW_NAME = \"evidence-review-memory-compaction\"\n", "FORCE_COMPACTION_CHECKPOINT = True\n", "DISABLE_TRACING = True\n", "\n", "if DISABLE_TRACING:\n", " os.environ[\"OPENAI_AGENTS_DISABLE_TRACING\"] = \"true\"\n", "else:\n", " os.environ.pop(\"OPENAI_AGENTS_DISABLE_TRACING\", None)\n", "\n", "if RUN_AGENT and not os.environ.get(\"OPENAI_API_KEY\"):\n", " raise RuntimeError(\"Set OPENAI_API_KEY before running the live sandbox workflow.\")\n", "\n", "print({\n", " \"run_agent\": RUN_AGENT,\n", " \"model\": MODEL,\n", " \"compaction_model\": COMPACTION_MODEL,\n", " \"force_compaction_checkpoint\": FORCE_COMPACTION_CHECKPOINT,\n", " \"tracing_disabled\": DISABLE_TRACING,\n", "})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Agents SDK in This Notebook\n", "\n", "A **sandbox agent** is an Agents SDK agent that runs with a controlled workspace. In this notebook, that workspace contains synthetic evidence files, a `manifest.csv`, an output folder, and SDK-generated memory files.\n", "\n", "The sandbox gives the agent a bounded place to inspect files and write artifacts. Instead of pasting every document into the prompt, the application creates a workspace and lets the agent use capabilities such as:\n", "\n", "- `Filesystem()` to read and write workspace files.\n", "- `Shell()` to list files, inspect documents, and search across batches.\n", "- `Compaction()` to support long-running reviews when the active context grows.\n", "- `Memory()` to store reusable workflow lessons for future sandbox-agent runs.\n", "\n", "The memo remains the human-reviewed artifact. Tools help the agent work, compaction helps it continue, memory helps future runs improve, and generated artifacts hold the reviewable output." ] }, { "cell_type": "markdown", "id": "460cf890", "metadata": {}, "source": [ "## Memory vs. Compaction\n", "\n", "A useful way to separate the concepts is to ask what each one is allowed to carry forward.\n", "\n", "| Question | Compaction | Memory |\n", "|---|---|---|\n", "| What does it help with? | Continuing one long-running run when context grows. | Improving future runs with reusable workflow lessons. |\n", "| What does it summarize? | The active conversation and working state. | Patterns, preferences, and process lessons worth reusing. |\n", "| Should it store investigation conclusions? | No. It can preserve working state, but the memo is the reviewed artifact. | No. Store workflow lessons, not case-specific facts. |\n", "| When is it useful? | Mid-review, especially before later batches or follow-up turns. | Across repeated reviews of similar evidence workflows. |\n", "\n", "For this notebook, the compliance memo is the source of truth for the investigation output. Memory is intentionally scoped to reviewer preferences and workflow habits, such as using the manifest first, preserving uncertainty, and keeping superseded assumptions visible." ] }, { "cell_type": "markdown", "id": "dae0de62", "metadata": {}, "source": [ "## Folder Structure and Manifest\n", "\n", "The agent works from a small file workspace. The folder structure is simple on purpose: evidence files are grouped by batch, generated outputs go under `outputs/`, and the manifest gives the agent a compact map of the available documents." ] }, { "cell_type": "markdown", "id": "ac6a82ee", "metadata": {}, "source": [ "### The Manifest Feature\n", "\n", "In the Agents SDK, a `Manifest` is the fresh-session workspace contract for a sandbox agent. It describes the files, directories, mounts, environment, users, groups, and related workspace configuration that should exist when a new sandbox session starts.\n", "\n", "The local SDK implementation defines these core fields:\n", "\n", "| Manifest field | What it controls | How to use it in this Cookbook |\n", "|---|---|---|\n", "| `root` | Workspace root path. Defaults to `/workspace`. | Keep the default unless a sandbox provider expects a different root. |\n", "| `entries` | Files, directories, local files, local directories, repos, or mounts to materialize. | Put `README.md`, `manifest.csv`, input documents, and `outputs/` here. |\n", "| `environment` | Environment variables available when the sandbox starts. | Use only for non-secret runtime configuration. Keep credentials out of prompts and committed notebooks. |\n", "| `users` / `groups` | Sandbox-local OS accounts and groups for providers that support them. | Usually unnecessary for a Cookbook, useful for production isolation. |\n", "| `extra_path_grants` | Additional path grants, especially useful for Unix-local workflows. | Use sparingly when a sandbox needs scoped read/write access to host paths. |\n", "| `remote_mount_command_allowlist` | Commands allowed against remote mounts. | Keep narrow when mounting external storage or data rooms. |\n", "\n", "Manifest entry paths should be workspace-relative. Avoid absolute paths and `..` escapes so the same agent can move between Unix-local, Docker, and hosted sandbox providers.\n" ] }, { "cell_type": "markdown", "id": "20487620", "metadata": {}, "source": [ "### Folder and Manifest Best Practices\n", "\n", "- Put source documents, manifests, helper files, and output directories in the `Manifest` instead of pasting large content into the prompt.\n", "- Put longer task instructions in workspace files such as `README.md`, `task.md`, or `AGENTS.md`; keep agent instructions focused on behavior and boundaries.\n", "- Use stable document IDs and a machine-readable manifest file so generated memos can cite sources and reviewers can inspect the path back to evidence.\n", "- Let `Memory()` manage its own memory artifacts. By default, sandbox memory uses `memories/` and `sessions/` under the workspace.\n", "- Keep generated artifacts under `outputs/` so the application can inspect, copy, validate, or archive them after the run.\n", "- Keep mount scopes narrow. If you mount a data room, mount only what the agent should read or write.\n", "- Treat secrets as runtime configuration injected by your application or sandbox provider, not as prompt text or committed manifest content.\n", "- Prefer a small synthetic `File(...)` or `Dir(...)` entry for a tutorial, then switch to `LocalDir`, `GitRepo`, or storage mounts for production-sized datasets.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "78335b14", "metadata": {}, "outputs": [], "source": [ "# This is a visual preview of the sandbox workspace structure.\n", "# The next cell builds the actual Manifest entries manually.\n", "WORKSPACE_TREE = \"\"\"\n", "/workspace/\n", " README.md\n", " manifest.csv\n", " docs/\n", " batch_1/\n", " batch_2/\n", " batch_3/\n", " outputs/\n", " memories/ # Generated by Memory()\n", " sessions/ # Generated by Memory()\n", "\"\"\".strip()\n", "\n", "print(WORKSPACE_TREE)\n" ] }, { "cell_type": "markdown", "id": "e6fde84a", "metadata": {}, "source": [ "## Prepare a Small Evidence Workspace\n", "\n", "A `Manifest` describes the starting files in a fresh sandbox workspace. For this tutorial, the workspace includes:\n", "\n", "- a `manifest.csv` listing documents by batch and document ID,\n", "- three small document batches,\n", "- an output directory for the review memo.\n", "\n", "The only memory primitive we attach later is the SDK's `Memory()` capability. Investigation findings stay in the generated reviewer memo, where they can be cited and inspected.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "dff0023e", "metadata": {}, "outputs": [], "source": [ "from agents.sandbox import Manifest\n", "from agents.sandbox.entries import Dir, File\n", "\n", "\n", "def workspace_file(text: str) -> File:\n", " return File(content=textwrap.dedent(text).strip().encode(\"utf-8\") + b\"\\n\")\n", "\n", "\n", "def build_evidence_manifest() -> Manifest:\n", " return Manifest(\n", " entries={\n", " \"README.md\": workspace_file(\n", " \"\"\"\n", " # Evidence Review Workspace\n", "\n", " Review the documents in batch order. Cite document IDs from\n", " `manifest.csv` when making findings. Write the final memo to\n", " `outputs/compliance_review_memo.md`.\n", " \"\"\"\n", " ),\n", " \"manifest.csv\": workspace_file(\n", " \"\"\"\n", " doc_id,batch,path,description\n", " ACME-B1-001,1,docs/batch_1/payment_policy.txt,Baseline payment policy\n", " ACME-B1-002,1,docs/batch_1/vendor_exception.txt,Vendor exception note\n", " ACME-B2-001,2,docs/batch_2/audit_followup.txt,Audit follow-up request\n", " ACME-B2-002,2,docs/batch_2/approval_thread.txt,Approval clarification\n", " ACME-B3-001,3,docs/batch_3/remediation_plan.txt,Remediation plan\n", " \"\"\"\n", " ),\n", " \"docs/batch_1/payment_policy.txt\": workspace_file(\n", " \"\"\"\n", " doc_id: ACME-B1-001\n", " ACME requires two approvals for payments over $50,000. Exceptions must\n", " be logged with Finance Ops and reviewed within five business days.\n", " \"\"\"\n", " ),\n", " \"docs/batch_1/vendor_exception.txt\": workspace_file(\n", " \"\"\"\n", " doc_id: ACME-B1-002\n", " A vendor onboarding exception was approved verbally for Northwind\n", " Logistics because the renewal was time-sensitive. The note does not show\n", " a Finance Ops log entry.\n", " \"\"\"\n", " ),\n", " \"docs/batch_2/audit_followup.txt\": workspace_file(\n", " \"\"\"\n", " doc_id: ACME-B2-001\n", " Internal Audit asked Finance Ops to confirm whether Northwind Logistics\n", " received post-approval review. The request says missing exception logs\n", " should be treated as a control gap until resolved.\n", " \"\"\"\n", " ),\n", " \"docs/batch_2/approval_thread.txt\": workspace_file(\n", " \"\"\"\n", " doc_id: ACME-B2-002\n", " The approval thread says Legal approved the vendor exception, but Finance\n", " Ops approval was still pending when the payment was released.\n", " \"\"\"\n", " ),\n", " \"docs/batch_3/remediation_plan.txt\": workspace_file(\n", " \"\"\"\n", " doc_id: ACME-B3-001\n", " The remediation plan requires Finance Ops to reconcile all verbal vendor\n", " exceptions from Q4 and add retrospective control attestations.\n", " \"\"\"\n", " ),\n", " \"outputs\": Dir(),\n", " }\n", " )\n", "\n", "\n", "manifest = build_evidence_manifest()\n", "print(f\"Workspace entries: {len(manifest.entries)}\")\n" ] }, { "cell_type": "markdown", "id": "2b22e23c", "metadata": {}, "source": [ "## Step 1: Start With a Simple Agent Configuration\n", "\n", "First, build the agent without memory or compaction. The goal is to make the baseline behavior clear before adding primitives.\n", "\n", "One subtle point: `SandboxAgent` defaults can include built-in capabilities. To keep this baseline explicit, pass the exact capability list you want. Here we include only the workspace tools the agent needs to inspect files and write an artifact: `Filesystem()` and `Shell()`. We intentionally do **not** attach `Compaction()` or `Memory()` yet.\n", "\n", "- `Filesystem()` gives the sandbox agent file-oriented workspace access so it can read staged evidence and write the memo artifact. In the [Sandbox Agents guide](https://developers.openai.com/api/docs/guides/agents/sandboxes#give-the-agent-capabilities), capabilities are described as the way to attach sandbox-native behavior and tools to a `SandboxAgent`\n", "- `Shell()` lets the agent inspect the workspace with terminal commands such as listing files, opening evidence documents, and searching for terms across batches. The Sandbox Agents guide notes that `Shell()` is one of the default capabilities, and the [Shell tool guide](https://developers.openai.com/api/docs/guides/tools-shell) explains that shell gives models a terminal environment for hosted or local execution.\n", "- For this baseline, these two capabilities are enough: `Filesystem()` handles workspace reads and writes, while `Shell()` handles deterministic inspection and search. Memory and compaction are added only after the baseline harness is clear.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d71ee371", "metadata": {}, "outputs": [], "source": [ "from agents.sandbox import SandboxAgent\n", "from agents.sandbox.capabilities import Filesystem, Shell\n", "\n", "BASELINE_INSTRUCTIONS = \"\"\"\n", "You are an evidence review agent for a compliance investigation.\n", "\n", "Review documents in batch order. Keep these boundaries clear:\n", "- Cite document IDs from `manifest.csv` for each finding.\n", "- If evidence is incomplete, record an open question instead of guessing.\n", "- Write a concise reviewer memo to `outputs/compliance_review_memo.md`.\n", "- Use the generated memo as the reviewer-facing investigation artifact.\n", "\"\"\".strip()\n", "\n", "\n", "def build_baseline_agent() -> SandboxAgent:\n", " return SandboxAgent(\n", " name=\"Evidence Review Agent\",\n", " model=MODEL,\n", " instructions=BASELINE_INSTRUCTIONS,\n", " default_manifest=build_evidence_manifest(),\n", " capabilities=[\n", " Filesystem(),\n", " Shell(),\n", " ],\n", " )\n", "\n", "\n", "baseline_agent = build_baseline_agent()\n", "print([type(capability).__name__ for capability in baseline_agent.capabilities])\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1bffe17c", "metadata": {}, "outputs": [], "source": [ "from agents import Runner\n", "from agents.run import RunConfig\n", "from agents.sandbox import SandboxRunConfig\n", "from agents.sandbox.sandboxes.unix_local import UnixLocalSandboxClient\n", "\n", "BASELINE_TASK = \"\"\"\n", "Review Batch 1 only, then draft `outputs/compliance_review_memo.md` for a\n", "compliance reviewer. Include cited findings and open questions.\n", "\"\"\".strip()\n", "\n", "\n", "async def _read_text_file(session, path: str) -> str | None:\n", " try:\n", " handle = await session.read(Path(path))\n", " except Exception as exc:\n", " if \"NotFound\" not in type(exc).__name__:\n", " raise\n", " return None\n", " try:\n", " return handle.read().decode(\"utf-8\", errors=\"replace\")\n", " finally:\n", " handle.close()\n", "\n", "\n", "async def _list_workspace_files(session) -> str:\n", " result = await session.exec(\n", " \"find outputs memories sessions -maxdepth 4 -type f 2>/dev/null | sort || true\",\n", " timeout=30,\n", " )\n", " return result.stdout.decode(\"utf-8\", errors=\"replace\").strip()\n", "\n", "\n", "async def _read_memory_artifacts(session) -> dict[str, str]:\n", " memory_artifacts = {}\n", " for path in [\"memories/MEMORY.md\", \"memories/memory_summary.md\"]:\n", " text = await _read_text_file(session, path)\n", " if text:\n", " memory_artifacts[path] = text\n", " return memory_artifacts\n", "\n", "\n", "async def run_in_unix_sandbox(agent: SandboxAgent, task: str, *, sdk_session=None) -> dict[str, object]:\n", " client = UnixLocalSandboxClient()\n", " session = await client.create(manifest=agent.default_manifest)\n", " try:\n", " await session.start()\n", " result = await Runner.run(\n", " agent,\n", " task,\n", " max_turns=12,\n", " run_config=RunConfig(\n", " sandbox=SandboxRunConfig(session=session),\n", " workflow_name=WORKFLOW_NAME,\n", " tracing_disabled=DISABLE_TRACING,\n", " ),\n", " session=sdk_session,\n", " )\n", "\n", " compaction_checkpoint = None\n", " if sdk_session is not None and FORCE_COMPACTION_CHECKPOINT:\n", " compaction_checkpoint = await force_compaction_checkpoint(sdk_session)\n", "\n", " # Memory generation runs as a sandbox pre-stop hook. Flush it before reading artifacts\n", " # so `memories/MEMORY.md` and `memories/memory_summary.md` are available here.\n", " await session.run_pre_stop_hooks()\n", "\n", " memo = await _read_text_file(session, \"outputs/compliance_review_memo.md\")\n", " files = await _list_workspace_files(session)\n", " memory_artifacts = await _read_memory_artifacts(session)\n", "\n", " return {\n", " \"result\": result,\n", " \"final_output\": str(result.final_output),\n", " \"memo\": memo,\n", " \"workspace_files\": files,\n", " \"memory_artifacts\": memory_artifacts,\n", " \"compaction_checkpoint\": compaction_checkpoint,\n", " }\n", " finally:\n", " await session.aclose()\n", "\n", "\n", "if RUN_AGENT:\n", " baseline_run = await run_in_unix_sandbox(baseline_agent, BASELINE_TASK)\n", " print(baseline_run[\"final_output\"])\n", " print(\"\\n----- END AGENT OUTPUT -----\")\n", "else:\n", " print(\"RUN_AGENT is False. Baseline agent is configured but not executed.\")\n" ] }, { "cell_type": "markdown", "id": "884105fb", "metadata": {}, "source": [ "## Step 2: Add Compaction\n", "\n", "Compaction is for long-running work. As a conversation grows, compaction reduces context size while preserving the state needed for later turns. There are three useful ways to think about it:\n", "\n", "1. **Automatic compaction with `Compaction()`**: attach the capability and let the SDK compact when context pressure requires it.\n", "2. **Threshold-based compaction with `StaticCompactionPolicy`**: set an explicit threshold for environments where you want more predictable context-size behavior.\n", "3. **Forced checkpoint compaction with `OpenAIResponsesCompactionSession.run_compaction({\"force\": True})`**: compact at an application-defined phase boundary, such as after a major review phase and before the next evidence batch.\n", "\n", "This notebook uses a forced checkpoint because the synthetic dataset is intentionally small. In production, automatic compaction is often the simplest starting point, and threshold-based compaction is useful when you want a tighter operational policy.\n", "\n", "> **Best practices**\n", ">\n", "> - Compact at meaningful workflow boundaries, not after every turn.\n", "> - Preserve enough working state for the next phase to make sense.\n", "> - Keep cited facts in generated artifacts, not only in compacted conversation state." ] }, { "cell_type": "code", "execution_count": null, "id": "9e403cd1", "metadata": {}, "outputs": [], "source": [ "from agents.sandbox.capabilities import Compaction, StaticCompactionPolicy\n", "\n", "\n", "def build_compaction_agent(*, demo_threshold: int | None = None) -> SandboxAgent:\n", " if demo_threshold is None:\n", " compaction = Compaction()\n", " else:\n", " compaction = Compaction(policy=StaticCompactionPolicy(threshold=demo_threshold))\n", "\n", " return SandboxAgent(\n", " name=\"Evidence Review Agent with Compaction\",\n", " model=MODEL,\n", " instructions=(\n", " BASELINE_INSTRUCTIONS\n", " + \"\\n\\nWhen context is compacted, preserve the current batch, cited facts, open \"\n", " \"questions, artifact paths, and unresolved reviewer concerns.\"\n", " ),\n", " default_manifest=build_evidence_manifest(),\n", " capabilities=[\n", " Filesystem(),\n", " Shell(),\n", " compaction,\n", " ],\n", " )\n", "\n", "\n", "compaction_agent = build_compaction_agent()\n", "threshold_compaction_agent = build_compaction_agent(demo_threshold=8_000)\n", "print({\n", " \"automatic\": [type(capability).__name__ for capability in compaction_agent.capabilities],\n", " \"threshold_policy\": [type(capability).__name__ for capability in threshold_compaction_agent.capabilities],\n", "})\n" ] }, { "cell_type": "markdown", "id": "672f08b6", "metadata": {}, "source": [ "### How Compaction Gets Triggered\n", "\n", "With the `Compaction()` capability, server-side compaction is eligible to run when the active context grows large enough. That is the current default behavior: attach the capability and let the SDK manage context pressure.\n", "\n", "For small tutorials, automatic compaction can be hard to see because the run may never get close to the model context limit. A lower `StaticCompactionPolicy` can help, but it still depends on the rendered context crossing the threshold.\n", "\n", "For a small evidence set, a forced checkpoint is the clearest operational pattern. The `OpenAIResponsesCompactionSession` wrapper stores session history and lets the application call `run_compaction({\"force\": True})` at a phase boundary. That makes compaction visible without inflating the evidence set.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9f63750c", "metadata": {}, "outputs": [], "source": [ "from agents.memory import OpenAIResponsesCompactionSession, SQLiteSession\n", "\n", "\n", "def build_compaction_session() -> OpenAIResponsesCompactionSession:\n", " underlying = SQLiteSession(\"evidence_review_session.sqlite\")\n", " return OpenAIResponsesCompactionSession(\n", " session_id=\"evidence-review-demo\",\n", " underlying_session=underlying,\n", " model=COMPACTION_MODEL,\n", " compaction_mode=\"input\",\n", " )\n", "\n", "\n", "async def force_compaction_checkpoint(session: OpenAIResponsesCompactionSession) -> dict[str, int]:\n", " items_before = await session.get_items()\n", " await session.run_compaction({\"force\": True, \"compaction_mode\": \"input\"})\n", " items_after = await session.get_items()\n", " return {\"items_before\": len(items_before), \"items_after\": len(items_after)}\n", "\n", "\n", "print(\"Compaction session helper defined. The final run uses it to show an explicit phase checkpoint.\")\n" ] }, { "cell_type": "markdown", "id": "46b352cd", "metadata": {}, "source": [ "## Step 3: Attach Memory\n", "\n", "Memory is for reuse across runs. In this example, memory should capture **workflow lessons**, not investigation facts.\n", "\n", "Good memory candidates include:\n", "\n", "- Use the manifest first when reviewing a file-based evidence workspace.\n", "- Preserve uncertainty in the memo instead of guessing.\n", "- Keep earlier assumptions visible when later evidence narrows them.\n", "\n", "Bad memory candidates include:\n", "\n", "- \"Northwind Logistics violated policy.\"\n", "- \"ACME's Finance Ops process is deficient.\"\n", "- Any case-specific conclusion that belongs in the memo.\n", "\n", "> **Best practices**\n", ">\n", "> - Use memory for stable process lessons and user preferences.\n", "> - Keep case-specific facts in reviewed artifacts such as the memo.\n", "> - Inspect generated memory before relying on it in future runs." ] }, { "cell_type": "code", "execution_count": null, "id": "b6ccf4b9", "metadata": {}, "outputs": [], "source": [ "from agents.sandbox import MemoryGenerateConfig\n", "from agents.sandbox.capabilities import Memory\n", "\n", "MEMORY_GENERATION_PROMPT = \"\"\"\n", "Store reusable workflow lessons only.\n", "Do not store ACME-specific compliance findings, document facts, evidence citations,\n", "or memo conclusions. Those belong in outputs/compliance_review_memo.md.\n", "Memory should help future evidence-review workflows behave better; it should not\n", "become a second investigation record.\n", "\"\"\".strip()\n", "\n", "\n", "def workflow_memory() -> Memory:\n", " return Memory(\n", " generate=MemoryGenerateConfig(\n", " extra_prompt=MEMORY_GENERATION_PROMPT,\n", " )\n", " )\n", "\n", "\n", "def build_memory_agent() -> SandboxAgent:\n", " return SandboxAgent(\n", " name=\"Evidence Review Agent with Memory\",\n", " model=MODEL,\n", " instructions=BASELINE_INSTRUCTIONS,\n", " default_manifest=build_evidence_manifest(),\n", " capabilities=[\n", " Filesystem(),\n", " Shell(),\n", " workflow_memory(),\n", " ],\n", " )\n", "\n", "\n", "memory_agent = build_memory_agent()\n", "print([type(capability).__name__ for capability in memory_agent.capabilities])\n" ] }, { "cell_type": "markdown", "id": "d39c8f63", "metadata": {}, "source": [ "## Step 4: Run With Both Compaction and Memory\n", "\n", "Now combine the pieces:\n", "\n", "- `Filesystem()` and `Shell()` let the agent navigate the evidence workspace.\n", "- `Compaction()` keeps the active review viable as context grows.\n", "- `Memory()` captures reusable workflow lessons after the run.\n", "- The final memo remains the investigation artifact.\n", "\n", "The task below asks the agent to review the synthetic evidence, write a memo, then read the memo back to verify it preserved the required structure and uncertainty." ] }, { "cell_type": "code", "execution_count": null, "id": "de806493", "metadata": {}, "outputs": [], "source": [ "FINAL_REVIEW_TASK = \"\"\"\n", "Review all three document batches in order.\n", "\n", "For each batch:\n", "1. Read the manifest and relevant documents.\n", "2. Preserve cited findings and uncertainty in your working notes and final memo.\n", "3. Preserve any superseded or narrowed assumption instead of silently deleting it.\n", "\n", "After Batch 3, write `outputs/compliance_review_memo.md` with:\n", "- executive summary,\n", "- cited findings table,\n", "- open questions,\n", "- recommended next steps for the reviewer.\n", "\n", "Reviewer preference for future runs: keep the memo concise, preserve uncertainty\n", "instead of guessing, and separate reusable workflow lessons from document-specific\n", "compliance findings.\n", "\"\"\".strip()\n", "\n", "\n", "def build_reliable_evidence_agent() -> SandboxAgent:\n", " return SandboxAgent(\n", " name=\"Reliable Evidence Review Agent\",\n", " model=MODEL,\n", " instructions=(\n", " BASELINE_INSTRUCTIONS\n", " + \"\\n\\nUse compaction as working context. Use SDK memory for reusable \"\n", " \"workflow lessons across runs. Do not treat memory as the system of record \"\n", " \"for ACME-specific findings; those belong in the cited memo artifact.\"\n", " ),\n", " default_manifest=build_evidence_manifest(),\n", " capabilities=[\n", " Filesystem(),\n", " Shell(),\n", " Compaction(),\n", " workflow_memory(),\n", " ],\n", " )\n", "\n", "\n", "reliable_agent = build_reliable_evidence_agent()\n", "print([type(capability).__name__ for capability in reliable_agent.capabilities])\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bdef7e0d", "metadata": {}, "outputs": [], "source": [ "EXAMPLE_OUTPUT = \"\"\"\n", "# Compliance Review Memo\n", "\n", "## Executive Summary\n", "\n", "The current record supports a control-gap finding for the Northwind Logistics\n", "vendor exception, not a final conclusion that policy was intentionally violated.\n", "The strongest evidence is that ACME required two approvals and Finance Ops\n", "logging for payment exceptions, while the Northwind exception appears to have\n", "been released before Finance Ops approval was complete.\n", "\n", "Later evidence narrows the initial concern. The record no longer points only to\n", "an undocumented verbal exception; it now points to a specific process weakness:\n", "Legal approval may have been obtained, but Finance Ops review and exception-log\n", "reconciliation remained incomplete at the time of release.\n", "\n", "## Cited Findings\n", "\n", "| Finding | Support | Status |\n", "|---|---|---|\n", "| Payments over $50,000 required two approvals, and exceptions had to be logged with Finance Ops. | ACME-B1-001 | Supported |\n", "| The Northwind Logistics exception was approved verbally, but the initial note does not show a Finance Ops log entry. | ACME-B1-002 | Supported |\n", "| Internal Audit treated missing exception logs as a control gap until resolved. | ACME-B2-001 | Supported |\n", "| The approval thread indicates Legal approved the exception, but Finance Ops approval was still pending when payment was released. | ACME-B2-002 | Supported |\n", "| The remediation plan requires Finance Ops to reconcile verbal vendor exceptions from Q4 and add retrospective control attestations. | ACME-B3-001 | Supported |\n", "\n", "## Open Questions\n", "\n", "- Was Finance Ops approval completed after the payment release, and if so, when?\n", "- How many other Q4 verbal vendor exceptions lack retrospective attestations?\n", "- Did any compensating control apply to the Northwind payment before remediation began?\n", "\n", "## Recommended Next Steps\n", "\n", "1. Reconcile Northwind Logistics against the Finance Ops exception log and payment-release timestamp.\n", "2. Pull the full Q4 population of verbal vendor exceptions into the same review workflow.\n", "3. Classify the issue as a control gap unless later evidence shows timely Finance Ops approval or an approved compensating control.\n", "\"\"\".strip()\n", "\n", "if not RUN_AGENT:\n", " print(\"RUN_AGENT is False, so this cell shows an example memo shape rather than running the model.\\n\")\n", " print(EXAMPLE_OUTPUT)\n" ] }, { "cell_type": "markdown", "id": "inspect-generated-artifacts", "metadata": {}, "source": [ "## Inspect Generated Artifacts\n", "\n", "The final agent response is useful, but the reliability pattern becomes clearer when you inspect the files the sandbox run produced. This section makes the normally hidden state visible:\n", "\n", "- the reviewer-facing memo in `outputs/compliance_review_memo.md`,\n", "- generated SDK memory files such as `memories/MEMORY.md` and `memories/memory_summary.md`,\n", "- the workspace files produced by the run, including the session log.\n", "\n", "The generated memory artifact is not the compliance memo and should not be treated as investigation truth. It is reusable workflow memory. The `Task Group` heading is the memory system's own grouping label, and the memory generator is steered with `MemoryGenerateConfig.extra_prompt` so it stores workflow lessons rather than ACME-specific findings.\n", "\n", "If `RUN_AGENT = False`, this section displays the expected output shape instead of live sandbox artifacts.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "inspect-generated-artifacts-code", "metadata": {}, "outputs": [], "source": [ "try:\n", " from IPython.display import Markdown, display\n", "except ImportError:\n", " Markdown = None\n", " display = None\n", "\n", "memo_text = final_run.get(\"memo\") if \"final_run\" in globals() else None\n", "workspace_files = final_run.get(\"workspace_files\") if \"final_run\" in globals() else None\n", "memory_artifacts = final_run.get(\"memory_artifacts\", {}) if \"final_run\" in globals() else {}\n", "compaction_checkpoint = final_run.get(\"compaction_checkpoint\") if \"final_run\" in globals() else None\n", "\n", "print(\"Generated workspace files:\")\n", "print(workspace_files or \"No generated workspace files were captured.\")\n", "\n", "print(\"\\nMemory and compaction configuration:\")\n", "print({\n", " \"final_agent_capabilities\": [type(capability).__name__ for capability in reliable_agent.capabilities],\n", " \"sandbox_compaction\": \"Compaction() is attached to the final agent for automatic context management\",\n", " \"sandbox_memory\": \"Memory(generate=MemoryGenerateConfig(...)) is attached to the final agent\",\n", " \"memory_write_policy\": \"Store reusable workflow lessons, not ACME-specific compliance findings\",\n", " \"forced_checkpoint\": \"OpenAIResponsesCompactionSession.run_compaction({force: True}) after the final review\",\n", " \"compaction_model_for_checkpoint\": COMPACTION_MODEL,\n", " \"checkpoint_result\": compaction_checkpoint,\n", "})\n", "\n", "if memory_artifacts:\n", " for path, text in memory_artifacts.items():\n", " heading = f\"### Generated SDK memory artifact: `{path}`\\n\\n\"\n", " explanation = (\n", " \"This block is generated by the SDK `Memory()` primitive. It is reusable \"\n", " \"workflow memory, not the compliance memo and not the investigation system \"\n", " \"of record.\\n\\n\"\n", " \"```text\\n----- BEGIN GENERATED SDK MEMORY ARTIFACT -----\\n\"\n", " )\n", " closing = \"\\n----- END GENERATED SDK MEMORY ARTIFACT -----\\n```\"\n", " if display is not None and Markdown is not None:\n", " display(Markdown(heading + explanation + text + closing))\n", " else:\n", " print(f\"\\nGenerated SDK memory artifact: {path}\\n\")\n", " print(\"----- BEGIN GENERATED SDK MEMORY ARTIFACT -----\")\n", " print(text)\n", " print(\"----- END GENERATED SDK MEMORY ARTIFACT -----\")\n", "else:\n", " print(\"\\nNo generated memory artifacts were captured. Set RUN_AGENT = True and rerun the final workflow.\")\n", "\n", "if memo_text:\n", " if display is not None and Markdown is not None:\n", " display(Markdown(\"## Generated memo: `outputs/compliance_review_memo.md`\\n\\n\" + memo_text))\n", " else:\n", " print(\"\\nGenerated memo:\\n\")\n", " print(memo_text)\n", "else:\n", " print(\"\\nNo memo was captured. Set RUN_AGENT = True and rerun the final workflow.\")\n" ] }, { "cell_type": "markdown", "id": "1eaf9e31", "metadata": {}, "source": [ "**Common Pitfall**\n", "\n", "Do not treat `Memory()` as an unreviewed fact database.\n", "\n", "Memory should help the next run remember how to work. It should not become a shadow compliance record. If a conclusion matters, write it into a reviewed artifact with citations." ] }, { "cell_type": "markdown", "id": "597a291b", "metadata": {}, "source": [ "## Conclusion\n", "\n", "You now have the building blocks for a reliable long-running agent workflow:\n", "\n", "- A sandbox workspace for controlled file access.\n", "- A manifest that helps the agent route across documents.\n", "- Compaction for finite context windows.\n", "- Memory for reusable workflow lessons.\n", "- A generated memo as the reviewed investigation artifact.\n", "\n", "The main design choice is separation of responsibility: context helps the agent work, memory helps future agents work better, and reviewed artifacts hold the facts that people will rely on." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }