# Seer Documentation ## Overview Seer is a framework for having agents conduct interpretability work and investigations. The core mechanism involves launching a remote sandbox hosted on a remote GPU or CPU. The agent operates an IPython kernel and notebook on this remote host. ## Why use it? This approach is valuable because it allows you to see what the agent is doing as it runs, and it can iteratively add things, fix bugs, and adjust its previous work. You can provide tooling to make an environment and any interpretability techniques available as function calls that the agent can use in the notebook as part of writing normal code. ## When to use Seer - **Exploratory investigations** where you have a hypothesis but want to try many variations quickly - **Scaling up** measuring how well different interp techniques perform through giving agents controlled access to them - **Replicating known experiments** on new models — the agent knows the recipe, you just point it at your model - **Building and improving existing agents** Using seer to build better investigative agents, building better auditing agents etc. ## Example runs - [Replicate the key experiment in the Anthropic introspection paper on gemma3 27b](https://github.com/ajobi-uhc/seer/blob/main/example_runs/introspection_gemma_27b.ipynb) - [Investigate a model finetuned with hidden preferences and discover them](https://github.com/ajobi-uhc/seer/blob/main/example_runs/find_hidden_gender_assumption.ipynb) - [Create a hackable version of Petri for categorizing and finding weird behaviours](https://github.com/ajobi-uhc/seer/tree/main/example_runs/petri-style-transcripts) - [Use SAE techniques to diff two Gemini checkpoints and discover behavioral differences](https://github.com/ajobi-uhc/seer/blob/main/example_runs/checkpoint_diffing.ipynb) ## Quick Start ### Prerequisites - [Modal](https://modal.com) account (GPU infrastructure) - [uv](https://docs.astral.sh/uv/) package manager ### Setup ```bash git clone https://github.com/ajobi-uhc/seer cd seer uv sync uv run modal token new ``` Create `.env`: ```bash ANTHROPIC_API_KEY=sk-ant-... HF_TOKEN=hf_... # Optional, for gated models ``` ### Run an experiment ```bash cd experiments/hidden-preference-investigation uv run python main.py ``` **What happens:** 1. Modal provisions GPU (~30 sec) 2. Downloads models (cached for future runs) 3. Agent runs the experiment in a notebook 4. Results saved to `./outputs/` **Costs:** A100 ~$1-2/hour. Typical experiments 10-60 minutes. ## Design Philosophy Seer tries not to be opinionated and is built to be hackable. We provide utilities for environments and harnesses, but you're encouraged to modify everything. The goal is to make infrastructure and scaffolding simple so experiments stay reproducible. --- # Core Concepts ``` ┌──────────────────────────────────────────────────────────────┐ │ Your Machine │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Harness │ │ │ │ run_agent(prompt, mcp_config, provider="claude") │ │ │ └───────────────────────────┬─────────────────────────────┘ │ │ │ MCP │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Session │ │ │ │ Notebook: agent works in Jupyter │ │ │ │ Local: agent runs locally, calls GPU via RPC │ │ │ └───────────────────────────┬─────────────────────────────┘ │ └──────────────────────────────┼───────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ Modal (Remote GPU) │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Sandbox │ │ │ │ - GPU (A100, H100, etc.) │ │ │ │ - Models (cached on Modal volumes) │ │ │ │ - Workspace libraries │ │ │ └─────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘ ``` ## Sandbox GPU environment with models loaded. ```python sandbox = Sandbox(SandboxConfig( gpu="A100", models=[ModelConfig(name="google/gemma-2-9b")], )).start() ``` Two types: - **Sandbox** — agent has full access - **ScopedSandbox** — agent can only call functions you expose ## Workspace Any files/libraries the agent should have in its workspace. ```python workspace = Workspace(libraries=[ Library.from_file("steering_hook.py"), ]) ``` ## Session How the agent connects to the sandbox. ```python session = create_notebook_session(sandbox, workspace) # Access via notebook # or session = create_cli_session(workspace, workspace_dir) # Access via the cli ``` ## Harness Runs the agent. ```python async for msg in run_agent(prompt, mcp_config=session.mcp_config): pass ``` ## Putting it together ```python # 1. Sandbox config = SandboxConfig(gpu="A100", models=[...]) sandbox = Sandbox(config).start() # 2. Workspace workspace = Workspace(libraries=[...]) # 3. Session session = create_notebook_session(sandbox, workspace) # 4. Harness async for msg in run_agent(prompt, mcp_config=session.mcp_config): pass # 5. Cleanup sandbox.terminate() ``` --- # Environment An environment is everything your agent needs to do its work: GPU compute, models, packages, files, and tools. Seer environments run on Modal, so you get on-demand GPUs without managing infrastructure. You define what you need declaratively. Seer handles provisioning, model downloads, and caching. ## Sandbox The sandbox is the running Modal container where your environment lives. Your agent runs locally and connects to the sandbox to execute code. ```python config = SandboxConfig( gpu="A100", models=[ModelConfig(name="google/gemma-2-9b")], python_packages=["torch", "transformers"], ) sandbox = Sandbox(config).start() # ... agent works ... sandbox.terminate() ``` ## Config options | Field | What it does | |-------|--------------| | `gpu` | GPU type: "A100", "H100", or None for CPU | | `gpu_count` | Number of GPUs (default: 1) | | `models` | HuggingFace models to download | | `python_packages` | pip packages to install | | `system_packages` | apt packages to install | | `secrets` | Env vars to pass from local .env | | `timeout` | Sandbox timeout in seconds (default: 3600) | | `local_files` | Files to mount: `[("./local.txt", "/sandbox/path.txt")]` | | `local_dirs` | Directories to mount: `[("./data", "/workspace/data")]` | | `debug` | Enable VS Code in browser | ## Models Models are downloaded to Modal volumes and cached across runs: ```python models=[ ModelConfig(name="google/gemma-2-9b"), ModelConfig(name="my-org/my-adapter", is_peft=True, base_model="meta-llama/Llama-2-7b"), ] ``` | ModelConfig field | What it does | |-------------------|--------------| | `name` | HuggingFace model ID | | `var_name` | Variable name in model info (default: "model") | | `hidden` | Hide model details from agent | | `is_peft` | Model is a PEFT/LoRA adapter | | `base_model` | Base model ID (required if `is_peft=True`) | ## Repos Clone git repos into the sandbox: ```python repos=[ RepoConfig(url="https://github.com/org/repo"), RepoConfig(url="org/repo", install="pip install -e ."), ] ``` ## Working with a running sandbox Write files: ```python sandbox.write_file("/workspace/config.json", '{"key": "value"}') sandbox.ensure_dir("/workspace/outputs") ``` Run commands: ```python sandbox.exec("pip install einops") sandbox.exec_python("print(torch.cuda.is_available())") ``` ## Snapshots Save sandbox state and restore it later: ```python snapshot = sandbox.snapshot("after setup") # Later... new_sandbox = Sandbox.from_snapshot(snapshot, config) ``` Useful for checkpointing long experiments or sharing reproducible starting points. ## Sandbox vs ScopedSandbox **Sandbox** — agent has full notebook access, can run arbitrary code **ScopedSandbox** — agent can only call functions you expose via an interface file ```python # Full access sandbox = Sandbox(config).start() session = create_notebook_session(sandbox, workspace) # Scoped access scoped = ScopedSandbox(config).start() model_tools = scoped.serve("interface.py", expose_as="library") session = create_local_session(workspace, workspace_dir) ``` ## Properties | Property | What it returns | |----------|-----------------| | `sandbox.jupyter_url` | Jupyter URL (notebook mode) | | `sandbox.code_server_url` | VS Code URL (debug mode) | | `sandbox.model_handles` | Prepared model handles | | `sandbox.sandbox_id` | Modal sandbox ID | --- # Scoped Sandbox & RPC A `ScopedSandbox` serves specific GPU functions via RPC instead of giving the agent full access. ## When to use - **Sandbox** — agent has full notebook access, good for exploration - **ScopedSandbox** — agent can only call functions you expose, good for controlled experiments ## Writing interface files An interface file defines what GPU functions the agent can call. ```python # interface.py from transformers import AutoModel, AutoTokenizer import torch model_path = get_model_path("google/gemma-2-9b") # injected model = AutoModel.from_pretrained(model_path, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_path) @expose def get_embedding(text: str) -> dict: inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) embedding = outputs.hidden_states[-1].mean(dim=1).squeeze() return {"embedding": embedding.tolist()} ``` Rules: - `@expose` marks functions the agent can call - Must return JSON-serializable types (use `.tolist()` for tensors) - `get_model_path()` is injected — returns cached model path - Load models at module level, not inside functions ## Serving the interface ```python scoped = ScopedSandbox(SandboxConfig( gpu="A100", models=[ModelConfig(name="google/gemma-2-9b")], )).start() model_tools = scoped.serve( "interface.py", expose_as="library", # or "mcp" name="model_tools" ) ``` `expose_as` options: - `"library"` — agent imports it: `import model_tools` - `"mcp"` — agent sees functions as MCP tools ## Using with local session ```python workspace = Workspace(libraries=[model_tools]) session = create_local_session(workspace, workspace_dir) async for msg in run_agent(prompt, mcp_config={}): pass ``` The agent runs locally. When it calls `model_tools.*`, the call goes to the GPU via RPC. --- # Sessions Sessions define how the agent connects to the sandbox. | Sandbox type | Session type | Agent experience | |-------------|--------------|------------------| | `Sandbox` | Notebook | Full Jupyter access on GPU | | `ScopedSandbox` | Local | Runs locally, calls exposed functions via RPC | ## Notebook session Agent gets a Jupyter notebook running on the sandbox. ```python session = create_notebook_session(sandbox, workspace) ``` Returns: - `session.mcp_config` — pass to `run_agent` - `session.jupyter_url` — view notebook in browser - `session.model_info_text` — model details for agent prompt Use when: exploratory research, iterative probing, visualization. ## Local session Agent runs on your machine. GPU access is through the functions you exposed. ```python session = create_local_session(workspace, workspace_dir, name) ``` Returns the same `mcp_config` interface, but execution happens locally. Use when: controlled experiments, benchmarking specific functions, reproducibility. Requires `ScopedSandbox` with interface file. --- # Harness The harness runs the agent and connects it to a session. Seer provides a default harness, but it's designed to be swapped out. The session provides an mcp config for any harness/agent to connect to. ## Basic usage ```python async for msg in run_agent( prompt=task, mcp_config=session.mcp_config, ): print(msg) ``` The harness: 1. Connects the agent to the session via MCP 2. Sends the prompt 3. Streams messages back 4. Handles tool calls automatically ## Providers ```python provider="claude" # Claude (default) ``` ## Interactive mode Chat with the agent in your terminal. Press ESC to interrupt mid-response. ```python await run_agent_interactive( prompt=prompt, mcp_config=session.mcp_config, user_message="Start by exploring the model's hidden preferences.", ) ``` ## Multi-agent For multi-agent setups, run multiple agents with different (or the same!) configs: ```python auditor = run_agent(auditor_prompt, mcp_config=auditor_tools) investigator = run_agent(investigator_prompt, mcp_config=investigator_tools) judge = run_agent(judge_prompt, mcp_config={}) ``` ## Custom harnesses The harness is just scaffolding around the agent. You can: - Swap models (`model="claude-sonnet-4-5-20250929"`) - Add custom logging or callbacks - Build supervisor/worker patterns - Implement retries or error handling The session's `mcp_config` works with any agent framework that supports MCP. --- # Workspaces A workspace defines everything the agent has access to: files, libraries, skills, and initialization code. ```python workspace = Workspace( local_dirs=[("./data", "/workspace/data")], libraries=[Library.from_file("helpers.py")], skill_dirs=["./skills/research"], custom_init_code="model = load_my_model()", ) ``` ## What you can configure | Field | What it does | |-------|--------------| | `local_dirs` | Mount local directories into the workspace | | `local_files` | Mount individual files | | `libraries` | Python modules the agent can import | | `skill_dirs` | Skill folders for agent discovery | | `custom_init_code` | Python code to run at startup | | `preload_models` | Whether to load models before agent starts (default: true) | | `hidden_model_loading` | Hide model loading output from agent (default: true) | ## Libraries Make Python files importable by the agent: ```python workspace = Workspace(libraries=[ Library.from_file("utils.py"), Library.from_skill_dir("skills/steering"), ]) ``` When using `ScopedSandbox`, RPC handles are also libraries: ```python model_tools = scoped.serve("interface.py", expose_as="library") workspace = Workspace(libraries=[model_tools]) ``` Either way, the agent just imports: ```python from utils import my_helper import model_tools ``` ## Skills Skill directories contain documentation and tools the agent can discover. Useful for giving the agent reference material or predefined procedures. ```python workspace = Workspace(skill_dirs=["./skills/activation_patching"]) ``` ## Custom init code Run arbitrary Python before the agent starts: ```python workspace = Workspace( custom_init_code=""" from transformers import AutoModel model = AutoModel.from_pretrained("google/gemma-2-9b") tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b") """ ) ``` Variables defined here are available in the agent's namespace. ## Seer toolkit Common interpretability utilities live in `experiments/toolkit/`: - `extract_activations.py` — layer activation extraction - `steering_hook.py` — activation steering via hooks - `generate_response.py` — text generation helper ```python toolkit = Path("experiments/toolkit") workspace = Workspace(libraries=[ Library.from_file(toolkit / "steering_hook.py"), Library.from_file(toolkit / "extract_activations.py"), ]) ``` These are meant to be copied and modified. --- # Experiments ## Experiment 0: Local Mode (No Modal) Run experiments locally without Modal signup or GPU. This will restrict you to mostly black box investigations. ### When to use local mode Local mode is for experiments that don't need GPU: - **API-based investigations** - Probe models via OpenRouter, OpenAI, Anthropic APIs - **Testing and development** - Iterate on prompts/tools before running on GPU - **CPU-only analysis** - Data processing, visualization, lightweight inference For GPU workloads (loading large models locally), use the standard sandbox. ### Prerequisites - Repo cloned and `uv sync` completed - `ANTHROPIC_API_KEY` in your `.env` file (for the agent) - Any other API keys your experiment needs (e.g., `OPENROUTER_API_KEY`) ### Quick start ```bash cd experiments/api-kimi-investigation export OPENROUTER_API_KEY=your_key uv run python main_local.py ``` That's it. No Modal signup, no GPU provisioning. ### How it works Instead of `Sandbox` + `create_notebook_session`, use `create_local_notebook_session`: ```python from src.execution import create_local_notebook_session from src.workspace import Workspace, Library # Create local session (starts Jupyter locally) session = create_local_notebook_session( workspace=Workspace(libraries=[Library.from_file("my_tools.py")]), name="my-experiment", ) # Same interface as remote sessions print(session.mcp_config) # For agent connection session.exec("print('Hello!')") # Execute code session.terminate() # Cleanup ``` ### Full example: Kimi investigation This experiment uses Claude to investigate another model's (Kimi) behavior via API: ```python # experiments/api-kimi-investigation/main_local.py import asyncio from pathlib import Path from src.workspace import Workspace, Library from src.execution import create_local_notebook_session from src.harness import run_agent async def main(): example_dir = Path(__file__).parent # Workspace with OpenRouter client library workspace = Workspace( libraries=[Library.from_file(example_dir / "openrouter_client.py")] ) # Local session - no Modal needed session = create_local_notebook_session( workspace=workspace, name="kimi-investigation", ) task = """ You are investigating the Kimi model's behavior on sensitive topics. Use model "moonshotai/kimi-k2-0905" via openrouter_client.client. Task: Investigate how the model responds to questions about the 2024 Zhuhai car attack. """ try: async for msg in run_agent( prompt=task, mcp_config=session.mcp_config, provider="claude", ): pass finally: session.terminate() if __name__ == "__main__": asyncio.run(main()) ``` The helper library (`openrouter_client.py`): ```python import os from openai import OpenAI client = OpenAI( base_url="https://openrouter.ai/api/v1", api_key=os.environ.get("OPENROUTER_API_KEY"), ) ``` ### What's different from remote mode | Feature | Local | Remote (Modal) | |---------|-------|----------------| | GPU access | No | Yes | | Model loading | Via API only | Local in sandbox | | Startup time | ~5 sec | ~30 sec | | Cost | Free (except API calls) | ~$1-2/hour | | Snapshots | No | Yes | | Isolation | Runs in your env | Sandboxed | ### API compatibility `LocalNotebookSession` has the same interface as `NotebookSession`: - `session.exec(code)` - Execute Python code - `session.mcp_config` - MCP config for agents - `session.workspace_path` - Where libraries are installed - `session.terminate()` - Cleanup So you can often switch between local and remote by just changing the session creation. --- ## Experiment 1: Sandbox Intro Spin up a GPU with a model and let an agent explore it in a Jupyter notebook. ### 1. Configure the sandbox ```python from src.environment import Sandbox, SandboxConfig, ExecutionMode, ModelConfig config = SandboxConfig( gpu="A100", execution_mode=ExecutionMode.NOTEBOOK, models=[ModelConfig(name="google/gemma-2-2b-it")], python_packages=["torch", "transformers", "accelerate"], ) ``` - `gpu` — A100 has 40GB VRAM, fits models up to ~30B params - `execution_mode` — NOTEBOOK means agent works in Jupyter on the GPU - `models` — HuggingFace model IDs to download and load - `python_packages` — installed in the sandbox ### 2. Start the sandbox ```python sandbox = Sandbox(config).start() ``` Provisions the GPU on Modal. First run downloads the model (~2 min), subsequent runs use cache. ### 3. Create a workspace ```python from src.workspace import Workspace workspace = Workspace(libraries=[]) ``` Workspace defines custom code the agent can import. Empty for now — later examples add interpretability tools here. ### 4. Create a session ```python from src.execution import create_notebook_session session = create_notebook_session(sandbox, workspace) ``` Returns: - `session.mcp_config` — config for agent to connect to the notebook - `session.jupyter_url` — open this to watch the agent work - `session.model_info_text` — model details to include in agent prompt ### 5. Run the agent ```python from src.harness import run_agent task = (example_dir / "task.md").read_text() prompt = f"{session.model_info_text}\n\n{task}" async for msg in run_agent( prompt=prompt, mcp_config=session.mcp_config, provider="claude" ): pass sandbox.terminate() ``` The notebook saves to `./outputs/` as the agent works. ### Full example ```bash cd experiments/sandbox-intro && python main.py ``` --- ## Experiment 2: Scoped Sandbox Give the agent access to specific GPU functions instead of a full notebook. ### When to use this - **Full sandbox** (previous example) — agent has a notebook, can run arbitrary code, good for exploration - **Scoped sandbox** — agent can only call functions you define, good when you want explicit control ### 1. Configure the scoped sandbox ```python from src.environment import ScopedSandbox, SandboxConfig, ModelConfig scoped = ScopedSandbox(SandboxConfig( gpu="A100", models=[ModelConfig(name="google/gemma-2-9b")], python_packages=["torch", "transformers", "accelerate"], )) scoped.start() ``` No `execution_mode` — the agent doesn't run in the sandbox. Instead, you serve specific functions from it. ### 2. Define GPU functions Create an interface file with functions that run on the GPU: ```python # interface.py from transformers import AutoModel, AutoTokenizer import torch model_path = get_model_path("google/gemma-2-9b") # injected by RPC server model = AutoModel.from_pretrained(model_path, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_path) @expose def get_model_info() -> dict: """Get basic model information.""" return { "num_layers": model.config.num_hidden_layers, "hidden_size": model.config.hidden_size, "vocab_size": model.config.vocab_size, } @expose def get_embedding(text: str) -> dict: """Get text embedding from model.""" inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) embedding = outputs.hidden_states[-1].mean(dim=1).squeeze() return {"embedding": embedding.tolist()} ``` - `@expose` marks functions the agent can call — everything else is hidden - Functions must return JSON-serializable types (use `.tolist()` for tensors) - `get_model_path()` is injected — returns the cached model path ### 3. Serve the interface ```python model_tools = scoped.serve( str(example_dir / "interface.py"), expose_as="library", name="model_tools" ) ``` Loads `interface.py` on the GPU and creates an RPC server. `expose_as` options: - `"library"` — agent imports it: `import model_tools; model_tools.get_embedding("hello")` - `"mcp"` — agent sees them as MCP tools ### Full example ```bash cd experiments/scoped-sandbox-intro && python main.py ``` --- ## Experiment 3: Hidden Preference Investigation Investigate a fine-tuned model for hidden biases using interpretability tools. This builds on Sandbox Intro by adding interpretability libraries to the workspace. ### 1. Configure with PEFT model ```python from src.environment import Sandbox, SandboxConfig, ExecutionMode, ModelConfig config = SandboxConfig( gpu="A100", execution_mode=ExecutionMode.NOTEBOOK, models=[ModelConfig( name="bcywinski/gemma-2-9b-it-user-female", base_model="google/gemma-2-9b-it", is_peft=True, hidden=True )], python_packages=["torch", "transformers", "accelerate", "datasets", "peft"], secrets=["huggingface-secret"], ) ``` New `ModelConfig` parameters: - `base_model` — base model to load first - `is_peft=True` — this is a PEFT adapter (LoRA, etc.), not a full model - `hidden=True` — hides model name from agent to prevent bias in investigation ### 2. Add interpretability libraries ```python from src.workspace import Workspace, Library toolkit = Path(__file__).parent.parent / "toolkit" workspace = Workspace(libraries=[ Library.from_file(toolkit / "steering_hook.py"), Library.from_file(toolkit / "extract_activations.py"), ]) ``` These are in `experiments/toolkit/`: - `extract_activations.py` — extract activations at any layer/position - `steering_hook.py` — inject vectors during generation The agent can then: ```python from extract_activations import extract_activation from steering_hook import create_steering_hook # Extract activations for two inputs act1 = extract_activation(model, tokenizer, "neutral text", layer_idx=15) act2 = extract_activation(model, tokenizer, "biased text", layer_idx=15) # Compute steering vector steering_vec = act2 - act1 # Test if it causally affects behavior with create_steering_hook(model, layer_idx=15, vector=steering_vec, strength=2.0): output = model.generate(...) ``` ### Full example ```bash cd experiments/hidden-preference-investigation && python main.py ``` --- ## Experiment 4: Introspection Replicate the Anthropic introspection experiment: can a model detect which concept is being injected into its activations? This uses the same setup as Hidden Preference — notebook mode with steering libraries. ### The experiment 1. Extract concept vectors (e.g., "Lightning", "Oceans", "Happiness") by computing `activation(concept) - mean(activation(baselines))` 2. Inject these vectors during generation while asking the model "Do you detect an injected thought? What is it about?" 3. Score whether the model correctly identifies the injected concept 4. Compare against control trials (no injection) to establish baseline ### Setup ```python config = SandboxConfig( gpu="H100", # Larger model needs more VRAM execution_mode=ExecutionMode.NOTEBOOK, models=[ModelConfig(name="google/gemma-3-27b-it")], python_packages=["torch", "transformers", "accelerate", "pandas", "matplotlib", "numpy"], ) sandbox = Sandbox(config).start() workspace = Workspace(libraries=[ Library.from_file(shared_libs / "steering_hook.py"), Library.from_file(shared_libs / "extract_activations.py"), ]) session = create_notebook_session(sandbox, workspace) ``` ### What the agent does The task prompt guides the agent through: 1. Extracting concept vectors at ~70% model depth 2. Verifying steering works on neutral prompts 3. Running injection trials with the introspection prompt 4. Running control trials without injection 5. Computing identification rates and comparing against baseline ### Full example ```bash cd experiments/introspection && python main.py ``` --- ## Experiment 5: Checkpoint Diffing Compare two model checkpoints (Gemini 2.0 vs 2.5 Flash) using SAE-based analysis to find behavioral differences. This introduces new config options: cloning external repos and accessing external APIs. ### New concepts #### Cloning external repos ```python from src.environment import RepoConfig config = SandboxConfig( repos=[RepoConfig(url="nickjiang2378/interp_embed")], # ... ) ``` The repo is cloned to `/workspace/interp_embed` in the sandbox. The agent can import from it. #### External API access ```python config = SandboxConfig( secrets=["GEMINI_API_KEY", "OPENAI_KEY", "OPENROUTER_API_KEY", "HF_TOKEN"], # ... ) ``` Secrets are Modal secrets you've configured. They're available as environment variables in the sandbox. #### Longer timeout ```python config = SandboxConfig( timeout=7200, # 2 hours (default is 1 hour) # ... ) ``` SAE encoding is slow — this experiment can take 1-2 hours. ### What the agent does 1. Generate prompts designed to reveal behavioral differences 2. Collect responses from both Gemini versions via OpenRouter 3. Encode responses using SAE (Llama 3.1 8B SAE with 65k features) 4. Diff feature activations to find what changed between versions 5. Analyze top differentiating features with examples ### Full example ```bash cd experiments/checkpoint-diffing && python main.py ``` --- ## Experiment 6: Petri-Style Harness A hackable version of Petri for categorizing and finding weird behaviors in models. This shows how to build multi-agent auditing pipelines with Seer. ### Architecture ``` Phase 1: Audit ┌──────────┐ MCP tools ┌─────────────────────┐ │ Auditor │ ──────────────────► │ Scoped Sandbox │ │ (Claude) │ │ │ │ │ ◄────responses───── │ Target (via API) │ └──────────┘ └─────────────────────┘ Phase 2: Judge ┌──────────┐ │ Judge │ ◄── transcript retrieved from sandbox │ (Claude) │ └──────────┘ │ ▼ scores ``` 1. **Auditor** probes the Target via MCP tools exposed from the sandbox 2. **Transcript** is retrieved after the audit completes 3. **Judge** scores the transcript on multiple dimensions ### New concepts #### Scoped sandbox exposing MCP tools Use `expose_as="mcp"` so the agent gets tools instead of importable functions: ```python scoped = ScopedSandbox(SandboxConfig( gpu=None, # No GPU — using OpenRouter API python_packages=["openai"], secrets=["OPENROUTER_API_KEY"], )) scoped.start() mcp_config = scoped.serve( "conversation_interface.py", expose_as="mcp", name="petri_tools" ) ``` The Auditor sees tools like `send_message()`, `get_transcript()` in its tool list. #### No GPU The Target model runs via API, so no GPU needed: ```python SandboxConfig(gpu=None, ...) ``` #### Sequential agents ```python # Phase 1: Auditor uses MCP tools to probe Target async for msg in run_agent(auditor_prompt, mcp_config=mcp_config): pass # Phase 2: Retrieve transcript transcript = scoped.exec("cat /tmp/petri_transcript.txt") # Phase 3: Judge scores (simple API call, no tools) judge_response = client.messages.create( model="claude-sonnet-4-5-20250929", messages=[{"role": "user", "content": build_judge_prompt(transcript)}], ) ``` ### Conversation interface `conversation_interface.py` exposes these MCP tools: - `set_system_prompt(prompt)` — configure Target's system prompt - `send_message(content)` — send user message to Target - `get_response()` — get Target's last response - `get_transcript()` — save and return full conversation - `reset_conversation()` — start over ### Full example ```bash cd experiments/petri-style-harness && python main.py ``` --- # API Reference ## Environment API ### SandboxConfig ```python SandboxConfig( gpu: str = None, # "A100", "H100", "A10G", or None for CPU gpu_count: int = 1, # Number of GPUs execution_mode: ExecutionMode = ExecutionMode.CLI, models: list[ModelConfig] = [], repos: list[RepoConfig] = [], python_packages: list[str] = [], system_packages: list[str] = [], secrets: list[str] = [], # Modal secret names timeout: int = 3600, # Seconds (default 1 hour) local_files: list[tuple] = [], # [(local_path, sandbox_path), ...] local_dirs: list[tuple] = [], # [(local_path, sandbox_path), ...] env: dict[str, str] = {}, # Environment variables debug: bool = False, # Enable VS Code in browser ) ``` ### ModelConfig ```python ModelConfig( name: str, # HuggingFace model ID var_name: str = "model", # Variable name in model info hidden: bool = False, # Hide model name from agent is_peft: bool = False, # Is a PEFT adapter base_model: str = None, # Base model ID if PEFT ) ``` ### RepoConfig ```python RepoConfig( url: str, # GitHub repo (e.g., "user/repo") dockerfile: str = None, # Optional Dockerfile path install: str = None, # Install command (e.g., "pip install -e .") ) ``` ### ExecutionMode ```python ExecutionMode.NOTEBOOK # Jupyter notebook on GPU ExecutionMode.CLI # Shell interface ``` ### Sandbox ```python sandbox = Sandbox(config).start() ``` **Methods:** - `start()` → Sandbox — provision GPU, download models, return running sandbox - `terminate()` — shutdown sandbox - `exec(cmd: str)` → str — execute shell command - `exec_python(code: str)` → str — execute Python code - `write_file(path: str, content: str)` — write file to sandbox - `ensure_dir(path: str)` — create directory in sandbox - `snapshot(name: str)` — save sandbox state **Properties:** - `jupyter_url` — Jupyter URL (notebook mode) - `code_server_url` — VS Code URL (debug mode) - `model_handles` — list of ModelHandle for loaded models - `repo_handles` — list of RepoHandle for cloned repos - `sandbox_id` — Modal sandbox ID ### ScopedSandbox ```python scoped = ScopedSandbox(config) scoped.start() lib = scoped.serve( "interface.py", expose_as="library", # or "mcp" name="model_tools" ) ``` **Methods:** - `start()` — provision sandbox - `serve(file, expose_as, name)` → Library | dict — serve file as RPC library or MCP tools - `write_file(path, content)` — write file to sandbox - `exec(cmd)` → str — execute shell command - `terminate()` — shutdown sandbox **expose_as options:** - `"library"` — returns Library, agent imports it - `"mcp"` — returns MCP config dict, agent sees tools **Snapshots:** ```python # Save state snapshot = sandbox.snapshot("after setup") # Restore later new_sandbox = Sandbox.from_snapshot(snapshot, config) ``` --- ## Workspace API ### Workspace ```python Workspace( libraries: list[Library] = [], skills: list[Skill] = [], skill_dirs: list[str] = [], local_dirs: list[tuple] = [], # [(src_path, dest_path), ...] local_files: list[tuple] = [], custom_init_code: str = None, preload_models: bool = True, # Load models before agent starts hidden_model_loading: bool = True, # Hide model loading from agent ) ``` **Methods:** - `get_library_docs()` → str — combined docs for all libraries (for agent prompt) ### Library ```python # From local file lib = Library.from_file("helpers.py") # From code string lib = Library.from_code("utils", "def foo(): ...") # From skill directory lib = Library.from_skill_dir("skills/steering") # From ScopedSandbox (RPC) lib = scoped.serve("interface.py", expose_as="library", name="tools") ``` **Methods:** - `Library.from_file(path)` → Library - `Library.from_code(name, code)` → Library - `Library.from_skill_dir(path)` → Library - `get_prompt_docs()` → str — documentation for agent ### Skill ```python # From directory with SKILL.md skill = Skill.from_dir("skills/steering") # From function with @expose decorator @expose def extract_activation(...): ... skill = Skill.from_function(extract_activation) ``` Skills are discovered by Claude Code and shown in agent's skill list. --- ## Execution API ### create_notebook_session ```python session = create_notebook_session( sandbox: Sandbox, workspace: Workspace, name: str = "notebook" ) ``` Agent gets Jupyter notebook on GPU. **Returns NotebookSession:** - `mcp_config` — pass to run_agent - `jupyter_url` — view notebook in browser - `model_info_text` — model details for prompt - `session_id` — unique identifier - `workspace_path` — path to workspace in sandbox - `exec(code)` — execute Python in notebook - `terminate()` — shutdown session ### create_local_session ```python session = create_local_session( workspace: Workspace, workspace_dir: str, name: str = "local" ) ``` Agent runs locally. Use with ScopedSandbox for GPU access via RPC. **Returns LocalSession:** - `mcp_config` — pass to run_agent (empty dict) - `name` — session name - `workspace_dir` — local workspace path ### create_local_notebook_session ```python session = create_local_notebook_session( workspace: Workspace, name: str = "notebook", output_dir: str = "./outputs" ) ``` Agent gets Jupyter notebook running locally (no Modal needed). **Returns LocalNotebookSession:** - `mcp_config` — pass to run_agent - `jupyter_url` — view notebook in browser - `notebook_path` — path to saved notebook - `workspace_path` — path to workspace - `exec(code)` — execute Python in notebook - `terminate()` — shutdown session ### create_cli_session ```python session = create_cli_session( sandbox: Sandbox, workspace: Workspace, name: str = "cli" ) ``` Agent gets shell interface to sandbox. **Returns CLISession:** - `mcp_config` — pass to run_agent - `session_id` — unique identifier - `exec(code)` — execute Python in sandbox - `exec_shell(cmd)` — execute shell command --- ## Harness API ### run_agent ```python async for msg in run_agent( prompt: str, mcp_config: dict = {}, provider: str = "claude", model: str = None, user_message: str = None, ): print(msg) ``` Run agent with task prompt. Streams messages. **Parameters:** - `prompt` — system prompt / task description - `mcp_config` — from session (or empty dict) - `provider` — "claude" (default) - `model` — specific model (optional, defaults to claude-sonnet-4-5-20250929) - `user_message` — initial user message (optional) **Example:** ```python async for msg in run_agent( prompt="Explore this model's behavior", mcp_config=session.mcp_config, provider="claude" ): pass ``` ### run_agent_interactive ```python await run_agent_interactive( prompt: str = "", mcp_config: dict = {}, provider: str = "claude", model: str = None, user_message: str = None, ) ``` Interactive chat session with agent. For debugging or manual exploration. Press ESC to interrupt mid-response. **Parameters:** - `prompt` — optional system prompt - `mcp_config` — from session (or empty dict) - `provider` — "claude" (default) - `model` — specific model (optional) - `user_message` — initial message to start conversation --- # Additional Resources - GitHub: https://github.com/ajobi-uhc/seer - Example Notebooks: https://github.com/ajobi-uhc/seer/tree/main/example_runs - Modal: https://modal.com - Documentation: https://ajobi-uhc.github.io/seer/