# Aegis Experiments Archive [English](README.md) · [繁體中文](README.zh-TW.md) Empirical validation runs for the [Aegis](../README.md) judgment-free LLM-agent fact layer. Nine rounds of paired A/B comparisons across Anthropic Haiku / Sonnet, OpenAI GPT-5 / Codex / GPT-5.4-mini, and Google Gemini 2.5/3 Flash family on four task shapes. Each round runs the **same task twice with the same model**: variant **A** without the Aegis MCP server, variant **B** with it. --- ## Overview chart ``` Total round directories: 52 (26 paired × 2 variants) Rounds by task shape: ┌──────────────────────────────┐ Plan A │██████████████ 14 dirs (7 pairs) ambiguous spec │ ├──────────────────────────────┤ Plan B │████████████████ 16 dirs (8 pairs) brownfield │ ├──────────────────────────────┤ Plan C │████████████████ 16 dirs (8 pairs) multi-module │ ├──────────────────────────────┤ Initial │██████ 6 dirs (3 pairs) (Round 1-2) │ └──────────────────────────────┘ Models tested: 11 Haiku · Sonnet · GPT-5.2 · GPT-5.3-codex · GPT-5.4 (codex) GPT-5.4-mini · Gemini 2.5 Flash · 2.5 Flash-Lite 3 Flash · 3.1 Flash-Lite · (Gemma family — skipped due to API errors) ``` --- ## Chart 1 — Plan B brownfield: 0/3 → 3/3 fix rate The headline result. Each starting `auth.py` has **3 planted SEC bugs**: `md5` password hash, timing-unsafe `==` compare, weak RNG for session token. Cells show **bugs remaining** out of 3 (lower = better, 0 = all fixed). The right column shows how many bugs Aegis pointed out during the run; the agent then fixed those plus often more. ``` Model │ A (no aegis) │ B (with aegis) │ Δ │ aegis hits │ bugs left /3 │ bugs left /3 │ │ ───────────────────┼──────────────────┼──────────────────┼────────┼──────── Haiku │ 1 ▓░░ 1 ▓░░ 0 1* Sonnet │ 3 ▓▓▓ 1 ▓░░ +2 3 Gemini 2.5 Flash │ 3 ▓▓▓ 3 ▓▓▓ 0 2* GPT-5.2 │ 3 ▓▓▓ 0 ░░░ +3 3 GPT-5.3-codex │ 3 ▓▓▓ 0 ░░░ +3 3 GPT-5.4-mini │ 3 ▓▓▓ 0 ░░░ +3 3 GPT-5.4-mini Go │ 3 ▓▓▓ 2 ▓▓░ +1 2** (md5 missed) GPT-5.4-mini Java │ 2 ▓▓░ 0 ░░░ +2 1** (only SEC012 fired) │ ───────────────────┴──────────────────┴──────────────────┴────────┴──────── * = Plan B was a re-run after rules were tightened — original Haiku run was on a slimmer rule set. ** = Aegis ran with a coverage gap (SEC009 multi-language, SEC010 Java enclosing-context). Both are now fixed in PRs #11 / #12. ``` **Two of the most striking observations:** 1. **GPT-5.4-mini Java** — Aegis only flagged 1 out of 3 bugs (SEC012 timing-unsafe), but the agent fixed all 3. The remaining md5 → SHA-256 and `new Random()` → `SecureRandom` were **self-driven** once Aegis put the agent into security-review mode. (The "one finding triggers the cascade" mechanism.) 2. **GPT-5.2** — Went well beyond the prompt: replaced md5 with OWASP-recommended **PBKDF2-HMAC-SHA256 + 16-byte salt + 210k iterations** instead of plain SHA-256. Aegis only said "md5 is weak"; the agent decided what to escalate to. --- ## Chart 2 — Plan C multi-module: anti-paralysis ritual Plan C (add a `notifications` feature to a 5-module Python project) has **clean starting code** — no planted bugs. Aegis can't show off its rule library here, but a third ROI mechanism surfaced anyway. ``` Model │ A (no aegis) │ B (with aegis) │ │ codex (GPT-5.4) │ ✓ 5 tests pass │ ✓ 5 tests pass Gemini 2.5 Flash │ ✓ 5 tests pass │ ✓ 5 tests pass Gemini 2.5 Flash-Lite │ ✓ 4 tests pass │ ✓ 4 tests pass Gemini 3.1 Flash-Lite │ ✗ task abandoned │ ✗ task abandoned ← preview-mode planning loop, both stuck Gemini 3 Flash │ ✓ 5 tests pass │ ✓ 5 tests pass GPT-5.2 │ ✓ 5 tests pass │ ✓ 5 tests pass GPT-5.3-codex │ ✓ 4 tests pass │ ✓ 4 tests pass GPT-5.4-mini │ ✗ task abandoned │ ✓ 5 tests pass ← THIS PAIR is the finding │ (no notifications.py │ no tests.py; │ 24k tokens spent │ on design proposals) Cycle introductions 16/16 = 0 (clean architecture is self-stabilizing) Public symbol breaks 16/16 = 0 (no agent removed an existing public name) ``` **Key data point**: **GPT-5.4-mini's A variant abandoned the task** — spent 24,051 tokens describing two design alternatives and asking "approve this design?" without ever writing code. The B variant of the **same model** completed the task because the prompt's `REQUIRED workflow: run aegis_validate.py after every .py file you write` made file-writing mandatory. Even though Aegis surfaced 0 security findings on the (clean) starting code, the **ritual itself** kept the weak model action-oriented. --- ## Chart 3 — The three Aegis ROI mechanisms Discovered empirically across the 9 rounds. Only mechanism 1 was the designer's stated intent. ```mermaid flowchart TD Trigger[Agent edits a file] --> MCP[validate_file MCP call] MCP --> Findings{Findings emitted?} Findings -->|Security: SEC009/010/012/etc.| ROI1[Mechanism 1: rule-hit → fix
Plan B: 0/3 → 3/3 across 3 models] Findings -->|Workspace: cycle / symbol_removed| ROI2[Mechanism 2: structural guardrail
0/14 hits — dead weight on clean code
but would catch real cycles] Findings -->|nothing — empty result| ROI3[Mechanism 3: anti-paralysis ritual
Forced write-then-validate cycle
prevents weak-model planning loops] ROI1 --> Cascade[Often triggers cascade:
1 finding → agent rewrites whole file
g52: md5 → PBKDF2+salt+210k iter] ROI3 --> Saved[Plan C: g54mini-mc-a abandoned
g54mini-mc-b completed same task] style ROI1 fill:#d4edda,stroke:#155724,color:#000 style ROI2 fill:#fff3cd,stroke:#856404,color:#000 style ROI3 fill:#d1ecf1,stroke:#0c5460,color:#000 ``` | Mechanism | Trigger | Evidence | Designed in? | |:---:|---|---|:---:| | **1. Rule-hit → fix** | brownfield + planted SEC bug | Plan B 3/3 models 0/3 → 3/3 | ✅ | | **2. Structural guardrail** | cycle / public_symbol_removed | 0/14 hits (clean code = silent) | ✅ | | **3. Anti-paralysis ritual** | weak model + any task | Plan C g54mini A abandoned vs B completed | ❌ emergent | --- ## Chart 4 — Direct lineage: experiment finding → Aegis code change Every recent SEC PR has a specific experiment trigger. The dogfooding loop in action: ``` Round 8 codex Plan A │ ▼ FP discovered: SEC010 fires on `secrets.choice` (the SECURE choice) │ — agent spent a turn "fixing" already-secure code │ ▼─────────────────────────► PR #9: secrets./os.urandom/crypto. allowlist +2 regression tests Round 9 Go brownfield │ ▼ FN discovered: SEC009 doesn't fire on Go `md5.Sum(...)` │ — agent kept md5 because aegis didn't surface it │ ▼─────────────────────────► PR #12: SEC009 language-aware dispatch +8 multi-language tests + enclosing_security_context function-name check Round 9 Java brownfield │ ▼ FN discovered: SEC010 inner-block `break` hides │ `int idx = new Random().nextInt(...)` inside │ `generateSessionToken()` │ ▼─────────────────────────► PR #11: enclosing_token_context walks past inner blocks + reads function name +3 regression tests ``` | Round | Discovered | Fixed in | |---|---|---| | Round 8 codex Plan A | SEC010 false-positive on `secrets.choice` | PR #9 | | Round 9 Go / Java | SEC009 multi-language coverage = 0 | PR #12 | | Round 9 Java | SEC010 inner-block `break` hides production case | PR #11 | | Plan A 32+ runs | SEC010 needles too narrow (URL shorteners) | PR #6 | | Plan A entropy bypass | SEC002 misses placeholder-shaped strings | PR #6 | | Plan B 6 runs | "What aegis is NOT" missing in README | PR #6 | PR #6 — #12 (the post-experiment SEC coverage and B-class rule batches) all traced back to specific findings in this archive. --- ## Files - [`comparison-report.md`](comparison-report.md) — the 1199-line rolling Round 1 → Round 9 analysis - `starting-code/` — Plan B Python brownfield fixture (3 planted SEC bugs) - `starting-go/` — Round 9 Go brownfield fixture - `starting-java/` — Round 9 Java brownfield fixture - `starting-multi/` — Plan C 5-module fixture - `prompt-*.txt` — the prompts handed to each agent. `-a.txt` is the no-aegis variant; `-b.txt` adds the `REQUIRED workflow: run aegis_validate.py after every write` ritual instruction - `aegis_validate.py` — Python wrapper around `aegis-mcp` stdio JSON-RPC. The agents run it after every file write - `eval_round_*.sh` — analysis scripts - `--/` — one directory per agent run. Naming convention: - models: `haiku` / `sonnet` / `flash` (Gemini 2.5) / `25flash` / `25fl` (Gemini 2.5 Flash-Lite) / `3flash` / `31flashlite` (Gemini 3.1 Flash-Lite Preview) / `codex` (GPT-5.4) / `g52` (GPT-5.2) / `g53codex` (GPT-5.3-codex) / `g54mini` (GPT-5.4-mini) - tasks: `amb` (Plan A) / `bf` (Plan B Python) / `mc` (Plan C multi-module) / `bf-go` (Round 9 Go) / `bf-java` (Round 9 Java) - variants: `a` (no Aegis) / `b` (with Aegis MCP) ## Reproducing a run ```bash # 1. Build aegis-mcp (from repo root) cargo install --path crates/aegis-mcp --force # 2. Set up a paired run mkdir my-round-a my-round-b cp starting-code/*.py my-round-a/ cp starting-code/*.py my-round-b/ # 3. Hand each variant to your agent CLI of choice cd my-round-a && claude code < ../prompt-bf-a.txt # no-aegis cd ../my-round-b && claude code < ../prompt-bf-b.txt # with-aegis # 4. Compare against planted bugs cd .. && bash eval_round_8.sh ``` The exact commands used for codex-driven rounds appear at the top of each round's `run.log`.