# Aegis Experiments Archive
[English](README.md) · [繁體中文](README.zh-TW.md)
Empirical validation runs for the [Aegis](../README.md) judgment-free
LLM-agent fact layer. Nine rounds of paired A/B comparisons across
Anthropic Haiku / Sonnet, OpenAI GPT-5 / Codex / GPT-5.4-mini, and
Google Gemini 2.5/3 Flash family on four task shapes.
Each round runs the **same task twice with the same model**: variant
**A** without the Aegis MCP server, variant **B** with it.
---
## Overview chart
```
Total round directories: 52 (26 paired × 2 variants)
Rounds by task shape:
┌──────────────────────────────┐
Plan A │██████████████ 14 dirs (7 pairs)
ambiguous spec │
├──────────────────────────────┤
Plan B │████████████████ 16 dirs (8 pairs)
brownfield │
├──────────────────────────────┤
Plan C │████████████████ 16 dirs (8 pairs)
multi-module │
├──────────────────────────────┤
Initial │██████ 6 dirs (3 pairs)
(Round 1-2) │
└──────────────────────────────┘
Models tested: 11
Haiku · Sonnet · GPT-5.2 · GPT-5.3-codex · GPT-5.4 (codex)
GPT-5.4-mini · Gemini 2.5 Flash · 2.5 Flash-Lite
3 Flash · 3.1 Flash-Lite · (Gemma family — skipped due to API errors)
```
---
## Chart 1 — Plan B brownfield: 0/3 → 3/3 fix rate
The headline result. Each starting `auth.py` has **3 planted SEC bugs**:
`md5` password hash, timing-unsafe `==` compare, weak RNG for session
token. Cells show **bugs remaining** out of 3 (lower = better, 0 = all
fixed). The right column shows how many bugs Aegis pointed out during
the run; the agent then fixed those plus often more.
```
Model │ A (no aegis) │ B (with aegis) │ Δ │ aegis hits
│ bugs left /3 │ bugs left /3 │ │
───────────────────┼──────────────────┼──────────────────┼────────┼────────
Haiku │ 1 ▓░░ 1 ▓░░ 0 1*
Sonnet │ 3 ▓▓▓ 1 ▓░░ +2 3
Gemini 2.5 Flash │ 3 ▓▓▓ 3 ▓▓▓ 0 2*
GPT-5.2 │ 3 ▓▓▓ 0 ░░░ +3 3
GPT-5.3-codex │ 3 ▓▓▓ 0 ░░░ +3 3
GPT-5.4-mini │ 3 ▓▓▓ 0 ░░░ +3 3
GPT-5.4-mini Go │ 3 ▓▓▓ 2 ▓▓░ +1 2** (md5 missed)
GPT-5.4-mini Java │ 2 ▓▓░ 0 ░░░ +2 1** (only SEC012 fired)
│
───────────────────┴──────────────────┴──────────────────┴────────┴────────
* = Plan B was a re-run after rules were tightened — original Haiku
run was on a slimmer rule set.
** = Aegis ran with a coverage gap (SEC009 multi-language, SEC010
Java enclosing-context). Both are now fixed in PRs #11 / #12.
```
**Two of the most striking observations:**
1. **GPT-5.4-mini Java** — Aegis only flagged 1 out of 3 bugs (SEC012
timing-unsafe), but the agent fixed all 3. The remaining md5 →
SHA-256 and `new Random()` → `SecureRandom` were **self-driven**
once Aegis put the agent into security-review mode. (The "one
finding triggers the cascade" mechanism.)
2. **GPT-5.2** — Went well beyond the prompt: replaced md5 with
OWASP-recommended **PBKDF2-HMAC-SHA256 + 16-byte salt + 210k
iterations** instead of plain SHA-256. Aegis only said "md5 is
weak"; the agent decided what to escalate to.
---
## Chart 2 — Plan C multi-module: anti-paralysis ritual
Plan C (add a `notifications` feature to a 5-module Python project)
has **clean starting code** — no planted bugs. Aegis can't show off
its rule library here, but a third ROI mechanism surfaced anyway.
```
Model │ A (no aegis) │ B (with aegis)
│ │
codex (GPT-5.4) │ ✓ 5 tests pass │ ✓ 5 tests pass
Gemini 2.5 Flash │ ✓ 5 tests pass │ ✓ 5 tests pass
Gemini 2.5 Flash-Lite │ ✓ 4 tests pass │ ✓ 4 tests pass
Gemini 3.1 Flash-Lite │ ✗ task abandoned │ ✗ task abandoned ← preview-mode planning loop, both stuck
Gemini 3 Flash │ ✓ 5 tests pass │ ✓ 5 tests pass
GPT-5.2 │ ✓ 5 tests pass │ ✓ 5 tests pass
GPT-5.3-codex │ ✓ 4 tests pass │ ✓ 4 tests pass
GPT-5.4-mini │ ✗ task abandoned │ ✓ 5 tests pass ← THIS PAIR is the finding
│ (no notifications.py
│ no tests.py;
│ 24k tokens spent
│ on design proposals)
Cycle introductions 16/16 = 0 (clean architecture is self-stabilizing)
Public symbol breaks 16/16 = 0 (no agent removed an existing public name)
```
**Key data point**: **GPT-5.4-mini's A variant abandoned the task**
— spent 24,051 tokens describing two design alternatives and asking
"approve this design?" without ever writing code. The B variant of
the **same model** completed the task because the prompt's `REQUIRED
workflow: run aegis_validate.py after every .py file you write` made
file-writing mandatory. Even though Aegis surfaced 0 security
findings on the (clean) starting code, the **ritual itself** kept
the weak model action-oriented.
---
## Chart 3 — The three Aegis ROI mechanisms
Discovered empirically across the 9 rounds. Only mechanism 1 was the
designer's stated intent.
```mermaid
flowchart TD
Trigger[Agent edits a file] --> MCP[validate_file MCP call]
MCP --> Findings{Findings emitted?}
Findings -->|Security: SEC009/010/012/etc.| ROI1[Mechanism 1: rule-hit → fix
Plan B: 0/3 → 3/3 across 3 models]
Findings -->|Workspace: cycle / symbol_removed| ROI2[Mechanism 2: structural guardrail
0/14 hits — dead weight on clean code
but would catch real cycles]
Findings -->|nothing — empty result| ROI3[Mechanism 3: anti-paralysis ritual
Forced write-then-validate cycle
prevents weak-model planning loops]
ROI1 --> Cascade[Often triggers cascade:
1 finding → agent rewrites whole file
g52: md5 → PBKDF2+salt+210k iter]
ROI3 --> Saved[Plan C: g54mini-mc-a abandoned
g54mini-mc-b completed same task]
style ROI1 fill:#d4edda,stroke:#155724,color:#000
style ROI2 fill:#fff3cd,stroke:#856404,color:#000
style ROI3 fill:#d1ecf1,stroke:#0c5460,color:#000
```
| Mechanism | Trigger | Evidence | Designed in? |
|:---:|---|---|:---:|
| **1. Rule-hit → fix** | brownfield + planted SEC bug | Plan B 3/3 models 0/3 → 3/3 | ✅ |
| **2. Structural guardrail** | cycle / public_symbol_removed | 0/14 hits (clean code = silent) | ✅ |
| **3. Anti-paralysis ritual** | weak model + any task | Plan C g54mini A abandoned vs B completed | ❌ emergent |
---
## Chart 4 — Direct lineage: experiment finding → Aegis code change
Every recent SEC PR has a specific experiment trigger. The
dogfooding loop in action:
```
Round 8 codex Plan A
│
▼ FP discovered: SEC010 fires on `secrets.choice` (the SECURE choice)
│ — agent spent a turn "fixing" already-secure code
│
▼─────────────────────────► PR #9: secrets./os.urandom/crypto. allowlist
+2 regression tests
Round 9 Go brownfield
│
▼ FN discovered: SEC009 doesn't fire on Go `md5.Sum(...)`
│ — agent kept md5 because aegis didn't surface it
│
▼─────────────────────────► PR #12: SEC009 language-aware dispatch
+8 multi-language tests
+ enclosing_security_context
function-name check
Round 9 Java brownfield
│
▼ FN discovered: SEC010 inner-block `break` hides
│ `int idx = new Random().nextInt(...)` inside
│ `generateSessionToken()`
│
▼─────────────────────────► PR #11: enclosing_token_context
walks past inner blocks +
reads function name
+3 regression tests
```
| Round | Discovered | Fixed in |
|---|---|---|
| Round 8 codex Plan A | SEC010 false-positive on `secrets.choice` | PR #9 |
| Round 9 Go / Java | SEC009 multi-language coverage = 0 | PR #12 |
| Round 9 Java | SEC010 inner-block `break` hides production case | PR #11 |
| Plan A 32+ runs | SEC010 needles too narrow (URL shorteners) | PR #6 |
| Plan A entropy bypass | SEC002 misses placeholder-shaped strings | PR #6 |
| Plan B 6 runs | "What aegis is NOT" missing in README | PR #6 |
PR #6 — #12 (the post-experiment SEC coverage and B-class rule
batches) all traced back to specific findings in this archive.
---
## Files
- [`comparison-report.md`](comparison-report.md) — the 1199-line
rolling Round 1 → Round 9 analysis
- `starting-code/` — Plan B Python brownfield fixture (3 planted SEC bugs)
- `starting-go/` — Round 9 Go brownfield fixture
- `starting-java/` — Round 9 Java brownfield fixture
- `starting-multi/` — Plan C 5-module fixture
- `prompt-*.txt` — the prompts handed to each agent. `-a.txt` is the
no-aegis variant; `-b.txt` adds the `REQUIRED workflow: run
aegis_validate.py after every write` ritual instruction
- `aegis_validate.py` — Python wrapper around `aegis-mcp` stdio
JSON-RPC. The agents run it after every file write
- `eval_round_*.sh` — analysis scripts
- `--/` — one directory per agent run.
Naming convention:
- models: `haiku` / `sonnet` / `flash` (Gemini 2.5) /
`25flash` / `25fl` (Gemini 2.5 Flash-Lite) / `3flash` /
`31flashlite` (Gemini 3.1 Flash-Lite Preview) / `codex`
(GPT-5.4) / `g52` (GPT-5.2) / `g53codex` (GPT-5.3-codex) /
`g54mini` (GPT-5.4-mini)
- tasks: `amb` (Plan A) / `bf` (Plan B Python) / `mc` (Plan C
multi-module) / `bf-go` (Round 9 Go) / `bf-java` (Round 9 Java)
- variants: `a` (no Aegis) / `b` (with Aegis MCP)
## Reproducing a run
```bash
# 1. Build aegis-mcp (from repo root)
cargo install --path crates/aegis-mcp --force
# 2. Set up a paired run
mkdir my-round-a my-round-b
cp starting-code/*.py my-round-a/
cp starting-code/*.py my-round-b/
# 3. Hand each variant to your agent CLI of choice
cd my-round-a && claude code < ../prompt-bf-a.txt # no-aegis
cd ../my-round-b && claude code < ../prompt-bf-b.txt # with-aegis
# 4. Compare against planted bugs
cd .. && bash eval_round_8.sh
```
The exact commands used for codex-driven rounds appear at the top
of each round's `run.log`.