# Aegis Experiments Archive

[English](README.md) · [繁體中文](README.zh-TW.md)

Empirical validation runs for the [Aegis](../README.md) judgment-free
LLM-agent fact layer. Nine rounds of paired A/B comparisons across
Anthropic Haiku / Sonnet, OpenAI GPT-5 / Codex / GPT-5.4-mini, and
Google Gemini 2.5/3 Flash family on four task shapes.

Each round runs the **same task twice with the same model**: variant
**A** without the Aegis MCP server, variant **B** with it.

---

## Overview chart

```
Total round directories: 52  (26 paired × 2 variants)

Rounds by task shape:
                  ┌──────────────────────────────┐
  Plan A          │██████████████ 14 dirs (7 pairs)
  ambiguous spec  │
                  ├──────────────────────────────┤
  Plan B          │████████████████ 16 dirs (8 pairs)
  brownfield      │
                  ├──────────────────────────────┤
  Plan C          │████████████████ 16 dirs (8 pairs)
  multi-module    │
                  ├──────────────────────────────┤
  Initial         │██████ 6 dirs (3 pairs)
  (Round 1-2)     │
                  └──────────────────────────────┘

Models tested: 11
  Haiku · Sonnet · GPT-5.2 · GPT-5.3-codex · GPT-5.4 (codex)
  GPT-5.4-mini · Gemini 2.5 Flash · 2.5 Flash-Lite
  3 Flash · 3.1 Flash-Lite · (Gemma family — skipped due to API errors)
```

---

## Chart 1 — Plan B brownfield: 0/3 → 3/3 fix rate

The headline result. Each starting `auth.py` has **3 planted SEC bugs**:
`md5` password hash, timing-unsafe `==` compare, weak RNG for session
token. Cells show **bugs remaining** out of 3 (lower = better, 0 = all
fixed). The right column shows how many bugs Aegis pointed out during
the run; the agent then fixed those plus often more.

```
Model              │ A (no aegis)     │ B (with aegis)   │ Δ      │ aegis hits
                   │ bugs left  /3    │ bugs left  /3    │        │
───────────────────┼──────────────────┼──────────────────┼────────┼────────
Haiku              │ 1 ▓░░             1 ▓░░              0       1*
Sonnet             │ 3 ▓▓▓             1 ▓░░             +2       3
Gemini 2.5 Flash   │ 3 ▓▓▓             3 ▓▓▓              0       2*
GPT-5.2            │ 3 ▓▓▓             0 ░░░             +3       3
GPT-5.3-codex      │ 3 ▓▓▓             0 ░░░             +3       3
GPT-5.4-mini       │ 3 ▓▓▓             0 ░░░             +3       3
GPT-5.4-mini Go    │ 3 ▓▓▓             2 ▓▓░             +1       2** (md5 missed)
GPT-5.4-mini Java  │ 2 ▓▓░             0 ░░░             +2       1** (only SEC012 fired)
                   │                                              
───────────────────┴──────────────────┴──────────────────┴────────┴────────

  *  = Plan B was a re-run after rules were tightened — original Haiku
       run was on a slimmer rule set.
  ** = Aegis ran with a coverage gap (SEC009 multi-language, SEC010
       Java enclosing-context). Both are now fixed in PRs #11 / #12.
```

**Two of the most striking observations:**

1. **GPT-5.4-mini Java** — Aegis only flagged 1 out of 3 bugs (SEC012
   timing-unsafe), but the agent fixed all 3. The remaining md5 →
   SHA-256 and `new Random()` → `SecureRandom` were **self-driven**
   once Aegis put the agent into security-review mode. (The "one
   finding triggers the cascade" mechanism.)
2. **GPT-5.2** — Went well beyond the prompt: replaced md5 with
   OWASP-recommended **PBKDF2-HMAC-SHA256 + 16-byte salt + 210k
   iterations** instead of plain SHA-256. Aegis only said "md5 is
   weak"; the agent decided what to escalate to.

---

## Chart 2 — Plan C multi-module: anti-paralysis ritual

Plan C (add a `notifications` feature to a 5-module Python project)
has **clean starting code** — no planted bugs. Aegis can't show off
its rule library here, but a third ROI mechanism surfaced anyway.

```
Model                  │ A (no aegis)       │ B (with aegis)
                       │                    │
codex (GPT-5.4)        │ ✓ 5 tests pass     │ ✓ 5 tests pass
Gemini 2.5 Flash       │ ✓ 5 tests pass     │ ✓ 5 tests pass
Gemini 2.5 Flash-Lite  │ ✓ 4 tests pass     │ ✓ 4 tests pass
Gemini 3.1 Flash-Lite  │ ✗ task abandoned   │ ✗ task abandoned   ← preview-mode planning loop, both stuck
Gemini 3 Flash         │ ✓ 5 tests pass     │ ✓ 5 tests pass
GPT-5.2                │ ✓ 5 tests pass     │ ✓ 5 tests pass
GPT-5.3-codex          │ ✓ 4 tests pass     │ ✓ 4 tests pass
GPT-5.4-mini           │ ✗ task abandoned   │ ✓ 5 tests pass     ← THIS PAIR is the finding
                       │ (no notifications.py    
                       │  no tests.py;          
                       │  24k tokens spent      
                       │  on design proposals)  
                       
Cycle introductions   16/16 = 0  (clean architecture is self-stabilizing)
Public symbol breaks  16/16 = 0  (no agent removed an existing public name)
```

**Key data point**: **GPT-5.4-mini's A variant abandoned the task**
— spent 24,051 tokens describing two design alternatives and asking
"approve this design?" without ever writing code. The B variant of
the **same model** completed the task because the prompt's `REQUIRED
workflow: run aegis_validate.py after every .py file you write` made
file-writing mandatory. Even though Aegis surfaced 0 security
findings on the (clean) starting code, the **ritual itself** kept
the weak model action-oriented.

---

## Chart 3 — The three Aegis ROI mechanisms

Discovered empirically across the 9 rounds. Only mechanism 1 was the
designer's stated intent.

```mermaid
flowchart TD
    Trigger[Agent edits a file] --> MCP[validate_file MCP call]
    MCP --> Findings{Findings emitted?}

    Findings -->|Security: SEC009/010/012/etc.| ROI1[Mechanism 1: rule-hit → fix<br/>Plan B: 0/3 → 3/3 across 3 models]
    Findings -->|Workspace: cycle / symbol_removed| ROI2[Mechanism 2: structural guardrail<br/>0/14 hits — dead weight on clean code<br/>but would catch real cycles]
    Findings -->|nothing — empty result| ROI3[Mechanism 3: anti-paralysis ritual<br/>Forced write-then-validate cycle<br/>prevents weak-model planning loops]

    ROI1 --> Cascade[Often triggers cascade:<br/>1 finding → agent rewrites whole file<br/>g52: md5 → PBKDF2+salt+210k iter]
    ROI3 --> Saved[Plan C: g54mini-mc-a abandoned<br/>g54mini-mc-b completed same task]

    style ROI1 fill:#d4edda,stroke:#155724,color:#000
    style ROI2 fill:#fff3cd,stroke:#856404,color:#000
    style ROI3 fill:#d1ecf1,stroke:#0c5460,color:#000
```

| Mechanism | Trigger | Evidence | Designed in? |
|:---:|---|---|:---:|
| **1. Rule-hit → fix** | brownfield + planted SEC bug | Plan B 3/3 models 0/3 → 3/3 | ✅ |
| **2. Structural guardrail** | cycle / public_symbol_removed | 0/14 hits (clean code = silent) | ✅ |
| **3. Anti-paralysis ritual** | weak model + any task | Plan C g54mini A abandoned vs B completed | ❌ emergent |

---

## Chart 4 — Direct lineage: experiment finding → Aegis code change

Every recent SEC PR has a specific experiment trigger. The
dogfooding loop in action:

```
Round 8 codex Plan A
     │
     ▼ FP discovered: SEC010 fires on `secrets.choice` (the SECURE choice)
     │ — agent spent a turn "fixing" already-secure code
     │
     ▼─────────────────────────►  PR #9: secrets./os.urandom/crypto. allowlist
                                  +2 regression tests


Round 9 Go brownfield
     │
     ▼ FN discovered: SEC009 doesn't fire on Go `md5.Sum(...)`
     │ — agent kept md5 because aegis didn't surface it
     │
     ▼─────────────────────────►  PR #12: SEC009 language-aware dispatch
                                  +8 multi-language tests
                                  + enclosing_security_context
                                    function-name check


Round 9 Java brownfield
     │
     ▼ FN discovered: SEC010 inner-block `break` hides
     │ `int idx = new Random().nextInt(...)` inside
     │ `generateSessionToken()`
     │
     ▼─────────────────────────►  PR #11: enclosing_token_context
                                  walks past inner blocks +
                                  reads function name
                                  +3 regression tests
```

| Round | Discovered | Fixed in |
|---|---|---|
| Round 8 codex Plan A | SEC010 false-positive on `secrets.choice` | PR #9 |
| Round 9 Go / Java | SEC009 multi-language coverage = 0 | PR #12 |
| Round 9 Java | SEC010 inner-block `break` hides production case | PR #11 |
| Plan A 32+ runs | SEC010 needles too narrow (URL shorteners) | PR #6 |
| Plan A entropy bypass | SEC002 misses placeholder-shaped strings | PR #6 |
| Plan B 6 runs | "What aegis is NOT" missing in README | PR #6 |

PR #6 — #12 (the post-experiment SEC coverage and B-class rule
batches) all traced back to specific findings in this archive.

---

## Files

- [`comparison-report.md`](comparison-report.md) — the 1199-line
  rolling Round 1 → Round 9 analysis
- `starting-code/` — Plan B Python brownfield fixture (3 planted SEC bugs)
- `starting-go/` — Round 9 Go brownfield fixture
- `starting-java/` — Round 9 Java brownfield fixture
- `starting-multi/` — Plan C 5-module fixture
- `prompt-*.txt` — the prompts handed to each agent. `-a.txt` is the
  no-aegis variant; `-b.txt` adds the `REQUIRED workflow: run
  aegis_validate.py after every write` ritual instruction
- `aegis_validate.py` — Python wrapper around `aegis-mcp` stdio
  JSON-RPC. The agents run it after every file write
- `eval_round_*.sh` — analysis scripts
- `<model>-<task>-<variant>/` — one directory per agent run.
  Naming convention:
  - models: `haiku` / `sonnet` / `flash` (Gemini 2.5) /
    `25flash` / `25fl` (Gemini 2.5 Flash-Lite) / `3flash` /
    `31flashlite` (Gemini 3.1 Flash-Lite Preview) / `codex`
    (GPT-5.4) / `g52` (GPT-5.2) / `g53codex` (GPT-5.3-codex) /
    `g54mini` (GPT-5.4-mini)
  - tasks: `amb` (Plan A) / `bf` (Plan B Python) / `mc` (Plan C
    multi-module) / `bf-go` (Round 9 Go) / `bf-java` (Round 9 Java)
  - variants: `a` (no Aegis) / `b` (with Aegis MCP)

## Reproducing a run

```bash
# 1. Build aegis-mcp (from repo root)
cargo install --path crates/aegis-mcp --force

# 2. Set up a paired run
mkdir my-round-a my-round-b
cp starting-code/*.py my-round-a/
cp starting-code/*.py my-round-b/

# 3. Hand each variant to your agent CLI of choice
cd my-round-a && claude code < ../prompt-bf-a.txt   # no-aegis
cd ../my-round-b && claude code < ../prompt-bf-b.txt # with-aegis

# 4. Compare against planted bugs
cd .. && bash eval_round_8.sh
```

The exact commands used for codex-driven rounds appear at the top
of each round's `run.log`.