# Hermes Agent Self-Evolution — Evolutionary Self-Improvement for Hermes Agent

## Vision

A standalone optimization pipeline that systematically improves Hermes Agent's performance by evolving skills, prompts, tool descriptions, and agent configurations using automated optimization loops. Lives in its own repo (`NousResearch/hermes-agent-self-evolution`), operates ON hermes-agent — not part of it.

Three complementary engines, unified under one workflow:

| Engine | What It Optimizes | License | Integration |
|--------|------------------|---------|-------------|
| **DSPy + GEPA** | Skills, prompts, instructions, tool descriptions | MIT | Native Python, primary engine |
| **Darwinian Evolver** | Code files, algorithms, tool implementations | AGPL v3 | External CLI only |
| **DSPy MIPROv2** | Few-shot examples, instruction text | MIT | Native Python, fallback optimizer |

GEPA is the star — it's integrated into DSPy, reads execution traces to understand WHY things fail (not just that they fail), and works with as few as 3 examples. It outperforms both RL and previous DSPy optimizers.

**Important: No GPU training required.** Everything in this plan operates via API calls only. DSPy+GEPA and MIPROv2 optimize the *text* of prompts, instructions, and few-shot examples — they mutate and evaluate strings, not model weights. The Darwinian Evolver evolves code files (also text). The only DSPy component that trains weights (`BootstrapFinetune`) is explicitly excluded from this plan. All evaluation runs through batch_runner making standard LLM API calls.

---

## What Can Be Improved

### Tier 1: Skill Files (Highest Value, Lowest Risk)
- **What:** SKILL.md files — procedural instructions the agent follows
- **How:** Wrap skill text as a DSPy module, evaluate on test tasks via batch_runner, evolve with GEPA
- **Why it works:** Skills are pure text, easily mutated, and directly measurable (did the agent complete the task correctly when following this skill?)
- **Example:** Evolve the `github-code-review` skill to produce better reviews by testing against a dataset of known-good code reviews

### Tier 2: Tool Descriptions (Medium Value, Low Risk)
- **What:** The `description` field in tool schemas (what the agent sees when deciding which tool to use)
- **How:** GEPA evolves descriptions, evaluates whether the agent picks the right tool for given tasks
- **Why it works:** Tool selection is a classification problem — perfect for DSPy optimization
- **Example:** Evolve the `search_files` description so the agent picks it over `terminal(grep)` more reliably

### Tier 3: System Prompt Components (High Value, Higher Risk)
- **What:** Sections of the system prompt (persona, policies, formatting instructions)
- **How:** Parameterize prompt_builder.py sections as DSPy Signatures, optimize with GEPA
- **Why it works:** System prompt quality directly determines agent behavior quality
- **Risk:** Must be careful not to break prompt caching — only optimize offline, deploy as new versions
- **Example:** Evolve the "tool usage guidelines" section to reduce unnecessary tool calls

### Tier 4: Code Evolution (High Value, Highest Risk)
- **What:** Tool implementation code, helper functions
- **How:** Darwinian Evolver with GitBasedOrganism, test via pytest + batch_runner
- **Why it works:** Some tool implementations have subtle bugs or inefficiencies that evolutionary search can find
- **Risk:** Code changes can break things — requires strong test suites as guardrails
- **Example:** Evolve `file_tools.py` patch matching to handle more edge cases

---

## Architecture

### The Optimization Loop

```
┌─────────────────────────────────────────────┐
│  1. SELECT TARGET                           │
│     - Pick a skill, prompt section, or tool  │
│     - Load current version as baseline       │
│                                             │
│  2. BUILD EVALUATION DATASET                │
│     - Mine session_db for real usage examples │
│     - Or use hand-crafted test cases         │
│     - Split: train / validation / test       │
│                                             │
│  3. WRAP AS DSPy MODULE                     │
│     - Skill text → dspy.Signature            │
│     - Agent workflow → dspy.ReAct             │
│     - Tool selection → dspy.Predict           │
│                                             │
│  4. RUN OPTIMIZER                           │
│     - Primary: dspy.GEPA (reflective evolution)│
│     - Fallback: dspy.MIPROv2 (bayesian opt)  │
│     - Code: Darwinian Evolver (external CLI)  │
│                                             │
│  5. EVALUATE & COMPARE                      │
│     - Run optimized version on held-out test  │
│     - Compare: accuracy, cost, latency        │
│     - Statistical significance check          │
│                                             │
│  6. DEPLOY (with approval)                  │
│     - Git commit the improved version         │
│     - A/B test in production (optional)       │
│     - Rollback mechanism via git revert       │
└─────────────────────────────────────────────┘
```

### Integration Points with Existing Hermes Infrastructure

| Hermes Component | Role in Self-Improvement |
|-----------------|------------------------|
| `batch_runner.py` | Evaluation harness — run agent on test tasks in parallel |
| `agent/trajectory.py` | Collect execution traces for GEPA's reflective analysis |
| `hermes_state.py` (SessionDB) | Mine real usage data for evaluation datasets |
| `skills/` directory | The primary optimization targets |
| `tools/registry.py` | Tool descriptions to optimize |
| `agent/prompt_builder.py` | System prompt components to optimize |
| `tests/` | Guardrails — evolved code must pass all tests |
| Git history | Track all evolution lineage, enable rollback |

### Data Flow

```
SessionDB (real conversations)
    │
    ▼
Evaluation Dataset Builder
    │
    ├──► DSPy Module Wrapper (wraps skill/prompt/tool as optimizable module)
    │        │
    │        ▼
    │    GEPA Optimizer ◄── Execution Traces (from batch_runner)
    │        │                    ▲
    │        │                    │
    │        ▼                    │
    │    Candidate Variants ──► batch_runner (parallel evaluation)
    │        │
    │        ├──► Constraint Validation (tests, char limits, caching compat)
    │        │
    │        ▼
    │    Best Valid Variant
    │        │
    ▼        ▼
Git Branch + PR (with diff, metrics, before/after comparison)
    │
    ▼
Human Review & Merge
```

---

## Implementation Structure

### Where It Lives

Hermes Agent Self-Evolution lives in its own repo (`NousResearch/hermes-agent-self-evolution`), separate from hermes-agent. It pip-installs or clones hermes-agent to access its infrastructure, and outputs PRs against the hermes-agent repo.

```
hermes-agent-self-evolution/             # Standalone repo
├── PLAN.md                             # This file
├── README.md                           # Setup, usage, examples
├── pyproject.toml                      # Package config + dependencies (dspy, gepa)
│
├── evolution/                          # Main package
│   ├── core/                           # Shared infrastructure
│   │   ├── __init__.py
│   │   ├── dataset_builder.py          # Eval dataset generation (synthetic, SessionDB mining)
│   │   ├── fitness.py                  # Fitness functions (LLM-as-judge, rubrics, length penalties)
│   │   ├── constraints.py              # Constraint validators (char limits, caching compat, test suite)
│   │   ├── benchmark_gate.py           # Benchmark gating (run TBLite/YC-Bench, check regression)
│   │   └── pr_builder.py              # Auto-generate PR with metrics, diffs, comparison
│   │
│   ├── skills/                         # Phase 1: Skill evolution
│   │   ├── __init__.py
│   │   ├── evolve_skill.py            # Main entry: python -m evolution.skills.evolve_skill --skill <name>
│   │   └── skill_module.py            # Wraps SKILL.md as DSPy module
│   │
│   ├── tools/                          # Phase 2: Tool description evolution
│   ├── prompts/                        # Phase 3: System prompt evolution
│   ├── code/                           # Phase 4: Code evolution (Darwinian Evolver)
│   └── monitor/                        # Phase 5: Continuous loop
│
├── datasets/                           # Generated eval datasets (gitignored, local)
│   ├── skills/
│   └── tools/
│
└── tests/                              # Test suite
```

### How It's Invoked

```bash
# Clone and install
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution
pip install -e ".[dev]"

# Point at hermes-agent repo (auto-detected from ~/.hermes/hermes-agent or env var)
export HERMES_AGENT_REPO=~/.hermes/hermes-agent

# Phase 1: Evolve a skill
python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10 \
    --eval-source synthetic         # or: sessiondb, golden, auto

# Phase 2: Evolve tool descriptions
python -m evolution.tools.evolve_tool_descriptions \
    --iterations 5 \
    --benchmark-gate tblite-fast

# Phase 3: Evolve a system prompt section
python -m evolution.prompts.evolve_prompt_section \
    --section MEMORY_GUIDANCE \
    --iterations 5

# Phase 4: Evolve tool code (uses Darwinian Evolver CLI)
python -m evolution.code.evolve_tool_code \
    --tool file_tools \
    --bug-issue 742 \
    --iterations 10

# All commands output a PR branch + summary against hermes-agent. Human merges.
```

### Relationship to hermes-agent

**hermes-agent-self-evolution operates ON hermes-agent, not inside it.** Zero changes to the agent repo are needed. It reads from the hermes-agent codebase and writes evolved versions to git branches, creating PRs for human review.

| hermes-agent Component | How Self-Evolution Uses It |
|------------------------|-------------------|
| `batch_runner.py` | Run agent on eval tasks in parallel |
| `environments/benchmarks/tblite/` | Benchmark gating |
| `environments/benchmarks/yc_bench/` | Coherence checks |
| `hermes_state.py` (SessionDB) | Mine real usage for eval data |
| `agent/prompt_builder.py` | Read current prompt sections (read-only) |
| `tools/registry.py` | Read current tool descriptions (read-only) |
| `skills/` directory | Read current skills, write evolved versions to branch |

---

## Execution Plan

### How Phases Work

Phases are sequential — each one builds on infrastructure from the previous one and must prove itself before we move on. The flow is:

```
Phase 1 ──► Validation Gate ──► Phase 2 ──► Validation Gate ──► Phase 3 ──► ...
  Build       "Did it actually       Build       "Did it work         Build
  & test       make things            & test       without breaking     & test
               better?"                            anything?"
```

**Between every phase:**
1. Run full benchmark suite (TBLite + YC-Bench fast_test) to establish a new baseline
2. Review all evolved artifacts — are the changes sensible to a human?
3. Merge the proven improvements via PR
4. Retrospective: what worked, what didn't, adjust approach for next phase

**Each phase has three stages:**
- **Build** (~1-2 weeks): Write the optimization infrastructure for that tier
- **Run** (~1 week): Execute optimization on real targets, iterate on eval datasets
- **Validate** (~1 week): Benchmark, review, merge. Decide if results justify moving to next phase

If a phase doesn't produce meaningful improvements (evolved variants aren't better than baseline), we stop and reassess before moving on. No point optimizing tool descriptions if we can't even improve skills.

### Timeline Overview

| Phase | What | Duration | Depends On | Gate to Next |
|-------|------|----------|-----------|-------------|
| **Phase 1** | Skill evolution | 3-4 weeks | Nothing — starts here | ≥1 skill measurably improved, no benchmark regression |
| **Phase 2** | Tool descriptions | 2-3 weeks | Phase 1 infra (GEPA runner, eval framework) | Tool selection accuracy improved, no benchmark regression |
| **Phase 3** | System prompt | 2-3 weeks | Phase 1-2 infra + validated benchmark gating | Behavioral tests pass, benchmarks hold or improve |
| **Phase 4** | Code evolution | 3-4 weeks | Phases 1-3 + strong eval pipeline | Bugs fixed, tests pass, benchmarks hold |
| **Phase 5** | Continuous loop | 2 weeks | All above working | Automated pipeline runs unattended |

**Total: ~13-17 weeks if all phases prove valuable.** But we may stop at Phase 1 or 2 if the returns diminish — no obligation to do all five.

### Detailed Phase Breakdown

---

### Phase 1: Skill Evolution via DSPy+GEPA (Core Capability)

**Goal:** The agent can optimize any SKILL.md file by running it through GEPA.

**Week 1-2 (Build):**
- Install DSPy + GEPA, verify they work in Hermes' .venv
- Build the skill-as-DSPy-module wrapper (takes a SKILL.md → DSPy module)
- Build the eval dataset generator (strong model reads skill → generates test cases)
- Build the GEPA optimization runner (wraps dspy.GEPA with Hermes config)
- Unit tests for all components

**Week 2-3 (Run):**
- Pick 2-3 target skills: `github-code-review`, `systematic-debugging`, `arxiv`
- Generate eval datasets for each (15-30 examples per skill)
- Run GEPA optimization (5-10 iterations per skill)
- Compare baseline vs evolved on holdout set
- Iterate on eval dataset quality if results are noisy

**Week 3-4 (Validate):**
- Run TBLite + YC-Bench fast_test with evolved skills vs baseline
- Human review of all evolved skill diffs — do the changes make sense?
- Create PRs for improvements that pass all gates
- Document what worked and what didn't

**Done when:**
- ≥1 skill shows measurable improvement on its eval dataset (≥10% score increase)
- No benchmark regression (TBLite score holds within 2%)
- The evolved skill diff reads sensibly to a human reviewer
- The optimization pipeline is reusable (can point it at any skill and run)

**What to build:**
1. **Skill-as-DSPy-Module wrapper** — Takes a SKILL.md, creates a DSPy module that:
   - Injects the skill text as the system prompt
   - Runs the agent on a test task
   - Returns the result for scoring

2. **Evaluation dataset builder** — Creates train/val/holdout splits from multiple sources:

   **Source A: Synthetic generation (primary, bootstrapping)**
   Use a strong model (e.g., Claude Opus) to generate test cases for a skill:
   - Read the skill file → understand what it does
   - Generate 15-30 realistic (task_input, expected_behavior) pairs
   - Expected_behavior is a rubric, not exact text — e.g., "should identify the SQL injection on line 42" not "output this exact string"
   - Split: 10 train / 5 val / 5-10 holdout
   - GEPA works with as few as 3 examples, so this is sufficient to start

   **Source B: SessionDB mining (real usage, LLM-as-judge scored)**
   - Query SessionDB for sessions where the skill was loaded (search for skill name in messages)
   - Extract the task the user gave and the agent's full response
   - Use LLM-as-judge to score each (task, response) pair on a rubric
   - High-scoring pairs become "good" examples; low-scoring pairs become failure cases for GEPA's reflective analysis
   - This improves over time as more real usage accumulates

   **Source C: Hand-curated golden sets (optional, high-value skills)**
   - Manually written test cases with expected outputs
   - Stored as JSONL in `~/.hermes/evolution/datasets/<skill-name>/golden.jsonl`
   - Highest quality signal but requires manual effort — reserve for critical skills

   **Source D: Skill-specific auto-evaluation (where applicable)**
   - `systematic-debugging`: Plant a bug, run the skill, check if tests pass after
   - `arxiv`: Search for known papers, check if they're found
   - `github-code-review`: Create a PR with planted issues, check if they're caught
   - Not all skills have natural auto-eval — this is a bonus, not a requirement

   **Scoring: LLM-as-judge with rubrics**
   For most skills, there's no binary right/wrong — quality is subjective. The fitness function uses an LLM judge that scores on a rubric:
   - Did the agent follow the skill's procedure? (0-1)
   - Was the output correct/useful? (0-1)
   - Was it concise (within token budget)? (0-1)
   - Rubrics are skill-specific and stored alongside the eval dataset

3. **GEPA optimization runner** — Wraps `dspy.GEPA` with Hermes-specific config:
   - Uses batch_runner for parallel evaluation
   - Captures execution traces (trajectories) for GEPA's reflective analysis
   - Saves snapshots for pause/resume

4. **Comparison & deployment** — Side-by-side evaluation:
   - Runs baseline vs optimized on held-out test set
   - Shows diff of what changed
   - Commits improved version with evolution metadata

**CLI interface:**
```bash
# Evolve a skill with auto-generated eval data from session history
hermes evolve skill github-code-review --iterations 10

# Evolve with a custom evaluation dataset
hermes evolve skill arxiv --dataset eval_tasks.jsonl --iterations 5

# Compare baseline vs evolved
hermes evolve compare github-code-review --version latest

# Deploy evolved version
hermes evolve deploy github-code-review --version 3
```

**Or as agent tool calls:**
```
The agent can self-invoke optimization:
"I notice this skill could be improved. Let me run GEPA optimization on it."
→ Uses execute_code to run DSPy+GEPA
→ Evaluates results
→ Proposes the improved skill for human approval
```

### Phase 2: Tool Description Optimization

**Goal:** Optimize the natural language descriptions in tool schemas so the agent picks the right tools more reliably and uses them correctly.

**Prerequisite:** Phase 1 gate passed — GEPA optimization loop proven to work on skills.

**Week 1 (Build):** Adapt Phase 1's GEPA runner for tool descriptions. Build tool selection evaluator and synthetic dataset generator. The hard part is cross-tool evaluation — ensuring one tool's improvement doesn't steal from another.

**Week 2 (Run):** Generate tool selection dataset (~200-400 triples). Run GEPA on all tool descriptions simultaneously. Mine SessionDB for misselection patterns.

**Week 3 (Validate):** Benchmark gate. Human review of evolved descriptions — do they still accurately describe the tools? PR.

**Done when:**
- Tool selection accuracy improves on holdout set (≥5% improvement)
- No individual tool's selection rate regresses
- Benchmarks hold (TBLite within 2%)
- Evolved descriptions are factually accurate and ≤500 chars

**What gets evolved:**
Tool descriptions are hardcoded string constants in `tools/*.py` files, registered via `registry.register()`. Each tool has:
- A top-level `description` field (what the tool does, when to use it, behavioral guidance)
- Per-parameter `description` fields (what each parameter means, valid values)
- Some tools have separate description constants (e.g., `TERMINAL_TOOL_DESCRIPTION`)

These descriptions are sent with every API call as part of the tool schema — every extra character multiplies across the entire conversation.

**What to build:**

1. **Tool selection evaluator** — Given a task description, does the agent pick the right tool?
   - Build a dataset of (task_description, correct_tool, correct_params) triples
   - Example: "find all Python files containing 'import os'" → `search_files` (not `terminal(grep)`)
   - Example: "read lines 50-100 of config.py" → `read_file` (not `terminal(cat)`)
   - Score: tool_selection_accuracy + parameter_correctness

2. **Description optimizer** — GEPA evolves description text to improve selection accuracy
   - Wrap each tool description as a DSPy Signature parameter
   - Mutate descriptions, evaluate on tool selection dataset
   - GEPA reads traces of WRONG tool selections to understand why the agent was confused

3. **Cross-tool evaluation** — Ensure improving one description doesn't hurt others
   - Always evaluate ALL tool descriptions together (not in isolation)
   - Fitness function penalizes regressions on any tool's selection rate
   - This prevents a `search_files` description from "stealing" selections from `read_file`

**Evaluation data sources:**

   **Source A: Synthetic tool selection dataset**
   Generate (task, correct_tool, correct_params) triples using a strong model:
   - For each tool, generate 10-20 tasks where that tool is clearly the right choice
   - Include 10-20 "confusing" tasks where two tools could work but one is better
   - Include 10 tasks where the agent should use NO tool (just respond directly)
   - Total: ~200-400 triples, split 60/20/20 train/val/holdout

   **Source B: SessionDB mining — tool selection patterns**
   - Find conversations where the agent used a tool
   - Identify cases where the agent used `terminal(grep)` when `search_files` was better (or similar mismatches)
   - LLM-as-judge scores whether the tool choice was optimal
   - Misselections become high-value training examples

   **Source C: Benchmark-derived tool selection**
   - Run TBLite with baseline descriptions, log every tool call
   - Identify tasks where wrong tool selection caused failures
   - These become hard examples in the eval dataset

**Constraints specific to tool descriptions:**
- Max 500 chars per tool description (sent every API call)
- Max 200 chars per parameter description
- Must remain factually accurate (can't claim a tool does something it doesn't)
- Schema structure (parameter names, types, required fields) is FROZEN — only text evolves

### Phase 3: System Prompt Evolution

**Goal:** Optimize the sections of the system prompt that guide agent behavior.

**Prerequisite:** Phase 2 gate passed — benchmark gating validated, GEPA producing sensible text mutations.

**Week 1 (Build):** Build section-as-DSPy-parameter wrapper for the 5 evolvable prompt sections. Build behavioral test suite generator. This is the riskiest tier so far — system prompt changes affect everything.

**Week 2 (Run):** Generate behavioral test scenarios (~60-80 total across all sections). Run GEPA on each section independently first, then jointly. Run benchmarks after each optimization round.

**Week 2-3 (Validate):** Full benchmark suite (TBLite + YC-Bench). Extra scrutiny here — system prompt changes have the widest blast radius. Multiple human reviewers if possible.

**Done when:**
- Behavioral test scores improve (≥10% on targeted sections)
- Benchmarks hold or improve (zero tolerance for regression here)
- The agent's personality/tone hasn't drifted noticeably
- Prompt stays within caching boundaries

**What gets evolved:**
The system prompt is assembled in `run_agent.py` / `agent/prompt_builder.py` from 8 distinct sections:

| Section | Location | What It Does | Evolvable? |
|---------|----------|-------------|-----------|
| `DEFAULT_AGENT_IDENTITY` | prompt_builder.py | Core persona, behavioral traits | ✅ Yes — tone, priorities, approach |
| `MEMORY_GUIDANCE` | prompt_builder.py | How to use persistent memory | ✅ Yes — when to save, what to save |
| `SESSION_SEARCH_GUIDANCE` | prompt_builder.py | When to search past sessions | ✅ Yes — trigger conditions |
| `SKILLS_GUIDANCE` | prompt_builder.py | When to save/load skills | ✅ Yes — trigger conditions |
| `PLATFORM_HINTS` | prompt_builder.py | Per-platform formatting guidance | ✅ Yes — per platform |
| Memory block | memory_store.py | User's actual memories | ❌ No — user data |
| Skills index | prompt_builder.py | Auto-generated skill list | ❌ No — auto-generated |
| Context files | prompt_builder.py | AGENTS.md, .cursorrules | ❌ No — project-specific |

**What to build:**

1. **Section-as-DSPy-parameter wrapper** — Each evolvable section becomes a DSPy Signature field
   - The optimizer can mutate each section independently
   - Sections are evaluated together (the full system prompt matters, not individual sections)

2. **Behavioral evaluator** — Does the agent behave correctly with this system prompt?
   - Measure: tool usage patterns, response quality, memory usage, skill loading
   - Use batch_runner to run the agent on diverse tasks with the evolved prompt

3. **Benchmark-gated validation** — Evolved prompts must not regress on benchmarks
   - Run TBLite (fast, ~1-2 hours) as a regression check
   - If TBLite score drops, reject the variant regardless of other metrics

**Evaluation data sources:**

   **Source A: Behavioral test suite (synthetic)**
   Generate scenarios that test specific prompt sections:
   - Memory guidance: "Does the agent save important user preferences?" (10 scenarios)
   - Session search: "Does the agent search history when the user says 'like last time'?" (10 scenarios)
   - Skills guidance: "Does the agent load relevant skills before starting?" (10 scenarios)
   - Identity: "Is the response helpful, direct, and not overly verbose?" (20 scenarios)
   - Platform hints: "Does CLI output avoid markdown? Does Telegram use formatting?" (10 per platform)

   **Source B: Benchmark scores as fitness signal**
   - TBLite (100 tasks, ~1-2 hours, binary pass/fail) — primary regression check
   - YC-Bench fast_test preset (~50 turns, composite score) — tests long-term coherence
   - These don't measure specific prompt sections, but they catch broad regressions
   - A prompt variant that scores higher on behavioral tests but lower on TBLite is rejected

   **Source C: SessionDB — behavioral pattern mining**
   - Find sessions where the agent failed to search memory when it should have
   - Find sessions where the agent was too verbose or used wrong formatting
   - These become targeted test cases for the relevant prompt section

**Constraints specific to system prompt sections:**
- Each section must not exceed its current size by >20% (prevents prompt bloat)
- Total system prompt must stay under the model's prompt caching boundary
- Identity section must retain core traits (helpful, direct, admits uncertainty)
- Platform hints must remain platform-accurate (don't tell Telegram to use ANSI codes)

### Phase 4: Code Evolution via Darwinian Evolver

**Goal:** Evolve tool implementation code for better performance and fewer bugs.

**Prerequisite:** Phases 1-3 complete — strong evaluation pipeline, validated benchmark gating, confidence in the optimization loop.

**Week 1-2 (Build):** Set up Darwinian Evolver as external CLI. Build code-as-organism wrapper mapping tool files to GitBasedOrganism. Build composite fitness function (pytest + benchmarks + bug reproduction). This phase uses a different engine (Darwinian Evolver instead of DSPy+GEPA) so there's new infrastructure to build.

**Week 2-3 (Run):** Start with known bugs from GitHub issues — create reproduction scripts, run evolution to find fixes. Then try edge case hardening on 1-2 tools (e.g., `file_tools.py`, `search_files`).

**Week 3-4 (Validate):** Full test suite + full benchmark suite (including TerminalBench2 for thorough validation). Strictest human review — every line of evolved code reviewed before merge.

**Done when:**
- ≥1 known bug fixed by evolution (validated by reproduction script)
- Full test suite passes (2550+ tests)
- Full benchmark suite holds (TBLite + TerminalBench2 + YC-Bench)
- No function signatures or registry calls changed
- Human reviewer approves all code changes

**What gets evolved:**
Actual Python source code in `tools/*.py` files. This is the highest-risk tier — code changes can break everything.

**What to build:**

1. **Code-as-organism wrapper** — Maps tool source files to Darwinian Evolver's `GitBasedOrganism`
   - Each tool file is a separate organism
   - Mutations are proposed by the LLM based on specific failure cases
   - All mutations are committed to a git branch for traceability

2. **Test-driven fitness function** — Composite score from multiple signals:
   - pytest results (hard gate — must pass 100%)
   - Benchmark scores (TBLite pass rate)
   - Specific failure case resolution (did the mutation fix the bug it targeted?)
   - Code quality heuristics (no regressions in error handling, no removed safety checks)

3. **Safety guardrails** — Strictest of all tiers:
   - Full test suite must pass
   - No changes to function signatures (would break callers)
   - No changes to registry.register() calls (would break tool discovery)
   - No removal of error handling or safety checks
   - Human review required on every PR

**Evaluation data sources:**

   **Source A: pytest suite (2550+ tests, primary gate)**
   - Every code mutation must pass the full test suite
   - Test failures = immediate rejection, no exceptions
   - This is the hard floor — it prevents regressions

   **Source B: Benchmark scores (broad capability check)**
   - Run TBLite with evolved code to verify tool implementations still work end-to-end
   - TerminalBench2 for thorough validation (89 tasks, but expensive — use selectively)
   - YC-Bench for long-horizon coherence (does the agent still handle 200-turn sessions?)

   **Source C: Known bug reproduction datasets**
   - Collect GitHub issues that report tool bugs
   - Create reproduction scripts that trigger the bug
   - Fitness: does the evolved code fix the bug while passing all tests?
   - This is the most targeted and efficient use of code evolution

   **Source D: Edge case generation**
   - Use a strong model to generate adversarial inputs for each tool
   - Example: `read_file` with symlinks, binary files, huge files, missing files, permission errors
   - Example: `search_files` with regex edge cases, unicode, very large repos
   - Score: does the tool handle all edge cases gracefully?

**Constraints specific to code evolution:**
- Full test suite (2550+ tests) must pass — zero tolerance
- Function signatures frozen (no breaking API changes)
- registry.register() calls frozen (no tool discovery changes)
- Error handling coverage must not decrease
- Darwinian Evolver runs as external CLI only (AGPL v3)
- All PRs require human review — no auto-merge for code changes

### Phase 5: Continuous Self-Improvement Loop

**Goal:** The agent automatically identifies its weakest areas and improves them over time.

**Prerequisite:** Phases 1-4 proven — manual optimization works reliably for skills, tools, prompts, and code. Now we automate it.

**Week 1 (Build):** Build performance monitor (tracks skill success rates, tool selection accuracy, benchmark scores over time). Build auto-triage logic (ranks optimization targets by impact × frequency). Wire up to Hermes cron scheduler.

**Week 2 (Deploy & Monitor):** Set up weekly benchmark runs via cron. Set up threshold-triggered optimization (when a skill's failure rate exceeds X%, auto-trigger GEPA). All automated PRs still require human merge.

**Done when:**
- Weekly benchmark runs execute unattended and report scores
- Auto-triage correctly identifies underperforming skills
- At least one optimization cycle runs end-to-end (detect problem → optimize → PR) without manual intervention
- Human still reviews and merges every PR — this phase automates detection and optimization, not deployment

**What to build:**

1. **Performance monitor** — Track metrics from real usage:
   - Per-skill success rates (from SessionDB — was the skill loaded? did the task succeed?)
   - Tool selection accuracy (from trajectories — did the agent pick the right tools?)
   - Benchmark scores over time (periodic TBLite + YC-Bench runs)
   - User corrections (when the user says "no, use X instead" — that's a signal)

2. **Auto-triage** — Identify what to optimize next:
   - Skills with declining success rates or high failure rates
   - Tools that are frequently misselected
   - Benchmark categories with low pass rates
   - Rank by (potential improvement × usage frequency)

3. **Scheduled optimization** — Cron job pipeline:
   - Weekly: Run TBLite + YC-Bench fast_test, log scores
   - When scores drop or skill failure rate exceeds threshold: trigger GEPA optimization
   - Generate PR with evolved improvements
   - Notify for human review

4. **Feedback loop** — Real usage improves evaluation datasets:
   - User corrections are logged and added to eval datasets
   - High-quality sessions become positive examples
   - Failed sessions become failure cases for GEPA's reflective analysis
   - Evaluation datasets grow organically over time

---

## Benchmarks as Fitness Signals

The three existing benchmarks serve different roles in the optimization pipeline:

| Benchmark | What It Tests | Speed | Cost | Role in Self-Improvement |
|-----------|-------------|-------|------|------------------------|
| **TBLite** | Coding/sysadmin (100 tasks, calibrated difficulty) | ~1-2 hours | ~$20-50 | **Primary regression gate** — fast enough to run on every candidate |
| **TerminalBench2** | Coding/sysadmin (89 harder tasks, Docker sandboxes) | ~2-4 hours | ~$50-200 | **Thorough validation** — run on final candidates before PR |
| **YC-Bench** | Long-horizon strategic coherence (100-500 turns) | ~3-6 hours | ~$50-200 | **Coherence check** — ensures evolved prompts don't break multi-turn behavior |

**How benchmarks fit into the optimization loop:**

```
Candidate Variant
    │
    ├──► pytest (must pass 100%) ────────── GATE 1: functional correctness
    │
    ├──► TBLite fast subset (20 tasks) ──── GATE 2: quick capability check (~20 min)
    │
    ├──► Task-specific eval dataset ──────── FITNESS: skill/tool/prompt quality score
    │
    ▼
Top Candidates Only (top 3)
    │
    ├──► Full TBLite (100 tasks) ─────────── GATE 3: thorough regression check
    │
    ├──► YC-Bench fast_test ──────────────── GATE 4: coherence check
    │
    ▼
Best Candidate → PR with full metrics
```

**Key principle:** Benchmarks are GATES, not fitness functions. The fitness function is task-specific (did the skill/tool/prompt do its job better?). Benchmarks ensure the improvement didn't break something else. A variant that improves skill quality by 20% but drops TBLite by 5% is REJECTED.

---

## Constraints & Guardrails

Every candidate variant must pass ALL of these before it can be considered valid. Variants that fail any constraint are discarded — GEPA/MIPROv2 never see them as successful.

### 1. Full Test Suite
```
python -m pytest tests/ -q  # Must pass 100% — zero tolerance
```
Every evolved variant (skill text, tool description, code) triggers the full test suite. If any test fails, the variant is rejected. This is the hard floor — nothing ships that breaks existing functionality.

### 2. Character/Token Limits
Evolved text must stay within strict size budgets:

| Target | Max Size | Why |
|--------|----------|-----|
| Skill files (SKILL.md) | Configurable per skill, default 15KB | Skills are injected as user messages — bloated skills waste context window |
| Tool descriptions | 500 chars | Tool schemas are sent every turn — every extra char multiplies across the entire conversation |
| System prompt sections | Must not exceed current section size by >20% | Prevents prompt bloat that degrades model attention and increases cost |

The optimizer's fitness function applies a **length penalty** — variants that approach the limit get scored lower even if they're otherwise better. This prevents evolutionary drift toward verbose solutions.

### 3. Prompt Caching Compatibility
Hermes relies on prompt caching to keep costs manageable. Evolved content must not break this:

- **Skills**: Injected as user messages at conversation start. Evolved skills are deployed as new versions — they take effect on NEW sessions only, never mid-conversation.
- **Tool descriptions**: Part of the tool schema sent with every API call. Changes take effect on next session start. Schema structure (parameter names, types) must NOT change — only the description text.
- **System prompt sections**: Rebuilt once at session start. Evolved sections deploy as config updates, applied on next session. No mid-session prompt rebuilds.

**Rule: No evolved content is ever hot-swapped into an active conversation.** All changes take effect on the next fresh session.

### 4. Semantic Preservation
The optimizer must preserve the core behavior/intent of what it's evolving:

- A skill for "GitHub code review" must still perform code reviews, not drift into something else
- Tool descriptions must still accurately describe what the tool does
- System prompt sections must maintain their functional role

This is enforced by including **semantic similarity checks** in the fitness function — the evolved text is compared against the original to ensure it hasn't drifted too far in meaning, only improved in effectiveness.

### 5. Deployment via PR (Never Direct Commit)
All evolved changes go through a pull request:

```bash
git checkout -b evolve/<target>-<timestamp>
# Apply evolved changes
git add <files>
git commit -m "evolve: <target> — score improved X% → Y%

Optimizer: GEPA (N iterations, M candidates evaluated)
Eval dataset: <dataset name> (K examples)
Before: <baseline score>
After: <evolved score>
Holdout: <holdout score>"
git push -u origin evolve/<target>-<timestamp>
gh pr create --title "evolve: <target>" --body "<metrics, diff, comparison>"
```

The PR body includes:
- Before/after scores on train, validation, AND holdout sets
- The full diff of what changed
- Cost of the optimization run
- Any constraint violations that were caught and rejected during evolution

---

## Practical Considerations

### Cost
- GEPA optimization: ~$2-10 per run (depending on eval dataset size)
- Darwinian Evolver: ~$2-9 per task
- Batch evaluation: depends on number of test cases and model cost
- Recommendation: Start with small eval sets (10-20 examples), scale up for important skills

### Safety
See the **Constraints & Guardrails** section above for the full enforcement list. Summary:
- **Human approval required** — all changes deploy via PR, never direct commit
- **Full test suite gate** — zero tolerance, every variant must pass 100%
- **Character/token budgets** — prevents evolutionary bloat
- **Caching compatibility** — no mid-conversation changes, ever
- **Semantic preservation** — evolved text must not drift from its original purpose
- **Git-tracked lineage** — every evolution step is a commit, rollback is trivial
- **Holdout test sets** — separate from training data to catch overfitting

### Licensing
- DSPy: MIT ✓ (can import and integrate freely)
- GEPA: MIT ✓ (integrated into DSPy, also standalone `pip install gepa`)
- Darwinian Evolver: AGPL v3 ⚠️ (external CLI only, no Python imports)
- All Hermes-native code: MIT ✓

---

## Relationship to Existing Issues

| Issue | Status | Relationship |
|-------|--------|-------------|
| #336 (Darwinian Evolver Skill) | Open | Subsumed by Phase 3 of this plan |
| #337 (Evolutionary Self-Improvement) | Open | This plan IS the implementation of #337 |
| #339 (PR: Darwinian Evolver skill) | Open | Close — replaced by this unified approach |

---

## Open Questions

1. Should the optimization skill live in the repo (bundled) or Skills Hub (optional install)?
   - Recommendation: Core orchestration in repo, optimization engines as optional dependencies
   
2. How do we build evaluation datasets for skills that don't have much usage history?
   - Option A: LLM-generated synthetic test cases
   - Option B: Manual curation by skill authors
   - Option C: Community-contributed eval sets

3. Should evolved skills be versioned separately from the main repo?
   - Recommendation: Git branches per evolution run, merge winning variants to main

4. What's the minimum viable first target?
   - Recommendation: Pick 2-3 well-used skills with clear success metrics (e.g., arxiv paper search, github-code-review, systematic-debugging)