---
name: eval-creator
description: "[Beta] Creates permanent eval cases from promoted learnings and runs regression checks against them. Turns failures into test cases that prevent silent regression. This is the outer loop's regress-test step. Use when a learning is promoted and has a clear pass/fail condition, or on cadence to verify promoted rules still hold."
user-invocable: true
argument-hint: "[create | run | list] [--pattern-key KEY]"
---

# Eval Creator

Turns promoted learnings into permanent eval cases. Runs regression checks to verify promoted rules hold. This is the outer loop's **regress-test** step.

The blog says: "If a failure taught you something important, it should become a permanent test case. Otherwise the knowledge is still fragile."

## When to Use

- **After harness-updater promotes a pattern** — create an eval for it
- **On cadence** — run all evals to check for regression
- **Before major releases** — verify the harness is holding
- **When a promoted rule seems to have stopped working** — diagnose with targeted eval run

## Eval Directory Structure

```
.evals/
  EVAL_INDEX.md          # Index of all eval cases with status
  cases/
    eval-YYYYMMDD-001.md # Individual eval case
    eval-YYYYMMDD-002.md
    ...
```

## Creating an Eval Case

### Input

From harness-updater or manually:
- Pattern-Key of the promoted learning
- The rule that was added to CLAUDE.md / AGENTS.md
- What to test (the assertion)
- Verification method

### Eval Case Format

```markdown
---
id: eval-YYYYMMDD-NNN
pattern-key: [from learning]
source: [LRN-YYYYMMDD-001, ERR-YYYYMMDD-003]
promoted-rule: "[the rule text in CLAUDE.md]"
promoted-to: CLAUDE.md
created: YYYY-MM-DD
last-run: YYYY-MM-DD
last-result: pass | fail | skip
---

## What This Tests

[One sentence: what failure this eval prevents from recurring]

## Precondition

[What must be true for this eval to be runnable]
- File X exists
- Project uses framework Y
- etc.

## Verification Method

[One of: grep-check, command-check, file-check, rule-check]

### grep-check
Search for a pattern that should (or should not) exist:
```
target: src/**/*.ts
pattern: "hardcoded-secret-pattern"
expect: not_found
```

### command-check
Run a command and check the exit code or output:
```
command: npm run typecheck
expect_exit: 0
```

### file-check
Verify a file or section exists:
```
target: CLAUDE.md
section: "## Verification"
expect: exists
```

### rule-check
Verify a rule exists in an instruction file:
```
target: CLAUDE.md
contains: "[the promoted rule text or key phrase]"
expect: found
```

## Expected Result

**Pass:** [What "good" looks like]
**Fail:** [What regression looks like]

## Recovery Action

If this eval fails:
1. [Specific step to diagnose]
2. [Specific step to fix]
3. Re-run this eval to verify
```

## Running Evals

### Run All
Read `.evals/EVAL_INDEX.md`, iterate through all cases, execute each verification method.

### Run by Pattern-Key
Filter to evals matching a specific pattern.

### Run by Area
Filter to evals whose source files match an area (frontend, backend, etc.).

### Execution

For each eval case:

1. **Check precondition** — if not met, mark as `skip`
2. **Execute verification method:**
   - `grep-check`: Use Grep tool to search target files for the pattern
   - `command-check`: Run the command via Bash, check exit code and/or output
   - `file-check`: Use Read/Glob to verify file/section existence
   - `rule-check`: Read the target file, search for the expected content
   - `skill-check`: Run `quick_validate.py` on a skill directory (see Skill Validation below)
   - `script-check`: Run a custom mcp-script by name (see Custom Verification Methods)
3. **Compare result** to expected
4. **Update `last-run` and `last-result`** in the eval case file
5. **Update `EVAL_INDEX.md`** with the result

### Regression Report

```markdown
## Eval Run: YYYY-MM-DD

**Total:** N evals
**Passed:** N
**Failed:** N
**Skipped:** N

### Failures

#### eval-YYYYMMDD-001 — [pattern-key]
- **What regressed:** [description]
- **Expected:** [X]
- **Got:** [Y]
- **Recovery action:** [from eval case]

### Summary
[All green / N regressions need attention]
```

## Eval Index Format

`.evals/EVAL_INDEX.md`:

```markdown
# Eval Index

| ID | Pattern-Key | Rule Summary | Last Run | Result | Created |
|----|-------------|-------------|----------|--------|---------|
| eval-YYYYMMDD-001 | auth-middleware-lock | Run migrations on test DB first | YYYY-MM-DD | pass | YYYY-MM-DD |
| eval-YYYYMMDD-002 | pnpm-not-npm | Use pnpm in this repo | YYYY-MM-DD | fail | YYYY-MM-DD |
```

## Integration

### Upstream
- **harness-updater** flags eval candidates after promoting a pattern
- **learning-aggregator** identifies patterns with clear pass/fail conditions

### Downstream
- Regression failures feed back into **self-improvement** as new error entries
- Persistent failures may indicate the promoted rule needs refinement → feed back to **harness-updater**

### Scheduled Use
For projects with a CI pipeline, eval-creator can run as a scheduled check:
- Weekly: run all evals
- Per-PR: run evals related to changed files
- Post-promotion: run the newly created eval immediately

## Custom Verification Methods (mcp-scripts) — Extension Point

> These are examples of what's possible with gh-aw mcp-scripts, not features shipped with this plugin. Projects define their own scripts.

Beyond the four built-in methods (grep-check, command-check, file-check, rule-check), projects can define custom verification tools as mcp-scripts for complex assertions that the built-ins can't express.

Example — an eval that verifies a promoted auth rule is enforced:

```yaml
# In gh-aw workflow config
mcp-scripts:
  check-auth-middleware:
    lang: javascript
    description: "Verify all /admin routes have auth middleware"
    run: |
      const routes = require('./src/routes/admin');
      const unprotected = routes.filter(r => !r.auth);
      if (unprotected.length) {
        console.error('Unprotected admin routes:', unprotected.map(r => r.path));
        process.exit(1);
      }
```

Reference the script in an eval case as `verification_method: script-check` with the mcp-script name. This is an extension point — the built-in methods cover most cases, but mcp-scripts handle project-specific behavioral assertions.

## Persistence

Eval cases live in `.evals/` in the working directory. This interactive skill does not integrate with external memory backends. For CI-side durable storage that survives across workflow runs, see `eval-creator-ci`, which can optionally back its run history and created-pattern ledger with gh-aw's `repo-memory`.

## Skill Validation (skill-check)

The Anthropic `/skill-creator` skill includes two validation systems that eval-creator can use:

### Structural validation via `quick_validate.py`

The `skill-check` verification method runs the skill-creator's `quick_validate.py` script on a skill directory. It checks:

- SKILL.md exists with valid YAML frontmatter
- Only allowed frontmatter keys (`name`, `description`, `license`, `allowed-tools`, `metadata`, `compatibility`)
- Name is kebab-case, max 64 chars, no leading/trailing/consecutive hyphens
- Description has no angle brackets, max 1024 chars
- Compatibility field max 500 chars if present

Eval case example:

```markdown
---
id: eval-YYYYMMDD-NNN
pattern-key: skill-quality.verify-gate
verification_method: skill-check
target: skills/verify-gate
expect: valid
---

## What This Tests
Verify that the verify-gate skill passes structural validation after harness updates.
```

Execution: `python .claude/skills/skill-creator/scripts/quick_validate.py <target>`. Exit 0 = pass, exit 1 = fail.

### Behavioral validation via `run_eval.py`

For deeper validation, the skill-creator's `run_eval.py` tests whether a skill's description causes Claude to invoke it for given queries. This is useful when harness-updater modifies a skill's description or the outer loop creates a new skill — the eval verifies the skill still triggers correctly.

This requires Claude CLI access and is expensive. Use it for high-value skills only, not as a routine CI check.

### When to create skill-check evals

Two scenarios connect the outer loop to skill validation:

1. **Harness-updater modifies a skill**: When a promoted rule is inserted into a SKILL.md (rather than CLAUDE.md), create a `skill-check` eval to verify the skill remains structurally valid after the edit.

2. **Self-improvement identifies a skill gap**: When learning-aggregator classifies a pattern as `skill_gap` and recommends "create a new skill", the new skill should pass `quick_validate.py` before being committed. Create a `skill-check` eval for it that persists as a regression test.

This closes the loop: failure → learning → new/updated skill → eval verifies skill quality → regression prevents quality drift.

## What This Skill Does NOT Do

- Does not fix regressions (reports them for the agent or human to fix)
- Does not promote learnings (that's harness-updater)
- Does not analyze patterns (that's learning-aggregator)
- Does not replace project test suites — evals test the harness, not the code