---
name: skill-eval
description: Test and iterate on a skill using parallel eval runs. Spawns with-skill and baseline runs, grades results, and helps improve the skill based on feedback. Use when testing skill quality or iterating on skill improvements.
allowed-tools: Read Write Edit Bash Glob Grep AskUserQuestion Agent
---

# Skill Eval

Test a skill with parallel eval runs and iterate based on results.

## Input

```
SKILL_PATH = $ARGUMENTS
```

If empty: `AskUserQuestion("Which skill do you want to test? Provide the path.")`

## Workflow

### Step 1: Validate Skill Exists

```
If not exists(SKILL_PATH/SKILL.md):
  ERROR "No SKILL.md found at SKILL_PATH"

SKILL_CONTENT = Read(SKILL_PATH/SKILL.md)
SKILL_NAME = extract name from frontmatter
SKILL_DESCRIPTION = extract description from frontmatter
```

### Step 2: Load or Create Evals

```
EVALS_PATH = SKILL_PATH/evals/evals.json

If exists(EVALS_PATH):
  EVALS = Read(EVALS_PATH)
  Present evals to user, ask: "Use these evals, modify, or create new ones?"
Else:
  Generate 2-3 realistic test prompts based on SKILL_CONTENT
  Each prompt should be what a real user would actually say
  Include enough detail: file paths, context, specific requests
  Present to user: "Here are test cases I'd like to try. Look right, or want changes?"
```

Wait for user confirmation before proceeding.

**Eval format:**

```json
{
  "skill_name": "SKILL_NAME",
  "evals": [
    {
      "id": 1,
      "prompt": "User's realistic task prompt",
      "expected_output": "Description of expected result",
      "expectations": [
        "The output includes X",
        "The skill used approach Y"
      ]
    }
  ]
}
```

Save confirmed evals to EVALS_PATH.

### Step 3: Set Up Workspace

```
WORKSPACE = SKILL_PATH-workspace
ITERATION = find highest iteration-N in WORKSPACE + 1, or 1 if none
ITER_DIR = WORKSPACE/iteration-ITERATION

mkdir -p ITER_DIR
```

### Step 4: Spawn All Runs

For each EVAL in EVALS, spawn TWO subagents in the SAME turn:

**With-skill run:**

```
Agent(prompt: """
Execute this task:
- Read the skill at: SKILL_PATH/SKILL.md
- Follow the skill's instructions to complete this task: EVAL.prompt
- Save all outputs to: ITER_DIR/eval-EVAL.id/with_skill/outputs/
""")
```

**Baseline run (no skill):**

```
Agent(prompt: """
Execute this task (no special instructions):
- Task: EVAL.prompt
- Save all outputs to: ITER_DIR/eval-EVAL.id/without_skill/outputs/
""")
```

Launch ALL runs (with-skill + baseline for every eval) in parallel.

### Step 5: Draft Assertions While Runs Execute

While runs are in progress, draft quantitative assertions for each eval:

- Assertions must be objectively verifiable
- Give each a descriptive name
- Skip assertions for subjective outcomes (writing style, design quality)

Write `eval_metadata.json` for each eval directory:

```json
{
  "eval_id": 1,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": ["Output includes X", "File is valid JSON"]
}
```

Explain assertions to user while waiting.

### Step 6: Grade Results

Once all runs complete:

For each run (with_skill and without_skill):

```
Agent(prompt: """
Read agents/skill-grader.md and follow its instructions.
- expectations: EVAL.expectations
- transcript_path: ITER_DIR/eval-EVAL.id/CONFIG/transcript.md
- outputs_dir: ITER_DIR/eval-EVAL.id/CONFIG/outputs/
- eval_prompt: EVAL.prompt
""", subagent_type: "majestic-tools:agents:skill-grader")
```

### Step 7: Present Results

For each eval, show:
- Prompt
- With-skill pass rate vs baseline pass rate
- Key differences in outputs
- Any eval feedback from grader

Summary table:

```
| Eval | With Skill | Baseline | Delta |
|------|-----------|----------|-------|
| 1    | 85%       | 35%      | +50%  |
```

Ask user: "How do these look? Any specific feedback on the outputs?"

### Step 8: Iterate (if needed)

```
If user has feedback:
  1. Analyze feedback — generalize, don't overfit to specific cases
  2. Improve the skill (Edit SKILL.md)
  3. Explain changes made and why
  4. Ask: "Want to rerun the evals with the updated skill?"
  5. If yes: goto Step 3 (new iteration)

If user is satisfied:
  "Skill looks good!"
```

## Improvement Guidelines

When rewriting skills based on feedback:

- **Generalize** — don't overfit to the 2-3 test cases; the skill will be used many times
- **Keep it lean** — remove instructions that aren't pulling their weight
- **Explain the why** — tell the model why things matter instead of rigid MUSTs
- **Look for repeated work** — if all runs wrote similar helper scripts, bundle the script in the skill
- **Read transcripts** — check if the skill wastes time on unproductive steps

## Error Handling

| Condition | Action |
|-----------|--------|
| Subagent run times out | Note timeout, grade available outputs |
| No outputs produced | FAIL all expectations for that run |
| Skill has syntax errors | Fix before running evals |
| User wants to stop iterating | Accept and summarize current state |