---
name: evaluating-skills-with-models
description: Evaluate skills by executing them across sonnet, opus, and haiku models using sub-agents. Use when testing if a skill works correctly, comparing model performance, or finding the cheapest compatible model. Returns numeric scores (0-100) to differentiate model capabilities.
license: Apache-2.0
metadata:
  author: Softgraphy GK
  version: "0.2.1"
---

# Evaluating Skills with Models

Evaluate skills across multiple Claude models using sub-agents with quality-based scoring.

> **Requirement:** Claude Code CLI only. Not available in Claude.ai.

## Why Quality-Based Scoring

Binary pass/fail ("did it do X?") fails to differentiate models - all models can "do the steps." The difference is **how well** they do them. This skill uses weighted scoring to reveal capability differences.

## Workflow

### Step 1: Load Test Scenarios

Check for `tests/scenarios.md` in the target skill directory.

**Default to difficult scenarios:** When multiple scenarios exist, prioritize **Hard** or **Medium** difficulty scenarios for evaluation. Easy scenarios often don't show meaningful differences between models and aren't realistic for production use.

**Required scenario format:**

```markdown
## Scenario: [Name]

**Difficulty:** Easy | Medium | Hard | Edge-case

**Query:** User request that triggers this skill

**Expected behaviors:**

1. [Action description]
   - **Minimum:** What counts as "did it"
   - **Quality criteria:** What "did it well" looks like
   - **Haiku pitfall:** Common failure mode
   - **Weight:** 1-5

**Output validation:** (optional)
- Pattern: `regex`
- Line count: `< N`
```

**If scenarios.md missing or uses old format:** Ask user to update following [references/evaluation-structure.md](references/evaluation-structure.md).

### Step 2: Execute with Sub-Agents (Phase 1)

Spawn Task sub-agents for each model in parallel.

**Prompt template:**

```
Execute the skill at {skill_path} with this query:
{evaluation_query}

IMPORTANT:
- Actually execute the skill, don't just describe what you would do.
- Create output directory under Claude Code's working directory ($PWD):
  $PWD/.ai_text/{yyyyMMdd}/tmp/{skill_name}-{model}-{hhmmss}/
  (Example: If $PWD=/path/to/project, create /path/to/project/.ai_text/20250101/tmp/formatting-tables-haiku-143052/)
- Create all output files under that directory.
- If the skill asks questions, record the exact questions, then assume reasonable answers and proceed.

Return ONLY (keep it brief to minimize tokens):
- Questions skill asked: [list exact questions the skill asked you, or "none"]
- Assumed answers: [your assumed answers to those questions, or "n/a"]
- Key decisions: [1-2 sentences on freedom level, structure choices]
- Files created: [paths only, no content]
- Errors: [any errors, or "none"]

Do NOT include file contents or detailed explanations.
```

Use Task tool with `model` parameter: `haiku`, `sonnet`, `opus`

**After sub-agents complete:** Read created files directly using Glob + Read to evaluate file quality (naming, structure, content). The minimal report provides process info (questions, decisions) that can't be inferred from files.

### Step 3: Score Each Behavior

For each expected behavior, score 0-100:

| Score | Meaning |
|-------|---------|
| 0 | Not attempted or completely wrong |
| 25 | Attempted but below minimum |
| 50 | Meets minimum criteria |
| 75 | Meets most quality criteria |
| 100 | Meets all quality criteria |

**Scoring checklist per behavior:**

1. Did it meet the **Minimum**? (No → score ≤ 25)
2. How many **Quality criteria** met? (Calculate proportion)
3. Did it hit the **Haiku pitfall**? (Deduct points)
4. Apply **Weight** to final calculation

### Step 4: Calculate Weighted Scores

```
Behavior Score = base_score  // after applying deductions (e.g., Haiku pitfalls)
Total = Σ(behavior_score × weight) / Σ(weights)
```

**Rating thresholds:**

| Score | Rating | Meaning |
|-------|--------|---------|
| 90-100 | ✅ Excellent | Production ready |
| 75-89 | ✅ Good | Acceptable |
| 50-74 | ⚠️ Partial | Quality issues |
| 25-49 | ⚠️ Marginal | Significant problems |
| 0-24 | ❌ Fail | Does not work |

### Step 5: Add Results to README

After evaluation, add a table to the skill's README documenting the results:

**README section format:**

```markdown
## Evaluation Results

| Date | Scenario | Difficulty | Model | Score | Rating |
|------|----------|------------|-------|-------|--------|
| 2025-01-15 | Standard workflow | Hard | claude-haiku-4-5-20250101 | 42 | ⚠️ Marginal |
| 2025-01-15 | Standard workflow | Hard | claude-sonnet-4-5-20250929 | 85 | ✅ Good |
| 2025-01-15 | Standard workflow | Hard | claude-opus-4-5-20251101 | 100 | ✅ Excellent |
```

**Table requirements:**
- Include full model IDs (e.g., `claude-sonnet-4-5-20250929`) not just short names
- Show evaluation date in YYYY-MM-DD format
- Indicate scenario difficulty level (Easy/Medium/Hard/Edge-case)
- Include both numeric score and rating emoji
- Append new evaluations (don't overwrite previous results)

This creates a historical record of how the skill performs across models and improvements over time.

### Step 6: Output Summary

```markdown
## Model Evaluation Results

**Skill:** {skill_path}
**Scenario:** {scenario_name} ({difficulty})
**Date:** {YYYY-MM-DD}

### Scores by Behavior

| Behavior | Weight | claude-haiku-4-5-20250101 | claude-sonnet-4-5-20250929 | claude-opus-4-5-20251101 |
|----------|--------|---------------------------|----------------------------|--------------------------|
| Asks clarifying questions | 4 | 25 | 75 | 100 |
| Determines freedom level | 3 | 50 | 75 | 100 |
| Creates proper SKILL.md | 5 | 50 | 100 | 100 |

### Total Scores

| Model | Score | Rating |
|-------|-------|--------|
| claude-haiku-4-5-20250101 | 42 | ⚠️ Marginal |
| claude-sonnet-4-5-20250929 | 85 | ✅ Good |
| claude-opus-4-5-20251101 | 100 | ✅ Excellent |

### Observations
- Haiku: Skipped justification for freedom level (pitfall)
- Haiku: Asked only 1 generic question vs 3 specific
- Sonnet: Met all quality criteria except verbose output

### Next Steps
- Add these results to the skill's README (see Step 5)
- Consider model selection based on your quality requirements and budget
```

## Common Pitfalls by Model

| Model | Pitfall | Detection |
|-------|---------|-----------|
| haiku | Shallow questions | Count specificity |
| haiku | Skip justification | Check reasoning present |
| haiku | Miss references | Check files read |
| sonnet | Over-engineering | Check scope creep |
| sonnet | Verbose reporting | High token count vs output |
| opus | Over-verbose output | Token count |

> **Note:** Token usage includes both skill execution AND reporting overhead. Sonnet tends to produce detailed reports, which inflates token count. Compare tool uses for execution efficiency.

## Quick Reference

```
Load scenarios (prioritize Hard) → Execute (parallel) → Score behaviors → Calculate totals → Add to README → Output summary
```

For detailed scoring guidelines, see [references/evaluation-structure.md](references/evaluation-structure.md).