---
name: skill-grader
description: Evaluate skill test run outputs against expectations and extract implicit claims.
allowed-tools: Read Grep Glob Write
---

# Skill Grader

Evaluate skill test run outputs against expectations and extract implicit claims.

## Input Schema

```yaml
expectations:        # List of verifiable statements
  - "Output includes X"
  - "Skill used script Y"
transcript_path:     # Path to execution transcript
outputs_dir:         # Directory containing output files
eval_prompt:         # Original task prompt
```

## Grading Process

### Step 1: Read Context

```
TRANSCRIPT = Read(transcript_path)
OUTPUT_FILES = Glob(outputs_dir + "/**/*")
For each FILE in OUTPUT_FILES:
  CONTENT[FILE] = Read(FILE)
```

Note: eval_prompt, execution steps, errors, final result.

### Step 2: Grade Expectations

For each EXPECTATION in expectations:

```
EVIDENCE = search TRANSCRIPT and CONTENT for EXPECTATION
If EVIDENCE confirms EXPECTATION genuinely (not superficially):
  verdict = PASS
Else:
  verdict = FAIL
```

**PASS criteria:**
- Clear evidence in transcript or outputs
- Evidence reflects genuine task completion, not surface compliance
- A correct filename with wrong content is FAIL, not PASS

**FAIL criteria:**
- No evidence found
- Evidence contradicts expectation
- Evidence is superficial (right format, wrong substance)
- Cannot be verified from available information

When uncertain: burden of proof is on the expectation to pass.

### Step 3: Extract Claims

Beyond predefined expectations, find implicit claims:

```
For each CLAIM in (TRANSCRIPT + CONTENT):
  CLAIM.type = "factual" | "process" | "quality"
  CLAIM.verified = verify(CLAIM, available_evidence)
  CLAIM.evidence = supporting_or_contradicting_text
```

- Factual: "The form has 12 fields" — check against outputs
- Process: "Used pypdf to fill the form" — verify from transcript
- Quality: "All fields filled correctly" — evaluate if justified

Flag unverifiable claims.

### Step 4: Critique the Evals

After grading, assess whether the evals themselves could improve.
Only surface suggestions when there's a clear gap:

- Assertion that passed but would also pass for clearly wrong output
- Important outcome (good or bad) that no assertion covers
- Assertion that can't actually be verified from available outputs

Keep bar high. Flag things the eval author would say "good catch" about.

### Step 5: Write Results

```
Write(outputs_dir + "/../grading.json", RESULTS)
```

## Output Schema

```json
{
  "expectations": [
    {
      "text": "The output includes X",
      "passed": true,
      "evidence": "Found in transcript Step 3: '...'"
    }
  ],
  "summary": {
    "passed": 2,
    "failed": 1,
    "total": 3,
    "pass_rate": 0.67
  },
  "claims": [
    {
      "claim": "The form has 12 fillable fields",
      "type": "factual",
      "verified": true,
      "evidence": "Counted 12 fields in output"
    }
  ],
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "Output includes name",
        "reason": "A hallucinated doc mentioning the name would also pass"
      }
    ],
    "overall": "No suggestions, evals look solid."
  }
}
```

**Field requirements:**
- expectations[].text, .passed, .evidence — all required (viewer depends on exact names)
- summary.pass_rate — float 0.0 to 1.0
- claims[] — optional but encouraged
- eval_feedback — include only when warranted; "No suggestions" is fine

## Error Handling

| Condition | Action |
|-----------|--------|
| Transcript not found | FAIL all expectations, note in evidence |
| Output files empty | FAIL expectations requiring output content |
| Binary files in outputs | Note as unreadable, skip content check |
| Malformed JSON in outputs | FAIL expectations about JSON structure |