---
name: test-runner
description: Runs systematic tests on Claude Code customizations. Executes sample queries, validates responses, generates test reports, and identifies edge cases for agents, commands, skills, and hooks.
allowed-tools: [Read, Write, Glob, Grep, Bash, Skill]
# model: inherit
---

## Reference Files

This skill uses reference materials:

- [examples.md](examples.md) - Concrete test case examples for different customization types
- [common-failures.md](common-failures.md) - Catalog of common failure patterns

## Focus Areas

- **Sample Query Generation** - Creating realistic test queries based on descriptions
- **Expected Behavior Validation** - Verifying outputs match specifications
- **Regression Testing** - Ensuring changes don't break existing functionality
- **Edge Case Identification** - Finding unusual scenarios and boundary conditions
- **Integration Testing** - Validating customizations work together
- **Performance Assessment** - Analyzing context usage and efficiency

## Test Framework

### Test Types

#### Functional Tests

**Purpose**: Verify core functionality works as specified

**Process**:

1. Generate test queries from description/documentation
2. Execute customization with test input
3. Compare actual output to expected behavior
4. Record PASS/FAIL for each test case

#### Integration Tests

**Purpose**: Ensure customizations work together

**Process**:

1. Test hook interactions (PreToolUse, PostToolUse chains)
2. Verify skills can invoke sub-agents
3. Check commands delegate correctly
4. Validate settings.json configuration
5. Test tool permission boundaries

#### Usability Tests

**Purpose**: Assess user experience quality

**Process**:

1. Evaluate error messages (are they helpful?)
2. Check documentation completeness
3. Test edge cases (what breaks it?)
4. Assess output clarity
5. Verify examples work as shown

### Test Execution Strategy

#### For Skills

1. **Discovery Test**: Generate queries that should trigger the skill
2. **Invocation Test**: Actually invoke the skill with sample query
3. **Output Test**: Verify skill produces expected results
4. **Tool Test**: Confirm only allowed tools are used
5. **Reference Test**: Check that references load correctly

#### For Agents

1. **Frontmatter Test**: Validate YAML structure
2. **Invocation Test**: Invoke agent with test prompt
3. **Tool Test**: Verify agent uses appropriate tools
4. **Output Test**: Check output format and quality
5. **Context Test**: Measure context usage

#### For Commands

1. **Delegation Test**: Verify command invokes correct agent/skill
2. **Usage Test**: Test with valid and invalid arguments
3. **Documentation Test**: Verify usage instructions are accurate
4. **Output Test**: Check output format and clarity

#### For Hooks

1. **Input Test**: Verify JSON stdin handling
2. **Exit Code Test**: Confirm 0 (allow) and 2 (block) work correctly
3. **Error Handling Test**: Verify graceful degradation
4. **Performance Test**: Check execution speed
5. **Integration Test**: Test hook chain behavior

## Test Process

### Step 1: Identify Customization Type

Determine what to test:

- Agent (in agents/)
- Command (in commands/)
- Skill (in skills/)
- Hook (in hooks/)

### Step 2: Read Documentation

Use Read tool to examine:

- Primary file content
- Frontmatter/configuration
- Usage instructions
- Examples (if provided)

### Step 3: Generate Test Cases

Based on description and documentation:

**For Skills**:

- Extract trigger phrases from description
- Create 5-10 sample queries that should trigger
- Create 3-5 queries that should NOT trigger
- Identify edge cases from description

**For Agents**:

- Create prompts based on focus areas
- Generate scenarios agent should handle
- Identify scenarios outside agent scope

**For Commands**:

- Test with documented arguments
- Test with no arguments
- Test with invalid arguments

**For Hooks**:

- Create sample tool inputs that should pass
- Create inputs that should block
- Create malformed inputs to test error handling

### Step 4: Execute Tests

**Read-Only Testing** (default):

- Analyze whether customization would work
- Check configurations and settings
- Verify documentation accuracy
- Assess expected behavior

**Active Testing** (when appropriate):

- Actually invoke skills with sample queries
- Run commands with test arguments
- Trigger hooks with test inputs
- Record actual outputs

### Step 5: Compare Results

For each test:

- **Expected**: What should happen (from docs/description)
- **Actual**: What did happen (from testing)
- **Status**: PASS (matched) / FAIL (didn't match) / EDGE CASE (unexpected)

### Step 6: Generate Test Report

Create structured report following output format.

## Output Format

```markdown
# Test Report: {name}

**Type**: {agent|command|skill|hook}
**File**: {path}
**Tested**: {YYYY-MM-DD HH:MM}
**Test Mode**: {read-only|active}

## Summary

{1-2 sentence overview of what was tested and overall results}

## Test Results

**Total Tests**: {count}
**Passed**: {count} ({percentage}%)
**Failed**: {count} ({percentage}%)
**Edge Cases**: {count}

## Functional Tests

### Test 1: {test name}

- **Input**: {test input/query}
- **Expected**: {expected behavior}
- **Actual**: {actual behavior}
- **Status**: PASS | FAIL | EDGE CASE
- **Notes**: {observations}

### Test 2: {test name}

...

## Integration Tests

{If applicable - tests with other customizations}

### Integration 1: {integration name}

- **Components**: {what was tested together}
- **Expected**: {expected interaction}
- **Actual**: {actual interaction}
- **Status**: PASS | FAIL
- **Notes**: {observations}

## Usability Assessment

- **Documentation**: CLEAR | UNCLEAR | MISSING
- **Error Messages**: HELPFUL | CONFUSING | ABSENT
- **Examples**: WORKING | BROKEN | MISSING
- **Overall UX**: EXCELLENT | GOOD | NEEDS WORK | POOR

## Edge Cases Discovered

{Unusual scenarios or boundary conditions found during testing}

1. {edge case 1}
   - **Impact**: {severity}
   - **Recommendation**: {how to handle}

2. {edge case 2}
   ...

## Performance Metrics

{If applicable}

- **Context Usage**: ~{token estimate}
- **Execution Time**: {estimate or N/A}
- **Resource Impact**: LOW | MEDIUM | HIGH

## Failures and Issues

{Detailed analysis of failed tests}

### Failure 1: {test name}

- **Why It Failed**: {root cause}
- **Expected vs Actual**: {specific diff}
- **Fix Recommendation**: {how to resolve}

### Failure 2: {test name}

...

## Recommendations

### Priority 1 (Critical)

{Must-fix issues that prevent functionality}

1. {critical fix 1}
2. {critical fix 2}

### Priority 2 (Important)

{Should-fix issues that impact usability}

1. {important fix 1}
2. {important fix 2}

### Priority 3 (Nice-to-Have)

{Could-fix issues that polish the experience}

1. {enhancement 1}
2. {enhancement 2}

## Next Steps

{Specific actions to address failures and improve tests}
```

## Test Case Examples

For detailed test case examples for skills, agents, commands, and hooks, see [examples.md](examples.md).

## Best Practices for Testing

1. **Test Both Paths**: Success cases AND failure cases
2. **Edge Cases Matter**: Test boundaries and unusual inputs
3. **Clear Expected Behavior**: Document what should happen
4. **Realistic Queries**: Use natural language users would actually type
5. **Integration Testing**: Don't just test in isolation
6. **Performance Aware**: Note if tests are slow or heavy
7. **Regression Testing**: Re-test after changes
8. **Document Failures**: Explain why tests failed, not just that they did
9. **Actionable Recommendations**: Provide specific fixes
10. **Version Testing**: Note which version was tested

## Common Test Failures

For a catalog of common failure patterns by customization type, see [common-failures.md](common-failures.md).

## Tools Used

This skill uses these tools for testing:

- **Read** - Examine customization files
- **Write** - Generate and save test reports to files
- **Grep** - Search for patterns and configurations
- **Glob** - Find files and customizations
- **Bash** - Execute read-only commands for analysis
- **Skill** - Invoke skills for active testing

In read-only mode (default), no customizations are actually invoked - the skill analyzes configurations and documentation to assess expected behavior. In active mode, skills are invoked with test queries to verify actual behavior.

Test reports are written to `~/.claude/logs/evaluations/tests/` using the Write tool.