---
name: prompt-testing
description: A/B testing and performance metrics for prompts
allowed-tools: Read, Write, Bash
---

# Prompt Testing

Skill for testing, comparing, and measuring prompt performance.

## Documentation

- [metrics.md](docs/metrics.md) - Performance metrics definition
- [methodology.md](docs/methodology.md) - A/B testing protocol

## Testing Workflow

```text
1. DEFINE
   └── Test objective
   └── Metrics to measure
   └── Success criteria

2. PREPARE
   └── Variants A and B
   └── Test dataset
   └── Baseline (if existing)

3. EXECUTE
   └── Run on dataset
   └── Collect results
   └── Document observations

4. ANALYZE
   └── Calculate metrics
   └── Compare variants
   └── Identify patterns

5. DECIDE
   └── Recommendation
   └── Statistical confidence
   └── Next iterations
```

## Performance Metrics

### Quality

| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Accuracy** | Correct responses | Correct / Total |
| **Compliance** | Format adherence | Compliant / Total |
| **Consistency** | Response stability | 1 - Variance |
| **Relevance** | Meeting the need | Average score (1-5) |

### Efficiency

| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Tokens Input** | Prompt size | Token count |
| **Tokens Output** | Response size | Token count |
| **Latency** | Response time | ms |
| **Cost** | Price per request | Tokens × Price |

### Robustness

| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Edge Cases** | Edge case handling | Passed / Total |
| **Jailbreak Resist** | Bypass resistance | Blocked / Attempts |
| **Error Recovery** | Error recovery | Recovered / Errors |

## Test Format

### Test Dataset

```json
{
  "name": "Test Dataset v1",
  "description": "Dataset for testing prompt XYZ",
  "cases": [
    {
      "id": "case_001",
      "type": "standard",
      "input": "Test input",
      "expected": "Expected output",
      "tags": ["basic", "format"]
    },
    {
      "id": "case_002",
      "type": "edge_case",
      "input": "Edge input",
      "expected": "Expected behavior",
      "tags": ["edge", "error"]
    }
  ]
}
```

### Test Report

```markdown
# A/B Test Report: {{TEST_NAME}}

## Configuration

| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |

## Tested Variants

### Variant A (Baseline)
[Description or link to prompt A]

### Variant B (Challenger)
[Description or link to prompt B]

## Results

### Overall Scores

| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |

### Detail by Case Type

| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |

### Problematic Cases

| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |

## Analysis

### B's Strengths
- [Improvement 1]
- [Improvement 2]

### B's Weaknesses
- [Regression 1]

### Observations
[Qualitative insights]

## Recommendation

**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A

**Confidence**: High / Medium / Low

**Justification**:
[Explanation of recommendation]

## Next Steps
1. [Action 1]
2. [Action 2]
```

## Commands

```bash
# Create a test
/prompt test create --name "Test v1" --dataset tests.json

# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

# View results
/prompt test results --id test_001

# Compare two tests
/prompt test compare --tests test_001,test_002
```

## Decision Criteria

### When to adopt variant B?

```text
IF:
  - Accuracy B >= Accuracy A
  AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
  AND no regression on edge cases
THEN:
  → Adopt B

ELSE IF:
  - Accuracy improvement > 10%
  AND token regression < 20%
THEN:
  → Consider B (acceptable trade-off)

ELSE:
  → Keep A or iterate
```

## Best Practices

1. **Minimum 20 test cases** for significance
2. **Include edge cases** (15-20% of dataset)
3. **Test multiple runs** for consistency
4. **Document hypotheses** before testing
5. **Version the prompts** being tested