---
name: prompt-lab
description: >
  Systematic LLM prompt engineering: analyzes existing prompts for failure modes,
  generates structured variants (direct, few-shot, chain-of-thought), designs evaluation
  rubrics with weighted criteria, and produces test case suites for comparing prompt
  performance. Triggers on: "prompt engineering", "prompt lab", "generate prompt variants",
  "A/B test prompts", "evaluate prompt", "optimize prompt", "write a better prompt",
  "prompt design", "prompt iteration", "few-shot examples", "chain-of-thought prompt",
  "prompt failure modes", "improve this prompt".
  Use this skill when designing, improving, or evaluating LLM prompts specifically.
  NOT for evaluating Claude Code skills or SKILL.md files — use skill-evaluator instead.
metadata:
  version: 1.1.0
---

# Prompt Lab

Replaces trial-and-error prompt engineering with structured methodology: objective
definition, current prompt analysis, variant generation (instruction clarity, example
strategies, output format specification), evaluation rubric design, test case creation,
and failure mode identification.

## Reference Files

| File                               | Contents                                                                       | Load When                  |
| ---------------------------------- | ------------------------------------------------------------------------------ | -------------------------- |
| `references/prompt-patterns.md`    | Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output | Always                     |
| `references/evaluation-metrics.md` | Quality metrics (accuracy, format compliance, completeness), rubric design     | Evaluation needed          |
| `references/failure-modes.md`      | Common prompt failure taxonomy, detection strategies, mitigations              | Failure analysis requested |
| `references/output-constraints.md` | Techniques for constraining LLM output format, JSON mode, schema enforcement   | Format control needed      |

## Prerequisites

- Clear objective: what should the prompt accomplish?
- Target model (GPT-4, Claude, open-source) — prompting techniques vary by model
- Current prompt (if improving) or task description (if creating)

## Workflow

### Phase 1: Define Objective

1. **Task specification** — What should the LLM produce? Be specific: "Classify customer
   support tickets into 5 categories" not "Handle support tickets."
2. **Success criteria** — How do you know the output is correct? Define measurable criteria
   before writing any prompt.
3. **Failure modes** — What does a bad output look like? Missing information? Wrong format?
   Hallucinated content? Refusal to answer?

### Phase 2: Analyze Current Prompt

If an existing prompt is provided:

1. **Structure assessment** — Is the instruction clear? Are examples provided? Is the
   output format specified?
2. **Ambiguity detection** — Where could the model misinterpret the instruction?
3. **Missing components** — What's not specified that should be? (output format, tone,
   length constraints, edge case handling)
4. **Failure mode mapping** — Which known failure patterns (see `references/failure-modes.md`)
   apply to this prompt?

### Phase 3: Generate Variants

Create 2-4 prompt variants, each testing a different hypothesis:

| Variant Type       | Hypothesis                           | When to Use                      |
| ------------------ | ------------------------------------ | -------------------------------- |
| Direct instruction | Clear instruction is sufficient      | Simple tasks, capable models     |
| Few-shot           | Examples improve output consistency  | Pattern-following tasks          |
| Chain-of-thought   | Reasoning improves accuracy          | Multi-step logic, math, analysis |
| Persona/role       | Role framing improves tone/expertise | Domain-specific tasks            |
| Structured output  | Format specification prevents errors | JSON, CSV, specific templates    |

For each variant:

- State the hypothesis (why this variant might work)
- Identify the risk (what could go wrong)
- Provide the complete prompt text

### Phase 4: Design Evaluation

1. **Rubric** — Define weighted criteria:

   | Criterion         | What It Measures               | Typical Weight |
   | ----------------- | ------------------------------ | -------------- |
   | Correctness       | Output matches expected answer | 30-50%         |
   | Format compliance | Follows specified structure    | 15-25%         |
   | Completeness      | All required elements present  | 15-25%         |
   | Conciseness       | No unnecessary content         | 5-15%          |
   | Tone/style        | Matches requested voice        | 5-10%          |

2. **Test cases** — Minimum 5 cases covering:
   - Happy path (standard input)
   - Edge cases (unusual but valid input)
   - Adversarial cases (inputs designed to confuse)
   - Boundary cases (minimum/maximum input)

### Phase 5: Output

Present variants, rubric, and test cases in a structured format ready for execution.

## Output Format

```text
## Prompt Lab: {Task Name}

### Objective
{What the prompt should achieve — specific and measurable}

### Success Criteria
- [ ] {Criterion 1 — measurable}
- [ ] {Criterion 2 — measurable}

### Current Prompt Analysis
{If existing prompt provided}
- **Strengths:** {what works}
- **Weaknesses:** {what fails or is ambiguous}
- **Missing:** {what's not specified}

### Variants

#### Variant A: {Strategy Name}
```

{Complete prompt text}

```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}

#### Variant B: {Strategy Name}
```

{Complete prompt text}

```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}

#### Variant C: {Strategy Name}
```

{Complete prompt text}

```text
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}

### Evaluation Rubric

| Criterion | Weight | Scoring |
|-----------|--------|---------|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |

### Test Cases

| # | Input | Expected Output | Tests Criteria |
|---|-------|-----------------|---------------|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |

### Failure Modes to Monitor
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}

### Recommended Next Steps
1. Run all variants against the test suite
2. Score using the rubric
3. Select the highest-scoring variant
4. Iterate on the winner with targeted improvements
```

## Calibration Rules

1. **One variable per variant.** Each variant should change ONE thing from the baseline.
   Changing instruction style AND examples AND format simultaneously makes results
   uninterpretable.
2. **Test before declaring success.** A prompt that works on 3 examples may fail on the
   4th. Minimum 5 diverse test cases before concluding a variant works.
3. **Failure modes are more valuable than successes.** Understanding WHY a prompt fails
   guides improvement more than confirming it works.
4. **Model-specific optimization.** A prompt optimized for GPT-4 may not work for Claude
   or Llama. Always note the target model.
5. **Simplest effective prompt wins.** If a zero-shot prompt scores as well as a few-shot
   prompt, use the zero-shot. Fewer tokens = lower cost + latency.

## Error Handling

| Problem                                               | Resolution                                                                                    |
| ----------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| No clear objective                                    | Ask the user to define what "good output" looks like with 2-3 examples.                       |
| Prompt is for a task LLMs are bad at (math, counting) | Flag the limitation. Suggest tool-augmented approaches or pre/post-processing.                |
| Too many variables to test                            | Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing. |
| No existing prompt to analyze                         | Start with the simplest possible prompt. The first variant IS the baseline.                   |
| Output format requirements are strict                 | Use structured output mode (JSON mode, function calling) instead of prompt-only constraints.  |

## When NOT to Use

Push back if:

- The task doesn't need an LLM (deterministic rules, regex, SQL) — use the right tool
- The user wants prompt execution, not design — this skill designs and evaluates, it doesn't run prompts
- The prompt is for safety-critical decisions without human review — LLM output should not be the sole input