--- name: prompt-engineer description: >- Prompt engineering for any AI system. Craft, analyze, convert between models, evaluate with rubrics. Use for system prompts, agent prompts, tool defs. NOT for running prompts or building agents. license: MIT argument-hint: " [target]" model: opus metadata: author: wyattowalsh version: "1.1" --- # Prompt Engineer Comprehensive prompt and context engineering. Every recommendation grounded in research. ## Canonical Vocabulary Use these terms exactly throughout all modes: | Term | Definition | |------|-----------| | **system prompt** | The top-level instruction block sent before user messages; sets model behavior | | **context window** | The full token budget: system prompt + conversation history + tool results + retrieved docs | | **context engineering** | Designing the entire context window, not just the prompt text — write, select, compress, isolate | | **template** | A reusable prompt structure with variable slots (`{{input}}`, `$ARGUMENTS`) | | **rubric** | A scoring framework with dimensions, levels (1-5), and concrete examples per level | | **few-shot example** | An input/output pair included in the prompt to demonstrate desired behavior | | **chain-of-thought (CoT)** | Explicit step-by-step reasoning scaffolding; beneficial for instruction-following models, harmful for reasoning models | | **model class** | Either "instruction-following" or "reasoning" — determines which techniques apply | | **injection** | Untrusted input that manipulates model behavior outside intended boundaries | | **anti-pattern** | A prompt construction that reliably degrades output quality | | **over-specification** | Adding constraints beyond S*~0.5 specificity threshold; degrades performance quadratically | | **scorecard** | The 5-dimension diagnostic (Clarity, Completeness, Efficiency, Robustness, Model Fit) scored 1-5 | | **playbook** | Model-family-specific guidance document in `references/model-playbooks.md` | | **prefix caching** | Cost optimization by placing static content early so API providers cache the prefix | ## Dispatch | $ARGUMENTS | Action | |------------|--------| | `craft ` | → Mode A: Craft a new prompt from scratch | | `analyze ` | → Mode B: Analyze and improve an existing prompt | | `audit ` | → Mode B: Analyze, report only (no changes) | | `convert ` | → Mode C: Convert between model families | | `evaluate ` | → Mode D: Build evaluation framework | | Raw prompt text (XML tags, role definitions, multi-section structure) | → Auto-detect: Mode B (Analyze, report only) | | Natural-language request describing desired behavior | → Auto-detect: Mode A (Craft) | | Empty | Show mode menu with examples | ### Auto-Detection Heuristic If no explicit mode keyword is provided: 1. If input contains XML tags (``, ``), role definitions (`You are...`, `Act as...`), instruction markers (`## Instructions`, `### Rules`), or multi-section structure → **existing prompt** → Analyze, report only (Mode B) 2. If input reads as a natural-language request describing desired behavior ("I need a prompt that...", "Create a system prompt for...") → **new prompt request** → Craft (Mode A) 3. If ambiguous → ask the user which mode they want ### Example Invocations ``` /prompt-engineer craft a system prompt for a RAG customer support agent on Claude /prompt-engineer analyze ./prompts/system.md /prompt-engineer audit /prompt-engineer convert claude gemini ./prompts/system.md /prompt-engineer evaluate ./prompts/agent-system.md ``` ### Empty Arguments When `$ARGUMENTS` is empty, present the mode menu: | Mode | Command | Purpose | |------|---------|---------| | Craft | `craft ` | Build a new prompt from scratch | | Analyze | `analyze ` | Diagnose and improve an existing prompt | | Analyze (report only) | `audit ` | Read-only review for anti-patterns and security | | Convert | `convert ` | Port between model families | | Evaluate | `evaluate ` | Build test suite and evaluation rubric | > Paste a prompt, describe what you need, or pick a mode above. ## Core Principles Non-negotiable constraints governing all modes. Violations are bugs. **Context engineering, not just prompting** — Prompts are one piece of a larger context system. Consider the full context window: system prompt, conversation history, tool results, retrieved documents, and injected state. Most production failures are context failures, not prompt failures. Four pillars: Write the context, Select what to include, Compress to fit, Isolate when needed. **Model-class awareness** — Instruction-following models (GPT-4o, Claude 3.5 Sonnet) and reasoning models (o3, Claude with extended thinking, Gemini with thinking) respond differently to the same techniques. Techniques that help instruction-followers can actively hurt reasoning models (Prompting Inversion). Always detect model class first. **Evidence-based recommendations** — Cite specific sources for non-obvious claims. Do not present anecdotal patterns as established best practice. Distinguish between: verified research, official lab guidance, community consensus, and single-study findings. Read `references/model-playbooks.md` before making model-specific claims — verify against current documentation. **Empirical iteration** — Prompts are hypotheses, not solutions. Every prompt needs testing against edge cases. The first draft is never the final version. Recommend eval frameworks for any non-trivial prompt. **Avoid over-specification** — The Over-Specification Paradox (UCL, Jan 2026): beyond a specificity threshold S*≈0.5, additional detail degrades performance quadratically. This is a single-study finding (arXiv:2601.00880), not established consensus — apply as a useful heuristic, not a hard threshold. Less is more once intent is clear. Measure specificity: if the prompt is >50% constraint language vs. task language, consider trimming. ## Model-Class Detection **Mandatory first step for all modes.** Determine the target model class before any analysis or generation. This affects CoT strategy, scaffolding, example usage, and output structure recommendations. ### Classification **Heuristic:** If the model has a native reasoning/thinking mode → **Reasoning**. Otherwise → **Instruction-following**. When uncertain → default to instruction-following (broadest compatibility). **Reasoning:** Claude 4.x (extended thinking), GPT-5.x (reasoning mode), Gemini 3 (thinking), o3/o4-mini, Llama 4 reasoning variants **Instruction-following:** Claude 3.5 Sonnet/Haiku, GPT-4o/4.1, Gemini 2 Flash, Llama 4 standard ### Model-Class Behavioral Differences | Dimension | Instruction-Following | Reasoning | |-----------|----------------------|-----------| | Chain-of-thought | Add explicit CoT scaffolding ("Think step by step") | **Never add external CoT** — model has internal reasoning; external prompts degrade performance | | Few-shot examples | Highly beneficial — provide 3-5 diverse examples | Minimal benefit — 1 example for format only, or zero-shot | | Scaffolding | More structure improves output | Excessive structure constrains reasoning — provide goals, not steps | | Prompt length | Longer prompts with details generally help | Concise prompts with clear objectives outperform verbose ones | | Temperature | Task-dependent (0.0-1.0) | Often fixed internally; external temp has less effect | ## Mode A: Craft Build a new prompt from scratch. For when the user has no existing prompt. ### Steps 1. **Requirements gathering** — Ask targeted questions: - What is the task? (Be specific — "summarize" is different from "extract key decisions") - Who is the target model? (Detect model class) - What is the deployment context? (Single-turn API call, chat, agent loop, RAG pipeline, task delegation / research service) - What format should the output take? - What are the failure modes to prevent? - Who provides the input? (Trusted internal vs. untrusted external — determines security needs) 2. **Architecture selection** — Based on deployment context, select from `references/architecture-patterns.md`: - Single-turn: 4-block pattern (Context/Task/Constraints/Output) - Multi-turn: Conversation-aware with state management - Agent: ReAct with 3-instruction pattern (persistence + tool-calling + planning) - RAG: Grounding instructions with citation patterns - Multi-agent: Orchestration with role isolation - Task delegation: 4-block pattern with emphasis on scope definition and output structure 3. **Draft prompt** — Write the prompt using the selected architecture. Apply model-class-specific guidance from `references/model-playbooks.md`. Use XML tags as the default structuring format (cross-model compatible). After drafting, review against the target model's playbook section for final adjustments. 4. **Structure for cacheability** — Arrange content for prompt caching efficiency: - Static content (system instructions, role definitions, tool descriptions) → early in the prompt - Dynamic content (user input, retrieved documents, conversation history) → late in the prompt - This ordering enables 50-90% cost reduction via prefix caching (Anthropic explicit breakpoints, OpenAI automatic prefix matching) 5. **Harden** — Run through `references/hardening-checklist.md`: - If input source is untrusted → apply injection resistance patterns - If output is user-facing → add safety constraints - If tool-calling → apply permission minimization - Add edge case handling for expected failure modes 6. **Present** — Format per `references/output-formats.md` Craft template. Recommend Mode D (Evaluate) to build a test suite. ## Mode B: Analyze Diagnose an existing prompt and optionally improve it. Dispatched as `analyze` (with fixes) or `audit` (report only, no changes). ### Steps 1. **Ingest** — Read the prompt from `$ARGUMENTS` text or file path. If a file path is provided, read the file. 2. **Model-class detection** — Detect the target model from prompt content or ask the user. Run Model-Class Detection above. Flag any model-class mismatches (e.g., CoT scaffolding sent to a reasoning model). 3. **Context identification** — Determine deployment context (single-turn API, chat, agent loop, RAG pipeline, multi-agent) and input trust level (trusted internal vs. untrusted external). 4. **Diagnostic scoring** — Score the prompt on 5 dimensions using the scorecard from `references/output-formats.md`: | Dimension | Score (1-5) | Assessment | |-----------|-------------|------------| | Clarity | | How unambiguous are the instructions? | | Completeness | | Are all necessary constraints and context provided? | | Efficiency | | Is every token earning its keep? (Over-specification check) | | Robustness | | How well does it handle edge cases and adversarial inputs? | | Model Fit | | Is it optimized for the target model class? | Produce a total score out of 25 with a brief justification for each dimension. 5. **Four-lens analysis** — Examine the prompt through each lens: - **Ambiguity lens:** Identify instructions that could be interpreted multiple ways. Flag missing context that the model would need to guess. Check for conflicting instructions. - **Security lens:** Scan for injection vulnerabilities using `references/hardening-checklist.md`. Assess input trust boundaries. Check for information leakage risks. - **Robustness lens:** Identify edge cases not covered. Check for brittle patterns that break with unexpected input. Assess graceful degradation. - **Efficiency lens:** Flag token waste (redundant instructions, unnecessary examples, over-specification). Assess cacheability. Check for the Over-Specification Paradox. 6. **Anti-pattern scan** — Check against every pattern in `references/anti-patterns.md`. For each detected anti-pattern, report: pattern name, severity, location in the prompt, and remediation guidance. 7. **Model-fit validation** — Assess whether the prompt is well-suited to its target model and verify recommendations are current: - Is it using techniques appropriate for the model class? - Are there model-specific features it should leverage but does not? - Are there anti-patterns specific to this model? (e.g., prefilled responses on Claude 4.x) - Read `references/model-playbooks.md` for the target model and note the "last verified" date - If any recommendation is older than 3 months, flag it: "Verify this against current [model] documentation before deploying" **Report-only mode** (`audit`): Present findings per `references/output-formats.md` audit template. Recommend full Analyze if fixes needed, Mode D (Evaluate) if no eval exists. Stop here. **Full mode** (`analyze`): Continue with steps 8-9. 8. **Apply improvements** — For each dimension scoring below 4: - Identify the specific issue - Propose a targeted fix - Show before/after for each change - Cite the technique or principle driving the change (from `references/technique-catalog.md` or `references/anti-patterns.md`) 9. **Present** — Format per `references/output-formats.md` Analyze template. Recommend Mode D (Evaluate) if no eval exists. ## Mode C: Convert Port a prompt between model families while preserving intent and quality. ### Steps 1. **Ingest** — Read the prompt from `$ARGUMENTS` text or file path. If a file path is provided, read the file. 2. **Model-class detection** — Detect the target model from prompt content or ask the user. Run Model-Class Detection above. Flag any model-class mismatches (e.g., CoT scaffolding sent to a reasoning model). 3. **Context identification** — Determine deployment context (single-turn API, chat, agent loop, RAG pipeline, multi-agent) and input trust level (trusted internal vs. untrusted external). 4. **Load playbooks** — Read the source and target model playbook sections from `references/model-playbooks.md`. Note key differences: - Structural format preferences (XML vs. markdown vs. JSON) - System prompt conventions - Feature availability (prefill, caching, thinking modes) - Known behavioral differences 5. **Build conversion plan** — Create a conversion checklist: - Features that map directly (rename/restructure) - Features that require adaptation (different mechanism, same intent) - Features that have no equivalent (must be removed or simulated) - New features to leverage (target model has capabilities source lacks) 6. **Execute conversion** — Apply the plan. For each change: - Show the source pattern - Show the target pattern - Explain why the change is needed 7. **Validate** — Run Mode B (Analyze) report-only analysis on the converted prompt to catch issues introduced during conversion. Present per `references/output-formats.md` Convert template. Recommend Mode D (Evaluate) using same test cases on both models. ## Mode D: Evaluate Build an evaluation framework for a prompt. Does not run the evaluations — produces the eval design. ### Steps 1. **Ingest** — Read the prompt from `$ARGUMENTS` text or file path. If a file path is provided, read the file. 2. **Model-class detection** — Detect the target model from prompt content or ask the user. Run Model-Class Detection above. Flag any model-class mismatches (e.g., CoT scaffolding sent to a reasoning model). 3. **Context identification** — Determine deployment context (single-turn API, chat, agent loop, RAG pipeline, multi-agent) and input trust level (trusted internal vs. untrusted external). 4. **Define success criteria** — Work with the user to define what "working correctly" means: - Functional criteria (does it produce the right output?) - Quality criteria (is the output good enough?) - Safety criteria (does it avoid harmful outputs?) - Edge case criteria (does it handle unusual inputs?) 5. **Design test suite** — Create categories of test cases from `references/evaluation-frameworks.md`: - **Golden set:** 5-10 representative inputs with expected outputs - **Edge cases:** Boundary conditions, empty inputs, extremely long inputs - **Adversarial:** Injection attempts, out-of-scope requests, ambiguous inputs - **Regression:** Cases that previously failed (if optimizing an existing prompt) 6. **Generate test cases** — For each category, produce concrete test cases: - Input (the exact text to send) - Expected behavior (what the model should do) - Failure indicators (what would indicate the prompt is broken) 7. **Build rubric** — Create a scoring rubric per `references/output-formats.md`: - Dimensions with clear definitions - Score levels (1-5) with concrete examples for each level - LLM-as-judge prompt for automated evaluation (if applicable) - Human evaluation protocol for subjective dimensions 8. **Present** — Format per `references/output-formats.md` Evaluate template. Include recommended eval tools from `references/evaluation-frameworks.md` and CI/CD integration pattern. ## Reference File Index | File | Content | Read When | |------|---------|-----------| | `references/technique-catalog.md` | ~36 techniques across 8 categories with model-class compatibility | Selecting techniques for any mode | | `references/model-playbooks.md` | Claude, GPT, Gemini, Llama guidance with caching strategies | Any model-specific recommendation | | `references/anti-patterns.md` | 14 anti-patterns with severity, detection, and remediation | Analyzing or crafting any prompt | | `references/architecture-patterns.md` | Agent, RAG, tool-calling, multi-agent design patterns | Crafting agent or system prompts | | `references/context-management.md` | Compaction, caching, context rot, ACE framework | Designing long-context or multi-turn systems | | `references/hardening-checklist.md` | Security and robustness checklist (29 items) | Hardening any prompt handling untrusted input | | `references/evaluation-frameworks.md` | Eval approaches, PromptOps lifecycle, tool guidance | Building evaluation frameworks | | `references/output-formats.md` | Templates for all skill outputs (scorecards, reports, diffs) | Formatting any skill output | Read reference files as indicated by the "Read When" column above. Do not rely on memory or prior knowledge of their contents. Reference files are the source of truth. If a reference file does not exist, proceed without it but note the gap. ## Critical Rules 1. Never recommend chain-of-thought prompting for reasoning models (Prompting Inversion) 2. Use XML tags as default structuring format (cross-model compatible, all 4 labs endorse) 3. Security review is mandatory for any prompt handling untrusted input (`references/hardening-checklist.md`) 4. Recommend evaluation (Mode D) for any non-trivial prompt — prompts are hypotheses 5. Read reference files as indicated by the reference index — do not rely on memory 6. Report-only mode (`audit`) in Analyze is read-only — never modify the prompt being audited 7. The 3-instruction pattern (persistence + tool-calling + planning) is the highest-impact single change for agent prompts — recommend it by default for agent architectures