---
name: harden
description: >
  Audits any substantial deliverable — documents, grant sections,
  strategies, protocols, skills, configs, roadmaps, plans — against
  domain-specific quality criteria before delivery. Searches the web
  for best practices, runs dual-pass thematic and structural audit with
  severity classification, fixes all issues, and iterates until
  convergence. Activates when producing deliverables exceeding 4
  paragraphs or 20 lines, or when the user says "harden", "audit",
  "stress test", or "review before delivering", or when re-auditing
  a modified deliverable for regressions. Does not activate for
  conversational responses, brainstorming, web search summaries, or
  rough drafts.
compatibility: >
  claude.ai, Claude Code, Codex CLI, Gemini CLI (requires web search
  for research step; degrades gracefully without it)
license: MIT
---

# Harden

> **MANDATORY:** Re-read this entire file before every audit pass. Do not rely on earlier readings within the same conversation. Context window drift causes skipped steps.

## Purpose

Ensures substantial deliverables are structurally sound, internally consistent,
and aligned with domain best practices before delivery. Replaces ad hoc quality
checks with a repeatable methodology.

## How It Works

1. **Research** — Searches the web for domain-specific best practices
   before auditing (not just internal assumptions)
2. **Dual-pass audit** — Checks both ideas (are they sound?) and
   structure (do paths/references/formatting work?)
3. **Severity classification** — 🔴 Critical / 🟡 Significant / 🟢 Minor
   with specific locations
4. **Fix** — Implements all critical and significant fixes (not just reports)
5. **Converge** — Re-audits until only minor issues remain (max 3 passes)

## The Process

### Step 1: Research

Search for current best practices specific to the deliverable's domain.

**Scope guidance:**
- Minimum 2 searches for focused domains (e.g., a skill file -> Anthropic
  skill docs)
- 3-5 searches for broad or unfamiliar domains (e.g., a grant aims page ->
  NIH guidelines, successful examples, common reviewer complaints)
- Always check for official/canonical sources first, then practitioner insights
- **Fetch the canonical source.** If search results include the single most
  authoritative document for this domain (official guide, spec, standard),
  fetch and read it — do not rely on search snippets. Snippets capture ~5%
  of a document's content and miss testing, edge cases, and patterns.
  Limit to the first ~5000 tokens if the source is very long; prioritize
  sections on requirements, validation, and common mistakes.
- Synthesize findings into actionable bullets that inform the audit — not a
  reading list

**If web search is unavailable:** State this limitation explicitly. Proceed
using training knowledge but flag that the audit lacks external validation.

### Step 2: Comparative Audit

Run TWO distinct passes within this step. Both are mandatory.

**Pass A — Thematic audit (ideas level):**
- Does the deliverable achieve its stated purpose?
- Are the ideas sound, complete, and non-contradictory?
- Does it align with research findings from Step 1?
- Are there gaps, unstated assumptions, or logical errors?
- Would the target audience (reviewer, agent, collaborator) find it clear?

**Pass B — Structural audit (artifact level):**
- Do all paths, filenames, and references resolve?
- Are names, terms, and conventions consistent throughout?
- Do cross-references point to real sections/files?
- Is formatting correct for the target format (markdown, YAML, code)?
- Are there orphaned sections, redundant content, or dangling references?
- If executable (a plan, a config): would a literal reader succeed?

**Pass C — Regression check (only when re-hardening a modified deliverable):**

Trigger detection: Apply Pass C when the deliverable was previously completed
or delivered and is now being revised — indicated by the user referencing a
prior version, requesting modifications to existing work, or re-submitting
after changes. Examples: "update the roadmap I wrote earlier", "fix the issues
you found", or any edit to an artifact that was previously declared done.

- What previously worked that this modification could break?
- Are existing cross-references still valid after the change?
- Do acceptance criteria from the original still hold?
- Has scope crept beyond the intended modification?

**Severity classification for every issue found:**
- 🔴 **Critical** — Will cause failure, incorrect behavior, or fundamental
  misunderstanding. Must fix.
- 🟡 **Significant** — Degrades quality, violates best practices, causes
  confusion, or wastes resources (tokens, time). Must fix.
- 🟢 **Minor** — Cosmetic, low-impact, stylistic. Fix if easy, skip if not.

**Required output structure:**
1. **What passes** — Sections/aspects reviewed and confirmed correct. This
   proves the audit actually examined these areas rather than skipping them.
2. **What's missing** — Gaps, absent sections, unstated assumptions.
3. **What's wrong** — Errors, inconsistencies, violations. Each with severity
   and specific location (line number, section name, or quote).
4. **Prioritized fix list** — All issues ordered 🔴 → 🟡 → 🟢 with specific
   remediation for each.

### Step 3: Update

Execute all 🔴 and 🟡 fixes. Do not cherry-pick. Do not defer significant
issues. Fix 🟢 issues if they're adjacent to other changes.

After fixes, run a mechanical verification:
- If the deliverable contains paths: grep/check all paths resolve
- If it contains names/terms: verify consistency
- If it references sections: confirm they exist
- If it has acceptance criteria: verify the artifact satisfies them

### Step 4: Diminishing Returns Check

Evaluate what this pass found:
- Any 🔴 critical issues? → Run another pass (return to Step 2)
- Any structural 🟡 issues? → Run another pass (return to Step 2)
- Only 🟢 minor issues? → Stop. Deliver.

**Limits:** 2 passes is typical. 3 is maximum. If still finding critical
issues after 3 passes, flag to the user that the deliverable may need
fundamental restructuring rather than incremental fixes.

**Re-hardening trigger:** When external standards change (new Anthropic docs,
updated NIH guidelines, framework version bumps), previously hardened
deliverables should be re-audited against the new standards. The user triggers
this manually; the skill treats it as a fresh hardening with Pass C enabled.

## Domain Quality Criteria

When auditing, identify the relevant quality criteria for the deliverable type.
For domains not listed below, generate criteria by asking: "What would an
expert reviewer check before approving this?"

### Skills (Claude Code / Claude Chat)
- Frontmatter: name (lowercase, hyphens, <=64 chars; gerund form is
  conventional but not required), description (third person, non-empty,
  all triggers included; Anthropic Help Center states <=200 chars,
  ecosystem convention <=1024 chars — test auto-invocation if exceeding
  200)
- If the skill has environment dependencies (web search, filesystem, MCP,
  specific platform), declare them in the `compatibility` frontmatter field
  AND mention the primary dependency in the description
- Body under 500 lines, context-efficient (no content Claude already knows)
- Progressive disclosure where applicable
- Trigger validation: ask Claude "when would you use [skill name]?" and
  compare its answer against intended triggers. Check for under-triggering
  (missing expected phrases) and over-triggering (matching unrelated queries)
- Instructions are specific and actionable, not buried or ambiguous; critical
  steps appear early in the document

### Setup/Configuration Documents
- All paths consistent and valid, instructions imperative
- Acceptance criteria explicit and mechanically verifiable
- Self-documenting with graceful fallback for missing prerequisites

### Grant Sections
- Significance (clear gap), innovation (distinct from prior art),
  approach (feasible, risks mitigated), alignment (budget matches scope)
- Reviewer-friendly: scannable, no undefined jargon

### Strategic Plans / Roadmaps
- Outcomes measurable, dependencies explicit, timeline realistic
- Risks identified with mitigations, prioritization rationale stated

### Protocols / SOPs
- Steps unambiguous and executable by stated audience
- Materials specified with versions, decision points identified
- Expected results stated for verification

### Implementation Plans / Execution Runbooks
- Every summary list (e.g., "Bootstrapping Steps") must have 1:1
  correspondence with its detailed execution section. No step in one
  without a match in the other.
- For each step, verify preconditions are satisfied by prior steps.
  Flag: config created before its target exists, files referenced before
  creation, tools invoked before dependencies are available.
- Every CLI command, slash command, skill name, and tool referenced must
  actually exist. Verify against the known environment — not assumptions
  or recall. Flag speculative or hallucinated commands.
- Hook/automation rules must be achievable with their mechanism type
  (command hooks = grep/regex, prompt hooks = LLM yes/no, agent hooks =
  multi-turn). Flag rules requiring capabilities beyond their type.
- Every task must include a concrete verification method: a command with
  expected output, a file/state existence check, or an observable result.
  "Verify it works" and "run tests" without specifics are insufficient —
  a literal executor must be able to confirm success without follow-up.
- Verification methods must be falsifiable — they must be able to fail.
  "Check that the file exists" is falsifiable. "Review the output" is not.

## Anti-Patterns

**Performative skepticism:** Adding self-critical commentary ("you might want
to strengthen this") without running the actual audit process. Sounds rigorous,
catches nothing structural.

**Thematic-only auditing:** Checking that ideas are sound while ignoring paths,
names, formatting, and consistency. How 14 issues survived a "validated"
deliverable in prior experience.

**Structural-only auditing:** The inverse — checking paths and formatting while
missing that the ideas are wrong or incomplete.

**Severity inflation:** Marking everything 🔴 to appear thorough. Reserve
critical for actual failure modes.

**Audit without research:** Skipping Step 1 and auditing against internal
assumptions rather than external standards. The research step imports criteria
that wouldn't be generated otherwise.

**Snippet-depth research:** Finding the canonical source but reading only
search result excerpts instead of fetching the full document. Captures
technical requirements but misses testing methodology, edge cases, and
behavioral patterns. How 4 relevant quality criteria were missed despite
the source appearing twice in search results.

**Skipping "what passes":** Only reporting failures without confirming what
was reviewed and found correct. Hides gaps in audit coverage.

**Static-only plan auditing:** Checking that references resolve and formatting
is correct while treating a sequential plan as a static document. Plans are
executable — each step has preconditions. Auditing a plan without simulating
execution order misses ordering violations, summary↔detail drift, and
nonexistent tool references. How 7 issues (4 moderate) survived 4 audit passes
on an implementation plan.

**Verification-free planning:** Approving an implementation plan where tasks
lack concrete verification methods. Every task produces a result; every result
can be checked. Plans without per-task verification shift the burden to the
executor to invent checks — which means they often don't.