... ``` **Key writing guidelines:** - **Description field** is the primary triggering mechanism. Include both *what* the skill does and *when* to use it. Avoid vague triggers; be specific. - **Progressive Disclosure**: keep SKILL.md under 500 lines. Offload large reference content to `references/` files with clear pointers. - **Bundled resources**: place executable scripts in `scripts/`, reference docs in `references/`, templates/assets in `assets/`. - **Writing style**: explain the *why* behind instructions. Prefer imperative form. Avoid heavy-handed ALWAYS/NEVER; explain reasoning so the model understands the purpose, not just the rule. - **Anatomy**: ``` skill-name/ ├── SKILL.md (required — name + description frontmatter, markdown body) └── Bundled Resources (optional) ├── scripts/ - Executable code for deterministic/repetitive tasks ├── references/ - Docs loaded into context as needed └── assets/ - Templates, icons, fonts ``` Write a draft, then review it with fresh eyes before proceeding. --- ## Step 5: Write Test Cases After drafting the skill, create 2–3 realistic test prompts — the kind of message a real user would actually send. Share them with the user for confirmation before running. Save to `evals/evals.json` (assertions can be empty at this stage): ```json { "skill_name": "example-skill", "evals": [ { "id": 1, "prompt": "User's task prompt", "expected_output": "Description of expected result", "files": [] } ] } ``` See `references/schemas.md` for the full schema including the `assertions` field (added in Step 6). --- ## Step 6: Run & Evaluate (max 5 iterations) This is one continuous sequence — do not stop partway through. ### 6.1 Spawn all runs in the same turn For each test case, spawn two subagents in the **same turn** — one with the skill, one as baseline. Do not run with-skill first and then come back for baselines. **With-skill run:** ``` Execute this task: - Skill path: - Task: - Input files: - Save outputs to: /iteration-/eval-/with_skill/outputs/ - Outputs to save: ``` **Baseline run** (same prompt): - *New skill*: no skill at all — save to `without_skill/outputs/`. - *Improving existing skill*: snapshot the old version first (`cp -r /skill-snapshot/`), point baseline at snapshot, save to `old_skill/outputs/`. Write `eval_metadata.json` per eval directory: ```json { "eval_id": 0, "eval_name": "descriptive-name-here", "prompt": "The user's task prompt", "assertions": [] } ``` ### 6.2 Draft assertions while runs are in progress Do not just wait — draft quantitative assertions now and explain them to the user. Update `eval_metadata.json` and `evals/evals.json` with finalized assertions. Good assertions are objectively verifiable and descriptively named. Required grading output fields (exact names — the viewer depends on these): `text`, `passed`, `evidence`. ### 6.3 Capture timing data as each run completes When a subagent task completes, you receive `total_tokens` and `duration_ms`. Save immediately to `timing.json` in each run directory: ```json { "total_tokens": 84852, "duration_ms": 23332, "total_duration_seconds": 23.3 } ``` This is the only opportunity to capture this data — do not batch. ### 6.4 Grade, aggregate, and launch the viewer Once all runs finish: 1. **Grade** — spawn a grader subagent reading `agents/grader.md`, save `grading.json` per run. For assertions checkable programmatically, write and run a script (faster, more reliable, reusable). 2. **Aggregate** — run from the skill-creator directory: ```bash python -m scripts.aggregate_benchmark /iteration-N --skill-name ``` Produces `benchmark.json` and `benchmark.md`. Put each with_skill version before its baseline counterpart. 3. **Analyst pass** — read `agents/analyzer.md` ("Analyzing Benchmark Results") and surface non-discriminating assertions, high-variance evals, and time/token tradeoffs. 4. **Launch the viewer** — **IMPORTANT: do this BEFORE evaluating results yourself. Get outputs in front of the user ASAP.** ```bash nohup python /eval-viewer/generate_review.py \ /iteration-N \ --skill-name "my-skill" \ --benchmark /iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$! ``` For iteration 2+: also pass `--previous-workspace /iteration-`. **Headless/Cowork:** use `--static ` to write a standalone HTML file. The "Submit All Reviews" button downloads `feedback.json`. 5. Tell the user: *"I've opened the results in your browser. 'Outputs' lets you click through test cases and leave feedback; 'Benchmark' shows the quantitative comparison. Come back when you're done."* ### 6.5 Read the feedback When the user is done, read `feedback.json`: ```json { "reviews": [ {"run_id": "eval-0-with_skill", "feedback": "chart is missing axis labels", "timestamp": "..."}, {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."} ], "status": "complete" } ``` Empty feedback = the user thought it was fine. Focus improvements on evals with specific complaints. Kill the viewer server when done: ```bash kill $VIEWER_PID 2>/dev/null ``` --- ## Step 7: Iterate on Errors (max 3 rounds) After reading feedback, improve the skill and re-run: 1. **Generalize from feedback** — create skills that work across many prompts, not just the test examples. Avoid overfitting or rigid MUST/NEVER rules. Explain the *why* rather than enforcing mechanics blindly. 2. **Keep the prompt lean** — remove instructions that don't pull their weight. Read transcripts (not just final outputs) to spot wasted work. 3. **Bundle repeated work** — if all 3 test cases resulted in subagents writing the same helper script, put it in `scripts/` and reference it from SKILL.md. After improving, repeat the iteration loop (Step 6) into `iteration-/`. Keep going until: - The user says they're happy - All feedback is empty - No meaningful progress in the last round **Common Fixes:** | Issue | Fix | |-------|-----| | Output missing required structure | Add a template section with explicit format example | | Skill over-triggers | Narrow the description with more specific context phrases | | Skill under-triggers | Make the description "pushier" per the description-writing guide | | Subagents reinvent the same helper | Bundle the helper into `scripts/`, reference from SKILL.md | | Grading.json field names wrong | Ensure fields are `text`, `passed`, `evidence` (not `name`/`met`/`details`) | | Viewer not generating | Explicitly run `generate_review.py` before self-evaluating | --- ## Step 8: Advanced — Blind Comparison (Optional) For rigorous comparison between two skill versions, use the blind comparison system in `agents/comparator.md` and `agents/analyzer.md`. An independent agent evaluates both outputs without knowing which is which. This is optional and most users won't need it — the human review loop is usually sufficient. --- ## Step 9: Description Optimization After finalizing the skill body, offer to optimize the description for better triggering accuracy. ### 9.1 Generate 20 trigger eval queries Create a mix of should-trigger (8–10) and should-not-trigger (8–10) queries. Make them realistic and specific — include file paths, personal context, company names, typos, abbreviations, and varying length. Focus on **edge cases** and **near-misses** rather than clear-cut examples. Save as JSON: ```json [ {"query": "ok so my boss just sent me this xlsx file ...", "should_trigger": true}, {"query": "another prompt", "should_trigger": false} ] ``` ### 9.2 Review with user 1. Read `assets/eval_review.html` 2. Replace `__EVAL_DATA_PLACEHOLDER__` with the JSON array, `__SKILL_NAME_PLACEHOLDER__` with the skill name, `__SKILL_DESCRIPTION_PLACEHOLDER__` with the current description 3. Write to `/tmp/eval_review_.html` and open it 4. User edits queries, toggles should-trigger flags, then clicks "Export Eval Set" 5. Read the downloaded `~/Downloads/eval_set.json` (check for `eval_set (1).json`) ### 9.3 Run the optimization loop ```bash python -m scripts.run_loop \ --eval-set \ --skill-path \ --model claude-sonnet-4-6 \ --max-iterations 5 \ --verbose ``` Use the model ID from your system prompt. The loop: splits 60% train / 40% held-out test, evaluates the current description (3 runs per query), proposes improvements based on failures, iterates up to 5 times, selects `best_description` by test score (not train score) to avoid overfitting. Periodically tail the output to give the user progress updates. ### 9.4 Apply the result Take `best_description` from the JSON output, update SKILL.md frontmatter, and show the user a before/after diff with final scores. --- ## Step 10: Package and Present If the `present_files` tool is available: ```bash python -m scripts.package_skill ``` Direct the user to the resulting `.skill` file path for installation. --- ## Step 11: Environment-Specific Adaptations ### Claude.ai - **No subagents**: run test cases sequentially, executing SKILL.md instructions inline. Skip baseline runs. - **No browser**: skip `generate_review.py`. Present results directly in conversation; ask for feedback inline. - **No benchmarking**: skip quantitative benchmarks. Focus on qualitative user feedback. - **Description optimization**: requires `claude -p` CLI — skip if unavailable. - **Blind comparison**: requires subagents — skip. - **Updating an existing skill**: preserve the original `name` frontmatter. Copy to `/tmp/skill-name/` before editing (installed path may be read-only). ### Cowork - Subagents are available — full parallel test execution works. - No display: use `--static ` for the eval viewer to write standalone HTML. User clicks to open in their browser. - Feedback: "Submit All Reviews" downloads `feedback.json`. Copy it into the workspace directory for the next iteration. - Packaging works — `package_skill.py` just needs Python and a filesystem. - Description optimization (`run_loop.py`) works fine in Cowork. - **Updating an existing skill**: follow the Claude.ai update guidance above. --- ## Final Checklist (MANDATORY — do not skip) ### Verification Script ```bash echo "=== FINAL OUTPUT VERIFICATION ===" SKILL_NAME="${SKILL_NAME:-my-skill}" WORKSPACE="${SKILL_NAME}-workspace" # Check SKILL.md exists and has frontmatter if [ -f "${SKILL_NAME}/SKILL.md" ]; then echo "PASS: ${SKILL_NAME}/SKILL.md exists" python3 -c "import yaml,re,pathlib; t=pathlib.Path('${SKILL_NAME}/SKILL.md').read_text(); m=re.match(r'^---\n(.*?)\n---',t,re.DOTALL); fm=yaml.safe_load(m.group(1)); assert 'name' in fm and 'description' in fm; print('PASS: SKILL.md frontmatter valid')" else echo "FAIL: ${SKILL_NAME}/SKILL.md missing" fi # Check evals [ -f "evals/evals.json" ] && echo "PASS: evals/evals.json" || echo "FAIL: evals/evals.json missing" # Check benchmark for latest iteration LATEST=$(ls -d "${WORKSPACE}"/iteration-* 2>/dev/null | sort -V | tail -1) if [ -n "$LATEST" ]; then [ -f "${LATEST}/benchmark.json" ] && echo "PASS: benchmark.json" || echo "FAIL: benchmark.json missing in ${LATEST}" [ -f "${LATEST}/benchmark.md" ] && echo "PASS: benchmark.md" || echo "FAIL: benchmark.md missing in ${LATEST}" fi # Check required output files for f in /app/results/summary.md /app/results/validation_report.json; do [ -s "$f" ] && echo "PASS: $f ($(wc -c < "$f") bytes)" || echo "FAIL: $f missing or empty" done ``` ### Checklist - [ ] `SKILL.md` exists with valid YAML frontmatter (`name`, `description`) - [ ] `evals/evals.json` written with at least 2 test cases and assertions - [ ] At least one complete iteration directory with `benchmark.json` and `benchmark.md` - [ ] User has reviewed results via the eval viewer (or inline on Claude.ai) - [ ] Description optimization completed or explicitly skipped (with user consent) - [ ] `summary.md` written with skill name, final benchmark scores, and issues - [ ] `validation_report.json` written with `stages`, `results`, and `overall_passed` - [ ] Verification script printed PASS for every line **If ANY item fails, go back and fix it before stopping.** --- ## Tips - **Prefer bundling over re-invention.** If subagents consistently write the same helper script, that script belongs in `scripts/`. Future invocations should not pay the reinvention cost. - **Never skip the eval viewer.** Even if benchmark numbers look great, the human review step catches qualitative issues that metrics miss. Run `generate_review.py` BEFORE self-evaluating. - **Progressive disclosure prevents bloat.** If SKILL.md approaches 500 lines, introduce a `references/` file with a table of contents and link to it from SKILL.md. Lean SKILL.md = fast context loading on every trigger. - **"Pushy" descriptions beat vague ones.** Claude undertriggers by default. Explicitly naming contexts in the description ("even if they don't mention X") materially improves trigger rates. - **Explain the why, not just the what.** Instructions backed by reasoning generalize better than rigid rules. Avoid ALWAYS/NEVER in all-caps unless truly necessary. - **Management dir is append-only.** Never delete or rewrite prior iteration records — duplicate-detection and variance analysis depend on full history. - **Attribution low-confidence isn't a blocker.** It surfaces in the PR body so a human reviewer can rename the directory before merging.