--- name: skill-evaluator description: Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations. metadata: skillport: category: skills tags: - agent-skills --- # Skill Evaluator (WIP) Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations. ## Quick Start 1. Read the skill's SKILL.md and understand its purpose 2. Run automated validation: `scripts/validate_skill.py ` 3. Perform manual evaluation against criteria below 4. Generate evaluation report with scores and recommendations ## Evaluation Workflow ### Step 1: Automated Validation Run the validation script first: ```bash scripts/validate_skill.py ``` This checks: - SKILL.md exists with valid YAML frontmatter - Name follows conventions (lowercase, hyphens, max 64 chars) - Description is present and under 1024 chars - Body is under 500 lines - File references are one-level deep ### Step 2: Manual Evaluation Evaluate each dimension and assign a score (1-5): #### A. Naming (Weight: 10%) | Score | Criteria | |-------|----------| | 5 | Gerund form (-ing), clear purpose, memorable | | 4 | Descriptive, follows conventions | | 3 | Acceptable but could be clearer | | 2 | Vague or misleading | | 1 | Violates naming rules | **Rules**: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags. **Good**: `processing-pdfs`, `analyzing-spreadsheets`, `building-dashboards` **Bad**: `pdf`, `my-skill`, `ClaudeHelper`, `anthropic-tools` #### B. Description (Weight: 20%) | Score | Criteria | |-------|----------| | 5 | Clear functionality + specific activation triggers + third person | | 4 | Good description with some triggers | | 3 | Adequate but missing triggers or vague | | 2 | Too brief or unclear purpose | | 1 | Missing or unhelpful | **Must include**: What the skill does AND when to use it. **Good**: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." **Bad**: "A skill for PDFs." or "Helps with documents." #### C. Content Quality (Weight: 30%) | Score | Criteria | |-------|----------| | 5 | Concise, assumes Claude intelligence, actionable instructions | | 4 | Generally good, minor verbosity | | 3 | Some unnecessary explanations or redundancy | | 2 | Overly verbose or confusing | | 1 | Bloated, explains obvious concepts | **Ask**: "Does Claude really need this explanation?" Remove anything Claude already knows. #### D. Structure & Organization (Weight: 25%) | Score | Criteria | |-------|----------| | 5 | Excellent progressive disclosure, clear navigation, optimal length | | 4 | Good organization, appropriate file splits | | 3 | Acceptable but could be better organized | | 2 | Poor organization, missing references, or bloated SKILL.md | | 1 | No structure, everything dumped in SKILL.md | **Check**: - SKILL.md under 500 lines - References are one-level deep (no nested chains) - Long reference files (>100 lines) have table of contents - Uses forward slashes in all paths #### E. Degrees of Freedom (Weight: 10%) | Score | Criteria | |-------|----------| | 5 | Perfect match: high freedom for flexible tasks, low for fragile operations | | 4 | Generally appropriate freedom levels | | 3 | Acceptable but could be better calibrated | | 2 | Mismatched: too rigid or too loose | | 1 | Completely wrong freedom level for the task type | **Guideline**: - High freedom (text): Multiple valid approaches, context-dependent - Medium freedom (parameterized): Preferred pattern exists, some variation OK - Low freedom (specific scripts): Fragile operations, exact sequence required #### F. Anti-Pattern Check (Weight: 5%) Deduct points for each anti-pattern found: - [ ] Too many options without clear recommendation (-1) - [ ] Time-sensitive information with date conditionals (-1) - [ ] Inconsistent terminology (-1) - [ ] Windows-style paths (backslashes) (-1) - [ ] Deeply nested references (more than one level) (-2) - [ ] Scripts that punt error handling to Claude (-1) - [ ] Magic numbers without justification (-1) ### Step 3: Generate Report Use this template: ```markdown # Skill Evaluation Report: [skill-name] ## Summary - **Overall Score**: X.X/5.0 - **Recommendation**: [Ready for publication / Needs minor improvements / Needs major revision] ## Dimension Scores | Dimension | Score | Weight | Weighted | |-----------|-------|--------|----------| | Naming | X/5 | 10% | X.XX | | Description | X/5 | 20% | X.XX | | Content Quality | X/5 | 30% | X.XX | | Structure | X/5 | 25% | X.XX | | Degrees of Freedom | X/5 | 10% | X.XX | | Anti-Patterns | X/5 | 5% | X.XX | | **Total** | | 100% | **X.XX** | ## Strengths - [List 2-3 things done well] ## Areas for Improvement - [List specific issues with actionable fixes] ## Anti-Patterns Found - [List any anti-patterns detected] ## Recommendations 1. [Priority 1 fix] 2. [Priority 2 fix] 3. [Priority 3 fix] ## Pre-Publication Checklist - [ ] Description is specific with activation triggers - [ ] SKILL.md under 500 lines - [ ] One-level-deep file references - [ ] Forward slashes in all paths - [ ] No time-sensitive information - [ ] Consistent terminology - [ ] Concrete examples provided - [ ] Scripts handle errors explicitly - [ ] All configuration values justified - [ ] Required packages listed - [ ] Tested with Haiku, Sonnet, Opus ``` ## Score Interpretation | Score Range | Rating | Action | |-------------|--------|--------| | 4.5 - 5.0 | Excellent | Ready for publication | | 4.0 - 4.4 | Good | Minor improvements recommended | | 3.0 - 3.9 | Acceptable | Several improvements needed | | 2.0 - 2.9 | Needs Work | Major revision required | | 1.0 - 1.9 | Poor | Fundamental redesign needed | ## References - [references/evaluation-criteria.md](references/evaluation-criteria.md) - Detailed evaluation criteria with examples - [references/scoring-rubric.md](references/scoring-rubric.md) - Complete scoring rubric and edge cases ## Examples See [evaluations/](evaluations/) for example evaluation scenarios.