# PluginEval: Quality Evaluation Framework PluginEval is a three-layer quality evaluation framework for Claude Code plugins and skills. It combines deterministic static analysis, LLM-based semantic judging, and Monte Carlo simulation to produce calibrated quality scores with confidence intervals. ## Overview PluginEval answers the question: **"How good is this plugin or skill?"** It evaluates across 10 quality dimensions, detects anti-patterns, assigns letter grades, and awards quality badges (Bronze through Platinum). ### Architecture ``` ┌─────────────────────────────────────────────────┐ │ CLI / Commands │ │ score · certify · compare · init │ ├─────────────────────────────────────────────────┤ │ Eval Engine │ │ Composite scoring, layer blending │ ├────────────┬────────────────┬───────────────────┤ │ Layer 1 │ Layer 2 │ Layer 3 │ │ Static │ LLM Judge │ Monte Carlo │ │ Analysis │ (Semantic) │ (Statistical) │ │ <2s, free │ ~30s, 4 calls │ ~2min, 50 calls │ ├────────────┴────────────────┴───────────────────┤ │ Parser Layer │ │ SKILL.md, agents/*.md, plugin.json │ ├─────────────────────────────────────────────────┤ │ Statistical Methods │ │ Wilson CI · Bootstrap CI · Clopper-Pearson │ │ Cohen's κ · Coefficient of Variation │ ├─────────────────────────────────────────────────┤ │ Corpus & Elo Ranking │ │ Gold standard index · Pairwise comparison │ └─────────────────────────────────────────────────┘ ``` ## Installation & Setup PluginEval lives in `plugins/plugin-eval/` and uses [uv](https://docs.astral.sh/uv/) for dependency management. ```bash cd plugins/plugin-eval # Install core dependencies (static analysis only) uv sync # Install with LLM support (Layers 2 & 3) uv sync --extra llm # Install with direct API support uv sync --extra api # Install dev dependencies (tests, linting) uv sync --extra dev ``` ### Requirements - Python ≥ 3.12 - Core: `pydantic`, `typer`, `rich`, `pyyaml` - LLM layers: `claude-agent-sdk` (uses Claude Code Max plan by default) - API alternative: `anthropic` SDK (requires `ANTHROPIC_API_KEY`) ## CLI Commands ### `score` — Evaluate a plugin or skill ```bash # Quick evaluation (static only, instant) uv run plugin-eval score path/to/skill --depth quick # Standard evaluation (static + LLM judge) uv run plugin-eval score path/to/skill --depth standard # Deep evaluation (all three layers) uv run plugin-eval score path/to/skill --depth deep # Output formats uv run plugin-eval score path/to/skill --output json uv run plugin-eval score path/to/skill --output markdown uv run plugin-eval score path/to/skill --output html # CI gate: exit code 1 if below threshold uv run plugin-eval score path/to/skill --threshold 70 ``` **Options:** | Option | Default | Description | |--------|---------|-------------| | `--depth` | `standard` | `quick`, `standard`, `deep`, `thorough` | | `--output` | `markdown` | `json`, `markdown`, `html` | | `--verbose` | `false` | Show detailed output | | `--concurrency` | `4` | Max concurrent LLM calls (1–20) | | `--auth` | `max` | Auth mode: `max` (Claude Code Max plan) or `api-key` | | `--threshold` | none | Minimum score; exit 1 if below | ### `certify` — Full certification with badge Runs at `deep` depth (all three layers). Takes 15–20 minutes. ```bash uv run plugin-eval certify path/to/skill --output markdown ``` ### `compare` — Head-to-head comparison Compare two skills side-by-side across all dimensions. ```bash uv run plugin-eval compare path/to/skill-a path/to/skill-b ``` ### `init` — Initialize corpus Build a gold-standard corpus index from a plugins directory for Elo ranking. ```bash uv run plugin-eval init plugins/ --corpus-dir ~/.plugineval/corpus ``` ## Claude Code Integration PluginEval is also a Claude Code plugin with agents and commands. ### Slash Commands | Command | Description | | ------------------ | -------------------------------------------------------- | | `/eval ` | Evaluate a plugin or skill (orchestrates static + judge) | | `/certify ` | Full certification pipeline with badge | | `/compare ` | Head-to-head skill comparison | ### Agents | Agent | Model | Role | | ------------------- | ------ | ---------------------------------------------------------------------- | | `eval-orchestrator` | Opus | Coordinates evaluation: runs CLI, dispatches judge, computes composite | | `eval-judge` | Sonnet | LLM judge: scores 4 semantic dimensions with anchored rubrics | ### Skill The `evaluation-methodology` skill provides the full scoring methodology reference, including dimension definitions, rubric anchors, blend weights, and improvement guidance. ## The Three Evaluation Layers ### Layer 1: Static Analysis **Speed:** < 2 seconds. **Cost:** Free (no LLM calls). **Deterministic.** Runs seven structural sub-checks against the parsed SKILL.md: | Sub-check | Weight | What it measures | | ------------------------- | ------ | --------------------------------------------------------------------------------- | | `frontmatter_quality` | 32% | Name, description length, trigger-phrase quality ("Use when…", "Use PROACTIVELY") | | `orchestration_wiring` | 23% | Output/input documentation, code examples, orchestrator anti-pattern | | `progressive_disclosure` | 14% | Line count vs. sweet spot (200–600 lines), references/ and assets/ directories | | `structural_completeness` | 10% | Heading density, code blocks, examples section, troubleshooting section | | `token_efficiency` | 9% | MUST/NEVER/ALWAYS density, duplicate-line detection | | `ecosystem_coherence` | 6% | Cross-references to other skills/agents, "related"/"see also" mentions | | `harness_portability` | 6% | Codex/Cursor/OpenCode/Gemini portability — body cap, tool refs, model aliases, name collisions | Also detects anti-patterns (see below) and applies a multiplicative penalty. ### Layer 2: LLM Judge **Speed:** ~30 seconds. **Cost:** 4 LLM calls (Haiku + Sonnet). **Requires `claude-agent-sdk`.** Uses Claude as a semantic evaluator across 4 dimensions with anchored rubrics: | Dimension | Model | Method | | ----------------------- | ------ | ---------------------------------------------------------------------------- | | `triggering_accuracy` | Haiku | Generates 10 synthetic prompts (5 should-trigger, 5 should-not), computes F1 | | `orchestration_fitness` | Sonnet | Rates worker-vs-orchestrator role using 5-point anchored rubric | | `output_quality` | Sonnet | Simulates 3 realistic tasks, evaluates expected output quality | | `scope_calibration` | Sonnet | Rates scope appropriateness using 5-point anchored rubric | All 4 assessments run concurrently with semaphore-based throttling. ### Layer 3: Monte Carlo Simulation **Speed:** ~2 minutes (50 runs) to ~5 minutes (100 runs). **Cost:** 50–100 LLM calls. **Requires `claude-agent-sdk`.** Generates 15 varied prompts via Haiku, then runs N simulations to compute statistical reliability: | Metric | Measure | Statistical Method | | ------------------ | --------------------------------------- | --------------------------------- | | Activation rate | % of runs where skill activated | Wilson score CI | | Output consistency | Mean quality + coefficient of variation | Bootstrap CI (1000 resamples) | | Failure rate | % of runs that errored | Clopper-Pearson exact CI | | Token efficiency | Median tokens, IQR, outlier detection | Normalized against 8000-token cap | ## Evaluation Depths | Depth | Layers | Confidence Label | Time | Cost | | ---------- | --------------------------------------- | ---------------- | ------ | -------------- | | `quick` | Static only | Estimated | < 2s | Free | | `standard` | Static + Judge | Assessed | ~30s | 4 LLM calls | | `deep` | Static + Judge + Monte Carlo (50 runs) | Certified | ~3 min | ~54 LLM calls | | `thorough` | Static + Judge + Monte Carlo (100 runs) | Certified+ | ~6 min | ~104 LLM calls | ## The 10 Quality Dimensions Each dimension has a weight and receives scores from different layers, blended using per-dimension weights: | Dimension | Weight | Static | Judge | Monte Carlo | What it measures | | ------------------------- | ------ | ------ | ----- | ----------- | ------------------------------------------------ | | `triggering_accuracy` | 25% | 0.15 | 0.25 | 0.60 | Does the description fire for the right prompts? | | `orchestration_fitness` | 20% | 0.10 | 0.70 | 0.20 | Is it a composable worker, not an orchestrator? | | `output_quality` | 15% | 0.00 | 0.40 | 0.60 | Would it produce correct, useful output? | | `scope_calibration` | 12% | 0.30 | 0.55 | 0.15 | Is the scope well-sized for its domain? | | `progressive_disclosure` | 10% | 0.80 | 0.20 | 0.00 | Does it use references/ for large content? | | `token_efficiency` | 6% | 0.40 | 0.10 | 0.50 | Is it concise without repetition? | | `robustness` | 5% | 0.00 | 0.20 | 0.80 | Does it handle varied inputs reliably? | | `structural_completeness` | 3% | 0.90 | 0.10 | 0.00 | Does it have headings, code, examples? | | `code_template_quality` | 2% | 0.30 | 0.70 | 0.00 | Are code examples production-ready? | | `ecosystem_coherence` | 2% | 0.85 | 0.15 | 0.00 | Does it link to related skills/agents? | ### Composite Score Formula ``` Final = Σ(dimension_weight × blended_score) × 100 × anti_pattern_penalty ``` Where `blended_score` for each dimension is a weighted combination of available layer scores, renormalized to the layers actually present. ## Quality Badges | Badge | Score | Elo | Stars | Meaning | | -------- | ----- | ------ | ----- | ------------------------ | | Platinum | ≥ 90 | ≥ 1600 | ★★★★★ | Reference quality | | Gold | ≥ 80 | ≥ 1500 | ★★★★ | Production ready | | Silver | ≥ 70 | ≥ 1400 | ★★★ | Functional, needs polish | | Bronze | ≥ 60 | ≥ 1300 | ★★ | Minimum viable | Badges require both score AND Elo thresholds when Elo data is available. ## Letter Grades Scores are also converted to letter grades: | Grade | Score Range | | ----- | ----------- | | A+ | ≥ 97 | | A | ≥ 93 | | A- | ≥ 90 | | B+ | ≥ 87 | | B | ≥ 83 | | B- | ≥ 80 | | C+ | ≥ 77 | | C | ≥ 73 | | C- | ≥ 70 | | D+ | ≥ 67 | | D | ≥ 63 | | D- | ≥ 60 | | F | < 60 | ## Anti-Pattern Detection The static analyzer detects these anti-patterns, each with a severity that contributes to a multiplicative penalty: | Flag | Severity | Trigger | | ------------------- | -------- | --------------------------------------------- | | `OVER_CONSTRAINED` | 10% | > 15 MUST/ALWAYS/NEVER directives | | `EMPTY_DESCRIPTION` | 10% | Description < 20 characters | | `MISSING_TRIGGER` | 15% | No "Use when…" trigger phrase in description | | `BLOATED_SKILL` | 10% | > 800 lines without a references/ directory | | `ORPHAN_REFERENCE` | 5% | Dead link to a file in references/ | | `DEAD_CROSS_REF` | 5% | Cross-reference to a non-existent skill/agent | | `SKILL_OVER_CODEX_CAP` | 15% | Skill body > 8 KB without references/ (Codex hard-truncates) | | `CLAUDE_TOOL_REFS` | 2–10% | Backticked CamelCase tool names (`` `Read` ``, `` `Bash` ``) | | `CLAUDE_TOOL_PROSE` | 5% | Prose like "use the Read tool" (Codex prefers action verbs) | | `AGENT_NAME_COLLISION` | 10% | Agent named `default`/`worker`/`explorer` (Codex built-ins) | | `BARE_MODEL_ALIAS` | 3% | Bare `opus`/`sonnet`/`haiku` (use `inherit` for portability) | Each `harness_portability` finding carries a `remediation` string surfaced via the AntiPattern description, so the fix is in-context when the lint fires. **Penalty formula:** `penalty = max(0.5, 1.0 − 0.05 × count)` — each anti-pattern reduces the score by 5%, flooring at 50%. ## Elo Ranking System For relative quality comparison against a corpus of known skills: - **Initial rating:** 1500 - **K-factor:** 32 - **Confidence intervals:** Bootstrap resampling (500 resamples) - **Corpus management:** `init` command indexes all skills from a plugins directory - **Reference selection:** Matches by category and similar line count The Elo system uses the standard formula: `E(A) = 1 / (1 + 10^((Rb - Ra) / 400))`. ## Corpus Management The corpus is a JSON index of all skills used for Elo comparisons: ```bash # Build corpus from your plugins directory uv run plugin-eval init plugins/ --corpus-dir ~/.plugineval/corpus # The corpus stores: # - Skill name, path, category, line count # - Current Elo rating (updated after each comparison) ``` Reference skills are selected by matching category and approximate line count. ## Statistical Methods PluginEval uses rigorous statistical methods throughout: | Method | Used For | Details | | ------------------------ | -------------------------- | ----------------------------------------- | | Wilson score CI | Activation rate confidence | Handles small-sample binomial proportions | | Bootstrap CI | Output quality confidence | 1000 resamples, percentile method | | Clopper-Pearson | Failure rate confidence | Exact CI for small failure counts | | Coefficient of variation | Output consistency | std/mean ratio; lower = more consistent | | Cohen's kappa | Inter-rater agreement | For multi-judge scenarios | All statistical functions are pure Python with no external dependencies (no scipy/numpy required). ## Parser The parser extracts structured data from Claude Code plugin files: - **Skills:** Parses SKILL.md frontmatter (name, description), counts headings, code blocks, languages, MUST/NEVER/ALWAYS directives, cross-references, and detects references/ and assets/ directories - **Agents:** Parses agent .md frontmatter (name, description, model, tools), detects proactive triggers and skill references - **Plugins:** Aggregates all skills and agents from a plugin directory ## Project Structure ``` plugins/plugin-eval/ ├── .claude-plugin/ │ └── plugin.json # Claude Code plugin manifest ├── agents/ │ ├── eval-orchestrator.md # Orchestrates evaluation (Opus) │ └── eval-judge.md # LLM judge agent (Sonnet) ├── commands/ │ ├── eval.md # /eval slash command │ ├── certify.md # /certify slash command │ └── compare.md # /compare slash command ├── skills/ │ └── evaluation-methodology/ │ ├── SKILL.md # Full methodology reference │ └── references/ │ └── rubrics.md # Detailed rubric anchors ├── src/plugin_eval/ │ ├── __init__.py │ ├── cli.py # Typer CLI (score, certify, compare, init) │ ├── engine.py # Eval engine (layer coordination, composite scoring) │ ├── models.py # Pydantic models (Depth, Badge, EvalConfig, results) │ ├── parser.py # Plugin/skill/agent parser │ ├── reporter.py # JSON/Markdown/HTML output │ ├── corpus.py # Gold standard corpus for Elo ranking │ ├── elo.py # Elo rating calculator with bootstrap CI │ ├── stats.py # Statistical methods (Wilson, bootstrap, Clopper-Pearson) │ └── layers/ │ ├── __init__.py │ ├── static.py # Layer 1: deterministic structural analysis │ ├── judge.py # Layer 2: LLM semantic evaluation │ └── monte_carlo.py # Layer 3: statistical reliability simulation ├── tests/ # Comprehensive test suite │ ├── conftest.py │ ├── test_cli.py │ ├── test_engine.py │ ├── test_static.py │ ├── test_judge.py │ ├── test_monte_carlo.py │ ├── test_models.py │ ├── test_parser.py │ ├── test_reporter.py │ ├── test_corpus.py │ ├── test_elo.py │ ├── test_stats.py │ └── test_e2e.py # End-to-end tests against real plugins ├── pyproject.toml # uv/hatch project config └── uv.lock ``` ## Running Tests ```bash cd plugins/plugin-eval # Run all tests uv run pytest # Run with coverage uv run pytest --cov=plugin_eval # Run specific test file uv run pytest tests/test_static.py # Run e2e tests (requires real plugin corpus) uv run pytest tests/test_e2e.py ``` ## Example Output ### Markdown Report ``` # PluginEval Report **Path:** `plugins/python-development/skills/async-python-patterns` **Timestamp:** 2025-03-26T12:00:00+00:00 **Depth:** standard ## Overall Score | Metric | Value | |--------|-------| | Score | **78.3/100** | | Confidence | Assessed | | Badge | Silver | ## Layer Breakdown | Layer | Score | Anti-Patterns | |-------|-------|---------------| | static | 0.742 | 0 | | judge | 0.811 | 0 | ## Dimension Scores | Dimension | Weight | Score | Grade | |-----------|--------|-------|-------| | Triggering Accuracy | 25% | 0.850 | B | | Orchestration Fitness | 20% | 0.780 | C+ | | Output Quality | 15% | 0.820 | B- | | Scope Calibration | 12% | 0.750 | C | | Progressive Disclosure | 10% | 0.600 | D- | | Token Efficiency | 6% | 0.910 | A- | | ... ``` ## Tooling - **Package manager:** [uv](https://docs.astral.sh/uv/) - **Linter/formatter:** [ruff](https://docs.astral.sh/ruff/) (target Python 3.12, line length 100) - **Type checker:** [ty](https://docs.astral.sh/ty/) - **Test framework:** pytest with pytest-asyncio - **Build system:** hatchling