# 📊 Agent Skill Benchmark Report
> Generated: 2026-03-30T04:14:33.036Z
> Token counting: `ceil(characters / 4)` — cl100k_base approximation.
> Baselines: derived from **real, measured example prompts** (see Methodology).
> Quality: structural rubric (0–10), no live LLM calls required.
## ❓ How to Read This Report
This benchmark answers: **"How many tokens and dollars does an agent skill save compared to a developer writing the same guidance inline?"**
**WITHOUT a skill**: A developer writes domain knowledge directly into the prompt every time (Baseline).
**WITH a skill**: The agent loads the SKILL.md file (~400 tokens) — structured, reusable, cached.
**Eval Alignment**: % of eval assertion values that appear in SKILL.md. High alignment means the skill actually teaches what the evals test — the static proxy for "with skill > without skill" behavioral improvement.
## 🔢 Executive Summary
| Metric | Value |
| --------------------------------- | --------------------------------------------- |
| Total Skills Benchmarked | **238** |
| Avg. Tokens WITH Skill (SKILL.md) | **527 tokens** |
| Baseline: Light prompt (no skill) | **1449 tokens** ↓ see Methodology |
| Baseline: Heavy prompt (no skill) | **3656 tokens** ↓ see Methodology |
| Avg. Token Savings vs Light | **64%** (922 tokens/call) |
| Avg. Token Savings vs Heavy | **86%** (3129 tokens/call) |
| Avg. Quality Score | **9.8/10** |
| Skills with Evals | **236 / 238** |
| Avg. Eval Alignment | **91%** (eval assertions covered by SKILL.md) |
## 📜 History
| Version | Date | Skills | Avg Tokens | Savings (%) | Quality | Report |
| ------- | ---------- | ------ | ---------- | ----------- | ------- | -------------------------------------------- |
| v2.0.1 | 2026-03-30 | 238 | 527 | 86% | 9.8/10 | [Full Report](benchmarks/archive/v2.0.1.md) |
| v2.0.0 | 2026-03-25 | 235 | 523 | 86% | 9.9/10 | [Full Report](benchmarks/archive/v2.0.0.md) |
| v1.10.3 | 2026-03-21 | 234 | 505 | 86% | 9.8/10 | [Full Report](benchmarks/archive/v1.10.3.md) |
| v1.10.1 | 2026-03-16 | 229 | 428 | 88% | 9.9/10 | [Full Report](benchmarks/archive/v1.10.1.md) |
| v1.10.0 | 2026-03-16 | 229 | 434 | 88% | 7/10 | [Full Report](benchmarks/archive/v1.10.0.md) |
| v1.9.3 | 2026-03-15 | 229 | 460 | 87% | 8.9/10 | [Full Report](benchmarks/archive/v1.9.3.md) |
| v1.9.2 | 2026-03-07 | 228 | 458 | 87% | 8.9/10 | [Full Report](benchmarks/archive/v1.9.2.md) |
| v1.9.1 | 2026-03-07 | 228 | 458 | 87% | 8.9/10 | [Full Report](benchmarks/archive/v1.9.1.md) |
| v1.9.0 | 2026-03-05 | 228 | 457 | 88% | 8.9/10 | [Full Report](benchmarks/archive/v1.9.0.md) |
| v1.8.0 | 2026-03-02 | 228 | 443 | 88% | 8.9/10 | [Full Report](benchmarks/archive/v1.8.0.md) |
| v1.7.3 | 2026-02-25 | 222 | 418 | 89% | 8.9/10 | [Full Report](benchmarks/archive/v1.7.3.md) |
| v1.7.2 | 2026-02-25 | 220 | 413 | 89% | 8.9/10 | [Full Report](benchmarks/archive/v1.7.2.md) |
### 💰 Cost Comparison — Per Single Call (Average Skill)
> Comparison based on **Heavy Baseline** vs. modern and speculative models.
| Model | Original Cost | Skill Cost | Net Savings | % Saved |
| ----------------- | ------------- | ---------- | -------------- | ------- |
| Gemini 3 Flash | $0.0018280 | $0.0002635 | **$0.0015645** | 86% |
| GPT-5 | $0.0045700 | $0.0006588 | **$0.0039112** | 86% |
| Gemini 3.1 Pro | $0.0073120 | $0.0010540 | **$0.0062580** | 86% |
| Claude Sonnet 4.5 | $0.0109680 | $0.0015810 | **$0.0093870** | 86% |
### 📈 Monthly Savings at Scale — (Avg Skill vs Heavy Prompt)
| Daily Calls | Original Cost/mo | Monthly Savings (1 skill) | Monthly Savings (50 skills) | Model |
| ----------- | ---------------- | ------------------------- | --------------------------- | ----------------- |
| 1,000 | $137.1000 /mo | $117.3375 /mo | $5866.8750 /mo | GPT-5 |
| 1,000 | $329.0400 /mo | $281.6100 /mo | $14080.5000 /mo | Claude Sonnet 4.5 |
| 1,000 | $219.3600 /mo | $187.7400 /mo | $9387.0000 /mo | Gemini 3.1 Pro |
## 📦 Per-Category Summary
📦 android (22 skills | avg 350 tokens | quality 10.0/10 | eval alignment 92%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ------------------------------ | ------ | ------------------ | ------- | ----- | ------- |
| `android-architecture` | 505 | █████████░ 86% | 10/10 | 3 | ✅ 88% |
| `android-background-work` | 305 | █████████░ 92% | 10/10 | 3 | ✅ 100% |
| `android-compose` | 450 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `android-concurrency` | 315 | █████████░ 91% | 10/10 | 3 | ✅ 89% |
| `android-deployment` | 328 | █████████░ 91% | 10/10 | 3 | ✅ 100% |
| `android-design-system` | 300 | █████████░ 92% | 10/10 | 3 | ✅ 100% |
| `android-di` | 311 | █████████░ 91% | 10/10 | 3 | ✅ 88% |
| `android-legacy-navigation` | 311 | █████████░ 91% | 10/10 | 3 | ✅ 86% |
| `android-legacy-security` | 446 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `android-legacy-state` | 258 | █████████░ 93% | 10/10 | 3 | ✅ 100% |
| `android-navigation` | 277 | █████████░ 92% | 10/10 | 3 | ✅ 100% |
| `android-navigation-type-safe` | 267 | █████████░ 93% | 10/10 | 3 | ✅ 83% |
| `android-networking` | 415 | █████████░ 89% | 10/10 | 3 | ⚠️ 67% |
| `android-notifications` | 436 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `android-performance` | 385 | █████████░ 89% | 10/10 | 3 | ✅ 88% |
| `android-persistence` | 298 | █████████░ 92% | 10/10 | 3 | ✅ 100% |
| `android-resources` | 412 | █████████░ 89% | 10/10 | 3 | ✅ 100% |
| `android-security` | 398 | █████████░ 89% | 10/10 | 3 | ✅ 86% |
| `android-state` | 363 | █████████░ 90% | 10/10 | 3 | ✅ 88% |
| `android-testing` | 318 | █████████░ 91% | 10/10 | 3 | ✅ 88% |
| `android-tooling` | 296 | █████████░ 92% | 10/10 | 3 | ✅ 88% |
| `android-xml-views` | 297 | █████████░ 92% | 10/10 | 3 | ✅ 88% |
📦 angular (16 skills | avg 502 tokens | quality 9.9/10 | eval alignment 84%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ------------------------------ | ------ | ------------------ | ------- | ----- | ------- |
| `angular-architecture` | 620 | ████████░░ 83% | 10/10 | 6 | ✅ 95% |
| `angular-component-patterns` | 553 | █████████░ 85% | 10/10 | 6 | ✅ 91% |
| `angular-components` | 568 | ████████░░ 84% | 10/10 | 6 | ✅ 87% |
| `angular-dependency-injection` | 525 | █████████░ 86% | 10/10 | 6 | ✅ 89% |
| `angular-directives-pipes` | 496 | █████████░ 86% | 10/10 | 6 | ✅ 95% |
| `angular-forms` | 346 | █████████░ 91% | 10/10 | 6 | ⚠️ 58% |
| `angular-http-client` | 560 | █████████░ 85% | 10/10 | 6 | ✅ 96% |
| `angular-performance` | 476 | █████████░ 87% | 10/10 | 6 | ✅ 82% |
| `angular-routing` | 381 | █████████░ 90% | 10/10 | 6 | ⚠️ 43% |
| `angular-rxjs-interop` | 507 | █████████░ 86% | 10/10 | 6 | ✅ 100% |
| `angular-security` | 500 | █████████░ 86% | 10/10 | 6 | ✅ 89% |
| `angular-ssr` | 473 | █████████░ 87% | 10/10 | 6 | ✅ 90% |
| `angular-state-management` | 407 | █████████░ 89% | 10/10 | 6 | ✅ 81% |
| `angular-style-guide` | 521 | █████████░ 86% | 10/10 | 6 | ✅ 81% |
| `angular-testing` | 425 | █████████░ 88% | 10/10 | 6 | ✅ 70% |
| `angular-tooling` | 675 | ████████░░ 82% | 8/10 | 6 | ✅ 100% |
📦 common (31 skills | avg 622 tokens | quality 9.7/10 | eval alignment 91%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| --------------------------------- | ------ | ------------------ | ------- | ------- | ------- |
| `common-architecture-audit` | 623 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `common-architecture-diagramming` | 453 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `common-best-practices` | 391 | █████████░ 89% | 10/10 | 3 | ✅ 91% |
| `common-code-review` | 383 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `common-context-optimization` | 574 | ████████░░ 84% | 10/10 | 3 | ✅ 100% |
| `common-debugging` | 396 | █████████░ 89% | 10/10 | 3 | ✅ 100% |
| `common-documentation` | 483 | █████████░ 87% | 10/10 | 3 | ✅ 89% |
| `common-error-handling` | 395 | █████████░ 89% | 10/10 | 3 | ✅ 78% |
| `common-feedback-reporter` | 670 | ████████░░ 82% | 10/10 | 3 | ✅ 100% |
| `common-git-collaboration` | 507 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `common-mobile-animation` | 542 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `common-mobile-ux-core` | 369 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `common-observability` | 380 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `common-performance-engineering` | 677 | ████████░░ 81% | 10/10 | 3 | ✅ 78% |
| `common-product-requirements` | 431 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `common-protocol-enforcement` | 466 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
| `common-security-audit` | 883 | ████████░░ 76% | 10/10 | 3 | ✅ 100% |
| `common-security-standards` | 709 | ████████░░ 81% | 10/10 | 3 | ✅ 100% |
| `common-session-retrospective` | 659 | ████████░░ 82% | 10/10 | 3 | ✅ 100% |
| `common-skill-creator` | 1238 | ███████░░░ 66% | 10/10 | 3 | ✅ 75% |
| `common-store-changelog` | 715 | ████████░░ 80% | 10/10 | 4 | ⚠️ 43% |
| `common-system-design` | 713 | ████████░░ 80% | 10/10 | 3 | ✅ 100% |
| `common-tdd` | 647 | ████████░░ 82% | 10/10 | 3 | ⚠️ 0% |
| `common-ui-design` | 784 | ████████░░ 79% | 10/10 | 6 | ✅ 100% |
| `common-workflow-writing` | 563 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `common-llm-security` | 688 | ████████░░ 81% | 9/10 | 5 | ✅ 100% |
| `common-owasp` | 901 | ████████░░ 75% | 9/10 | 5 | ✅ 91% |
| `common-accessibility` | 1009 | ███████░░░ 72% | 8/10 | 3 | ✅ 100% |
| `common-api-design` | 844 | ████████░░ 77% | 8/10 | 3 | ✅ 100% |
| `common-dast-tooling` | 607 | ████████░░ 83% | 8/10 | ❌ none | n/a |
| `common-learning-log` | 571 | ████████░░ 84% | 8/10 | ❌ none | n/a |
📦 dart (3 skills | avg 560 tokens | quality 10.0/10 | eval alignment 100%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| --------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `dart-best-practices` | 526 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `dart-language` | 637 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `dart-tooling` | 518 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
📦 database (3 skills | avg 566 tokens | quality 10.0/10 | eval alignment 95%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| --------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `database-mongodb` | 624 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `database-postgresql` | 471 | █████████░ 87% | 10/10 | 3 | ✅ 86% |
| `database-redis` | 602 | ████████░░ 84% | 10/10 | 3 | ✅ 100% |
📦 flutter (21 skills | avg 542 tokens | quality 9.5/10 | eval alignment 91%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ------------------------------------------ | ------ | ------------------ | ------- | ----- | ------- |
| `flutter-cicd` | 575 | ████████░░ 84% | 10/10 | 3 | ✅ 100% |
| `flutter-design-system` | 525 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `flutter-error-handling` | 616 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `flutter-feature-based-clean-architecture` | 713 | ████████░░ 80% | 10/10 | 3 | ⚠️ 30% |
| `flutter-getx-navigation` | 382 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `flutter-idiomatic-flutter` | 370 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `flutter-layer-based-clean-architecture` | 677 | ████████░░ 81% | 10/10 | 3 | ✅ 100% |
| `flutter-performance` | 466 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
| `flutter-retrofit-networking` | 565 | █████████░ 85% | 10/10 | 3 | ⚠️ 67% |
| `flutter-riverpod-state-management` | 557 | █████████░ 85% | 10/10 | 3 | ⚠️ 50% |
| `flutter-testing` | 750 | ████████░░ 79% | 10/10 | 3 | ✅ 100% |
| `flutter-widgets` | 499 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `flutter-auto-route-navigation` | 511 | █████████░ 86% | 9/10 | 3 | ✅ 100% |
| `flutter-bloc-state-management` | 666 | ████████░░ 82% | 9/10 | 3 | ✅ 100% |
| `flutter-dependency-injection` | 526 | █████████░ 86% | 9/10 | 3 | ✅ 80% |
| `flutter-localization` | 500 | █████████░ 86% | 9/10 | 3 | ✅ 100% |
| `flutter-navigation` | 400 | █████████░ 89% | 9/10 | 3 | ✅ 100% |
| `flutter-notifications` | 415 | █████████░ 89% | 9/10 | 3 | ✅ 100% |
| `flutter-security` | 502 | █████████░ 86% | 9/10 | 3 | ✅ 75% |
| `flutter-getx-state-management` | 506 | █████████░ 86% | 8/10 | 3 | ✅ 100% |
| `flutter-go-router-navigation` | 655 | ████████░░ 82% | 8/10 | 3 | ✅ 100% |
📦 golang (11 skills | avg 449 tokens | quality 9.9/10 | eval alignment 93%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ----------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `golang-api-server` | 445 | █████████░ 88% | 10/10 | 3 | ✅ 80% |
| `golang-architecture` | 500 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `golang-concurrency` | 427 | █████████░ 88% | 10/10 | 3 | ⚠️ 67% |
| `golang-configuration` | 434 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `golang-database` | 448 | █████████░ 88% | 10/10 | 3 | ✅ 71% |
| `golang-error-handling` | 345 | █████████░ 91% | 10/10 | 3 | ✅ 100% |
| `golang-language` | 499 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `golang-logging` | 390 | █████████░ 89% | 10/10 | 3 | ✅ 100% |
| `golang-security` | 511 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `golang-testing` | 418 | █████████░ 89% | 10/10 | 3 | ✅ 100% |
| `golang-tooling` | 518 | █████████░ 86% | 9/10 | 4 | ✅ 100% |
📦 ios (15 skills | avg 368 tokens | quality 10.0/10 | eval alignment 87%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| -------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `ios-app-lifecycle` | 345 | █████████░ 91% | 10/10 | 3 | ✅ 89% |
| `ios-architecture` | 663 | ████████░░ 82% | 10/10 | 3 | ✅ 100% |
| `ios-dependency-injection` | 307 | █████████░ 92% | 10/10 | 3 | ✅ 89% |
| `ios-deployment` | 337 | █████████░ 91% | 10/10 | 3 | ✅ 89% |
| `ios-design-system` | 240 | █████████░ 93% | 10/10 | 3 | ✅ 100% |
| `ios-localization` | 372 | █████████░ 90% | 10/10 | 3 | ✅ 78% |
| `ios-navigation` | 295 | █████████░ 92% | 10/10 | 3 | ✅ 100% |
| `ios-networking` | 371 | █████████░ 90% | 10/10 | 3 | ⚠️ 56% |
| `ios-notifications` | 310 | █████████░ 92% | 10/10 | 3 | ✅ 100% |
| `ios-performance` | 363 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `ios-persistence` | 343 | █████████░ 91% | 10/10 | 3 | ⚠️ 67% |
| `ios-security` | 379 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `ios-state-management` | 350 | █████████░ 90% | 10/10 | 3 | ⚠️ 56% |
| `ios-swiftui` | 429 | █████████░ 88% | 10/10 | 3 | ✅ 88% |
| `ios-ui-navigation` | 417 | █████████░ 89% | 10/10 | 3 | ✅ 100% |
📦 java (5 skills | avg 494 tokens | quality 10.0/10 | eval alignment 98%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| --------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `java-best-practices` | 479 | █████████░ 87% | 10/10 | 3 | ✅ 89% |
| `java-concurrency` | 448 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `java-language` | 535 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `java-testing` | 534 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `java-tooling` | 473 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
📦 javascript (3 skills | avg 419 tokens | quality 10.0/10 | eval alignment 100%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| --------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `javascript-best-practices` | 403 | █████████░ 89% | 10/10 | 3 | ✅ 100% |
| `javascript-language` | 511 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `javascript-tooling` | 342 | █████████░ 91% | 10/10 | 3 | ✅ 100% |
📦 kotlin (4 skills | avg 410 tokens | quality 10.0/10 | eval alignment 95%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ----------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `kotlin-best-practices` | 466 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
| `kotlin-coroutines` | 384 | █████████░ 89% | 10/10 | 3 | ✅ 89% |
| `kotlin-language` | 445 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `kotlin-tooling` | 346 | █████████░ 91% | 10/10 | 3 | ✅ 89% |
📦 laravel (10 skills | avg 650 tokens | quality 10.0/10 | eval alignment 82%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ------------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `laravel-api` | 707 | ████████░░ 81% | 10/10 | 6 | ✅ 88% |
| `laravel-architecture` | 395 | █████████░ 89% | 10/10 | 6 | ⚠️ 17% |
| `laravel-background-processing` | 621 | ████████░░ 83% | 10/10 | 6 | ✅ 91% |
| `laravel-clean-architecture` | 662 | ████████░░ 82% | 10/10 | 6 | ✅ 76% |
| `laravel-database-expert` | 703 | ████████░░ 81% | 10/10 | 6 | ✅ 84% |
| `laravel-eloquent` | 623 | ████████░░ 83% | 10/10 | 6 | ✅ 85% |
| `laravel-security` | 727 | ████████░░ 80% | 10/10 | 6 | ✅ 95% |
| `laravel-sessions-middleware` | 675 | ████████░░ 82% | 10/10 | 6 | ✅ 90% |
| `laravel-testing` | 708 | ████████░░ 81% | 10/10 | 6 | ✅ 95% |
| `laravel-tooling` | 675 | ████████░░ 82% | 10/10 | 6 | ✅ 100% |
📦 nestjs (21 skills | avg 632 tokens | quality 9.9/10 | eval alignment 98%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ----------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `nestjs-api-standards` | 628 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `nestjs-architecture` | 551 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `nestjs-bullmq` | 900 | ████████░░ 75% | 10/10 | 3 | ✅ 100% |
| `nestjs-caching` | 616 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `nestjs-configuration` | 611 | ████████░░ 83% | 10/10 | 3 | ✅ 83% |
| `nestjs-database` | 681 | ████████░░ 81% | 10/10 | 3 | ✅ 100% |
| `nestjs-deployment` | 717 | ████████░░ 80% | 10/10 | 3 | ✅ 100% |
| `nestjs-documentation` | 542 | █████████░ 85% | 10/10 | 3 | ✅ 83% |
| `nestjs-error-handling` | 587 | ████████░░ 84% | 10/10 | 3 | ✅ 100% |
| `nestjs-file-uploads` | 431 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `nestjs-notification` | 511 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `nestjs-observability` | 463 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
| `nestjs-performance` | 974 | ███████░░░ 73% | 10/10 | 3 | ✅ 100% |
| `nestjs-real-time` | 905 | ████████░░ 75% | 10/10 | 3 | ✅ 100% |
| `nestjs-scheduling` | 577 | ████████░░ 84% | 10/10 | 3 | ✅ 100% |
| `nestjs-search` | 533 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `nestjs-security` | 759 | ████████░░ 79% | 10/10 | 3 | ✅ 100% |
| `nestjs-security-isolation` | 536 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `nestjs-testing` | 556 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `nestjs-transport` | 451 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `nestjs-controllers-services` | 747 | ████████░░ 80% | 8/10 | 3 | ✅ 100% |
📦 nextjs (18 skills | avg 642 tokens | quality 9.6/10 | eval alignment 85%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| -------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `nextjs-app-router` | 987 | ███████░░░ 73% | 10/10 | 6 | ✅ 100% |
| `nextjs-architecture` | 1065 | ███████░░░ 71% | 10/10 | 6 | ✅ 93% |
| `nextjs-authentication` | 492 | █████████░ 87% | 10/10 | 6 | ⚠️ 61% |
| `nextjs-caching` | 813 | ████████░░ 78% | 10/10 | 6 | ✅ 100% |
| `nextjs-data-access-layer` | 523 | █████████░ 86% | 10/10 | 6 | ✅ 88% |
| `nextjs-data-fetching` | 467 | █████████░ 87% | 10/10 | 6 | ✅ 100% |
| `nextjs-optimization` | 512 | █████████░ 86% | 10/10 | 6 | ✅ 87% |
| `nextjs-rendering` | 736 | ████████░░ 80% | 10/10 | 6 | ✅ 100% |
| `nextjs-server-actions` | 735 | ████████░░ 80% | 10/10 | 6 | ✅ 85% |
| `nextjs-server-components` | 635 | ████████░░ 83% | 10/10 | 6 | ✅ 78% |
| `nextjs-upgrade` | 559 | █████████░ 85% | 10/10 | 6 | ⚠️ 64% |
| `nextjs-i18n` | 594 | ████████░░ 84% | 9/10 | 6 | ✅ 100% |
| `nextjs-pages-router` | 654 | ████████░░ 82% | 9/10 | 6 | ✅ 100% |
| `nextjs-security` | 678 | ████████░░ 81% | 9/10 | 6 | ✅ 100% |
| `nextjs-state-management` | 442 | █████████░ 88% | 9/10 | 6 | ⚠️ 0% |
| `nextjs-styling` | 654 | ████████░░ 82% | 9/10 | 6 | ✅ 100% |
| `nextjs-testing` | 622 | ████████░░ 83% | 9/10 | 6 | ✅ 100% |
| `nextjs-tooling` | 392 | █████████░ 89% | 9/10 | 6 | ✅ 73% |
📦 php (7 skills | avg 513 tokens | quality 9.6/10 | eval alignment 83%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| -------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `php-best-practices` | 522 | █████████░ 86% | 10/10 | 6 | ✅ 92% |
| `php-security` | 538 | █████████░ 85% | 10/10 | 6 | ✅ 100% |
| `php-testing` | 533 | █████████░ 85% | 10/10 | 6 | ⚠️ 13% |
| `php-tooling` | 546 | █████████░ 85% | 10/10 | 6 | ✅ 86% |
| `php-concurrency` | 525 | █████████░ 86% | 9/10 | 6 | ✅ 100% |
| `php-error-handling` | 472 | █████████░ 87% | 9/10 | 6 | ✅ 100% |
| `php-language` | 456 | █████████░ 88% | 9/10 | 6 | ✅ 91% |
📦 quality-engineering (5 skills | avg 777 tokens | quality 10.0/10 | eval alignment 82%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ---------------------------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `quality-engineering-business-analysis` | 1042 | ███████░░░ 71% | 10/10 | 6 | ✅ 95% |
| `quality-engineering-jira-integration` | 562 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `quality-engineering-quality-assurance` | 467 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
| `quality-engineering-zephyr-coverage-analysis` | 694 | ████████░░ 81% | 10/10 | 3 | ✅ 83% |
| `quality-engineering-zephyr-test-generation` | 1121 | ███████░░░ 69% | 10/10 | 3 | ⚠️ 33% |
📦 react (8 skills | avg 535 tokens | quality 10.0/10 | eval alignment 95%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| -------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `react-component-patterns` | 475 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
| `react-hooks` | 634 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `react-performance` | 734 | ████████░░ 80% | 10/10 | 3 | ✅ 100% |
| `react-security` | 508 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `react-state-management` | 532 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `react-testing` | 530 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `react-tooling` | 419 | █████████░ 89% | 10/10 | 3 | ⚠️ 57% |
| `react-typescript` | 447 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
📦 react-native (13 skills | avg 431 tokens | quality 10.0/10 | eval alignment 97%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| -------------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `react-native-architecture` | 553 | █████████░ 85% | 10/10 | 3 | ✅ 83% |
| `react-native-components` | 377 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `react-native-deployment` | 526 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `react-native-dls` | 257 | █████████░ 93% | 10/10 | 3 | ✅ 100% |
| `react-native-navigation` | 339 | █████████░ 91% | 10/10 | 3 | ✅ 100% |
| `react-native-navigation-v6` | 499 | █████████░ 86% | 10/10 | 3 | ✅ 86% |
| `react-native-notifications` | 357 | █████████░ 90% | 10/10 | 3 | ✅ 100% |
| `react-native-performance` | 566 | █████████░ 85% | 10/10 | 3 | ✅ 89% |
| `react-native-platform-specific` | 397 | █████████░ 89% | 10/10 | 3 | ✅ 100% |
| `react-native-security` | 566 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `react-native-state-management` | 425 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
| `react-native-styling` | 317 | █████████░ 91% | 10/10 | 3 | ✅ 100% |
| `react-native-testing` | 427 | █████████░ 88% | 10/10 | 3 | ✅ 100% |
📦 spring-boot (10 skills | avg 466 tokens | quality 10.0/10 | eval alignment 95%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ---------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `spring-boot-api-design` | 320 | █████████░ 91% | 10/10 | 3 | ✅ 100% |
| `spring-boot-architecture` | 621 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `spring-boot-best-practices` | 564 | █████████░ 85% | 10/10 | 3 | ✅ 100% |
| `spring-boot-data-access` | 514 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `spring-boot-deployment` | 498 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `spring-boot-microservices` | 484 | █████████░ 87% | 10/10 | 3 | ⚠️ 67% |
| `spring-boot-observability` | 483 | █████████░ 87% | 10/10 | 3 | ✅ 100% |
| `spring-boot-scheduling` | 343 | █████████░ 91% | 10/10 | 3 | ✅ 100% |
| `spring-boot-security` | 513 | █████████░ 86% | 10/10 | 3 | ✅ 100% |
| `spring-boot-testing` | 319 | █████████░ 91% | 10/10 | 3 | ✅ 83% |
📦 swift (8 skills | avg 479 tokens | quality 10.0/10 | eval alignment 92%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| ------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `swift-best-practices` | 659 | ████████░░ 82% | 10/10 | 4 | ✅ 92% |
| `swift-concurrency` | 521 | █████████░ 86% | 10/10 | 5 | ✅ 93% |
| `swift-error-handling` | 513 | █████████░ 86% | 10/10 | 4 | ⚠️ 67% |
| `swift-language` | 465 | █████████░ 87% | 10/10 | 5 | ✅ 94% |
| `swift-memory-management` | 381 | █████████░ 90% | 10/10 | 4 | ✅ 89% |
| `swift-swiftui` | 427 | █████████░ 88% | 10/10 | 4 | ✅ 100% |
| `swift-testing` | 451 | █████████░ 88% | 10/10 | 4 | ✅ 100% |
| `swift-tooling` | 414 | █████████░ 89% | 10/10 | 4 | ✅ 100% |
📦 typescript (4 skills | avg 663 tokens | quality 9.5/10 | eval alignment 98%)
| Skill | Tokens | Savings (vs Heavy) | Quality | Evals | Aligned |
| --------------------------- | ------ | ------------------ | ------- | ----- | ------- |
| `typescript-best-practices` | 594 | ████████░░ 84% | 10/10 | 3 | ✅ 90% |
| `typescript-language` | 650 | ████████░░ 82% | 10/10 | 4 | ✅ 100% |
| `typescript-security` | 608 | ████████░░ 83% | 10/10 | 3 | ✅ 100% |
| `typescript-tooling` | 799 | ████████░░ 78% | 8/10 | 3 | ✅ 100% |
## ⚠️ Low Eval Alignment — Skills to Review
> These skills have evals but SKILL.md content does not cover ≥70% of what the evals test. The skill may not actually improve agent behavior for its target scenarios.
| Skill | Category | Alignment | Evals | Action |
| -------------------------------------------- | ------------------- | --------- | ----- | -------------------------------------------------- |
| `common-tdd` | common | ⚠️ 0% | 3 | Add missing terms from eval assertions to SKILL.md |
| `nextjs-state-management` | nextjs | ⚠️ 0% | 6 | Add missing terms from eval assertions to SKILL.md |
| `php-testing` | php | ⚠️ 13% | 6 | Add missing terms from eval assertions to SKILL.md |
| `laravel-architecture` | laravel | ⚠️ 17% | 6 | Add missing terms from eval assertions to SKILL.md |
| `flutter-feature-based-clean-architecture` | flutter | ⚠️ 30% | 3 | Add missing terms from eval assertions to SKILL.md |
| `quality-engineering-zephyr-test-generation` | quality-engineering | ⚠️ 33% | 3 | Add missing terms from eval assertions to SKILL.md |
| `angular-routing` | angular | ⚠️ 43% | 6 | Add missing terms from eval assertions to SKILL.md |
| `common-store-changelog` | common | ⚠️ 43% | 4 | Add missing terms from eval assertions to SKILL.md |
| `flutter-riverpod-state-management` | flutter | ⚠️ 50% | 3 | Add missing terms from eval assertions to SKILL.md |
| `ios-networking` | ios | ⚠️ 56% | 3 | Add missing terms from eval assertions to SKILL.md |
| `ios-state-management` | ios | ⚠️ 56% | 3 | Add missing terms from eval assertions to SKILL.md |
| `react-tooling` | react | ⚠️ 57% | 3 | Add missing terms from eval assertions to SKILL.md |
| `angular-forms` | angular | ⚠️ 58% | 6 | Add missing terms from eval assertions to SKILL.md |
| `nextjs-authentication` | nextjs | ⚠️ 61% | 6 | Add missing terms from eval assertions to SKILL.md |
| `nextjs-upgrade` | nextjs | ⚠️ 64% | 6 | Add missing terms from eval assertions to SKILL.md |
## 🏆 Quality Leaders
| Rank | Skill | Category | Quality | Tokens | Evals | Aligned |
| ---- | --------------------------- | -------- | ------- | ------ | ----- | ------- |
| 1 | `android-architecture` | android | 10/10 | 505 | 3 | ✅ 88% |
| 2 | `android-background-work` | android | 10/10 | 305 | 3 | ✅ 100% |
| 3 | `android-compose` | android | 10/10 | 450 | 3 | ✅ 100% |
| 4 | `android-concurrency` | android | 10/10 | 315 | 3 | ✅ 89% |
| 5 | `android-deployment` | android | 10/10 | 328 | 3 | ✅ 100% |
| 6 | `android-design-system` | android | 10/10 | 300 | 3 | ✅ 100% |
| 7 | `android-di` | android | 10/10 | 311 | 3 | ✅ 88% |
| 8 | `android-legacy-navigation` | android | 10/10 | 311 | 3 | ✅ 86% |
| 9 | `android-legacy-security` | android | 10/10 | 446 | 3 | ✅ 100% |
| 10 | `android-legacy-state` | android | 10/10 | 258 | 3 | ✅ 100% |
## 📐 Methodology & Baseline Justification
### Why These Baselines?
The baselines are derived from **real, token-counted example prompts** that represent what a developer actually writes when there is no structured skill available.
Using NestJS as the **Reference Unit**: Because we measure instruction volume replaced, using a high-density reference ensures scientific consistency across all tech stacks.
#### 🟡 Reference Technical Prompt — Light — 1449 tokens
> **Reference Technical Prompt — Light (e.g., NestJS)**
> A compact inline system prompt used as a reference for token count calibration. Representative of focused developer instructions without a structured skill.
#### 🔴 Reference Technical Prompt — Heavy — 3656 tokens
> **Reference Technical Prompt — Heavy (e.g., NestJS Architecture)**
> A comprehensive architect-level inline prompt used as a reference for complex tasks. Includes deep patterns and rules sent by developers when no skill is present.
### 🏆 Detailed Quality Rubric (0–10)
To ensure skills are not just "short" but actually **high quality**, every skill is scored against this structural rubric:
| Score | Criteria | Rationale |
| ------ | ------------------------- | ------------------------------------------------------------------- |
| **+2** | **Structured Guidelines** | At least 3 specific instructions/bullet points. |
| **+2** | **Anti-Patterns** | `## Anti-Patterns` section or `**No X**` inline lines. |
| **+2** | **Reference Examples** | Presence of a verified `references/` folder with code. |
| **+2** | **Token Optimality** | Entire `SKILL.md` is ≤100 lines (forces brevity). |
| **+2** | **Eval Coverage** | ≥3 evals with `should_not_trigger`, ≥2 assertions each. +1 partial. |
> **Eval Alignment** (reported separately, not scored): % of eval `contains` assertion values that appear in SKILL.md content. Measures whether the skill actually teaches what its evals test — the closest static proxy for **with-skill vs without-skill** behavioral improvement.
### 🛡️ How to Verify This Report
Trust but verify. You can audit the raw data and run the benchmark yourself:
1. **Clone the repo** and install dependencies (`pnpm install`).
2. **Inspect Source**: The benchmark logic is open in [cli/src/scripts/benchmark/](./cli/src/scripts/benchmark/).
### Pricing (per 1M input tokens, Feb 2026)
- **Gemini 3 Flash**: $0.50
- **GPT-5**: $1.25
- **Gemini 3.1 Pro**: $2.00
- **Claude Sonnet 4.5**: $3.00