--- name: cost-benchmark description: Run the corpus benchmark — booster locally, optional Gemini/Sonnet/Opus baselines — and persist a verifiable measured-vs-claimed table argument-hint: "[--llm] [--anthropic]" allowed-tools: Bash --- # Cost Benchmark Runs `scripts/bench.mjs` against the structural+adversarial corpus and writes per-case + summary results to `docs/benchmarks/runs/`. This is the verification gate that backs every measurable claim in `cost-booster-edit` / `cost-booster-route`. ## When to use - Before publishing a release — verify booster win rate didn't regress. - After expanding `bench/booster-corpus.json` — confirm new cases route correctly. - When auditing a "claimed upstream" tag — flip it to "verified" once the bench supports it. - On a cost question ("is Sonnet 4.6 cheaper than Opus 4.7 for these tasks?") — re-run with `BENCH_ANTHROPIC=1`. ## Steps 1. **Run the bench from `v3/`** (where `agent-booster` resolves): ```bash ( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # booster only — free, ~85 ms ( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash (cheap) ( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \ node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Sonnet 4.6 + Opus 4.7 ``` 2. **Inspect the markdown summary** printed to stdout. The gate metric is `winRate` (Tier 1 cases). Adversarial cases are tracked separately as `escalationRate`. 3. **Persisted output** lands at: - `docs/benchmarks/runs/latest.json` — pointer to the most recent run - `docs/benchmarks/runs/.json` — historical record 4. **Read it back** in subsequent skills (e.g. `cost-report` step 2 reads `latest.json` for live tier-spend numbers). ## Smoke gates - `winRate ≥ 0.80` on Tier 1 cases (smoke step 23). Lower the threshold by editing `scripts/smoke.sh`. - `escalationRate` is reported but ungated — adversarial cases are diagnostic. ## Env overrides | Env var | Default | Purpose | |---|---|---| | `BENCH_LLM_BASELINE` | unset | `=1` runs the OpenAI-compat baseline | | `BENCH_LLM_MODEL` | `models/gemini-2.0-flash` | Override the OpenAI-compat model | | `BENCH_LLM_BASE_URL` | Gemini OpenAI shim | Override endpoint | | `BENCH_ANTHROPIC` | unset | `=1` runs Anthropic baseline (Sonnet 4.6 + Opus 4.7) | | `BENCH_ANTHROPIC_MODELS` | `claude-sonnet-4-6,claude-opus-4-7` | Comma-separated Claude IDs | | `BENCH_OUT` | timestamped file | Override output path | | `BENCH_QUIET=1` | unset | Suppress markdown summary | API keys auto-pulled from `gcloud secrets` (`GOOGLE_AI_API_KEY`, `ANTHROPIC_API_KEY`); override with `BENCH_LLM_API_KEY` / `BENCH_ANTHROPIC_API_KEY`. ## Cross-references ADR-0002 §"Decision 1" / §"Riskiest assumption" · `cost-booster-edit/SKILL.md` (verification table consumes this skill's output) · `cost-report/SKILL.md` step 2 (reads `runs/latest.json`).