--- sidebar_position: 6 title: Public Scoreboard description: Auto-updated weekly scoreboard of AiSOC eval runs — date, agent version, MITRE accuracy, MTC p50/p95, total USD cost, total tokens. Substrate rows are deterministic CI gates (clearly labelled); wet-eval rows are appended by the T5.5 weekly job. Reproducible, append-only, public. --- import Scoreboard from '@site/src/components/Scoreboard'; # Public benchmark scoreboard Append-only weekly history of every published AiSOC eval run. Each row is one end-to-end run of [`scripts/run_evals.py`](https://github.com/beenuar/AiSOC/blob/main/scripts/run_evals.py) against the [200-incident corpus](./benchmark-methodology.md#2-dataset), labelled with the agent version, the commit SHA, and whether it was a deterministic substrate run or a real wet-eval against a live LLM. :::warning Substrate rows ≠ live agent performance Rows tagged `substrate` are the deterministic CI gate — they execute in microseconds with no LLM call, no money. Their token and USD figures are **budget projections** computed from the [4-chars/token estimator and the illustrative public rate card](./benchmark-methodology.md#rate-card), not real bills. Wet-eval rows (real agent, real LLM) start arriving the moment the [T5.5 weekly job](./benchmark-methodology.md#3-substrate-vs-wet-eval) lands in CI. The two row types share the same columns so the table reads uniformly, but **never quote a substrate row as live agent performance**. ::: ## How this page is updated The scoreboard is sourced from a single checked-in JSON file at [`apps/docs/static/data/scoreboard.json`](https://github.com/beenuar/AiSOC/blob/main/apps/docs/static/data/scoreboard.json), validated against [`scoreboard.schema.json`](https://github.com/beenuar/AiSOC/blob/main/apps/docs/static/data/scoreboard.schema.json) on every docs build via `pnpm --filter @aisoc/docs scoreboard:check`. There are two ways a row reaches that file: 1. **Substrate rows (per-PR CI gate).** A row is appended whenever the substrate snapshot drifts enough to publish — captured during release tagging and committed by hand under `feat(eval): scoreboard substrate row for v`. 2. **Wet-eval rows (T5.5 weekly job).** The [`wet-eval-weekly.yml`](./benchmark-methodology.md#3-substrate-vs-wet-eval) GitHub Action runs the live agent against the same 200-incident corpus on a Sunday cadence, captures real latency / token / USD telemetry, and opens an auto-PR appending one row to `scoreboard.json`. Wet-eval rows show up at the top of the table and on the trend chart. This append-only contract is deliberate: the scoreboard becomes more informative the longer it runs. We never silently rewrite history; if a historic row turns out to be wrong we add a follow-up row with the correction in `notes` and link the issue. ## Reproducing any single row Every row in the table can be reproduced from a fresh clone: ```bash git clone https://github.com/beenuar/AiSOC.git cd AiSOC git checkout # the value in the "Commit" column pnpm install pnpm eval:public # writes eval_report.json + eval/charts/ ``` For wet-eval rows you additionally need an `OPENAI_API_KEY` (or another provider exposed via the same `--telemetry-model` flag) and access to the weekly workflow inputs documented in [benchmark-methodology.md → How to reproduce](./benchmark-methodology.md#how-to-reproduce). ## Schema and column reference | Column | JSON field | Notes | |--------|------------|-------| | Date | `date` | ISO date of the eval run. | | Agent | `agent_version` | Tagged release of `services/agents` (e.g. `v1.4.1`). | | Commit | `commit_sha` | Short or full git SHA the run was produced against. | | Mode | `eval_mode` + `substrate` | `substrate-only` (no LLM) or `wet-eval-*` (live agent). The badge colour repeats this distinction. | | MITRE acc. | `mitre_accuracy` | Per-case accuracy on the 200-incident corpus. | | MTC p50 | `mtc_p50_seconds` | Mean time to closure, p50, end-to-end. `n/a` on substrate rows because the substrate runs in microseconds — not a meaningful end-to-end timing. | | MTC p95 | `mtc_p95_seconds` | Same, p95. | | USD total | `usd_total` | On wet-eval rows, real spend. On substrate rows, `budget` projection from the rate card. | | Tokens total | `tokens_total` | Total tokens across the 200 investigations. | The full per-suite breakdown (`alert_reduction`, `investigation_completeness`, `response_quality`, `playbook_completion_rate`, per-template macros) lives in [`scoreboard.json`](https://github.com/beenuar/AiSOC/blob/main/apps/docs/static/data/scoreboard.json) and is rendered on the [main benchmark page](./benchmark.md) for the latest run; the scoreboard table keeps a tight five-number summary so the trend remains scannable. ## Comparing your own runs If you reproduce one of the rows on your own laptop and the numbers move, that's a signal worth filing — either the harness is non-deterministic on your platform (a bug we want to know about) or your fork has diverged. Open an issue on [github.com/beenuar/AiSOC/issues](https://github.com/beenuar/AiSOC/issues) with the JSON output of `pnpm eval:public` attached and the AiSOC team will investigate. If you reproduce against a different model or rate card and want your row on the public scoreboard, see [community submissions](./benchmark-methodology.md#how-to-compare). ## Provenance - Data file: [`apps/docs/static/data/scoreboard.json`](https://github.com/beenuar/AiSOC/blob/main/apps/docs/static/data/scoreboard.json) - JSON Schema: [`apps/docs/static/data/scoreboard.schema.json`](https://github.com/beenuar/AiSOC/blob/main/apps/docs/static/data/scoreboard.schema.json) - Renderer: [`apps/docs/src/components/Scoreboard/index.tsx`](https://github.com/beenuar/AiSOC/blob/main/apps/docs/src/components/Scoreboard/index.tsx) - Validator: `pnpm --filter @aisoc/docs scoreboard:check` → [`apps/docs/scripts/validate-scoreboard.mjs`](https://github.com/beenuar/AiSOC/blob/main/apps/docs/scripts/validate-scoreboard.mjs) - Methodology: [Benchmark methodology](./benchmark-methodology.md) - Latest snapshot tables: [Benchmark](./benchmark.md)