---
name: eval-leaderboard-updater
description: Use when implementing or operating the component that records benchmark run scores to the internal quality leaderboard and weekly AI quality trend report. Maintains the historical score series, computes week-over-week deltas, and surfaces the trend data to the engineering and product teams.
license: MIT
metadata:
  id: eval.leaderboard-updater
  category: eval
  jurisdictions: [__multi__]
  priority: P2
  intent: [__eval__, leaderboard, quality-trend, reporting, ci]
  related: [eval-benchmark-runner, eval-regression-detector, eval-llm-as-judge-system-prompt, eval-rubric-legal-soundness]
  source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
  version: "1.0"
---

# Leaderboard Updater

## When to use this

The leaderboard updater runs automatically at the end of every [[eval-benchmark-runner]] run. It is also triggered manually when historical scores need to be backfilled or when the scoring methodology changes and requires recalibration.

## Inputs / signals

| Input | Source | Notes |
|---|---|---|
| `runId` | eval-benchmark-runner | UUID of the completed benchmark run |
| `runAt` | eval-benchmark-runner | ISO 8601 timestamp |
| `model` | eval-benchmark-runner | Model slug under test |
| `scores` | eval-benchmark-runner | Per-dataset and per-rubric scores |
| `aggregateScore` | eval-benchmark-runner | Weighted aggregate |
| `hallucinationRate` | eval-benchmark-runner | Fraction 0–1 |
| `latencyP95Ms` | eval-benchmark-runner | Infrastructure quality signal |
| `costPerMessageUsd` | eval-benchmark-runner | Economics signal |
| `regressionDetected` | eval-regression-detector | Boolean |

## Logic

### Step 1 — Persist to leaderboard table

```sql
INSERT INTO eval_leaderboard (
  run_id, run_at, model, aggregate_score, hallucination_rate,
  latency_p95_ms, cost_per_message_usd, regression_detected,
  dataset_scores, rubric_scores, created_at
) VALUES (...)
ON CONFLICT (run_id) DO NOTHING;
```

The `dataset_scores` and `rubric_scores` columns are JSONB, preserving the full per-dataset breakdown.

### Step 2 — Compute trend deltas

```sql
-- Get the previous run for the same model
SELECT aggregate_score AS prev_score, hallucination_rate AS prev_halluc
FROM eval_leaderboard
WHERE model = $1 AND run_id != $2
ORDER BY run_at DESC LIMIT 1;
```

Compute:
- `score_delta` = current aggregate - previous aggregate
- `hallucination_delta` = current hallucination_rate - previous hallucination_rate
- `trend` = `improving` | `stable` | `declining` (based on 3-run moving average)

### Step 3 — Update the weekly AI quality trend report

Aggregate all runs in the current week and update the `weekly_quality_summary` table:
```json
{
  "week": "2026-W20",
  "avg_aggregate_score": 4.2,
  "best_run_score": 4.4,
  "worst_run_score": 3.9,
  "hallucination_incidents": 0,
  "regressions_detected": 1,
  "regressions_resolved": 1
}
```

This data feeds the internal dashboard and the `report.weekly-AI-quality-trend` report.

### Step 4 — Emit leaderboard update notification

Post to Slack `#eng-quality` with a summary card:
```
Model quality run: claude-sonnet-4-5 @ 2026-05-14 12:00 UTC
Aggregate: 4.2 / 5.0 (+0.1 vs prev) ✓
Hallucinations: 0 ✓
Regression: None ✓
[View full report → Langfuse link]
```

If regression detected, post to both `#eng-quality` and `#eng-on-call`.

## Output

```json
{
  "leaderboardRowId": "uuid",
  "scoreDelta": 0.1,
  "trend": "improving",
  "weekSummaryUpdated": true,
  "slackNotified": true
}
```

## Leaderboard schema

```sql
CREATE TABLE eval_leaderboard (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  run_id UUID UNIQUE NOT NULL,
  run_at TIMESTAMPTZ NOT NULL,
  model TEXT NOT NULL,
  aggregate_score NUMERIC(3,2),
  hallucination_rate NUMERIC(5,4),
  latency_p95_ms INT,
  cost_per_message_usd NUMERIC(8,6),
  regression_detected BOOLEAN NOT NULL DEFAULT FALSE,
  dataset_scores JSONB,
  rubric_scores JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON eval_leaderboard (model, run_at DESC);
```

## Why this matters

A single aggregate score per run is not enough information to improve the product. The leaderboard preserves the full historical series so that:
- Engineers can see whether a prompt-engineering change improved one rubric while degrading another.
- Product can report "model quality improved 12% over the past quarter."
- Teams can detect and reverse regressions promptly rather than discovering them in user complaints.
- The trend (moving average) is more meaningful than any single run's absolute score.

## Caveats & currency

Recalibrate rubric weights in [[eval-benchmark-runner]] when the product's practice area mix changes significantly (e.g., if real-estate usage grows to 40% of queries, its dataset weight should increase). When rubric weights change, historical scores are not directly comparable — mark the change in the leaderboard `notes` column and restart the moving average.

## Related skills

- [[eval-benchmark-runner]] — the upstream process that calls this updater
- [[eval-regression-detector]] — provides the `regressionDetected` signal
- [[eval-llm-as-judge-system-prompt]] — the scoring engine whose output feeds into scores
- [[eval-rubric-legal-soundness]] — primary rubric whose trend is most closely tracked