--- name: eval-benchmark-runner description: Use when running the automated daily evaluation suite that measures the legal AI system's output quality across all benchmark datasets. Orchestrates the full eval pipeline — loading datasets, calling the production model, scoring with LLM-as-judge rubrics, detecting regressions, and publishing results to the leaderboard and observability dashboards. license: MIT metadata: id: eval.benchmark-runner category: eval priority: P0 intent: [__eval__, benchmark, quality, regression, ci] related: [eval-llm-as-judge-system-prompt, eval-regression-detector, eval-leaderboard-updater, eval-dataset-nda-prompts-30, eval-dataset-employment-prompts-30, eval-dataset-adversarial-prompts, eval-dataset-multilingual-prompts] source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal) version: "1.0" --- # Benchmark Runner ## When to use this The benchmark runner is the automated quality gate for the legal AI system. It runs: - **Daily at 12:00 UTC** — tracks quality trend over time. - **On every staging deployment** — catches regressions before production. - **On-demand via Slack `/eval-run`** — for ad-hoc quality checks after prompt-engineering changes. Do not run it on production mid-traffic; use a staging endpoint or a dedicated eval tenant. ## Inputs | Input | Required | Notes | |---|---|---| | `model` | Yes | Production model slug (e.g., `claude-sonnet-4-5`) or experimental | | `datasets` | Yes | Array of dataset IDs to include; defaults to all | | `judgeModels` | Yes | Array of judge model slugs for ensemble scoring | | `costBudget` | Yes | Max USD to spend on this run; aborts if exceeded | | `runId` | Auto | UUID generated per run | | `baselineRunId` | Optional | Previous run to compare against for regression detection | ## Review methodology ### Step 1 — Load datasets Load all configured `eval.dataset.*` files from `eval/datasets/*.jsonl`. Each JSONL file contains records with: ```json { "id": "nda-001", "prompt": "...", "category": "draft", "expected_signals": ["mutual", "confidential_info_defined", "governing_law"] } ``` Supported datasets: [[eval-dataset-nda-prompts-30]], [[eval-dataset-employment-prompts-30]], [[eval-dataset-real-estate-prompts-30]], [[eval-dataset-research-prompts-30]], [[eval-dataset-adversarial-prompts]], [[eval-dataset-multilingual-prompts]], [[eval-dataset-competitor-comparison-set]]. ### Step 2 — Run prompts against production model For each prompt: 1. Call the production `/chat` endpoint (not the LLM API directly — test the full stack). 2. Record: `response_text`, `latency_ms`, `tokens_input`, `tokens_output`, `skills_routed`. 3. Enforce a per-prompt timeout of 120 seconds; log any that exceed it. 4. Track cumulative cost; abort if `costBudget` is exceeded. Run prompts concurrently (max 5 at a time) to keep wall-clock time under 30 minutes. ### Step 3 — Score each response For each response, invoke the [[eval-llm-as-judge-system-prompt]] with the following rubrics active: - [[eval-rubric-legal-soundness]] (weight: 0.35) - [[eval-rubric-citation-quality]] (weight: 0.20) - [[eval-rubric-jurisdiction-awareness]] (weight: 0.20) - [[eval-rubric-completeness]] (weight: 0.15) - [[eval-rubric-hallucination-detection]] (weight: 0.10 — binary, auto-fail) Use an **ensemble of judges** (e.g., GPT-4o + Claude Sonnet from a different provider + Gemini Pro). Average their scores. Flag any prompt where judge disagreement > 1.5 points for manual review. **Critical rule**: never use the same model family as both the system under test and the judge. If testing Claude Sonnet, judges must include at least one non-Claude model. ### Step 4 — Compute aggregate scores Per dataset: ``` dataset_score = Σ(rubric_score × rubric_weight) / n_prompts ``` Global aggregate: ``` aggregate_score = Σ(dataset_score × dataset_weight) / n_datasets ``` Dataset weights (by business priority): adversarial=0.25, NDA=0.20, employment=0.20, multilingual=0.15, real-estate=0.10, research=0.10. Also compute: - `hallucination_rate` = count(hallucinated) / total_prompts - `latency_p50`, `latency_p95` - `cost_per_message` = total_cost / n_prompts ### Step 5 — Detect regressions Call [[eval-regression-detector]] with current and previous run scores. Regression triggers: - Any rubric score drops > 5% vs previous run. - Hallucination rate increases > 0.5%. - Latency p95 increases > 20%. - Cost-per-message increases > 15%. On regression: Slack alert to `#eng-quality`, auto-create Linear ticket, block deployment promotion if a P0 rubric regressed. ### Step 6 — Publish results - Update [[eval-leaderboard-updater]] with scores. - Write structured run report to Langfuse. - Emit PostHog event `eval_run_completed` with aggregate score and regression flag. ## Output format ```json { "runId": "uuid", "model": "claude-sonnet-4-5", "runAt": "2026-05-14T12:00:00Z", "datasets": { "nda-prompts-30": { "score": 4.1, "n": 30, "hallucinations": 0 }, "adversarial-prompts": { "score": 4.7, "n": 30, "refusal_rate": 0.97 } }, "aggregate_score": 4.2, "hallucination_rate": 0.003, "latency_p50_ms": 3200, "latency_p95_ms": 8100, "cost_per_message_usd": 0.0023, "regression": false, "top_failing_prompts": [ { "id": "research-027", "score": 1.8, "issue": "fabricated statute number" } ] } ``` ## Limits & escalation - If `hallucination_rate > 1%`, halt deployment automatically regardless of other scores. - Recalibrate judge prompts against a human gold-standard label set **quarterly**. - Track *trend* over absolute score — a 4.1 improving steadily from 3.5 is healthier than a 4.5 declining from 4.8. - Do not over-optimize for benchmark scores — ensure at least 10% of prompts in each dataset are novel and not visible to the model team. ## Related skills - [[eval-llm-as-judge-system-prompt]] — the system prompt driving rubric scoring - [[eval-regression-detector]] — detects quality drops across runs - [[eval-leaderboard-updater]] — records scores to the trend dashboard - [[eval-dataset-nda-prompts-30]] — NDA benchmark dataset - [[eval-dataset-adversarial-prompts]] — safety and robustness dataset