---
name: openclaw-eval-harness-shared
description: Use when evaluating the quality, accuracy, or safety of a legal AI skill against a standardized benchmark. The shared eval harness provides community-maintained datasets (NDA, employment, real estate, research), rubrics for legal soundness and hallucination detection, multi-judge scoring to reduce bias, and a public leaderboard for comparing skill quality across providers and versions.
license: MIT
metadata:
  id: openclaw.eval-harness-shared
  category: openclaw
  priority: P2
  intent: [openclaw, eval, benchmark, legal-ai-quality, rubric, hallucination]
  related: [openclaw-public-skill-registry, openclaw-contrib-template, openclaw-skill-portability-claude-codex-gemini]
  source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
  version: "1.0"
---

# OpenClaw — Shared Eval Harness

## Purpose

Legal AI skills need to be evaluated against a consistent, reproducible quality bar before they are trusted in professional practice. The OpenClaw Shared Eval Harness is a community-maintained open-source framework that lets skill authors, AI vendors, and legal professionals benchmark legal AI quality across four dimensions: legal soundness, citation quality, hallucination rate, and jurisdictional accuracy.

The harness is intentionally open: anyone can contribute datasets, rubrics, or judge prompts. Vendors may run their models against the harness to produce publicly comparable scores.

## Components

### 1. Datasets

Community-contributed prompt/answer pairs organized by practice area:

| Dataset | Description | Coverage |
|---------|-------------|----------|
| NDA dataset | Drafting, review, red-flag analysis | UAE (onshore + DIFC), LB, UK, US |
| Employment dataset | Contracts, non-competes, termination analysis | UAE, KSA, LB, UK, US |
| Real estate dataset | Lease review, SPA analysis, title issues | UAE, LB, EG, UK |
| Research dataset | Regulatory questions, statute lookup, comparative law | Multi-jurisdiction |
| Corporate dataset | SHA, MOU, acquisition terms | DIFC, ADGM, GCC |

Each entry in a dataset contains:
- `prompt`: the user input (in the most realistic form possible)
- `reference_answer`: the expected correct response, authored or reviewed by an admitted lawyer
- `jurisdiction`: the applicable legal system
- `practice_area`: the skill category being tested
- `difficulty`: easy / medium / hard / trap (where the correct answer is counterintuitive)

### 2. Rubrics

Evaluation rubrics define what a correct response looks like across dimensions:

**Legal soundness** (0–5 scale)
- 5: Fully accurate, complete, and actionable; no material omission
- 4: Accurate with minor gaps that would not mislead a practitioner
- 3: Mostly accurate but missing at least one material point
- 2: Partially accurate; some claims are wrong or misleading
- 1: Mostly wrong; could lead a practitioner to a harmful conclusion
- 0: Completely wrong or hallucinates legal authority

**Citation quality** (0–3 scale)
- 3: All citations accurate and correctly formatted for the jurisdiction
- 2: Citations present but minor formatting or pin-cite errors
- 1: Citations present but at least one is fabricated or wrong
- 0: No citations where they were required, or all citations fabricated

**Hallucination rate** (binary per claim)
- A claim is a hallucination if it asserts a specific legal fact (statute number, article, case name, threshold) that is factually wrong or non-existent.
- Report hallucination rate as: (number of hallucinated claims) / (total verifiable claims).

**Jurisdictional accuracy** (0–2 scale)
- 2: Response correctly identifies and applies the applicable jurisdiction
- 1: Response applies the wrong jurisdiction but still gives technically correct advice for that jurisdiction
- 0: Response conflates jurisdictions or applies a clearly wrong legal system

### 3. Judge models

Single-judge evaluation introduces model-specific bias. The harness uses a panel of at least three judge models (e.g., Claude, GPT-4, Gemini) to score each response independently. The final score is the median across judges after excluding outliers.

Judge prompts are templated and version-controlled in the repository. They include:
- The rubric being applied
- The reference answer (for grounded evaluation)
- Instructions to score independently without knowing which vendor produced the candidate answer

### 4. Leaderboard

Aggregate scores per vendor/model/skill version are published to the OpenClaw public leaderboard. The leaderboard shows:
- Overall score per dataset
- Per-rubric breakdown
- Version history (so regressions are visible)
- Whether the run was community-verified or vendor-self-reported

Leaderboard entries that are vendor-self-reported are labelled as such; community-verified runs (where an independent reviewer replicated the evaluation) receive a verified badge.

## Running the harness

```bash
# Install
git clone https://github.com/sboghossian/mini-claude-for-legal
cd mini-claude-for-legal/eval

# Run against a specific skill + dataset
python run_eval.py \
  --skill draft-nda-unilateral \
  --dataset nda \
  --model claude-sonnet-4-5 \
  --judges claude,gpt-4o,gemini-1.5-pro \
  --output results/my-run.json
```

Results are written to a JSON file with per-prompt scores and the aggregated summary. To submit to the leaderboard, open a PR against `eval/results/` with your run file.

## Contributing datasets and rubrics

Contributors should:
1. Author reference answers in consultation with an admitted lawyer in the relevant jurisdiction.
2. Label each entry with accurate jurisdiction and difficulty tags.
3. Flag "trap" entries — cases where the obvious (but wrong) answer is a common AI failure mode.
4. Submit via a PR with a brief description of the gap being filled.

Do not submit synthetic reference answers generated purely by AI without practitioner review — the harness is only as good as its ground truth.

## Caveats

- Eval scores measure skill output quality at a point in time against a fixed dataset. They are not a guarantee of performance in production.
- Legal standards change. Datasets should be reviewed for currency at least annually. Outdated reference answers are labelled with a staleness warning.
- The harness tests the AI output, not the underlying law. Verify any regulatory claims against primary sources before acting on them in practice.

## Related skills

- [[openclaw-public-skill-registry]] — the registry of skills being evaluated
- [[openclaw-contrib-template]] — how to contribute skills and datasets
- [[openclaw-skill-portability-claude-codex-gemini]] — test portability across providers