---
name: metric-validation-harness
description: Empirically validates a software metric before trusting or optimizing it — point it at any candidate metric (a command that takes a path and prints one number) plus a corpus, and it runs experiments that try to falsify each property a good metric must have. Checks determinism (same input, same number across runs and hash seeds), invariance to cosmetic edits (also an anti-gaming probe), monotonicity under construct-increasing edits, discrimination, robustness on edge inputs, near-linear tractability, and construct validity (convergent, discriminant vs LOC, predictive AUC, lift over a baseline). Trigger whenever someone proposes, reviews, tunes, or ships a metric, score, or index, asks "is this metric any good", suspects a score tracks LOC or jumps between runs, or builds a deterministic optimization target. It is the empirical companion to the deterministic-metric-design skill and is read-only.
---
# Metric Validation Harness

Point this harness at a candidate metric and a corpus, and it runs experiments that try to **falsify** each property a trustworthy, optimizable metric must have. It is the empirical companion to `deterministic-metric-design`: that skill tells you to *prove* monotonicity, invariance, determinism, and construct validity; this skill *runs the experiment* and reports PASS/FAIL, each result mapped to the design-skill category it checks.

**Read-only.** It computes and reports; it never modifies your metric, the corpus, or any external state. Safe to run unsupervised.

## When to Apply

- Someone proposes, reviews, tunes, or ships a metric / score / index and you need evidence it is sound
- A score "feels off" — you suspect it tracks LOC, jumps between runs, or saturates
- You are about to let an agent **optimize** a metric and need to know it can't be gamed by cosmetic edits
- You built a candidate per `deterministic-metric-design` and want to empirically confirm the properties you argued for
- You are choosing between two metrics and need to know which actually predicts the outcome (and beats a trivial baseline)

## Workflow Overview

```
config.json / env  →  resolve metric_cmd, corpus, thresholds (env > config > bundled default)
        │
        ▼
   verify.sh ──► determinism ─ invariance ─ monotonicity ─ robustness ─ tractability ─ validity
        │            (each property check maps to a deterministic-metric-design category)
        ▼
   PASS / FAIL per property  →  exit 0 (all pass) or 1 (any group failed)
```

## The Adapter Contract

Your metric is **any command that takes a path as its last argument and prints exactly one number to stdout**:

```bash
$ python3 mymetric.py path/to/file.py
42
```

Language-agnostic — Python, a shell one-liner, a compiled binary, anything. Diagnostics go to stderr; stdout is the number only. A bundled example metric (`scripts/examples/metric_ast_nodes.py`, AST-node count) ships so the harness runs out of the box.

## How to Run

```bash
# 1. Validate the bundled example metric (works with zero setup):
bash scripts/verify.sh

# 2. Validate YOUR metric — set metric_cmd in config.json, or override per-run:
METRIC_CMD="python3 /abs/path/mymetric.py" bash scripts/verify.sh

# 3. Prove the harness itself works (positive + negative cases):
bash scripts/selftest.sh

# 4. Sanity-check your adapter prints one number:
bash scripts/run-metric.sh path/to/file.py
```

`verify.sh` runs every check and prints a final PASS/FAIL. Each check is also runnable on its own (e.g. `bash scripts/check-determinism.sh`).

## What It Checks

| Check | Maps to (design skill) | What it does | PASS condition |
|-------|------------------------|--------------|----------------|
| `check-determinism.sh` | `det-` | Runs the metric twice + under `PYTHONHASHSEED` 0/1 | identical number every time |
| `check-invariance.sh` | `prop-` / `game-` | Adds comments/blank lines/whitespace (cosmetic) | score unchanged (else it's gameable) |
| `check-monotonicity.sh` | `prop-` | Appends a code block (construct-increasing) + checks spread | score non-decreasing; not saturated |
| `check-robustness.sh` | `prop-` | Empty + single-statement edge inputs | finite, in declared range, no crash |
| `check-tractability.py` | `comp-` | Times the metric on growing inputs | within budget, sub-quadratic growth |
| `check-validity.py` | `valid-` | Spearman vs accepted, vs LOC; AUC vs outcome | convergent high, discriminant not ~LOC, predictive beats baseline |

Statistics (Spearman, AUC/Mann–Whitney) are pure Python stdlib — no numpy/scipy.

## Setup & Configuration

The harness runs with **zero config** against the bundled example. To validate your own metric, set fields in [`config.json`](config.json) (or override any of them with the matching `UPPER_CASE` environment variable per run):

| config.json | Env override | Meaning |
|-------------|--------------|---------|
| `metric_cmd` | `METRIC_CMD` | your metric command (path-printing → number) |
| `baseline_cmd` | `BASELINE_CMD` | trivial baseline (default: bundled LOC) |
| `corpus_dir` | `CORPUS_DIR` | artifacts the property checks iterate over |
| `labels_csv` | `LABELS_CSV` | `path[,outcome][,accepted]` for validity |
| `declared_min` / `declared_max` | `DECLARED_MIN` / `DECLARED_MAX` | range the robustness check enforces |

Validity thresholds are env-tunable: `CONVERGENT_MIN`, `DISCRIMINANT_MAX`, `PREDICTIVE_MIN` (defaults are lenient — tighten for a real run; see `gotchas.md`).

Empty config fields fall back to the bundled demo, so the skill never crashes on missing setup — it runs the example instead.

## Tool Requirements

- `python3` (3.8+) — runs the metric, the transforms, and the stats
- `bash` and `awk` — the orchestrator and numeric comparisons (scripts are macOS bash 3.2-safe)

No network, no external packages.

## Interpreting Results

A FAIL names the property and the design-skill rule to consult. Examples:
- *cosmetic noise moved the score* → the metric reads surface text; see `prop-prove-invariance-under-irrelevant-transforms` and `game-make-cheapest-improvement-the-right-one`.
- *score DROPPED after adding code* → non-monotonic; optimizing it can reward worse code (`prop-prove-monotonicity`).
- *|Spearman(metric, LOC)| too high* → it's LOC relabeled (`valid-discriminant-not-just-loc`).

## Related Skills

- `deterministic-metric-design` — the design half. Use it to *construct* the metric (define the construct, choose a computable proxy, pick the scale, argue the properties); use this harness to *empirically verify* what you argued.
- `same-results-less-code`, `complexity-optimizer`, `knip-deadcode` — prescriptive code-reduction skills; validate any reduction metric you build to drive them with this harness before letting an agent optimize against it.

See [`references/workflow.md`](references/workflow.md) for per-check details, how to wire up your own metric and corpus, and troubleshooting.