# Paired Benchmark Findings — When Does UAP Actually Help? Results from the controlled paired harness (`uap bench paired`, see [PAIRED_HARNESS.md](PAIRED_HARNESS.md)). These supersede earlier uncontrolled numbers and complete the story the [TBench Investigation](TBENCH_INVESTIGATION.md) started: every "UAP lift" measured without a controlled baseline turned out to be a confound. This is the rigorous version — same model, same tasks, same seeds, toggling **only** the UAP layer, with bootstrap confidence intervals on every delta. ## TL;DR > **UAP's accuracy lift depends entirely on whether the base agent already > self-verifies.** Against a strong agentic harness (opencode, which runs the > tests itself), UAP adds overhead with **no measurable accuracy gain**. Against > a non-agentic single-shot model, UAP's gate loop delivers **+20pp accuracy** > (78% → 98%, 95% CI [+8, +32], p=0.008) by catching and repairing edge-case > bugs the model ships in one shot. | Baseline type | UAP accuracy lift | What this means | |---|---|---| | **Agentic** (opencode self-tests) | **~0pp** (CI spans 0) | gate is redundant; cost is pure overhead | | **Non-agentic** (raw single-shot) | **+20pp** [CI +8, +32] | gate loop rescues failures the model can't self-catch | The headline implication: **UAP's gate value is real and large, but only over a baseline that doesn't already loop** — which is exactly the `uap deliver` (model-wrapping) case. When you are already driving a capable agentic harness, UAP's contribution shifts from *accuracy* to *efficiency, coordination, and never-regress*. ## The experiments Five controlled runs against `ik-llama/qwen36-35b-a3b-iq4xs` (a 35B-A3B MoE). All paired, all with deterministic ground-truth `verifyCmd` scorers (no LLM judge). Metrics reported as a vector — correctness **and** tokens/turns/latency — each as a paired delta with a 95% bootstrap CI and a sign-flip permutation p-value. Correctness deltas additionally get a McNemar 2×2 ("fixed" vs "regressed"). ### 1–3. Agentic baseline (opencode adapter) | Run | Suite | Design | baseline → uap-full | Token Δ | |---|---|---|---|---| | 1 | real-gate (easy) | 2 tasks × 2 × 6 | 100% → 100% | +34.2k | | 2 | real-gate-hard | 6 tasks × 8-arm ablation × 4 | 100% → 100% | +48.7k [13.9k, 93.3k] | | 3 | real-gate-gated | 5 edge-case tasks × 8-arm ablation × 4 | 100% → 100% | +25.2k (n.s.) | In every agentic run the **baseline already scored 100%** — there was no headroom for UAP to improve accuracy. The per-component ablation (turning off gates / worktree / memory / experts / skills / patterns one at a time) found **no component with a Δsuccess confidence interval above zero**; the token deltas only rank overhead. **Root cause:** opencode is itself an agentic harness that runs the in-repo tests and self-corrects by default. The UAP gate instruction is redundant on top of an agent that already self-verifies — and adding in-repo tests to give the "gate" something to run handed the same self-verification to the *baseline*. ### 4. Non-agentic baseline (raw single-shot vs gate loop) — the decisive run To isolate **gate value**, the `raw` adapter calls the model's completion endpoint directly: - **baseline** = one completion, no self-check (whatever it writes is final) - **uap-full** = execute → run the visible in-repo gate → feed the failure back → regenerate, looping until the gate passes (the `uap deliver` mechanism in miniature) 5 edge-case tasks × {single-shot, gate-loop} × 10 epochs, temperature 0.6 (so the first shot sometimes misses edges): | Condition | Success | Errors | Tokens | Turns | |---|--:|--:|--:|--:| | baseline (single-shot) | **78.0%** | 22% | 1,170 | 1.0 | | uap-full (gate loop) | **98.0%** | 2% | 2,242 | 1.3 | - **Correctness: +20.0pp** — 95% CI **[+8.0, +32.0]**, p=0.008 ✅ - **Gate value (McNemar):** 11 fixed, 1 regressed, **net +10** (p=0.006) - **Cost:** +1,072 tokens, +0.3 turns per task — cheap for a 20-point gain When the baseline can't self-verify, it ships edge-case bugs 22% of the time. The gate loop catches and repairs 11 of them, cutting the error rate to 2%. The one "regressed" case is the honest caveat — once, the loop's rewrite broke something single-shot happened to get right. ## What the tasks look like Findings only hold if the tasks have real headroom. The `real-gate-gated` suite was built so the **obvious solution passes the happy path but fails hidden edge cases**, and validated before any run: for every task the stub fails, a correct reference passes, **and a naive happy-path solution fails the hidden verify**. | Task | Happy-path trap | Hidden edge the gate must catch | |---|---|---| | `py-roman-strict` | sum the symbols | reject invalid numerals (`IIII`, `VV`, `IL`) | | `js-csv-parse` | `split(',')` | quoted commas, escaped `""`, CRLF, trailing newline | | `js-deep-equal` | `JSON.stringify` compare | `NaN`, key order, `{a:undefined}` vs `{}` | | `py-parse-duration` | single-unit only | combined `1h30m15s`, raise on invalid | | `py-merge-intervals` | assume sorted, strict overlap | unsorted input, touching `[1,2],[2,3]→[1,3]` | Each ships an in-repo test suite (the *visible* gate, `task.gateCmd`) distinct from the hidden superset `verifyCmd` that remains the authoritative scorer. ## Methodology notes (and honest caveats) - **Paired design** removes between-task variance — the dominant noise source. Deltas are bootstrap CIs (10k resamples, seeded for reproducibility) + a sign-flip permutation test. Single-arm point estimates are never claims. - **Small N + a single shared GPU.** Token/latency CIs are wide and latency is confounded by request queueing on one GPU; read the *correctness* and *token* deltas, not absolute latency. The +20pp result holds with N=50/arm and a CI that excludes zero, but it is one model on five tasks — directional, not a universal constant. - **Ceiling effect** is why the agentic runs show 0pp: a 35B model in an agentic harness simply solves these self-contained algorithmic tasks. A weaker model, or genuinely harder multi-file tasks, would re-open headroom. - **Harness reliability:** an early ablation wedged for ~50 min because `spawnSync`'s timeout only SIGTERMs the immediate child while opencode forks a detached tree that keeps the pipe open. Fixed with a detached process-group spawn + group SIGKILL; subsequent runs completed 158–190/192 cleanly. ## Reproduce ```bash # The decisive gate-value experiment (raw single-shot vs gate loop): UAP_RAW_TEMPERATURE=0.6 uap bench paired \ --adapter raw --model "" \ --suite benchmarks/suites/real-gate-gated \ --epochs 10 --concurrency 3 # Agentic baseline + per-component ablation: uap bench paired --adapter opencode --model \ --suite benchmarks/suites/real-gate-hard --ablation --epochs 4 ``` Artifacts (`records.jsonl`, `report.json`, `report.md`) land in `benchmark-results/paired-/`. See [PAIRED_HARNESS.md](PAIRED_HARNESS.md) for the harness internals, adapters, and how to author a task. ## Bottom line The accuracy-lift hypothesis is **falsified for the agentic-vs-agentic case** and **confirmed for the non-agentic case**. Use UAP's gate loop (`uap deliver`) when wrapping a model that won't self-verify — that is where the measurable accuracy win lives. When you already run a strong agentic harness, value UAP for its efficiency (token savings), coordination (multi-agent file safety), and never-regress guarantees, not for first-pass accuracy.