# Official QA benchmark This directory contains the contributor-facing benchmark harness for `pi-computer-use`. Use it before and after changes that affect semantic targeting, image fallback policy, AX execution, browser handling, or native helper behavior. For general local setup and helper build instructions, see [../docs/development.md](../docs/development.md). ## What it measures The benchmark answers four questions: 1. **AX-only efficacy** - navigation efficacy: can `screenshot`/`wait` return semantic AX state without vision fallback? - targeting efficacy: can `click(ref=@eN)` succeed through AX without fallback? 2. **Overall efficiency** - AX-only ratio - vision-fallback ratio - AX execution ratio 3. **Latency** - overall average latency - navigation latency - targeting latency 4. **Coverage** - baseline/frontmost - native apps (`TextEdit`, `Finder`, `Reminders`) - browsers (`Safari`, `Chrome`, `Firefox`, `Helium`, etc. when available) - expanded TextEdit action surface (`set_text`, raw keyboard/pointer primitives, and `computer_actions`) ## Commands Run the interactive but non-intrusive default: ```bash npx -y tsx benchmarks/qa.ts --allow-foreground-qa ``` Allow the harness to open apps for wider coverage: ```bash npx -y tsx benchmarks/qa.ts --allow-foreground-qa --allow-screen-takeover ``` Save a result file: ```bash npx -y tsx benchmarks/qa.ts --allow-foreground-qa --output benchmarks/results/latest.json ``` Compare against a saved baseline and fail on regression: ```bash npx -y tsx benchmarks/qa.ts \ --allow-foreground-qa \ --allow-screen-takeover \ --baseline benchmarks/results/baseline.json \ --output benchmarks/results/current.json ``` ## Result format The benchmark prints a JSON report containing: - environment metadata - aggregate metrics - optional baseline comparison - per-case records Important metrics: - `axOnlyRatio` - `coreAxOnlyRatio` - `visionFallbackRatio` - `coreVisionFallbackRatio` - `axExecutionRatio` - `stealthCompatibleRatio` - `navigationAxOnlyRatio` - `targetingAxOnlyRatio` - `primitivePassRatio` - `batchPassRatio` - `capabilityPassRatio` - `capabilityStealthRatio` - `avgLatencyMs` - `avgNavigationLatencyMs` - `avgTargetingLatencyMs` - `avgPrimitiveLatencyMs` - `avgBatchLatencyMs` `core*` metrics exclude frontier capability probes so experimental coverage does not hide regressions in the main user path. `axExecutionRatio` and `targetingAxOnlyRatio` intentionally track AX-first targeting actions (`click` and `set_text`). Raw primitives such as `keypress`, `drag`, and `scroll` are measured separately so improved primitive coverage does not hide AX-first targeting regressions. Current benchmark goals are defined in `benchmarks/config.json`: - `coreAxOnlyRatio >= 0.8` - `avgLatencyMs <= 7500` - `avgTargetingLatencyMs <= 4000` ## Regression policy Regression tolerances live in: ```text benchmarks/config.json ``` When `--baseline` is provided, the benchmark compares current results against the baseline and exits non-zero if any configured metric regresses beyond tolerance. ## Results directory Store committed or local benchmark artifacts under: ```text benchmarks/results/ ``` The repository includes `benchmarks/results/.gitkeep` so contributors have a stable location for baselines and comparison outputs. Suggested local workflow: ```bash npx -y tsx benchmarks/qa.ts \ --allow-foreground-qa \ --output benchmarks/results/baseline.local.json npx -y tsx benchmarks/qa.ts \ --allow-foreground-qa \ --baseline benchmarks/results/baseline.local.json \ --output benchmarks/results/current.local.json ``` ## Contributor workflow 1. Run the benchmark and save a baseline. 2. Make your change. 3. Re-run the benchmark with `--baseline`. 4. Only claim improvement if the benchmark shows it. This benchmark should be treated as the official gate for semantic-targeting changes, fallback-policy changes, and AX-vs-vision efficiency claims. For documentation-only changes, running this benchmark is usually not necessary.