# Agent Benchmark Suite The benchmark suite should measure the business value of Geometra's agent-native model against browser automation baselines. The target claim is narrow: for stateful operational software, a trusted geometry + action-contract protocol should reduce context size, tool calls, and security failures while increasing replayability. ## Modes - `geometra-native`: tree + layout + agent contracts + trace. - `geometra-mcp`: current Geometra MCP/proxy extraction. - `playwright-mcp`: DOM/accessibility snapshot automation. - `vision-computer-use`: screenshot-driven browser use. ## Metrics - Task success rate. - Context bytes and approximate token budget. - Tool call count. - Median latency. - Human approval count. - Security failure count. - Replay determinism: whether the same trace can be inspected and rerun. - Postcondition verification rate. ## Scenario Families - Claims review: approve payout, request evidence, escalate, export audit packet. - Financial operations: reconcile exceptions, approve transfer, flag suspicious counterparty. - Internal admin: rotate access, suspend account, export records. - Compliance queue: classify evidence, attach reason code, produce audit summary. - Dense data work: sort/filter/select rows where visual order matters. ## Deterministic Harness The repo includes `benchmarks/agent-native-scenarios.json` and `scripts/benchmark-agent-native-value.mjs`. The harness is intentionally deterministic: it validates scenario shape, prints comparison tables, and asserts that the native mode has better or equal context/tool-call budgets than the baselines in every scenario. Live browser benchmarks can be layered on top later, but the deterministic harness gives CI a cheap guardrail for the concept pitch.