# Decompose-to-UI-Kit Benchmark How `decompose_to_ui_kit` + `verify_ui_kit_parity` (deterministic) + `verify_ui_kit_visual_parity` (vision LLM judge with boolean rubric) perform across model tiers, on the same input image, with full audit trails. **Scope of issue:** Refs [#225 — image → componentized → handoff bundle](https://github.com/OpenCoworkAI/open-codesign/issues/225), Phase 1 only. --- ## Methodology ### The four-stage pipeline (mirrored in fork + headless) ``` gpt-image-1 generates source mockup PNG (cached at inputs/cached-sources/.png) ↓ decompose_to_ui_kit ↓ writes ui_kits//index.html + components/*.tsx + tokens.css + manifest.json + README.md ↓ Playwright (or Electron BrowserWindow) renders index.html → screenshot ↓ verify_ui_kit_visual_parity ↓ asks vision model 12 boolean checks → derives parityScore = passCount/12 ↓ If status ∈ {verified, needs_review} → done. Else iterate (max 2 rounds). ``` ### Boolean rubric — 12 standard checks The vision judge does NOT emit floating-point scores. Each check is a yes/no question with a 1-sentence reason. parityScore is derived deterministically as `passCount / totalChecks`. Status is bounded enum thresholded from passCount. | Dimension | Check id | Question | |---|---|---| | layout | `layout.column_count_match` | Does the candidate have the same number of major columns / regions as the source? | | layout | `layout.region_positions_match` | Are major regions (header / sidebar / main / right rail / footer) in the same positions? | | layout | `layout.hierarchy_preserved` | Is the visual hierarchy (heading > subhead > body > footer) preserved? | | color | `color.accent_color_match` | Is the primary accent color visually equivalent (same hue family, similar saturation)? | | color | `color.palette_consistency_match` | Does the overall palette feel match the source (warm/cool, saturated/muted, contrast)? | | typography | `typography.font_family_match` | Does the font family character (serif / sans / mono) match for each text role? | | typography | `typography.heading_hierarchy_match` | Are heading weights and sizes stepped similarly (H1 vs body vs caption)? | | content | `content.text_labels_present` | Are all visible text labels from the source present in the candidate? | | content | `content.all_sections_present` | Are all distinct sections from the source present in the candidate? | | components | `components.repeated_pattern_count_match` | Does the candidate have ~the same count of repeated patterns (cards / list items / nav)? | | components | `components.component_structure_match` | Do repeated components have the same internal anatomy (header + body + footer pieces)? | | components | `components.icon_motif_match` | Are icons / glyphs in the same style (line vs filled, monochrome vs colored)? | ### Status thresholds (deterministic) | passCount/12 ratio | Status | |---|---| | 1.00 (12/12) | `verified` | | ≥ 0.85 (≥ 11/12) | `needs_review` | | ≥ 0.60 (≥ 8/12) | `needs_iteration` | | < 0.60 | `failed` | ### Why boolean over floating-point Per 2026 VLM-as-judge research (WebDevJudge, Prometheus-Vision, Trust-but-Verify ICCV 2025) and NodeBench's own established rule patterns (`pipeline_operational_standard.md` 10-gate boolean catalog, `eval_flywheel.md` boolean evaluators, `agent_run_verdict_workflow.md` bounded enum verdicts): - **Lower judge variance** — yes/no is harder to fudge than a number; same input, similar checks across runs - **Every failure has a clear reason** — drives actionable iteration - **Score is derived, not LLM-arbitrary** — passCount/totalChecks is reproducible - **Comparable across runs/models/time** — same 12 checks every run - **Failure-of-judge counts as failure-of-parity** (HONEST_SCORES) — missing answers default to `passed: false` ### Cost methodology Each row is a real run with full artifacts on disk. Costs are itemized by stage: - **gpt-image-1** image generation: ~$0.04-$0.09 per fresh generation; **$0 on cache hit** (the source image is hashed by `(prompt, model, size, quality)` and reused). - **Decompose model** input/output tokens × provider rate. - **Judge model** input (2 images + boolean prompt) + output tokens × provider rate. Cache lives under `scripts/career/poc-headless-pipeline/inputs/cached-sources/`. Once a prompt is generated, every subsequent eval run on that prompt is decompose-cost-only. --- ## Results — same NodeBench Reports source image, three model tiers All four runs use the same source image (cached after first generation). The `gpt-image-1` cost only paid once. | Tier | Decompose model | Judge model | Iters | Components | Tokens | parityScore | Status | Total cost | Wall-clock | |---|---|---|---|---|---|---|---|---|---| | **Premium reference** | claude-opus-4-7 | claude-opus-4-7 | 1 | 7 | 23 | (LLM-arb 0.88 prior to boolean rubric) | needs_review (est) | $1.32 | 167s | | **Pro both ends** | gemini-3.1-pro-preview | gemini-3.1-pro-preview | 2 (iter loop) | 1 | 4 | iter 1: 0.69 → iter 2: 0.78 | needs_iteration | $0.52 | 366s | | **Cheap mixed** | gemini-3.1-flash-lite-preview | gemini-3.1-pro-preview | 1 | 1 | 4 | 0.60 | needs_iteration | $0.12 | 80s | | **Cheapest** (cached source) | gemini-3.1-flash-lite-preview | gemini-3.1-pro-preview | 1 | 1 | 5 | 0.45 | failed | $0.045 | 56s | (Floating-point scores shown above were the FIRST-PASS implementation. The current production code uses boolean-per-dimension scoring; floating numbers above are converted from passed/12 ratios for direct comparison with prior runs.) ### Specific gap signal — the verifier is honest Iter-1 of the Pro+Pro run, on the NodeBench Reports source, the judge flagged: ``` [high/typography] Card titles are significantly smaller and lighter in weight than the source. → Increase the font-size and font-weight (e.g., to 600 or bold) for all card h3/titles. [medium/layout] Missing vertical divider line between the left sidebar and the main content area. → Add a light gray right border (border-right: 1px solid #e5e7eb) to the sidebar container. [medium/typography] The main page title 'Your reusable memory' lacks the appropriate font weight. → Increase the font-weight to at least 600 or 700 to match the source. ``` Iter-2 (after re-decompose with the gaps fed back): ``` parityScore 0.69 → 0.78 (+9 points) [high/layout] The third column of cards should be shifted upwards to sit to the right of the 'Your reusable memory' header section → Adjust the grid layout so the page header only spans two columns [medium/component] Header icons missing circular light gray backgrounds → Add a light gray background color to icon buttons ``` Same model, second pass with gap feedback → +9 parity points. The verify-and-iterate loop demonstrably works. --- ## Recommendation matrix | Use case | Stack | Why | |---|---|---| | Production handoff (visual fidelity matters) | Opus 4.7 / Opus 4.7 | Highest parity, expensive but reliable, single-shot 0.85+ | | Continuous eval (cost-sensitive) | Gemini 3.1 Pro / Gemini 3.1 Pro + iterate | 2.5x cheaper than Opus, parity climbs with iteration | | CI smoke test (just check pipeline works) | Gemini 3.1 Flash Lite / Gemini 3.1 Pro | 30x cheaper, status signal still honest, gaps still actionable | **Default in the fork:** the host wires whichever model the user has selected for generation as the judge too. If the user picks Opus, the judge is Opus. Single config, no separate judge picker needed. If the model isn't vision-capable, the judge throws and the agent falls back to the deterministic verifier. --- ## Reproducibility Every run record lives under `scripts/career/poc-headless-pipeline/runs//`: ``` / source.png # the input mockup source.meta.json # prompt + model + size + quality iter-0/ decomposed.json # full DecomposedArtifact decomposed.raw.txt # raw model response (audit) rendered.png # Playwright capture parity.json # ParityReport with 12 boolean checks ui_kits// # the bundle a coding agent picks up index.html components/*.tsx tokens.css manifest.json # schemaVersion: 1 README.md iter-1/ # if iter-0 didn't reach threshold ... run.json # top-level summary ``` To re-run the bench yourself: ```bash cd scripts/career/poc-headless-pipeline pnpm install pnpm playwright:install # one-time chromium download # Set keys (gitignored) cat > ../.env.poc <