{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-00",
   "metadata": {},
   "source": [
    "# Evaluating Grounded Spatial Reasoning with GPT-5.5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-01",
   "metadata": {},
   "source": [
    "Computer vision systems have traditionally handled spatial understanding by decomposing it into narrower perception tasks: detecting objects, segmenting regions, classifying scenes, or grounding language to boxes and masks. That decomposition works well when the goal is localization. It is less direct when the task requires a model to interpret a visual scene, reason over what the structure means, apply constraints, and produce a plan that remains valid.\n",
    "\n",
    "Frontier multimodal models make a different workflow possible. Instead of training a task-specific model on many paired examples for one layout domain, we can test whether a reasoning-capable model like GPT-5.5 can solve more of the task directly when given enough scaffolding: clear visual context, an operating procedure, constrained objects, a structured output format, and evals that identify where the result breaks.\n",
    "\n",
    "The use case is office layout generation from an empty floorplan. Given an empty floorplan image, an office-layout SOP, and a furniture catalog, GPT-5.5 produces a structured layout spec that identifies spaces, infers room roles, applies numeric rules, places furniture, and records assumptions about the source drawing. The idea also came from a real office problem. Headcount in our New York office was growing fast enough that “could we fit a few more desks somewhere?” started to sound like a reasonable thing to ask a model.\n",
    "\n",
    "We use an office floorplan as the test case because the failures are easy to inspect. GPT-5.5 receives an empty floorplan, a reusable layout SOP, and a small furniture catalog, then produces a structured layout spec. The spec identifies spaces, infers room roles, applies numeric rules, places furniture, and records assumptions about the source drawing.\n",
    "\n",
    "The experiment is built around evaluating that spec, not judging a rendered image alone. Each run can be checked for mechanical validity, semantic grounding, and similarity to a human-authored reference when one exists. This lets us separate failure modes that a visual mockup would collapse together: invalid geometry, wrong room interpretation, incorrect furniture program, and weak source grounding."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-02",
   "metadata": {},
   "source": [
    "## Task setup: from floorplan to layout spec\n",
    "\n",
    "The generation step separates the problem into three inputs: 1) visual evidence from the floorplan, 2) planning policy from the SOP, and 3) object dimensions from the furniture catalog.\n",
    "\n",
    "The visible evidence is the empty floorplan image, `empty.png`. The model infers walls, doors, stairs, restrooms, partitions, openings, and usable room shapes from the image itself. It does not see the filled reference image or the gold JSON during generation; those artifacts are held back for evaluation.\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"images/grounded_spatial_figure_1_empty_vs_reference.png\" alt=\"Figure 1: empty floorplan and hand-filled reference\" width=\"90%\">\n",
    "</p>\n",
    "<p align=\"center\"><em>Figure 1: The model sees only the empty floorplan during generation. The filled reference is reserved for evaluation and qualitative comparison.</em></p>\n",
    "\n",
    "The reusable SOP in `constraints.json` defines the planning policy. It describes the functional spaces the layout should include, the anchor furniture required for those spaces, and the spatial rules the plan should try to satisfy.\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"profile_id\": \"office_furniture_sop_v1\",\n",
    "  \"client_brief\": {\n",
    "    \"required_spaces\": [\n",
    "      { \"semantic_type\": \"open_office\", \"count\": 2 },\n",
    "      { \"semantic_type\": \"reception\", \"count\": 1 },\n",
    "      { \"semantic_type\": \"executive_suite\", \"count\": 1 }\n",
    "    ],\n",
    "    \"required_objects\": [\n",
    "      { \"object_type\": \"reception_desk\", \"count\": 1, \"preferred_semantic_type\": \"reception\" },\n",
    "      { \"object_type\": \"executive_desk\", \"count\": 1, \"preferred_semantic_type\": \"executive_suite\" },\n",
    "      { \"object_type\": \"guest_chair\", \"count\": 3, \"preferred_semantic_type\": \"executive_suite\" }\n",
    "    ]\n",
    "  },\n",
    "  \"spatial_constraints\": [\n",
    "    {\n",
    "      \"id\": \"primary_circulation_clearance\",\n",
    "      \"enforcement\": \"hard\",\n",
    "      \"primary_path_min_clearance_in\": 48,\n",
    "      \"cluster_spacing_min_in\": 42\n",
    "    },\n",
    "    {\n",
    "      \"id\": \"workstation_cluster_size\",\n",
    "      \"enforcement\": \"soft\",\n",
    "      \"allowed_cluster_sizes\": [4, 6]\n",
    "    },\n",
    "    {\n",
    "      \"id\": \"storage_ratio\",\n",
    "      \"enforcement\": \"soft\",\n",
    "      \"rule\": \"Prefer roughly 1 storage cabinet for every 8 open-office workstations, rounded up.\"\n",
    "    }\n",
    "  ]\n",
    "}\n",
    "```\n",
    "The furniture catalog supplies object dimensions. It turns labels like `workstation`, `reception_desk`, or `storage_cabinet` into objects with dimensions and simple placement priors, so the model has to reason about whether objects actually fit.\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"workstation\": {\n",
    "    \"dimensions_in\": [60, 30],\n",
    "    \"rules\": [\"keep grouped\", \"orient for circulation\"]\n",
    "  },\n",
    "  \"reception_desk\": {\n",
    "    \"dimensions_in\": [72, 30],\n",
    "    \"rules\": [\"place near entrance\", \"keep visitor path clear\"]\n",
    "  },\n",
    "  \"executive_desk\": {\n",
    "    \"dimensions_in\": [84, 36],\n",
    "    \"rules\": [\"prefer private enclosed room\", \"keep guest-facing approach clear\"]\n",
    "  },\n",
    "  \"guest_chair\": {\n",
    "    \"dimensions_in\": [28, 28],\n",
    "    \"rules\": [\"use for visitor seating\", \"keep balanced around executive or meeting furniture\"]\n",
    "  },\n",
    "  \"storage_cabinet\": {\n",
    "    \"dimensions_in\": [48, 20],\n",
    "    \"rules\": [\"prefer wall adjacency\", \"avoid circulation\"]\n",
    "  }\n",
    "}\n",
    "```\n",
    "This separation keeps the setup reusable across cases. A new floorplan can change the geometry while the SOP and catalog stay fixed, which makes the task more about spatial reasoning than case-specific prompting.\n",
    "\n",
    "In the current implementation, generation has two reasoning stages. First, GPT-5.5 grounds the room semantics from the image. Then it generates the final layout conditioned on that semantic room plan and the fixed anchor requirements. This split came from the hillclimb: recognizing the room program and placing furniture are related, but they are not identical reasoning problems.\n",
    "\n",
    "The main output is `raw_layout.json`. It is a machine-checkable spatial specification that can be validated, repaired, compiled, and inspected before rendering. The spec includes a feasibility judgment, assumptions, unsatisfied constraints, semantic zones, furniture placements, and circulation paths. Placements include object type, center coordinates, dimensions, orthogonal rotation, zone membership, optional grouping, rationale, and confidence.\n",
    "\n",
    "All coordinates and dimensions are in inches, and rotations are discrete: `0`, `90`, `180`, or `270`. That keeps the output close to furniture-planning practice and makes it easier to validate with geometry checks.\n",
    "\n",
    "Rendering is downstream. GPT Image 2 can use the original floorplan plus instructions derived from the structured spec to produce a visual inspection artifact. The render makes the plan easier for humans to review, but the source of truth is the spec. Evaluation stays spec-first because validity is geometric."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-03",
   "metadata": {},
   "source": [
    "## Eval setup\n",
    "\n",
    "Generation and evaluation are handled as separate stages. GPT-5.5 first produces a candidate layout spec `raw_layout.json` candidate. Promptfoo then runs the eval pass over that saved spec. The config uses an echo provider because the layout has already been generated; Promptfoo is not creating the plan, only coordinating checks over the candidate JSON.\n",
    "\n",
    "Each candidate is checked in three ways:\n",
    "\n",
    "1. **validity — deterministic hard gate.**\n",
    "This is the only pass/fail gate. It checks whether the generated layout is mechanically valid against constraints.json and the source floorplan. The check covers hard spatial constraints such as source-wall collisions, object overlap, placements staying inside declared zones, and other geometry errors that would make the plan unusable regardless of how plausible the render looks.\n",
    "\n",
    "2. **grounded_program — semantic grounding score.**\n",
    "This score combines deterministic room-coverage checks with a GPT-5.5 judgment about whether the layout is faithful to the visible floorplan and the brief. It looks at whether the right kinds of rooms were recognized, whether major spaces were furnished appropriately, and whether the zoning interpretation is grounded in the drawing rather than invented. This is a score, not a hard reject rule, because semantic interpretation is harder to reduce to a single deterministic check.\n",
    "\n",
    "3. **reference_similarity — deterministic gold-spec score.**\n",
    "When a golden JSON spec exists, this check compares the generated layout to the reference using zone agreement, furniture-program agreement, and tolerant placement proximity. This gives us a benchmark-style signal without making the human reference the only acceptable layout. It is useful for comparing candidates, but it is not the hard gate.\n",
    "\n",
    "The outputs are written to `eval_results.json` for each run and surfaced in the trace viewer. The viewer is intentionally simple, but it makes the evaluation loop concrete: each run shows the case, model, reasoning setting, render status, validity result, semantic grounding score, and reference-similarity score in one place.\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"images/grounded_spatial_failed_run_viewer.png\" alt=\"Example failed run in the trace viewer\" width=\"90%\">\n",
    "</p>\n",
    "<p align=\"center\"><em>An example failed run surfaced in the viewer.</em></p>\n",
    "\n",
    "This candidate failed the hard `validity` gate because the executive desk and guest chair overlapped source wall or partition linework. It still received separate soft scores for `grounded_program` and `reference_similarity`, which makes the failure easier to diagnose than a single pass/fail label.\n",
    "\n",
    "This separation is a key design choice, since a run can be mechanically invalid but semantically informative. It can satisfy the hard geometry checks while still choosing the wrong room program. It can resemble the gold spec in furniture counts while drifting in placement or zoning. Keeping these signals separate lets us hillclimb the system: improve the prompt, revise the spec, add a validator, repair a placement issue, or collect a better gold reference depending on which layer failed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-04",
   "metadata": {},
   "source": [
    "## Results: selected hillclimb checkpoints\n",
    "\n",
    "The most useful runs were not a clean sequence of incremental improvements. They were checkpoints that exposed different problems in the workflow. The pattern was: wrong metric, wrong decomposition, wrong contract, then a better division of labor between model reasoning and deterministic geometry."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-05",
   "metadata": {},
   "source": [
    "### Run 1 - A valid layout exposed the wrong metric\n",
    "\n",
    "One early case passed the hard checks and produced a coherent office plan. GPT-5.5 selected the large east room as the open-office zone, treated the central lobby as reception, and used a smaller north room as the executive suite. Under that interpretation, the layout was internally consistent.\n",
    "\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"images/grounded_spatial_run_1_valid_layout.png\" alt=\"Run 1: empty floorplan, filled floorplan, and rendered floorplan\" width=\"90%\">\n",
    "</p>\n",
    "<p align=\"center\"><em>The generated plan looked plausible, but visual similarity to the human reference was the wrong primary metric.</em></p>\n",
    "\n",
    "The issue was the evaluation target. The run received a reference score of only `0.0644` because we were still comparing against the human image with furniture-mask IoU. That made the score too sensitive to pixel-level overlap, even when the generated plan was a coherent alternate layout.\n",
    "\n",
    "The JSON made the model’s interpretation easier to inspect than the render did:\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"assumptions\": [\n",
    "    \"The broad double doors at the south edge of the central lobby are treated as the primary public entrance.\",\n",
    "    \"The large east room is assigned as the usable open-office zone.\",\n",
    "    \"Other unlabeled large rooms are left unprogrammed rather than classified as open-office because the source image provides no room labels.\"\n",
    "  ],\n",
    "  \"zones\": [\n",
    "    {\n",
    "      \"id\": \"zone_open_office_east\",\n",
    "      \"label\": \"East open-office room used for workstation density calculation\"\n",
    "    }\n",
    "  ]\n",
    "}\n",
    "```\n",
    "This pushed the project toward spec-first evaluation. The render was still useful for inspection, but the primary object of evaluation had to be the structured plan: what zones the model selected, what program it assigned, and whether the layout was mechanically valid."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-06",
   "metadata": {},
   "source": [
    "### Run 2 - A plausible layout exposed the wrong decomposition\n",
    "\n",
    "On a more complex office floorplan, the generated furniture plan looked plausible, but the room interpretation was wrong. Reception moved to the wrong area, the south private-office region was not preserved, and the candidate produced a materially different room decomposition from the reference program.\n",
    "\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"images/grounded_spatial_run_2_wrong_decomposition.png\" alt=\"Run 2: empty floorplan, filled floorplan, and rendered floorplan\" width=\"90%\">\n",
    "</p>\n",
    "<p align=\"center\"><em>The visible furniture plan was plausible, but the room-program interpretation diverged before placement.</em></p>\n",
    "\n",
    "This changed the workflow. We split the task into two stages, and thus two LLM calls: first ground the room semantics from the image, then generate the furniture layout conditioned on that semantic room plan and the fixed anchor requirements. We also added an explicit client brief so the model did not have to infer the full business program from geometry alone."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-07",
   "metadata": {},
   "source": [
    "### Run 8 - An overfilled layout exposed the wrong contract\n",
    "\n",
    "After the split, the workflow became easier to inspect but briefly worse in a useful way. Deterministic furnishing logic produced a 38-workstation plan because requirements propagated too broadly across ambiguous `other` rooms.\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"images/grounded_spatial_run_8_overfilled_layout.png\" alt=\"Run 8: overfilled rendered floorplan\" width=\"75%\">\n",
    "</p>\n",
    "<p align=\"center\"><em>An over-specified furnishing contract propagated workstation requirements too broadly, leading to an unrealistic 38-workstation plan.</em></p>\n",
    "\n",
    "This exposed an over-specification problem. The model was no longer only solving a spatial task; the benchmark was starting to reward reproducing hidden allocation logic. The fix was to separate semantic room grounding from the furnishing contract, add brief-space bindings, and reduce reliance on broad labels like `other`.\n",
    "\n",
    "The goal became narrower: keep fixed anchors deterministic, but preserve enough design freedom for valid alternative layouts."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-08",
   "metadata": {},
   "source": [
    "### Run 15 - A valid end-to-end case clarified the division of labor\n",
    "\n",
    "In the strongest later case, the run reached physical validity, though some semantic disagreement with the reference room program still remained. The candidate reached `completed_valid` after one model repair for a zone-bounds issue, while deterministic geometry cleared residual desk and storage collisions through repair passes, orthogonal-rotation search, local repair, and a small joint-repack fallback.\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img src=\"images/grounded_spatial_run_15_trace_viewer.png\" alt=\"Run 15: valid rendered floorplan and trace viewer\" width=\"90%\">\n",
    "</p>\n",
    "<p align=\"center\"><em>The first physically valid result on the harder office-building case. GPT-5.5 produced the grounded layout plan, and deterministic repair cleared the last geometry errors.</em></p>\n",
    "\n",
    "This clarified the system boundary. GPT-5.5 was most useful for grounded semantic decisions: identifying room roles, assigning the program, and proposing a structured plan. Deterministic code was better for cheap geometric cleanup: bounds checks, object overlap, source-wall collisions, rotations, and repacking.\n",
    "\n",
    "That became the hillclimb. The model proposes a grounded layout spec. The eval stack identifies whether the failure is semantic, mechanical, or reference-level. Deterministic code repairs the parts that should not require another model call. The render remains useful for inspection, but progress is measured at the spec level."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grounded-spatial-09",
   "metadata": {},
   "source": [
    "## Takeaways\n",
    "\n",
    "Frontier models meaningfully change what is practical in a grounded spatial-planning workflow. They can read a sparse floorplan, infer a room program, apply numeric layout rules, emit a structured spec, expose assumptions, and respond to eval feedback without task-specific fine-tuning for this exact layout problem.\n",
    "\n",
    "By requiring GPT-5.5 to emit a machine-checkable layout spec, the system turns spatial reasoning into an inspectable artifact: room roles, assumptions, constraints, placements, and repair points are made explicit before rendering.\n",
    "\n",
    "That representation also clarifies the right division of labor. GPT-5.5 contributes high-level interpretation: reading ambiguous geometry, assigning a room program, and proposing a coherent layout. Deterministic components enforce the parts of the task that should be exact, including boundary checks, collision detection, clearance rules, and local repair. Model reasoning is therefore not treated as self-validating; it is constrained, checked, and improved through the surrounding system.\n",
    "\n",
    "The remaining challenge is semantic grounding. The hardest floorplans are not those with no valid solution, but those with several plausible ones. In those cases, success depends on whether the system identifies the program intended by the drawing, not merely whether it can produce a physically valid arrangement. Future progress will depend on broader eval coverage, stronger source-geometry extraction, and more precise matching between visual zones and semantic roles. The broader conclusion is that GPT-5.5 can move more spatial-planning work into general multimodal reasoning."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}