# Grothendieck's Vanishing Theorem: An AI-Driven Formalization Case Study **Project:** Formalization of Hartshorne III, Theorem 2.7 in Lean 4 + Mathlib. **Window:** 2026-03-27 → 2026-05-01 (35 elapsed days, 27 active). **Repo:** `github.com/Vilin97/Clawristotle`, branch `wip/grothendieck-vanishing` (merged via PR auto-merge into protected `grothendieck-vanishing`). **Authors:** see §13. **Status:** *draft technical report*. Awaiting Brian Nugent's review-2 to populate §11. --- ## 0. Abstract This report is a clinical case study, not a triumph narrative. We formalized Grothendieck's vanishing theorem (for any Noetherian topological space `X` of dimension `n` and any sheaf `F` of abelian groups on `X`, `Hⁱ(X, F) = 0` for all `i > n`) in Lean 4 + Mathlib in 9 active days, using a minimal-agent toolkit: Claude Code (interactive + autonomous loops) + `lean-lsp-mcp` + the Aristotle external prover (Harmonic AI). The proof closed sorry-free with no axioms beyond `propext` / `Classical.choice` / `Quot.sound` on 2026-04-04. After that, a human code review by Brian Nugent (Apr 8–15) identified five "Large changes" plus dozens of per-file items; the response ran for two weeks (Codex CLI refactor loop + Codex compress loop + a mathlib-style cleanup PR), reducing normalized LOC by 42% (7,016 → 4,087) and crossing a user-set raw <5,000 target on the very last cycle. Headline numbers: | Metric | Value | |---|---| | Theorem | Hartshorne III.2.7 — `H^n(X, F) = 0` for `n > dim X`, `X` Noetherian | | Sorry / non-default axioms at HEAD | 0 / 0 | | Lean files at HEAD | 15, raw 5,061 LOC, normalized 4,087 | | Git commits since Mar 27 | 904 | | Active days (commit-touched) | 27 of 35 elapsed | | Claude Code | 31,529 turns / 270 sessions / 13.9M output tokens / 6.76B cache reads | | Codex CLI | 5 persisted GV sessions (~26.4M tokens) + 564 ephemeral cycles (compress 86 + refactor 478) | | Aristotle (Harmonic) | 94 proving-phase + 7 aristotle-loop submissions; 22 / 66 / 5 / 1 outcome split (`COMPLETE` / `…WITH_ERRORS` / `CANCELED` / `FAILED`) | | Human reviews | Brian Nugent ×2 (Apr 8–15 = R1, May 1+ = R2 in progress) | The thesis: *with a minimal-agent toolkit, a third-year algebraic-geometry theorem can be formalized in five weeks, but the resulting code requires substantial human review to reach idiomatic quality, and even after a review-driven refactor, recognisable failure modes persist.* This document catalogues those failure modes and quantifies, for one project, how much code-quality improvement a single human review actually drove. The intended audience is anyone building or evaluating an AI-driven formalization workflow. We did not build a novel agent. We used an off-the-shelf one and tried to honestly catalogue what broke. --- ## 1. The Mathematics The theorem (Hartshorne III, Theorem 2.7): > Let `X` be a Noetherian topological space of dimension `n`. For any sheaf > `F` of abelian groups on `X` and any integer `i > n`, the sheaf cohomology > group `Hⁱ(X, F)` vanishes. In Lean (full statement extracted in `artifacts/lean-theorem-statement_gv.lean`): ```lean theorem GrothendieckVanishing (X : TopCat.{u}) [TopologicalSpace.NoetherianSpace X] (n : ℕ) (h : n > topologicalKrullDim X) (F : TopCat.Sheaf AddCommGrpCat.{u} X) : Subsingleton (Sheaf.H F n) ``` `Sheaf.H` is defined via `Ext` in Mathlib's derived-category machinery; the statement uses `Subsingleton` rather than `IsZero` because the derived category constructions naturally produce typeclass-style emptiness rather than a zero-object identification. (This choice is at the boundary of idiomatic Mathlib style and is one of the things review-2 may flag.) ### Proof architecture (4 pillars) The proof proceeds by well-founded induction on the topological Krull dimension of `X`, with four architectural pillars: 1. **Reducible → irreducible reduction.** Any Noetherian space decomposes into finitely many irreducible components; cohomology of the whole splits along Mayer-Vietoris into cohomology of the components and their intersections (which have strictly smaller dimension), so the inductive hypothesis closes the reducible case. 2. **Closed immersion machinery.** For an irreducible space, we pick a closed subset `Z` and the open complement `U = X \ Z`. The associated short exact sequence `0 → j_! F|_U → F → i_* F|_Z → 0` (with `j_!` extension by zero from the open set, `i_*` pushforward from the closed set) plus the long exact sequence in cohomology reduces vanishing to vanishing on `Z` and `U`. 3. **Flasque vanishing.** The constant sheaf `Z_X` on an irreducible space is flasque (every restriction map is surjective). Flasque sheaves have vanishing higher cohomology. This handles the dim-0 base case and parts of higher dimensions. 4. **Finite-generator reduction via filtered colimits.** Any sheaf of abelian groups on a Noetherian space is the filtered colimit of its finitely-generated subsheaves; cohomology commutes with filtered colimits on Noetherian spaces. This reduces the general statement to vanishing on finitely-generated sheaves, which decompose further until we reach pieces handled by pillars 2 and 3. The proof's headline architectural surprise is the **last sorry**: it was closed not by proving the dreaded "filtered colimits of injectives are injective," but by reframing — injectives are flasque, filtered colimits of flasques are flasque, flasques have vanishing higher cohomology. The hardest-looking sublemma was avoided. We return to this in §4.4. The proof is conventional in mathematical content; what makes this project worth reporting is *how* it was assembled, not what was proved. ![Lean dependency graph](artifacts/dep_graph_gv.png) **Figure: Lean import graph for `Aristotle/GrothendieckVanishing/main/`.** All 15 files at HEAD, with arrows pointing from importers to imports. The graph reads bottom-up in dependency order: `TopologicalKrullDim.lean` and the extension-by-zero / closed-immersion infrastructure feed into the `FlasqueVanishing` and `ClosedImmersionCohomology` blocks; those plus `PresheafFilteredColimit` and `FiniteGeneratorReduction` feed into `IrreducibleStep` (the dim≥1 irreducible case); finally `GrothendieckVanishing.lean` assembles the dim-0, irreducible-positive, and reducible cases by well-founded induction on Krull dimension. The narrow neck at `IrreducibleStep` reflects the four-pillar architecture — every non-trivial cohomology fact funnels through it. --- ## 2. The Setup — minimal-agent toolkit We deliberately used an off-the-shelf agent stack with no custom plumbing beyond a thin MCP wrapper for Aristotle and the loop drivers (`scripts/codex_*_loop.py`). ### 2.1 Tools - **Claude Code (the CLI agent).** Anthropic's coding agent, run on the laptop and on Hyak. Models used in this project: `claude-opus-4-7`, `claude-opus-4-6`, `claude-haiku-4-5`. Mostly opus-4-6 (24,431 of 31,529 assistant turns). - **`lean-lsp-mcp`.** MCP server exposing Lean LSP operations (`lean_goal`, `lean_diagnostic_messages`, `lean_multi_attempt`, `lean_run_code`, plus a family of search tools — `lean_leansearch`, `lean_loogle`, `lean_local_search`, `lean_state_search`, `lean_hammer_premise`). 2,256 calls across the project. - **Codex CLI.** OpenAI's coding agent, run on Hyak as the worker for the compress and refactor loops via `codex exec --ephemeral --dangerously-bypass-approvals-and-sandbox`. - **Aristotle (Harmonic AI).** External automated theorem prover. Submit a Lean file with a `sorry`; receive (eventually) a `tar.gz` with a proof or a partial proof. Accessed via `aristotlelib` and a custom MCP wrapper `tools/aristotle-mcp-server.py`. ### 2.2 Loop modes Five distinct loop modes ran across the five weeks (`automation-suite_gv.md` has the full slash-command catalogue): | Loop | Tool | When | Cycles | Persistence | |---|---|---|---|---| | `/babysit` | Claude Code | Proving | many (interactive) | full session jsonl | | `autonomous_loop` | Claude Code | Proving + docs | scripted | full session jsonl | | `aristotle-loop` | Aristotle | Apr 27 (infra burst) | ~10 | API records | | `codex_compress_loop` | Codex worker + Claude evaluator | Apr 27–28 | 86 | per-cycle JSONL + markdowns | | `codex_refactor_loop` | Codex worker + Claude evaluator | Apr 19–27 | 478 | per-cycle JSONL only | The two Codex loops have the same architecture: Codex (worker) edits Lean files in `--ephemeral` mode, `lake env lean` gates the result, and Claude (evaluator, with `--json-schema EVALUATOR_SCHEMA`) scores the cycle on a [-2, +2] scale and writes structured JSON state. A cycle is committed and pushed if score ≥ 1 with a non-empty diff. The Codex `--ephemeral` flag means **Codex sessions are not persisted to `~/.codex/sessions/`** — only the loop's own per-cycle JSONL records the work. Token cost for 564 of 569 GV-attributed Codex turns is therefore **not directly recoverable** from any local file; the report's Codex token totals reflect only the 5 persisted interactive sessions. ### 2.3 Two machines - **Laptop** (`vasil@MacBook-Pro`, `~/Github/Clawristotle`, was `~/Github/aristotle` before a partial rename — see §3 below). Primary during the proving phase. - **Hyak** (UW HPC cluster `klone.hyak.uw.edu`, `/mmfs1/gscratch/amath/vilin/Clawristotle`). Primary during the loop phases; the `.compress-state/` and `.refactor-state/` directories are gitignored and only on Hyak. Git was the only cross-machine coordination mechanism. We did not run distributed agents; the user typically sat at one machine at a time. ### 2.4 Branch protection workflow `grothendieck-vanishing` is protected. Agents push freely to `wip/grothendieck-vanishing`. A single PR (`wip → grothendieck-vanishing`) stays open with auto-merge enabled and merges automatically once the build CI check passes. 904 wip-branch commits flowed through with zero direct-push incidents over five weeks. This is one of the durable patterns in §9. --- ## 3. Timeline The project is bimodal in commit density and active in 27 of 35 elapsed days. | # | Phase | Dates | Driver | Headline outcome | |---|---|---|---|---| | 1 | **Proving** | Mar 27 – Apr 4 | Claude Code (laptop) + Aristotle | 0 sorries, but messy code → "A" | | 2 | **Docs / PR / first audit window** | Apr 5 – Apr 16 | Manual + sparse Claude | TECHNICAL_REPORT (old, now unreliable), PR #15; review-1 by Brian Apr 8–15 | | 3 | **Codex refactor** | Apr 19 – Apr 27 | Codex worker + Claude evaluator (Hyak) | 478 cycles, 35% completed assigned task | | 4 | **Codex compress** | Apr 27 – Apr 28 | Same architecture, LOC-decrease gate | 86 cycles, 7,016 → 4,087 normalized LOC | | 5 | **Mathlib polish + report** | Apr 29 – May 1 | Claude (interactive + agents) | PR #27 docstrings/lint, this report | | 6 | **Review 2** | May 1 + | Brian (in progress) | "C" — pending | ![Phase timeline](artifacts/phase_timeline_gv.png) **Figure: Commits per day across the project, classified by commit-message tag.** 904 commits over 35 elapsed days, of which 27 had at least one commit. The bimodality is visible: a 9-day proving spike (Mar 27 – Apr 4) dominated by `feat:` and `prove:` commits closes the sorries; an 8-day no-commit lull (Apr 8–15) is Brian's review-1 reading window; then the response phase (Apr 17 – May 1) is a wall of `refactor:` and `compress:` commits driven by the two Codex loops. The single largest commit-day is **Apr 23 (137 commits)** — almost all of which are checklist-exhaustion no-ops on the refactor loop (see §6.1). The final docstring/mathlib-style PR #27 is the small `style:` cluster on Apr 29 – May 1. ### 3.1 The two-review backbone The story has two human-review events, and almost everything else is reaction to them: - **Apr 8 – Apr 15 (review-1).** Eight calendar days, no GV commits. Brian read the sorry-free codebase and produced the audit reproduced in `review.md`. The audit identified five "Large changes" plus per-file items. - **Apr 17 onwards (review-1 response).** Both Codex loops and PR #27 exist because of review-1. Discussed in §6. - **May 1 + (review-2, in progress).** Will populate §11 ("C"). ### 3.2 The project-rename moment (factual aside) The repo was rebranded `aristotle → Clawristotle` in commit `f2acaf4` on 2026-03-17 — *before* GV started. The laptop filesystem directory, however, stayed `~/Github/aristotle` for several more weeks: Telegram alerts and internal task paths through Apr 5 still wrote to `/private/tmp/claude-501/-Users-vasil-Github-aristotle/`. The decisive "Clawristotle is correct" signal came in user prompt `[Apr 04 17:37]`. Hyak used the new name from inception. Net effect: laptop Claude session jsonls split across two project directories (53 in `~/.claude/projects/-Users-vasil-Github-aristotle/`, 7 in `~/.claude/projects/-Users-vasil-Github-Clawristotle/`), which made the session-data extraction non-trivial. ### 3.3 A loss to flag Mar 27–30 Claude session jsonls were auto-deleted by Claude Code's default `cleanupPeriodDays = 30` setting, which fires on the 30-day boundary (today is May 1; everything before Apr 1 is gone). This means our chronological human-prompt record (`artifacts/human_prompts_gv.txt`) starts at 2026-04-01. The Mar 27–31 prompt narrative is partially recoverable from `~/.claude/history.jsonl` (only 7 GV prompts survived there) and fully recoverable from the 175 commits in those days (`artifacts/proving_phase_commits_gv.txt`). For future projects we recommend setting `cleanupPeriodDays` to a year or more — a setting we have now bumped to 365 on this machine. ![Claude Code session activity](artifacts/session_activity_gv.png) **Figure: Claude Code activity across the laptop and Hyak, by day and hour-of-day.** Top: assistant turns per UTC day, split by machine (laptop = blue, Hyak = orange). The laptop dominates the proving phase (Mar 27 – Apr 4); Hyak takes over from Apr 19 onward as the home of the Codex compress and refactor loops with their Claude evaluator calls. Bottom: hour-of-day heatmap. Of **31,529 total assistant turns across 270 sessions** (29 laptop sessions + 241 Hyak sessions), the proving phase contributed roughly a third — the rest is automated loop traffic. The sparse band Apr 5–16 corresponds to the first review-1 lull (no GV work in flight); the dense Apr 19 – Apr 28 stretch is the two Codex loops running near-continuously on Hyak. The Mar 27–31 portion of the laptop data was reconstructed from `~/.claude/history.jsonl` plus commit timestamps after the default 30-day session-jsonl cleanup deleted the originals. --- ## 4. Phase 1: Proving (Mar 27 – Apr 4) 359 commits over nine days, ending at 0 sorry's on commit `08c3529d` (Apr 04 17:18). The sorry curve climbs from 4 → 35 (Mar 27–28 decomposition burst), drops to 2 by Mar 30 23:08, regresses to 24 on Apr 01 12:35 when heartbeat overrides were stripped, then walks 24 → 16 → 2 → 1 → 0 over the next 76 hours. This phase is documented in detail in `artifacts/findings/proving_phase.md`. The summary follows; the four sub-sections below pull out the most report-worthy threads. ![Sorry count over time](artifacts/sorry_history_gv.png) **Figure: `sorry`-count in `Aristotle/GrothendieckVanishing/main/` across all 904 commits, with the proving phase highlighted.** Classic sawtooth. Key events: (1) **Mar 27 — start at 4 sorry's** in the inherited skeleton. (2) **Mar 28 — peak 35** as the four pillars are stated and decomposed into named sub-lemmas (the "decomposition burst"). (3) **Mar 28–30** — foundational-layer Aristotle wins drop the count quickly (constant-sheaf flasqueness, dim inequality, closed-immersion stalk facts; see §4.3). (4) **Apr 01 12:35 — regression 3 → 24** when `set_option maxHeartbeats` overrides were stripped wholesale to satisfy the user's "default heartbeats only" demand (§4.2 Act III). (5) **Apr 01 13:53 — 16 → 2** in a six-hour cleanup. (6) **Apr 04 17:18 — final 0** when the last sorry closed by reframing the filtered-colimit-of-injectives goal through flasque (§4.4). The post-Apr-4 portion of the curve is flat at 0 across the 549 review- response commits. ![Daily git churn](artifacts/git_churn_gv.png) **Figure: Lines added (green) / deleted (red) / net (line) per UTC day.** The plot has two distinct shapes. **Mar 27 – Apr 4 (proving):** mostly net-positive, with two structural-decomposition spikes (Mar 30 and Apr 01) showing simultaneously large adds and deletes — the signature of splitting a long monolithic proof into named sub-lemmas. **Apr 17 – May 1 (response):** dominated by deletions from the compress loop. Apr 27 alone is `+1,300 / −2,400 net −1,100`, the day phase-1 sheaf-reversion deleted the unbundled `_presheaf` wrappers (cycles 1–10 of the compress loop). The 8-day flat band Apr 8–15 is the review-1 reading window. Project totals: ~+18.5K added / −13.5K deleted / +5K net at HEAD. ### 4.1 Pillars in dependency order The four architectural pillars landed roughly on schedule, in dependency order: - **Reducible / dim-induction skeleton.** First sorry-free piece is the strict-inequality lemma `[Mar 27 20:11] feat: prove dim strict inequality (Aristotle sorry-free proof!)`. - **Flasque core.** `[Mar 28 00:49] prove constantSheaf_flasque_of_irreducible (sorry-free)`. Then `[Mar 28 13:01] prove isFlasque_of_injective (Aristotle 8f42abaa), sorry count 7→5`. - **Closed-immersion SES.** The single hardest afternoon. In sequence: `[Mar 28 14:24] PROVE epi_app_of_shortExact_flasque! 7→4`, `[Mar 28 15:00] prove stalk=0 outside closed set`, `[Mar 28 17:47] fix: prove closedIncl_unit_stalk_isIso via triangle identity chain`, `[Mar 28 19:09] prove: i_* preserves ShortExact for closed immersions`, culminating in `[Mar 28 22:20] prove: ReducibleVanishing sorry-free`. - **Filtered colimits / finite-generator reduction.** Dominates Apr 2–3. `[Apr 02 13:23] prove: ELIMINATE IsSheaf sorry`. The hardest sub-goal is `hsep` separation: `[Apr 03 17:06] prove: CLOSE hsep separation proof (was the hardest sub-goal)`. Closes with `[Apr 03 20:36] prove: isSheaf_presheaf_filtered_colimit FULLY PROVED`. ### 4.2 The heartbeat saga (three acts) Mathlib's `synthInstance` budget for `HasDerivedCategory` / `HasExt` collided with proofs mentioning `Ext` or `Sheaf.H`. The arc spans Mar 28 – Apr 1, then collapses in a single afternoon. **Act I (Mar 28–29) — ratchet down, then blow up.** After `[Mar 28 02:29] reduce maxHeartbeats 1600000 → 800000` and `[Mar 28 03:27] remove 6 of 9 maxHeartbeats overrides`, the lemma `pushforwardH1Vanishing` exploded. A flurry of escalations: `fix: increase synthInstance.maxHeartbeats` (Mar 29 11:32) → `bump … back to 800K` (15:12) → `bump … to 1600000` (18:53) → `fix: add missing heartbeat options` (Mar 30 01:12). **Act II (Mar 30–31) — structural fixes.** Peak: `[Mar 30 10:51] perf: decompose 12.8M heartbeat proof into sub-lemmas`. A clean staircase `3.2M → 1.6M → 800K → 400K` (Mar 30 11:45 / 11:55 / 12:15). A `set_option` chase on Mar 31 16:33–17:09 tried `inferInstanceAs`, `hasDerivedCategory`, tactic-mode `infer_instance`, and `synthInstance` budgets of 800K / 1.6M / 4M — none held; commit `[Mar 31 17:09] cache instances via inferInstanceAs, revert broken budget reductions` accepted defeat. **Act III (Apr 01) — clean victory.** Triggered by user prompt: > `[Apr 01 11:10]` *"yesterday you spent 10 hours waiting for the profile > proof tool to complete on line 195 and line 212. they never completed. > now you are doing the same thing."* Followed by `[Apr 01 12:11]` *"keep optimizing until the whole project compiles with default heartbeats"* — sent **six times in 50 minutes**. What worked was **instance caching plus sub-lemma extraction**: `[Apr 01 11:26] perf: eliminate all heartbeat overrides in SetupCore.lean` → `[Apr 01 12:28] … across entire project`. User check at 12:53: *"i don't trust it. compiles in 5 minutes? seems too quick."* User again at 12:57: *"ok, that's amazing! good job!"* The cost: stripping `set_option`s blew the sorry count from 3 to 24 (`[Apr 01 12:35]`); `[Apr 01 13:53] prove: restore all 14 regressed sorry's (16 → 2)` cleaned it up six hours later. CLAUDE.md still carries the rule born here: *"Never increase `maxHeartbeats` above 200000."* ### 4.3 Aristotle's role: foundational layer + a counterexample 94 proving-phase Aristotle submissions: **22 COMPLETE, 66 COMPLETE_WITH_ERRORS, 5 CANCELED, 1 FAILED.** This split reproduces the PR #15 reported numbers exactly, and is recoverable from the Harmonic API by date filtering alone: the daily counts Mar 27 – Apr 4 sum to exactly 94. ![Aristotle outcomes](artifacts/aristotle_outcomes_gv.png) **Figure: Outcomes of the 94 Aristotle submissions during the proving phase (Mar 27 – Apr 4), filtered from the 1,350-job account dump by date window.** Only **22 (23%) returned `COMPLETE`** — a clean proof. **66 (70%) returned `COMPLETE_WITH_ERRORS`** — partial work that still leaves sorry's. 5 were canceled by the user (typically when a different architectural route obsoleted the submission, e.g. the three flasque- implies-injective successor jobs after the Mar 28 counterexample), 1 hard-failed. The stacked daily breakdown shows the mass on Mar 27–28 (the foundational-layer days, when 6 of the 22 COMPLETEs landed in sorry-elimination commits) and a thinning tail Apr 2–4 where 12 jobs covering filtered-colimit and Gabriel-theorem machinery returned **uniformly with errors or were canceled**. None of those Apr submissions corresponds to a commit that closed a sorry — the filtered-colimit and Gabriel sorries were closed by Claude directly. The 23% clean-COMPLETE rate is much lower than the VML project's 50%; the conceptual gap between Hartshorne III.2.7's harder steps and Aristotle's tactic library is one explanation. Of the 22 COMPLETE jobs, **only 6 map cleanly to integration commits that cut the sorry count**: | Aristotle job | Lemma | Integration commit | |---|---|---| | `daccfe4` | `DimStrictInequality` | `feat: prove dim strict inequality (Aristotle sorry-free proof!)` (Mar 27) | | `meq_const_of_irreducible` | constant-sheaf section | `e32c0ac integrate meq_const_of_irreducible from Aristotle (sorry-free!)` (Mar 28) | | `8f42abaa` | `isFlasque_of_injective` | `6f98b5f prove isFlasque_of_injective (Aristotle 8f42abaa)` (Mar 28) | | `887860fb` (resumed) | `epi_unit_of_closedImmersion` | (Mar 28) | | `819ad352` (resumed) | stalk surjectivity at `x ∈ Z` | (Mar 28) | | `17b9bce9` | `toPlus_surjective` | `2a38cf8 feat: Aristotle proves toPlus_surjective (0 sorry!)` (Mar 28) | All six are on **Mar 27–28**, the foundational-layer days. After Mar 28, COMPLETE jobs are either confirmations of already-solved sub-lemmas or silent `resumed-…` re-uploads folded into long-running proofs. The Apr 2–4 wave (12 jobs covering filtered-colimit and Gabriel machinery) is *uniformly* `COMPLETE_WITH_ERRORS` or `CANCELED`. None of those Apr submissions corresponds to a sorry-elimination commit; the filtered-colimit and Gabriel sorries were closed by Claude directly. Aristotle's most consequential single output was a **negative** result: > `[Mar 28 02:34] CRITICAL: flasque → injective is FALSE (Aristotle > counterexample)` — commit `d0cdf92`. The user was actively pursuing `flasque → injective` as the architectural move; Aristotle returned a counterexample, and three downstream Aristotle jobs (`FlasqueComplete`, `PlusObjIsSheaf`, `FlasqueVanishing` via Baer) were CANCELED in the same week. The eventual route used the **converse** plus a direct LES vanishing argument. **Honest summary.** Aristotle was the path to sorry-free for the foundational layer (constant-sheaf flasqueness, dim inequality, the flasque/injective direction, closed-immersion stalk facts). It was **background noise** from Mar 29 onward. The hardest pieces — heartbeat purge, presheaf-colimit-is-sheaf, the final flasque-via-colimit insight — were all closed by Claude after Aristotle timed out or returned errors. **Aristotle was a tactic, not a strategy.** ![Aristotle turnaround](artifacts/aristotle_turnaround_gv.png) **Figure: Aristotle turnaround times for the 94 GV proving-phase submissions, colored by outcome.** Left: histogram of `created_at → last_updated_at` deltas. The COMPLETE bucket (green) clusters in the short bins; `COMPLETE_WITH_ERRORS` (orange) dominates the long tail. Right: median turnaround by outcome. Note the unusually long medians across all categories — several days for any non-CANCELED outcome. Two caveats temper this: (1) `last_updated_at` is suspected to be a dashboard-refresh artefact, not compute completion (see Appendix B question 3); (2) the user often left jobs running overnight or across weekends without polling. The qualitative pattern is robust: COMPLETE is fastest, COMPLETE_WITH_ERRORS slowest, and the long tails reflect Aristotle exhausting its time budget on hard goals before returning a partial proof. The Apr 2–4 wave of jobs covering filtered-colimit and Gabriel-theorem machinery sits in the long-tail bins. ### 4.4 Reframing: how the last sorry actually closed By Apr 03 the architecture had reduced everything except the dim-≥1 case via the four-pillar plan. The sticking point was, in the user's mental model: *"filtered colimits of injectives are injective."* This is true (Grothendieck's AB5 + AB-injectivity + locally-Noetherian schemes manoeuvre) but Mathlib doesn't have it in the form the proof needed, and direct attempts kept circling. The last sorry closed by **reframing the goal**. Instead of: > show that an arbitrary sheaf has vanishing higher cohomology by > exhibiting it as a filtered colimit of injectives whose colimit is > injective. route through *flasque*: > 1. Injective sheaves are flasque (`isFlasque_of_injective`, Aristotle, Mar 28). > 2. Filtered colimits of flasques are flasque > (`isFlasque_filtered_colimit`, FiniteGeneratorReduction). > 3. Flasques have vanishing higher cohomology (`FlasqueVanishing`). What looked like a major missing infrastructure piece dissolved. Commit: `[Apr 04 17:18] feat: eliminate last sorry via flasque vanishing (0 sorry's, 0 axioms)`. This is a pattern — *reframe to avoid the hard sublemma rather than prove it* — that would not have surfaced from any single tool. It came from the human in the loop noticing that we'd over-specified the problem. ### 4.5 Notable failures during the proving phase - **`flasque → injective` direction (Mar 28)** — counterexample by Aristotle; three follow-up Aristotle jobs CANCELED. - **`lean_profile_proof` on lines 195/212 of SetupCore (Mar 31)** — runs that "never completed", cited by user prompt Apr 01 11:10. - **`[Mar 29 11:55] revert: remove broken constantSheaf_kernel_vanishing + openHom_comp_unit_eq_zero`** — agent-introduced regression. - **`[Apr 02 23:32] docs: attempt IsFinitelyPresentable (fails), revert to clean sorry`** — abandoned approach. - **`[Apr 03 19:00] prove: close isSheaf_presheaf_filtered_colimit via PreservesFilteredColimits`** — reverted 36 minutes later: `fix: revert circular isSheaf_presheaf_filtered_colimit to sorry`. ### 4.6 End of phase 1: state "A" On Apr 04 17:18 the project was sorry-free, axiom-free (modulo `propext`, `Classical.choice`, `Quot.sound`), and **messy**. Files had names like `Auxiliary.lean`, `Setup.lean`, `SetupCore.lean`. Many lemmas were stated for `(F : TopCat.Sheaf …)` but proved by unfolding to the underlying presheaf. The cohomology API leaked Ext throughout. Several lemmas of the form "if A is zero then B is zero" should have been "A and B are isomorphic". This is what review-1 was about to find. --- ## 5. Review 1: Brian Nugent's audit (Apr 8 – Apr 15) The full review is in `review.md`. Brian read the Apr 7 sorry-free state of the codebase over eight calendar days and produced a structured audit: five "Large changes" applying across the project, plus per-file detailed items. ### 5.1 The five "Large changes" **LC1. File structure.** *"There are files with indescriptive names like Auxiliary.lean, Setup.lean and SetupCore.lean. There are also files that are incorrectly labeled like ClosedOpenDecomposition.lean. … Some files should be combined, there only needs to be one file on flasque sheaves. Other files should be split, FiniteGeneratorReduction.lean should be split. … After this refactor, documentation should be added at the top of every file explaining what is done in that file as well as what the main definitions/theorems are in that file."* **LC2. Sheaf-vs-presheaf statement style.** *"Many lemmas/theorems are stated for `(F : TopCat.Sheaf AddCommGrpCat.{u} X)` but then the theorem proves something for either `F.val` or `(sheafToPresheaf _ _).obj F`. … The correct way of doing this is to state the results for `{F : Presheaf AddCommGrpCat X} (hF : IsSheaf F)` instead."* That is, **prefer unbundled** for results whose proof body uses the underlying presheaf; keep bundled `(F : Sheaf)` only where every step actually uses sheaf structure. **LC3. ShortComplex generalisation.** *"There are results proven with the assumptions `{G : TopCat.Sheaf …} (ip : InjectivePresentation G)` that only refer to `ip.shortComplex` in the theorem statement. These should all be generalized to work for `{S : ShortComplex (TopCat.Sheaf …)}` instead. If the assumption that the middle term is injective is necessary for a particular theorem then it can be added as an additional assumption `[S.X₂.Injective]`."* That is, **prefer bundled** for short-exact-sequence arguments. **LC4. Sheaf cohomology API.** *"Provide API for sheaf cohomology. Currently, anytime it needs something proven about H, it proves it for Ext then uses the fact that H is defined in terms of Ext. This is messy, it is better to have a separate file proving all results it needs about H … so the definition of H never needs to be unfolded in terms of Ext (except in the one file providing the API)."* **LC5. "If A=0 then B=0" should be "A ≅ B".** *"sheafH_preserves_filtered_colimits and PushforwardHVanishing both prove "if A=0 then B=0" when they should really be proving that "A is isomorphic to B". In both cases, the proofs are essentially the same … and the more general result is significantly more useful."* LC2 and LC3 are not contradictory: they apply to different shapes of argument. LC2 is about a single sheaf appearing as an argument; LC3 is about a triple (F₁, F₂, F₃, h₁, h₂, h₃, plus a composition-zero hypothesis) packaged as the components of a short exact sequence. Brian wants the first kind unbundled and the second kind bundled. We will see in §6.4 that the response correctly addressed LC3 but partially inverted LC2. ### 5.2 Cross-cutting per-file themes Brian's per-file items include: many *"this is a special case of [Mathlib lemma X]"* (e.g. `CategoryTheory.Limits.IsZero.epi`, `TopCat.Sheaf.isTerminalOfEqEmpty`, `Equiv.subsingleton.symm`, `Int.subgroup_cyclic`, `CategoryTheory.isIso_comp_right_iff`); many *"doesn't need to be its own theorem/lemma/definition/abbrev"*; many *"prove this in more generality, then derive the special case"* (most prominent: pushforward along closed immersion is exact, dim-bot characterisation by emptiness, stalk functor is exact). Three notable per-file judgments: - *"The proof of `TopCat.closedIncl_counit_isIso` is horrendous. It should be able to notice proofs like these and see that they are unacceptable and need to be simplified."* (ClosedImmersion.lean) - *"This is a zorn's lemma argument that is pretty technical and ugly. It is done much better in my PR, it should just copy it."* (FlasqueVanishing.lean lines 220–356) - *"Many of the theorems in this file are private but there's no reason for this. Many of them are even useful theorems so they definitely should not be private."* (SetupCore.lean) ### 5.3 What review-1 was telling us Restated as a property list, review-1's diagnosis of state "A" was: 1. **Naming and structural debt.** Dump-files (`Auxiliary`, `Setup`, `SetupCore`), miscategorised files (`ClosedOpenDecomposition`), single-lemma files (`Setup.lean`, `ReducibleVanishing.lean`). 2. **Statement-style debt.** Bundled `(F : Sheaf)` everywhere, even where the proof body is purely presheaf-side. 3. **Missing API.** Sheaf cohomology, topological Krull dimension, pushforward exactness, stalk functor exactness, dim-bot characterisation. 4. **Local hyperspecificity.** Many lemmas that should be inferred `inferInstance` calls or one-liners; many `abbrev`s that should not exist; several whole-file constructions that should be `def + API` instead of an ad-hoc abbrev. 5. **Style issues that propagate.** `let`-in-statement, `ConcreteCategory.hom` noise, deeply-nested AI-generated definitions. 6. **Algorithmic non-strengthening.** `if A=0 then B=0` instead of `A ≅ B`. (LC5.) Five buckets, all of which the response would need to address. --- ## 6. Phase 2: Review-1 response (Apr 17 – May 1) The response ran in three sub-phases: | Sub-phase | Dates | Driver | Cycles / commits | |---|---|---|---| | Codex refactor loop | Apr 19 – Apr 27 | `scripts/codex_refactor_loop.py` | 478 cycles, ~166 productive | | Codex compress loop | Apr 27 – Apr 28 | `scripts/codex_compress_loop.py` | 86 cycles, normalized LOC −2,929 | | Mathlib-style PR #27 | Apr 29 – May 1 | Claude (interactive + agents) | docstrings, lint, naming | We document each sub-phase honestly; both Codex loops have substantial failure-mode content that goes into §7. Then §6.4 maps the response back onto the five "Large changes." ### 6.1 The Codex refactor loop (Apr 19–27, 478 cycles) Architecture: `scripts/codex_refactor_loop.py` runs four stages on `wip/grothendieck-vanishing`. **Planner**: Codex (`gpt-5.5`, reasoning effort `xhigh`, `--ephemeral`) picks one `- [ ]` line from `.refactor-state/review_tasks.md` and writes `codex_strategy.md`; the chosen line is promoted to `- [>]` (WIP). If no `- [ ]` lines exist, an **auditor** Claude call inspects `review.md` and either appends fresh `- [ ]` lines or returns `loop_done: true`. **Worker**: Codex (3 hr timeout) edits Lean files. **Gate**: sorry count must not increase; changed `Aristotle/*.lean` files must compile under `lake env lean`; on add/delete/rename, `lake build` runs. On failure a **gate-repair** Codex call retries; second failure triggers `git checkout -- .` and a `-2` record. **Evaluator**: Claude (`claude-opus-4-7`, `bypassPermissions`, `--add-dir `, `--json-schema EVALUATOR_SCHEMA`) reads diff + last 8 cycles and emits `progress_score ∈ [-2, +2]`, `task_addressed`, `task_complete`, `stuck_on`, `strategy_recommendation`, `completed_task_lines`. **Quality breakdown.** This is where the report has to be honest. n = 478 cycles, cycles 1–520 with 42 missing in numbering. Score distribution: schema is `[-2, +2]` but only `{0, 1, 2}` ever fired: **`{0: 220, 1: 154, 2: 104}`** — 46% no-progress, 32% modest, 22% strong. **Zero records carry a `gate_failure` block**, **zero have `gate_repair_attempted: true`**. Worker exit codes were `{0: 417, None: 61}` — Codex never returned non-zero, and the 61 nulls are auditor-only cycles. The defensive gate-repair branch (`codex_refactor_loop.py:944–965`) was therefore exercised **zero times** across 478 cycles. Either `--ephemeral` always exits 0 even on partial work, or the 3-hour worker timeout plus pre-promotion to `- [>]` let Codex always converge to *something*. **The most defensive machinery in the design was dead code in production.** `task_complete: true` rate is **35% (166/478)**. **Recurring `stuck_on` themes.** Of 339 non-empty `stuck_on` fields, 139 are "umbrella scope too big" (the cycle completed one declaration-anchored slice of a broader umbrella task and couldn't tick the umbrella `- [ ]` line); 123 are "checklist exhaustion" (no unchecked entries; loop is idling); 6 are "no blocker"; 3 are lake-build failures; 2 are file-size targets missed; 65 other. Naming or namespace collisions never appeared as a Codex-side blocker. `principle_violations` fired 6 times across 478 cycles — mostly off-strategy edits (cycle 12: `.gitignore`; cycle 391: bookkeeping file) or layering (cycle 102: `WithBot` helper in a topology file). **Two pathological runs** dominate the failure narrative: - **Cycles 268–331: 64 consecutive checklist-exhaustion no-ops.** Same sentence in every record, cycle number swapped: *"Cycle 270 correctly preserved the intended no-op state by re-verifying that `.refactor-state/review_tasks.md` has no unchecked checklist entries."* - **Cycles 463–498: 36 consecutive returns of the literal default fallback string** *"Pick the next still-unchecked checklist item and keep scope tighter."* Treating score-0 no-op as acceptable rather than as a backoff signal let the loop spin. The auditor's `IDLE_COOLDOWN = 1 hr` did not trip because the auditor was returning empty `added_lines` — appearing to unblock the loop without actually populating it. **Real wins.** Almost all clean wins were **micro-mechanical refactors with a strict `COMPLETE_IF` gate** — rename a public theorem, demote to `private`, delete a sheaf-level wrapper whose only consumer is internal, extract a long proof body. Cycles 410–462 are textbook: **cycle 410** deleted 24 sheaf wrappers (`CohomologyAPI.lean` 2,125 → 1,644 lines); cycles 422–445 each deleted one wrapper or demoted one `extClass`-named lemma, gated by `rg -n '' Aristotle/`. Other landmarks: **cycle 22** (first Ext-leak step), **cycle 80** (ad-hoc stalk-lifting replaced by a reusable `CohomologyAPI` lemma), **cycle 396** (`FlasqueCohomology.lean` merged into `FlasqueVanishing.lean`), **cycle 500** (Zorn machinery extracted into 4 helpers, 226 → 115 lines). **Cycle rate.** Median inter-cycle gap 920 s (~15 min). Per UTC day: 04-19 (23), 04-20 (64), 04-21 (66), 04-22 (63), **04-23 (137)**, 04-24 (23), 04-25 (55), 04-26 (23), 04-27 (24). The 04-23 spike is artificial — each of those 137 cycles was a ~90-second checklist-exhaustion no-op. The rate did not slow sustainedly; what slowed was the average value per cycle. ![Cycle effectiveness](artifacts/cycle_effectiveness_gv.png) **Figure: Per-cycle effectiveness of the two Codex loops.** Top: refactor-loop `progress_score` distribution across 478 cycles. The schema is `[-2, +2]` but only `{0, 1, 2}` ever fired — **220 zeros (46% no-progress)**, 154 ones (32% modest), 104 twos (22% strong). The negative half of the schema was never exercised. Middle: refactor-loop `task_complete: true` rate by cycle index, with the two pathological runs visible — cycles 268–331 (64 consecutive checklist-exhaustion no-ops) and cycles 463–498 (36 consecutive default-fallback strategy returns). Bottom: compress-loop per-cycle LOC delta across 86 cycles. All deltas are negative by construction (the gate rejects equal-LOC commits); range is **[−501, −1], mean −33.9, median −14**. The top 5 single-cycle wins (cycles 1, 2, 3, 7, 10) account for **49% of total LOC reduction** — phase-1 sheaf-reversion. Cycles 26–86 average −12/cycle (long-proof golf), with the floor visible as the mass of small-magnitude bars at the right of the strip. (Detail: `artifacts/findings/refactor_loop.md`.) ### 6.2 The Codex compress loop (Apr 27–28, 86 cycles) Same three-process architecture (worker / evaluator / auditor on `wip/grothendieck-vanishing`), but the gate function differs in two material ways: - **LOC-decrease gate.** `codex_compress_loop.py:714–720` adds a hard mechanical gate: `loc_after >= loc_before` is treated like a sorry — cycle is rejected and gate-repair fires. Refactor allowed equal-LOC commits. - **Raw-LOC stop.** `RAW_LOC_STOP_TARGET = 5000` (line 63). When `wc -l Aristotle/GrothendieckVanishing/main/*.lean < 5000` the loop self-terminates. The 5,000 number was chosen by the user, somewhat arbitrarily. Anti-gaming evaluator rules in `principles.md` codify "comments and blanks don't count," "tactic-stacking via `;` / `<;>` doesn't count," and explicitly reward sheaf-bundling plus unused-decl deletion. Markdown reports preserved (3,711 lines across 86 files; the refactor loop discarded these). **Per-cycle delta distribution.** 86 cycles, all negative. **Mean = −33.9, median = −14, std = 74.3, range = [−501, −1].** Top 5 single-cycle wins: | Rank | Cycle | Δ | What | |---|---|---|---| | 1 | 2 | −501 | sheaf-reverted successor-map / H¹ cokernel-iso / H¹ vanishing clusters in `CohomologyAPI.lean` (1609 → 1193 raw) | | 2 | 1 | −350 | bundled the successor connecting iso onto `ShortComplex (Sheaf …)` | | 3 | 10 | −337 | sheafified the entire filtered-colimit boundary, deleted ~10 `_presheaf` shims | | 4 | 7 | −148 | sheaf-reverted CohomologyAPI dimension-shift cluster | | 5 | 3 | −103 | sheaf-reverted FlasqueVanishing's `sheafShortComplexOfPresheaf` bridge plus 5 `_presheaf` wrappers | **Top 5 alone account for −1,439 LOC = 49%** of the total. The bottom decile (cycles 30, 31, 73, 75, 77, 78) hovered at −1 to −4 LOC. **Three discernible phases:** - **Phase 1 — Sheaf-reversion (cycles 1–10).** Every cycle deletes a `_presheaf` wrapper and bundles the `(F : Presheaf, h : F.IsSheaf)` triple to a `Sheaf` argument. Mean Δ ≈ −180. Phase-1 gate (`raw < 6500`) met at cycle 10. - **Phase 2 — Unused-decl deletion (cycles 11–25).** Driven by `scripts/unused_decls.lean`. Cycles 17–18 wipe eleven `TopCat.Sheaf.{family,finset}*` wrappers from `GeneratedSubsheaf.lean`. Cycles 22–26 are a Codex-CLI 401 outage — five evaluator failures returned `authentication_error`, but the worker still produced real progress (−12, −19, −5, −22, −10) because the LOC gate is purely mechanical. - **Phase 3 — Long-proof golf (cycles 26–86).** `_presheaf` reversions exhausted (cycle 31: *"last `_presheaf` declaration in CohomologyAPI is gone"*). Strategy pivots to highest body-to-signature outliers via `compress_audit.py`. Mean Δ drops to ~−12. **Per-file effectiveness** (cumulative Δ attributable to summary mentions): | File | Δ | Cycles | |---|---|---| | `CohomologyAPI.lean` | −1,228 | 10 | | `PresheafFilteredColimit.lean / Core` | −910 | 13 | | `IrreducibleStep.lean` | −166 | 11 (mostly `exists_section_generating_stalks`) | | `FlasqueVanishing.lean` | −103 | 1 (cycle 3, then idle) | | `PresheafFilteredColimitGeneral.lean` | −10 | 1 | | `ConstantSheafFlasque.lean` | −10 | 1 (final cycle) | `exists_section_generating_stalks` (11 visits) and `sheafH_filtered_colimit_comparison_one_iso_hom` (10 visits) are the two classic over-mining cases — late visits yielded sub-10 LOC each, but the loop kept returning to them. **Outcome trajectory** (frontloaded, then linear-decay): | Cycle | Norm LOC | Raw LOC | Notes | |---|---|---|---| | 0 | 7,016 | 7,763 | start | | 10 | 5,295 | 6,275 | end of sheaf-reversion | | 20 | 5,027 | 5,939 | mid Phase-2 | | 42 | 4,606 | 5,496 | first below 5,500 raw | | 60 | 4,350 | 5,251 | mid golf phase | | 86 | 4,087 | **4,998** | crossed 5,000 raw → loop terminates | Cycles 1–10 averaged **−172/cycle**; 11–60 averaged **−24**; 61–86 averaged **−10**. The 4,500-raw level was never reached. The 5,000-raw target was crossed on the **last possible cycle** (cycle 86 trimmed `toPlus_surjective_of_firstPlus` for −10 to land at 4,998), and that's not a near-miss — the loop's stop condition is exactly raw < 5,000, so it simply self-terminated as designed. The interesting fact is the *trajectory*, not the landing. ![LOC history](artifacts/loc_history_gv.png) **Figure: Lean lines of code in `Aristotle/GrothendieckVanishing/main/` across the project.** Two curves: raw LOC across all 904 commits (top panel) and normalized LOC across the 86 compress cycles (bottom panel). **Phase 1 (Mar 27 – Apr 4):** the proving phase grows the codebase from ~1,000 to ~7,000 raw LOC as the four pillars are stated and proved; the end-of-phase plateau is the messy state-A. **Phase 2 (Apr 5–16):** flat, including the 8-day review-1 reading window. **Phase 3 (Apr 17–27):** the refactor loop hovers around 6,500–7,000 raw LOC — its job is restructuring and renaming, not shrinking, so net LOC moves little. Real extracts (cycles 410, 422–445, 500) are visible as small downward steps. **Phase 4 (Apr 27–28):** the compress loop's signature is the sharp ~30% drop from 7,016 → 4,087 normalized LOC over 86 cycles. The trajectory is frontloaded (cycles 1–10 drop from 7,016 to 5,295 — the sheaf-reversion phase, mean −172/cycle) and then linear-decay (cycles 11–86 each shave single-to-low-double-digit LOC, mean ~−16). The 5,000 raw target is crossed on the very last cycle (4,998). (Detail: `artifacts/findings/compress_loop.md`.) ### 6.3 Mathlib-style PR #27 (Apr 29 – May 1) After the loops finished, a final cleanup pass: docstrings on every file, namespace fixes, lint clean-up, lemma renames in line with mathlib4 conventions, removal of trivial private declarations, removal of stale comments. Captured in commit `6bce95f` ("mathlib-style refactor: docstrings, naming, lint-clean") and subsequent. Driver: Claude Code interactive + targeted agent calls (no loop). 15 files, ~5,061 raw LOC at HEAD. This is also when the report scaffolding (this document, the `artifacts/findings/` directory, the structured-data extracts) was built. ### 6.4 Did the response address Brian's review-1? Maps the response onto the five "Large changes". This is the A → B delta evaluation. **LC1 — file structure & names.** *Partially addressed.* The mathlib PR added top-of-file docstrings to most files. The compress loop merged `FlasqueCohomology.lean` into `FlasqueVanishing.lean` (cycle 396). `Auxiliary.lean`, `Setup.lean`, `SetupCore.lean`, `ClosedOpenDecomposition.lean`, and `ReducibleVanishing.lean` were *not* renamed; current `Aristotle/GrothendieckVanishing/main/` still contains some of these indescriptive names. `FiniteGeneratorReduction.lean` was *not* split. **Score: ~40% addressed.** Review-2 will likely re-flag this. **LC2 — sheaf-vs-presheaf statement style.** *Inverted.* This is the clearest case where the response went the *opposite* direction from review-1. Brian wanted single-sheaf arguments restated as `{F : Presheaf} (hF : IsSheaf F)`. The compress loop's "sheaf-reversion" (phase 1, cycles 1–10) deleted the `_presheaf`-suffixed unbundled versions and kept only the `(F : Sheaf)` form. Mean Δ on those cycles was the largest of the project (−180/cycle). The mechanical justification was sound — `_presheaf` wrappers were redundant with their bundled counterparts — but the resulting state actively contradicts LC2. **Score: net regression on LC2.** Review-2 will almost certainly re-raise this. **LC3 — ShortComplex generalisation.** *Addressed.* The compress loop's signature pattern (cycles 1, 2, 3, 6, 7, 8, 9, 10, 27, 29, 85) bundles the six-argument `(F₁ F₂ F₃ : Presheaf, h₁ h₂ h₃ : IsSheaf, f ≫ g = 0)` triple into `(S : ShortComplex (TopCat.Sheaf …))`. This is exactly what Brian asked for. **Score: addressed.** **LC4 — sheaf cohomology API.** *Partially addressed.* `CohomologyAPI.lean` exists at HEAD and is the largest single file. Cycle 410 of the refactor loop deleted 24 wrappers; cycles 422–445 demoted or removed `Ext`-named declarations. However, `artifacts/review_outstanding_2026-04-24.md` (an audit done mid-response, before the compress loop) still flagged "Ext leaks outside `CohomologyAPI.lean`" with several specific sites in `PresheafFilteredColimit.lean`, `ClosedImmersionCohomology.lean`, and `FlasqueVanishing.lean`. The compress loop addressed some of these but probably not all. **Score: partially addressed.** **LC5 — `if A=0 then B=0` ⟶ `A ≅ B`.** *Unaddressed.* The two specific lemmas Brian named (`sheafH_preserves_filtered_colimits`, `PushforwardHVanishing`) still exist in the implication form at HEAD. The compress loop did not touch them; the refactor loop's evaluator schema does not include "strengthen vanishing to iso" as a primitive. **Score: not addressed.** **Overall A → B delta (LC summary):** | Item | Status | Notes | |---|---|---| | LC1 file structure | Partial | Docstrings yes, renames mostly no | | LC2 sheaf-vs-presheaf | **Inverted** | Compress went the wrong way | | LC3 ShortComplex | Done | Compress addressed cleanly | | LC4 cohomology API | Partial | API created, leaks remain | | LC5 iso-not-implication | **Not done** | Two specific lemmas untouched | So the literal answer to "did the review-1 response close review-1's issues?" is *no*: 1 of 5 large changes addressed cleanly, 2 partially, 1 inverted, 1 untouched. Many per-file items were addressed (particularly the simple "doesn't need to be its own theorem" items that the refactor loop's `COMPLETE_IF` gates handled well), and the mathlib PR landed real readability gains. But the deepest structural recommendations were not. **This is the report's most important quantitative claim: A → B is real but partial.** The numerical headline (−42% normalized LOC) overstates the qualitative change because the LOC-reduction work and the LC-fix work were only loosely correlated. --- ## 7. Failure mode catalogue The report's centerpiece. Each entry: what failed, evidence, and the generalisable lesson. ### 7.1 Override juggling vs decomposition **What failed.** When Lean's `synthInstance` budget was exceeded, the agent's first instinct was to bump the budget. Mar 28 – Mar 31 saw twenty-plus `set_option maxHeartbeats N` commits oscillating between 200K and 12.8M, none of which durably fixed the underlying problem. **Evidence.** `[Mar 30 10:51] perf: decompose 12.8M heartbeat proof into sub-lemmas` is the breakthrough; everything before it is override juggling. Acts I and II of §4.2. **Lesson.** *In any system with a budget knob, decomposition is durable; override juggling is fragile.* Generalises to API rate limits, memory budgets, time budgets — the right move is almost always to break the work into smaller pieces, not to ask for more budget. ### 7.2 "Blocked" / "genuine mathlib gap" drift **What failed.** Long autonomous loops drift toward declaring tasks "blocked" or pinning failures on infrastructure ("genuine mathlib gap") unless the prompt explicitly forbids it. **Evidence.** User escalations: > `[Apr 02 09:36]` *"what do you mean 'not possible'??? if there is no > infrastructure, build what you need!"* > > `[Apr 02 23:44]` *"you keep saying 'genuine mathlib gap'. YOU ARE NOT > ALLOWED TO SAY THIS. THIS IS LAZY AND IRRESPONSIBLE."* > > `[Apr 07 12:04]` *"You are NOT allowed to say that you are blocked. > You MUST close the sorrys yourself!"* The slash-command suite all carries the rule *"A no-op cycle is never acceptable!"* **Lesson.** *Loops drift unless the prompt forbids "blocked" explicitly.* The drift is fast and consistent enough that it should be treated as a default agent failure mode, not as occasional noise. ### 7.3 Loop dead branches **What failed.** The Codex refactor loop's gate-repair branch (`codex_refactor_loop.py:944–965`) was never exercised. 478 cycles, 0 failures. The most defensive machinery in the design ran zero times. **Evidence.** `gate_repair_attempted: true` count = 0; `worker_exit_code != 0` count = 0 (61 nulls are auditor-only cycles). **Lesson.** *A loop's defensive branches need their own test invocations or they're dead code at the moment they're needed.* If `--ephemeral` always exits 0, the gate-repair logic relies on a signal that never fires; better designs gate on diff-presence or compile-status directly. ### 7.4 Loop drift to no-op **What failed.** Treating a `score: 0, task_complete: false` checklist-exhaustion outcome as acceptable (rather than as a backoff signal) lets the loop spin productively-looking but accomplishing nothing. **Evidence.** Cycles 268–331: **64 consecutive checklist-exhaustion no-ops** on the refactor loop. Cycles 463–498: 36 consecutive identical default-fallback strategy-recommendations. The auditor's `IDLE_COOLDOWN = 1 hr` did not trip because the auditor was returning empty `added_lines` — appearing to unblock the loop without actually populating it. **Lesson.** *A productive-looking score-0 cycle is a worse signal than a clear failure.* The loop's exit conditions need to recognise "I was asked to do nothing, I did nothing" as a stop condition, not a success. ### 7.5 Local-tactic golf hits a wall **What failed.** Two days of `/golf` (interactive Claude proof-body shortening) moved single-digit LOC. The compress loop's late-cycle phase (cycles 61–86) averaged −10/cycle for the same reason. **Evidence.** User prompt that broke through: > `[Apr 05 19:59]` *"You can do much much better… Local changes can > only get you so far. … if two objects are isomorphic… you should > prove that they are isomorphic, not just the vanishing implication."* **Lesson.** *Compression hits a wall at the boundary of theorem statements; only API-surface refactoring (LC2, LC3, LC4, LC5) unlocks the next order of magnitude.* This is also the root cause of LC5 going unaddressed — the LC-fix work and the LOC-reduction work are different levels of the codebase. ### 7.6 Profile-proof black holes **What failed.** `lean_profile_proof` runs that "never completed" on two specific lines (195, 212) of `SetupCore.lean`, eating ~10 hours of agent wall-clock with no progress. **Evidence.** User prompt `[Apr 01 11:10]`. No commit references a `lean_profile_proof` finding from those days; the entire branch was abandoned. **Lesson.** *A profiling tool that doesn't return is a signal that the proof itself shouldn't exist in its current form.* When a tool times out hard, escalate to decomposition rather than retrying the same inputs. ### 7.7 Aristotle: tactic, not strategy **What failed.** Aristotle was effective on foundational, well-scoped lemmas (Mar 27–28: 6 of 22 COMPLETEs landed). It failed uniformly on conceptual hard steps (Apr 2–4: 12 jobs, all `COMPLETE_WITH_ERRORS` or `CANCELED`). The Apr 27 `aristotle-loop` infra burst (7 whole-project "golf" submissions) produced **zero** landed commits — the parallel `codex_compress_loop` was the de-facto golf winner. **Evidence.** Outcome split 22/66/5/1 reproducible from `projects.json`. Commit-mapping: 6 of 22 COMPLETEs found in sorry-elimination commits, all on Mar 27–28. Apr 27's actual landed commits are all `compress: codex cycle N`, none are `aristotle-loop:`-tagged content commits. **Most consequential output was negative**: the flasque → injective counterexample (`d0cdf92`, Mar 28). Aristotle doubled as a falsification engine, redirecting the user away from a wrong target. **Lesson.** *External theorem provers are tactics, not strategies.* They work on bounded, named lemmas in a domain they have machinery for. Whole-project "golf this" submissions are the wrong shape of problem. And: provers can be valuable as falsifiers, not just as solvers. ### 7.8 Sheaf-reversion went opposite to review's recommendation **What failed.** The compress loop's largest single class of edits (phase 1, cycles 1–10) deleted the unbundled `(F : Presheaf, h : F.IsSheaf)` forms and kept only the bundled `(F : Sheaf)` versions. Brian's LC2 asked for the opposite. The compress loop's optimisation target (LOC) and the review's quality target (idiomaticity) were not aligned. **Evidence.** `[cycle 1] Renamed sheafH_succ_iso_of_subsingleton_middle_presheaf → sheafH_succ_iso_of_subsingleton_middle. Deleted same-cluster wrappers sheafH_succ_map_presheaf_eq_succ_iso_hom, sheafH_exists_preimage_of_subsingleton_middle_presheaf.` The unbundled forms are *gone*, not deferred. **Lesson.** *An optimisation loop can do real work that actively violates the human reviewer's intent, if the loop's gate doesn't encode the reviewer's preferences.* The compress loop's gate was "LOC must decrease" + "no new sorries"; nothing in the gate said "prefer the unbundled form." So the loop optimised one and ignored the other. This is the strongest argument in the report for human-in-the-loop review *after* every major automated phase. --- ## 8. What worked (durable patterns) Symmetric to §7. The patterns we'd carry forward to a future project. ### 8.1 Decomposition into named sub-lemmas Almost every proving-phase breakthrough is preceded by `refactor: decompose …`: - `[Mar 28 14:31] decompose epi_unit_of_closedImmersion into stalk sub-goals` - `[Mar 29 20:57] decompose IrreduciblePosVanishing into 3 sorry sub-lemmas` - `[Apr 03 16:02] decompose isSheaf sorry into separation+existence` The slash-command `/prove` codifies this: *"decompose 3-5 sub-lemmas, prove at least one."* Same pattern killed the heartbeat saga (§7.1) and unstuck the late-Apr 3 final push. ### 8.2 Cycle skills with hard "make progress" gates `/babysit`, `/prove`, `/golf`, `/simplify`, `/critique` all carry *"A no-op cycle is never acceptable!"* The user enforced this verbally when the agent softened (§7.2). The practical effect was that even when the agent didn't know what to do, it at least decomposed, renamed, or reformulated something — keeping the codebase moving. ### 8.3 `wip/` + protected target + auto-merge 904 wip-branch commits over five weeks, zero direct-push incidents on `grothendieck-vanishing`. The PR-with-auto-merge pattern is trivial to set up and decisively reduces the blast radius of agent errors. ### 8.4 Reframing instead of grinding The last sorry closed not by proving the dreaded "filtered colimits of injectives are injective" but by routing through flasque (§4.4). **This was a human noticing the proof was over-specified.** No tool volunteered the reframing. ### 8.5 API-surface refactoring (when the loop's gate cooperated) The compress loop's phase 1 (cycles 1–10) bundled six-argument presheaf-triple signatures into `ShortComplex (Sheaf)` via the ShortComplex pattern. Mean −180 LOC/cycle, the project's largest sustained refactor. **Worked because the gate (LOC reduction) and the goal (idiomaticity per LC3) happened to align.** When they don't (LC2), this pattern fails (§7.8). ### 8.6 Per-cycle JSONL state for resumable runs Both Codex loops persisted full per-cycle records to `.compress-state/codex_history.jsonl` (86 rows) and `.refactor-state/codex_history.jsonl` (478 rows). Combined with markdown reports for the compress loop, this made forensics possible weeks later — every claim in §6 above is grounded in those records. ### 8.7 Aristotle as falsifier Two of Aristotle's most consequential outputs were *negative* results (flasque → injective FALSE; Gabriel's-theorem-citation incorrect). External provers are valuable not just as oracles but as counterexample-generators when the human is uncertain about a target. --- ## 9. Quantitative summary ### 9.1 Headline numbers (Source of truth: `artifacts/report-data/metrics.json`.) | Metric | Value | |---|---| | Lean files at HEAD | 15 | | Raw LOC | 5,061 | | Normalized LOC | 4,087 | | Sorry / non-default axioms | 0 / 0 | | Active days | 27 of 35 | | Git commits | 904 since 2026-03-27 | | Claude Code turns | 31,529 | | Claude Code sessions | 270 (laptop 29, Hyak 241) | | Claude Code output tokens | 13.9 M | | Claude Code cache reads | 6.76 B | | Claude Code cache creation | 93 M | | Claude Code fresh input | 188 K | | Codex GV interactive sessions | 5 | | Codex captured tokens | ~26.4 M (input mostly cached) | | Codex ephemeral cycles | 564 (compress 86 + refactor 478) — **token cost not captured** | | Aristotle proving submissions | 94 | | Aristotle outcomes | 22 / 66 / 5 / 1 | | Aristotle "wins landed" | 6 of 22 COMPLETE | ### 9.2 Tokens and cost **Claude is the dominant cost** by a large margin. The 6.76 billion cache-read tokens and 93 million cache-creation tokens dwarf everything else. Output is 13.9 M, fresh input 188 K. Pricing varies by model; we use approximate flat rates here as illustration: - Output (Opus tier): ~$75/M → ~$1,040 - Cache creation: ~$18.75/M → ~$1,750 - Cache read: ~$1.50/M → ~$10,140 - Fresh input: ~$15/M → ~$3 Total Claude estimate: **~$13K**, dominated by cache reads. This is sensitive to pricing; the actual bill shape requires log-walking. We flag this as approximate; the dominance pattern (cache-read heavy) is robust. **Codex 5 captured sessions: ~26.4 M tokens** (mostly cached input; output ~129 K). Pricing for `gpt-5.5` reasoning-output combines a base rate plus reasoning-token surcharge. Approximate: low single digits of dollars. **The 564 ephemeral cycles**: token cost not in any local file. To estimate, we'd need to either re-run a representative cycle with logging on, or query OpenAI's billing API for the date range. We have done neither. **Aristotle**: pricing model is per-job; total bill is upstream and we do not have it on hand. A serious cost analysis would require the missing logs above; the qualitative claim is that **the marginal cost per landed lemma was dominated by the proving phase, not the loops.** The proving phase ran ~30K Claude turns to close 35 sorries; the loops ran 564 ephemeral cycles to remove 2,929 normalized LOC. Per-LOC costs of the loops are much lower than per-sorry costs of the proving phase, but absolute spend is still concentrated in proving. ![Token usage](artifacts/token_usage_gv.png) **Figure: Cumulative Claude Code token consumption across the project.** Top: cumulative output tokens (red, ~13.9 M) and cache-read tokens (blue, ~6.76 B). The cache-read curve is two orders of magnitude larger than output — the read-heavy structure of formal-verification work, where each Claude turn loads thousands of lines of Lean and Mathlib context to emit a few dozen lines of tactic. The fresh-input total (188 K) is negligible against cache reads, meaning **prompt caching saved roughly $30K** at Opus pricing. Bottom: per-day cumulative output tokens, with the proving phase (Mar 27 – Apr 4) and the response phase (Apr 19+) visible as the two main slopes. The Apr 27 inflection is the start of the compress loop with its high-frequency Claude evaluator calls. The Codex token total (~26.4 M from 5 persisted sessions) is omitted from the figure because the 564 ephemeral cycles' tokens were never written to local disk and remain unrecoverable from the filesystem. ![Tool usage](artifacts/tool_use_gv.png) **Figure: Claude Code tool invocations across the project.** Left: top-20 tools by total call count. The core editing loop dominates — **Bash (7,816), Read (3,118), Edit (2,454), Grep, Write** — followed by the `lean-lsp-mcp` family (2,256 combined: `lean_diagnostic_messages`, `lean_goal`, `lean_multi_attempt`, `lean_local_search`, `lean_leansearch`, `lean_loogle`, `lean_state_search`, `lean_run_code`, `lean_hammer_premise`). Aristotle MCP submissions show as a smaller band. Right: daily breakdown by tool category. The proving phase (Mar 27 – Apr 4) shows the highest lean-lsp density per day (the agent uses LSP feedback intensively while writing tactics); the response phase (Apr 19+) is dominated by Bash and Edit (file-level refactoring, `rg`-based usage searches, file moves). Lean LSP calls drop sharply after Apr 4 because the Codex worker on Hyak is the one editing Lean files in the loops, not Claude. ### 9.3 Source data In `artifacts/report-data/`: - `metrics.json` — headline numbers in machine-readable form. - `data_manifest.json` — provenance of every input file. - `raw/aristotle/projects.json` — full 1,350-job API dump. - `extracted/claude_turns_*.jsonl` — 31,529 rows. - `extracted/codex_turns_*.jsonl` — 288 rows from 5 sessions. - `extracted/cycle_history_compress.jsonl` (86), `…_refactor.jsonl` (478). - `extracted/commits.jsonl` — 904 rows. - `extracted/per_commit_loc_sorry.jsonl` — sorry + LOC at every commit. - `extracted/tool_use.jsonl` — 19,393 rows. - `README.md` — schema and caveats. Plus the chronological-record companion artifacts: - `artifacts/human_prompts_gv.txt` — 859 cleaned prompts (Apr 1 onward). - `artifacts/proving_phase_commits_gv.txt` — 410 commits Mar 27 – Apr 4 (covers the auto-deleted Mar 27–31 prompts gap). --- ## 10. The state at HEAD ("B") For comparison with "A" (§4.6): on 2026-05-01 the project is sorry-free, axiom-free, with 15 files in `Aristotle/GrothendieckVanishing/main/`, 5,061 raw / 4,087 normalized LOC, top-of-file docstrings on every file, the `ShortComplex (Sheaf)` pattern in place for short-exact-sequence arguments, a separate `CohomologyAPI.lean` providing wrapper API for `Sheaf.H` (with some Ext leaks remaining), and the two LC5 lemmas (`sheafH_preserves_filtered_colimits`, `PushforwardHVanishing`) still in implication form. ![LOC breakdown by file](artifacts/loc_breakdown_gv.png) **Figure: Raw LOC per file at HEAD (2026-05-01), all 15 files in `Aristotle/GrothendieckVanishing/main/`, sorted by size.** The largest file is `CohomologyAPI.lean` (~870 LOC) — the wrapper API for `Sheaf.H` that LC4 asked for. Next come the four-pillar workhorses `PresheafFilteredColimit.lean`, `FlasqueVanishing.lean`, `IrreducibleStep.lean`, `ClosedImmersionCohomology.lean`, `FiniteGeneratorReduction.lean` — each in the 300–600 LOC band. `GrothendieckVanishing.lean` itself (the assembly file) is small (~150 LOC), as expected for a final-assembly file. The indescriptive file names that LC1 flagged — `SetupCore.lean`, `Auxiliary.lean`, `ClosedOpenDecomposition.lean` — are all still present, occupying the mid-bracket; the renames the review asked for did not happen. `PresheafFilteredColimit.lean` was *not* split as LC1 suggested; it remains the second-largest file. Total 5,061 raw / 4,087 normalized LOC across 15 files. This is "B". Review-2 evaluates B against A. --- ## 11. Review 2 — STUB *Brian Nugent's review-2 is in progress as of 2026-05-01. This section is a placeholder; it will be populated once review-2 lands.* ### 11.1 Methodology (planned) Same as review-1: reading the codebase from the top, flagging Large changes and per-file items. ### 11.2 Likely findings (predicted from review-1 unaddressed items) Predictions for what review-2 may flag again, ranked by §6.4: 1. **LC2 (sheaf-vs-presheaf)** — likely re-flagged. The compress loop went the opposite direction; we expect Brian to ask why the `_presheaf` versions were deleted rather than promoted to primary. 2. **LC5 (iso-not-implication)** — likely re-flagged. The two specific lemmas Brian named are still in implication form. 3. **LC1 (file structure)** — partially re-flagged. Renames mostly didn't happen; the indescriptive file names are still there. 4. **LC4 (cohomology API)** — partial re-flag. The API exists but leaks remain in 2–3 files (per the Apr 24 outstanding-items audit in `artifacts/review_outstanding_2026-04-24.md`). 5. **LC3 (ShortComplex)** — unlikely to be re-flagged. Compress addressed it cleanly. ### 11.3 Failure modes "C" — TBD The paper's third concept — *failure modes that persist even after review-driven refactor* — is what review-2 will populate. We expect: - API-surface issues that the LOC-gated loop systematically missed (LC2, LC5). - Mathlib-vocabulary issues that Brian will spot but the agent didn't (e.g. the per-file *"this is just `[Mathlib lemma]`"* items he flagged in review-1 — some closed, but new ones likely surface in B). - Generality issues (Brian's recurring *"prove this in more generality, then derive the special case"* — likely still uneven). This section will be filled in once review-2 lands. --- ## 12. Lessons for future AI-formalization projects These are the things we'd carry forward, distilled from §7 and §8 with the framing the user supplied: *"if you build a minimal agent using Claude Code with lean-lsp-mcp and with Aristotle's help, you can formalize X in Y days, encounter obstacles Z, and if you do a review, you can expect the code quality to go from A to B, while failure modes C still remain even after the refactor after the review."* ### 12.1 What the toolkit can do A minimal off-the-shelf agent (Claude Code + `lean-lsp-mcp` + Aristotle) can drive an end-to-end formalization of a third-year algebraic-geometry theorem in **9 active proving days**, ending sorry-free with no non-default axioms. The dominant cost is Claude tokens (90%+ via cache reads), and the dominant value of Aristotle is on **bounded, well-scoped foundational lemmas** plus its role as a **falsifier**, not as a strategic prover for the whole theorem. ### 12.2 What the toolkit cannot do without a human - **Recognise that a proof is over-specified.** The "reframe through flasque" insight (§4.4) was a human contribution. No loop or prover was going to volunteer that the dreaded sublemma was avoidable. - **Decline override-juggling and decompose instead** (§7.1). The agent's default is to bump budgets; the durable fix needs prompting. - **Stop spinning** (§7.4). 64 consecutive no-ops happened because the loop's stop conditions did not include "I am asked to do nothing." - **Distinguish optimisation targets from quality targets** (§7.8). The compress loop optimised LOC and inadvertently regressed on idiomaticity (LC2). ### 12.3 What a single human review buys you For our project, review-1 closed roughly **1 of 5** of its own "Large changes" cleanly when the response loops ran (LC3), partially closed 2 (LC1, LC4), inverted 1 (LC2), and left 1 untouched (LC5). Per-file items fared substantially better — many *"doesn't need to be its own theorem"* and *"this is just [Mathlib lemma]"* items were closed by the refactor loop's `COMPLETE_IF`-gated wrapper-deletions. The numerical headline (−42% normalized LOC, +docstrings, +file mergers) overstates the qualitative change, because the LOC-reducing work and the recommendation-fixing work were only loosely correlated. We expect review-2 will show the **delta from one human review is finite and uneven**: a single review surfaces issues, but only the ones the response loops are gated to address actually move. The remaining ones come back. ### 12.4 What we'd change for the next project 1. **Keep `cleanupPeriodDays ≥ 365`** from day one. We lost the Mar 27–31 Claude session jsonls to the default-30-day cleanup and had to reconstruct from commits. (Already changed for this machine.) 2. **Cache the initial compile.** Half the proving-phase frustration was slow CI; precaching mathlib was a Day-1 task that paid for itself. 3. **Encode review-target preferences in the loop's gate, not just the prompt.** Phase-1 of the compress loop would not have inverted LC2 if the gate had refused commits that *deleted* unbundled forms. 4. **Auditor on a real timer.** The refactor loop's `IDLE_COOLDOWN` didn't trip because the auditor returned empty `added_lines`. A hard "cycles since last actual completion" counter with a short-circuit at, say, 10 would have saved 60+ no-op cycles. 5. **Two reviews before publishing**, not one. The asymmetry between "review-1 surfaces issues" and "review-2 measures persistence of issues" is the actual scientific design here. 6. **Treat ephemeral cycles as a token-billing risk, not a free win.** `codex exec --ephemeral` skips session writes, which means we have no per-cycle token cost. A future loop should at minimum write the final usage block to its own state directory. ### 12.5 What this isn't This isn't a benchmark. We did one theorem with one toolkit. The absolute numbers (35 days, 904 commits, $13K Claude estimate) tell you the order of magnitude, not the bound. If your tooling, model generation, mathematical area, or human reviewer is different, your numbers will be different. What we hope generalises is **the failure modes (§7) and the durable patterns (§8) — and the observation that human review delta is real but partial.** --- ## 13. Authors, credits, license Per `README.md` lines 37–43 and updates from PR #15 description. - **[Vasily Ilin](https://github.com/Vilin97) (Human).** Architect, reviewer, infrastructure, owns the repo. Drove all loops, set the user-facing direction, wrote and edited prompts. The user voice in the human-prompt corpus is his. - **[Brian Nugent](https://github.com/brian-nugent) (Human).** Wrote the initial theorem statement, supplied the Hartshorne III.2.7 proof reference and the flasque-vanishing architecture, did review-1 (Apr 8–15, this report's `review.md`) and review-2 (in progress). The agent built on his Mathlib PR #35790 code throughout. - **[Claude Code](https://claude.com/claude-code) (Anthropic).** Engineer and prover. Drove the proving phase interactively and via `/babysit` and `autonomous_loop`; ran as evaluator in the Codex loops; wrote this report. - **[Codex CLI](https://openai.com/codex) (OpenAI).** Worker for the compress and refactor loops in the late-April phase (86 + 478 cycles). - **[Aristotle](https://aristotle.harmonic.fun/) (Harmonic AI).** Lemma specialist; ~94 proving-phase submissions and the Apr 27 aristotle-loop infra burst. License: see `LICENSE` file at repo root. --- ## Appendix A. Where the evidence lives For every claim in this report, the underlying evidence is in one of the following: | Source | Path | What's in it | |---|---|---| | Brian's review-1 | `review.md` (this repo) | Primary source for §5 and §6.4 | | Outstanding items as of Apr 24 | `artifacts/review_outstanding_2026-04-24.md` | Mid-response audit (derivative) | | Proving-phase narrative | `artifacts/findings/proving_phase.md` | Source for §4 | | Refactor loop deep-dive | `artifacts/findings/refactor_loop.md` | Source for §6.1 | | Compress loop deep-dive | `artifacts/findings/compress_loop.md` | Source for §6.2 | | Aristotle role analysis | `artifacts/findings/aristotle_role.md` | Source for §4.3 and §7.7 | | Cross-cutting lessons | `artifacts/findings/lessons.md` | Source for §7, §8, §12 | | Synthesis | `artifacts/findings/_synthesis.md` | Earlier draft of §3 | | Human prompts (Apr 1+) | `artifacts/human_prompts_gv.txt` | 859 prompts | | Proving-phase commits | `artifacts/proving_phase_commits_gv.txt` | 410 commits Mar 27 – Apr 4 | | Theorem statement | `artifacts/lean-theorem-statement_gv.lean` | Used in §1 | | Loop docs | `artifacts/automation-suite_gv.md` | Used in §2.2 | | Headline numbers | `artifacts/report-data/metrics.json` | Used in §0, §9.1 | | Full Aristotle dump | `artifacts/report-data/raw/aristotle/projects.json` | Used throughout §4.3 and §6 | | Per-commit LOC + sorry | `artifacts/report-data/extracted/per_commit_loc_sorry.jsonl` | Used in §4 and §6.2 | | Cycle history | `artifacts/report-data/extracted/cycle_history_*.jsonl` | Used in §6.1 and §6.2 | | Tool-use log | `artifacts/report-data/extracted/tool_use.jsonl` | Used in §9.2 | | Figures | `artifacts/*_gv.png` | Embedded throughout §1, §3, §4, §6, §9, §10 | ## Appendix B. Open questions carried forward Listed for §11 / future work. 1. Why did the Apr 2–4 Aristotle jobs uniformly fail/error? Was it file size, prompt quality, or a real difficulty step from constant-sheaf to colimit machinery? Per-job error logs are not in the API. 2. What was on `SetupCore.lean` lines 195 and 212 that ate ~10 hours of `lean_profile_proof` on Mar 31? File was split before Apr 1, so the original span isn't in HEAD's tree. 3. Is `last_updated_at` ever distinct from compute completion in Aristotle's API, or always a dashboard-refresh artefact? The 4–6-day medians are unbelievable as compute time. 4. Why did Codex `exec --ephemeral` always exit 0? Either the worker silently succeeds-with-no-diff on failure, or `--ephemeral` swallows non-zero exits. Gate-repair branch never ran in 478 cycles. 5. Score schema is `[-2, +2]` but only `{0, 1, 2}` ever fired. Was the negative path ever exercised, or did the evaluator prompt forbid it? 6. What was actually in the 7 `aristotle-loop` outputs, and why did none merge — qualitatively worse than Codex's golf, or simply unevaluated? 7. Apr 8–15 lull: this report attributes it to review-1, per user confirmation. The factual basis is the user statement; the underlying review.md was produced during/after that window. 8. Could a successor "structural" phase have hit raw 4,500 LOC, or is residual ~5,000 LOC genuinely irreducible without API redesign? Stuck-on themes from compress cycles 55–67 suggest the latter. 9. **(For review-2.)** How many of the per-file items from review-1 were actually closed by the refactor loop's `COMPLETE_IF` gates? The `completed_task_lines` in the JSONL records 86 closures — are those false positives, or a faithful tally? 10. **(For review-2.)** What new failure modes appear in B that didn't appear in A? This is the heart of "C".