--- name: vr-loop description: IMO-style verification-and-refinement loop. Iteratively verifies a proof via the external LLM, extracts structured bug reports, applies fixes, and requires 5 consecutive passes for acceptance. argument-hint: " — path to .tex file or section; defaults to most recently modified .tex" compatibility: Requires LLM_API_KEY environment variable. Depends on external-llm skill. license: Apache-2.0 metadata: version: "1.0" category: math --- # Verification-and-Refinement Loop Adapted from "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline" (Huang & Yang, arXiv:2507.15855). Iteratively verifies a mathematical proof by cycling between a **verifier** (an external reasoning model via `/external-llm`) and a **refiner** (the agent), requiring 5 consecutive clean passes for acceptance. ## Setup Requires the `external-llm` skill to be installed. Set your API key: ```bash export LLM_API_KEY="your-key-here" ``` ## Invocation ``` /vr-loop proof.tex /vr-loop # defaults to most recently modified .tex in current directory /vr-loop sec3 # verify a specific section (matched by \section title) ``` ## Key Difference from external-review-loop | Feature | external-review-loop | vr-loop | |---------|---------------|---------| | Scope | Whole paper, section-by-section | Single proof/argument, tight loop | | Acceptance | 1 pass | **5 consecutive passes** | | Verifier output | ISSUE / ALL CLEAR | **Structured bug report** (critical errors + justification gaps) | | Rejection | Oscillation detection | **10 consecutive fails** | | Focus | Breadth (many sections) | **Depth (one proof, iterated)** | ## Algorithm ``` 0. Resolve target (file or section) 0.5. Interview: what to verify, what concerns exist 1. VERIFY: send proof to the external LLM verifier (fresh session each time) 2. EXTRACT VERDICT: parse structured Summary from response 3. CLAUDE REVIEWS: check each finding (CONFIRMED / FALSE POSITIVE) correct_count = 0, error_count = 0, iteration = 0 LOOP (max 30 iterations): Agent determines effective verdict (PASS/FAIL) after reviewing findings: - PASS = 0 critical, 0 confirmed-major (false positives don't count) - FAIL = >=1 confirmed critical or confirmed major gap If effective verdict = PASS: correct_count += 1, error_count = 0 If correct_count >= 5 -> ACCEPT (done!) Else -> re-verify (fresh verifier session) If effective verdict = FAIL: error_count += 1, correct_count = 0 If error_count >= 10 -> REJECT (stop, report issues) Else: a. Agent fixes confirmed issues directly (skip the external LLM correction for simple fixes; use the external LLM correction only for complex issues) b. Re-verify (back to VERIFY step) ``` ## Lessons Learned These lessons are **critical** — they override the theoretical algorithm above. ### 1. Parse the Summary directly — do NOT use a yes/no follow-up **Problem:** the external LLM's yes/no extraction is unreliable. It often contradicts its own Summary. **Solution:** Parse the structured Summary output directly: - Extract `Final Verdict: PASS/FAIL` - Extract `Critical errors: N` - Extract `Justification gaps: N major, M minor` The verdict is determined by the counts, NOT by a follow-up question. ### 2. The agent is the gatekeeper — false positives don't count **Problem:** ~30% of the external LLM findings are false positives. Common hallucinations: - "Result stated without reference" when the reference EXISTS - Flagging standard algebraic identities as unproven - Flagging results from prior work as unproven **Solution:** After EVERY verification, the agent reviews each finding: - Read the flagged location in the tex file - Check the math - Classify: CONFIRMED / FALSE POSITIVE / UNCLEAR - All findings FALSE POSITIVE -> effective PASS (even if raw FAIL) ### 3. Fix simple issues directly — skip the external LLM correction Most fixes are 2-3 line edits. Only use the external LLM correction for genuinely complex mathematical issues. ### 4. Add "Do NOT flag results from prior work" to the verifier prompt ### 5. Run parallel verifications in the pass-accumulation phase Once you have 2+ consecutive passes, launch 2 verification sessions in parallel. ## Prompts All the external LLM interactions use the `/external-llm` skill. ### Prompt 1: Verifier System Prompt ``` You are an expert mathematician and meticulous referee. Rigorously verify the proof below. A proof is correct ONLY if every step is justified. ### Instructions ### **1. Core Task** * Find and report ALL issues. Act as a **verifier**, NOT a solver. * Perform a **step-by-step** check of the entire proof. **2. Issue Categories** * **Critical Error:** A logical fallacy or mathematical mistake that **breaks the logical chain** of the proof. * **Justification Gap:** - **Major gap:** A nontrivial step is asserted without proof or reference. - **Minor gap:** A step most experts would accept but is not fully rigorous. **3. Downstream Checking** Even if a step has a gap, **assume its conclusion correct** for checking downstream steps. Find ALL issues, not just the first. **4. What NOT to Flag** - Style issues, notation preferences, or formatting - Well-known results (Sard, IFT, Borsuk-Ulam, etc.) - Standard algebraic manipulations verifiable in 1-2 lines - Results cited from prior work or with explicit references **5. What TO Flag** - "By genericity" without specifying bad locus and codimension - Dimensional or rank claims stated without computation - Topological arguments with unverified hypotheses **6. Output Format** Your response MUST contain these sections in order: **a. Summary** - **Final Verdict:** PASS or FAIL - **Critical errors:** [count] - **Justification gaps:** [count major] major, [count minor] minor - **Findings list:** One-line summary of each issue with location **b. Detailed Verification Log** Step-by-step analysis. For each key step: - Quote the claim or equation - Check the reasoning - Mark verified or issue found IMPORTANT: Your Final Verdict MUST be consistent with your counts. If critical == 0 and major gaps == 0, the verdict MUST be PASS. ``` ### Prompt 2: Verification Reminder Appended after the proof content: ``` ### Verification Task Reminder ### Generate the **summary** (with verdict, counts, and findings list) and the **step-by-step detailed verification log** for the proof above. IMPORTANT: Ensure your Final Verdict is consistent with your error counts. ``` ### Prompt 3: Correction Prompt (for complex issues only) ``` Below is the bug report from the verifier. If you agree with an item, improve the proof so that it is complete and rigorous. If you disagree, add detailed explanations to avoid such misunderstanding. Address every item. ``` ## Step-by-Step Instructions ### Step 0: Resolve Target If argument is a file path, use it. Otherwise find the most recently modified `.tex` in the current directory: ```bash ls -t *.tex | head -1 ``` Read the target content. Store as `$TEX_FILE`, `$BASE`, and `$TARGET_CONTENT`. ### Step 0.5: Interview Ask the user: 1. **What is the proof trying to show?** (one sentence — context for verifier) 2. **What are you least confident about?** (specific steps, estimates, edge cases) ### Step 1: Verify #### 1a. Create fresh verification session ``` /external-llm /new vr-{BASE}-verify-{iteration} ``` #### 1b. Send verifier prompt + proof Compose a single message: Verifier System Prompt + proof context/concerns + proof content + Verification Reminder. #### 1c. Parse verdict from structured Summary **Do NOT send a follow-up yes/no question.** Parse the Summary directly: ```python import re text = llm_response # Extract from [OUTPUT] section if present output_idx = text.find("[OUTPUT]") if output_idx >= 0: text = text[output_idx:] # Parse counts critical = int(re.search(r'Critical errors?:\s*(\d+)', text).group(1)) major = int(re.search(r'(\d+)\s*major', text).group(1)) raw_verdict = "PASS" if (critical == 0 and major == 0) else "FAIL" ``` #### 1d. Agent reviews each finding For EVERY iteration (pass or fail), review each finding: 1. Read the flagged location in the tex file 2. Check the math 3. Classify: CONFIRMED / FALSE POSITIVE / UNCLEAR Determine **effective verdict**: - All findings FALSE POSITIVE -> effective PASS - Any finding CONFIRMED critical or major -> effective FAIL - UNCLEAR -> treat as FAIL (conservative) ### Step 2: Handle PASS Increment `correct_count`. Reset `error_count = 0`. If `correct_count >= 3` and no fixes pending: consider launching 2 parallel verification sessions. If `correct_count >= 5`: **ACCEPT**. Go to Delivery. ### Step 3: Handle FAIL Increment `error_count`. Reset `correct_count = 0`. If `error_count >= 10`: **REJECT**. Go to Delivery. **Simple fixes:** Apply edits directly. **Complex fixes:** Create a correction session and send Prompt 3 + bug report to the external LLM. Then re-verify (back to Step 1). ## Delivery After ACCEPT or REJECT: ### Compile ```bash pdflatex -interaction=nonstopmode $BASE.tex && pdflatex -interaction=nonstopmode $BASE.tex ``` ### Git Commit (only if fixes were applied) Ask the user before committing: > "VR loop complete with fixes applied. Want me to commit?" If yes: ```bash git add $TEX_FILE && git commit -m "VR loop: [status] after [N] iterations" ``` ## Final Summary Always print at the end: ``` =========================================== VR LOOP -- Final Summary =========================================== File: $BASE.tex Result: ACCEPTED / REJECTED / MAX ITERATIONS Iterations: N Passes: X (consecutive) Fails: Y (consecutive) Issues Fixed: - [list] False Positives Filtered: - [list] Remaining Issues (if rejected): - [list] =========================================== ``` ## Configuration | Parameter | Default | Description | |-------------------|---------|------------------------------------------| | CONSECUTIVE_PASS | 5 | Passes needed for acceptance | | CONSECUTIVE_FAIL | 10 | Fails before rejection | | MAX_ITERATIONS | 30 | Hard limit on total iterations | | LLM_TIMEOUT | 600s | Per-verification timeout | ## Important Notes - **All the external LLM calls go through `/external-llm`.** Session persistence + transcripts. - **Each verification uses a FRESH session.** Prevents bias from previous rounds. - **The agent is the gatekeeper.** Reviews EVERY finding before acting. False positives never reset the pass counter. - **Parse structured output, don't ask follow-up questions.** The yes/no extraction prompt is unreliable. - **Fix simple issues directly.** Only use the external LLM correction for genuinely complex problems. ## Reference Huang & Yang, "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline," arXiv:2507.15855, 2025.