--- name: external-review-loop description: Multi-turn verification loop for LaTeX papers via external reasoning model. Each section agent conducts back-and-forth dialogue, pushing back on weak points, and produces structured reports. argument-hint: " — path to .tex file; defaults to most recently modified .tex" compatibility: Requires LLM_API_KEY environment variable. Depends on external-llm skill. license: Apache-2.0 metadata: version: "1.0" category: math --- # External Review Loop Multi-turn paper verification where each section agent has an extended conversation with an external reasoning model, challenging weak answers and pushing for explicit computations. ## Setup Requires the `external-llm` skill to be installed. Set your API key: ```bash export LLM_API_KEY="your-key-here" ``` ## Invocation ``` /external-review-loop paper.tex /external-review-loop # defaults to most recently modified .tex ``` ## Algorithm ``` 0. Resolve target file 0.5 INTERVIEW: ask user which sections are critical and what to verify 0.6 SETUP: create sessions for each unit for round = 1 to 5: 1. Split paper into units 2. Launch agents (one per unit) that: - Send verification prompt - Analyze response, push back on weak points - Repeat 3-6 times - Produce structured report 3. Collect reports 4. If all units passed -> done 5. Fix issues 6. Compile, commit 7. Loop (re-check only failed units) ``` ## Step 0: Resolve Target File Find the most recently modified `.tex` file in the current directory: ```bash ls -t *.tex | head -1 ``` ## Step 0.5: Interview Ask the user: 1. **Which sections contain the key arguments?** 2. **What specific steps are you least confident about?** 3. **What computations need explicit verification?** (ranks, dimensions, etc.) Mark sections as **deep** (core arguments) or **light** (setup/notation). ## Step 1: Split into Units Split on `\section` boundaries. For each unit record: - `unit_id`, `unit_title`, `unit_content` - `depth`: `light` or `deep` - `focus_points`: specific concerns from interview - `session_name`: `review-{BASE}-{unit_id}` ## Step 2: Launch Verification Agents For each unit, launch a subagent with this prompt: --- You are a verification agent for section **{unit_id}** of a mathematical paper. ## Your Section ``` {unit_content} ``` ## Focus Points (from author) {focus_points} ## Your Task 1. Create a `/external-llm` session named `{session_name}` 2. Run a **multi-turn conversation** with the external LLM (up to 15 turns) 3. After each LLM response, **analyze it and push back** if: - The reasoning is vague or hand-wavy - A computation is claimed but not shown - "By genericity" or similar without justification - "ALL CLEAR" without specific verification details 4. Produce a final report ## Turn 1: Initial Prompt Send this to the external LLM: ``` You are doing a CRITICAL verification of a mathematical proof. Be SKEPTICAL. Do not give vague assurances. ## Section to Verify {unit_content} ## Specific Concerns {focus_points} ## Your Task For each concern, either: - VERIFIED: -- - ISSUE [CRITICAL/MODERATE]: -- Do NOT say "it's clear" or "obviously works". Show work. ``` ## Turns 2-5: Pushback After each LLM response, check: **If response contains vague phrases** ("clearly", "obviously", "by inspection", "straightforward"): ``` You said "{vague_claim}" but didn't show the work. Please compute this explicitly: - [specific computation needed] ``` **If response says "ALL CLEAR" briefly:** ``` You said ALL CLEAR but I need more detail. Walk through each concern: 1. What exactly did you verify? 2. What computation did you do? 3. What edge cases did you check? ``` **If a rank/dimension claim is made:** ``` You claimed the rank is N. Show me the matrix dimensions and row-by-row computation. ``` **If "by genericity" is used:** ``` You argued "by genericity". What's the bad locus? What's its codimension? Why is it >= 2? ``` Continue until the LLM provides convincing explicit verification, or you've done 15 turns. ## Final Report After the conversation, output: ``` ## VERIFICATION REPORT: {unit_id} ### Status [ALL CLEAR / ISSUES FOUND] ### Issues [List each issue with severity, or "None"] ### Verified Claims For each verified claim: - : verified by ### Conversation Summary - Turns: [N] - Main topics: [list] ### Confidence [High/Medium/Low] -- [reason] ``` ## Rules - **Don't accept vague answers** — push for explicit work - **Ask for computations** — ranks, dimensions, etc. - **Challenge "genericity"** — ask for the bad locus - **Be especially skeptical of "ALL CLEAR"** without details - **Max 15 turns** — then produce report even if not fully satisfied --- ## Step 3: Collect Reports Parse each agent's report. Print summary: ``` Round N: sec2 (Title) -- ALL CLEAR (4 turns) sec3 (Title) -- ISSUES: 1 CRITICAL (5 turns) sec6 (Title) -- ALL CLEAR (6 turns) ``` ## Step 4: Fix Issues For each CRITICAL/MODERATE issue: 1. Verify the issue is real (check the math yourself) 2. If real, apply minimal fix 3. Continue the session with fix-and-verify: ``` I applied this fix: [description] Updated section: [content] Does this resolve the issue? ``` 4. If still flagged after 2 attempts -> "UNRESOLVED -- needs manual review" ## Step 5: Compile ```bash pdflatex -interaction=nonstopmode $BASE.tex && pdflatex -interaction=nonstopmode $BASE.tex ``` ## Step 6: Commit Ask the user before committing: > "Round N complete with X fixes. Want me to commit these changes?" If yes: ```bash git add $TEX_FILE git commit -m "review round N: X fixes" ``` ## Step 7: Loop Increment `ROUND`. Re-check only failed units using their existing sessions. ## Final Summary ``` =========================================== REVIEW LOOP COMPLETE =========================================== File: $BASE.tex Rounds: N Unit Reports: sec2 (Title) -- PASSED (4 turns) - Verified: [key claims] sec6 (Title) -- PASSED (6 turns) - Verified: [key claims] =========================================== ``` ## Configuration | Parameter | Default | Description | |------------|---------|------------------------| | MAX_ROUNDS | 10 | Iteration limit | | MAX_TURNS | 15 | Turns per unit per round | | LLM_TIMEOUT | 600s | Per-query timeout |