--- name: auto-paper-improvement-loop description: "Autonomously improve a generated paper via GPT-5.4 xhigh review → implement fixes → recompile, for 2 rounds. Use when user says \"改论文\", \"improve paper\", \"论文润色循环\", \"auto improve\", or wants to iteratively polish a generated paper." argument-hint: [paper-directory] allowed-tools: Bash(*), Read, Write, Edit, Grep, Glob, Agent, mcp__codex__codex, mcp__codex__codex-reply --- # Auto Paper Improvement Loop: Review → Fix → Recompile Autonomously improve the paper at: **$ARGUMENTS** ## Context This skill is designed to run **after** Workflow 3 (`/paper-plan` → `/paper-figure` → `/paper-write` → `/paper-compile`). It takes a compiled paper and iteratively improves it through external LLM review. Unlike `/auto-review-loop` (which iterates on **research** — running experiments, collecting data, rewriting narrative), this skill iterates on **paper writing quality** — fixing theoretical inconsistencies, softening overclaims, adding missing content, and improving presentation. ## Constants - **MAX_ROUNDS = 2** — Two rounds of review→fix→recompile. Empirically, Round 1 catches structural issues (4→6/10), Round 2 catches remaining presentation issues (6→7/10). Diminishing returns beyond 2 rounds for writing-only improvements. - **REVIEWER_MODEL = `gpt-5.4`** — Model used via Codex MCP for paper review. - **REVIEW_LOG = `PAPER_IMPROVEMENT_LOG.md`** — Cumulative log of all rounds, stored in paper directory. - **HUMAN_CHECKPOINT = false** — When `true`, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. When `false` (default), runs fully autonomously. > 💡 Override: `/auto-paper-improvement-loop "paper/" — human checkpoint: true` ## Inputs 1. **Compiled paper** — `paper/main.pdf` + LaTeX source files 2. **All section `.tex` files** — concatenated for review prompt ## State Persistence (Compact Recovery) If the context window fills up mid-loop, Claude Code auto-compacts. To recover, this skill writes `PAPER_IMPROVEMENT_STATE.json` after each round: ```json { "current_round": 1, "threadId": "019ce736-...", "last_score": 6, "status": "in_progress", "timestamp": "2026-03-13T21:00:00" } ``` **On startup**: if `PAPER_IMPROVEMENT_STATE.json` exists with `"status": "in_progress"` AND `timestamp` is within 24 hours, read it + `PAPER_IMPROVEMENT_LOG.md` to recover context, then resume from the next round. Otherwise (file absent, `"status": "completed"`, or older than 24 hours), start fresh. **After each round**: overwrite the state file. **On completion**: set `"status": "completed"`. ## Workflow ### Step 0: Preserve Original ```bash cp paper/main.pdf paper/main_round0_original.pdf ``` ### Step 1: Collect Paper Text Concatenate all section files into a single text block for the review prompt: ```bash # Collect all sections in order for f in paper/sections/*.tex; do echo "% === $(basename $f) ===" cat "$f" done > /tmp/paper_full_text.txt ``` ### Step 2: Round 1 Review Send the full paper text to GPT-5.4 xhigh: ``` mcp__codex__codex: model: gpt-5.4 config: {"model_reasoning_effort": "xhigh"} prompt: | You are reviewing a [VENUE] paper. Please provide a detailed, structured review. ## Full Paper Text: [paste concatenated sections] ## Review Instructions Please act as a senior ML reviewer ([VENUE] level). Provide: 1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept) 2. **Summary** (2-3 sentences) 3. **Strengths** (bullet list, ranked) 4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR) 5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix 6. **Missing References** (if any) 7. **Verdict**: Ready for submission? Yes / Almost / No Focus on: theoretical rigor, claims vs evidence alignment, writing clarity, self-containedness, notation consistency. ``` Save the threadId for Round 2. ### Step 2b: Human Checkpoint (if enabled) **Skip if `HUMAN_CHECKPOINT = false`.** Present the review results and wait for user input: ``` 📋 Round 1 review complete. Score: X/10 — [verdict] Key weaknesses (by severity): 1. [CRITICAL] ... 2. [MAJOR] ... 3. [MINOR] ... Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end. ``` Parse user response same as `/auto-review-loop`: approve / custom instructions / skip / stop. ### Step 3: Implement Round 1 Fixes Parse the review and implement fixes by severity: **Priority order:** 1. CRITICAL fixes (assumption mismatches, internal contradictions) 2. MAJOR fixes (overclaims, missing content, notation issues) 3. MINOR fixes (if time permits) **Common fix patterns:** | Issue | Fix Pattern | |-------|-------------| | Assumption-model mismatch | Rewrite assumption to match the model, add formal proposition bridging the gap | | Overclaims | Soften language: "validate" → "demonstrate practical relevance", "comparable" → "qualitatively competitive" | | Missing metrics | Add quantitative table with honest parameter counts and caveats | | Theorem not self-contained | Add "Interpretation" paragraph listing all dependencies | | Notation confusion | Rename conflicting symbols globally, add Notation paragraph | | Missing references | Add to `references.bib`, cite in appropriate locations | | Theory-practice gap | Explicitly frame theory as idealized; add synthetic validation subsection | ### Step 4: Recompile Round 1 ```bash cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex cp main.pdf main_round1.pdf ``` Verify: 0 undefined references, 0 undefined citations. ### Step 5: Round 2 Review Use `mcp__codex__codex-reply` with the saved threadId: ``` mcp__codex__codex-reply: threadId: [saved from Round 1] model: gpt-5.4 config: {"model_reasoning_effort": "xhigh"} prompt: | [Round 2 update] Since your last review, we have implemented: 1. [Fix 1]: [description] 2. [Fix 2]: [description] ... Please re-score and re-assess. Same format: Score, Summary, Strengths, Weaknesses, Actionable fixes, Verdict. ``` ### Step 5b: Human Checkpoint (if enabled) **Skip if `HUMAN_CHECKPOINT = false`.** Same as Step 2b — present Round 2 review, wait for user input. ### Step 6: Implement Round 2 Fixes Same process as Step 3. Typical Round 2 fixes: - Add controlled synthetic experiments validating theory - Further soften any remaining overclaims - Formalize informal arguments (e.g., truncation → formal proposition) - Strengthen limitations section ### Step 7: Recompile Round 2 ```bash cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex cp main.pdf main_round2.pdf ``` ### Step 8: Format Check After the final recompilation, run a format compliance check: ```bash # 1. Page count vs venue limit PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}') echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)" # 2. Overfull hbox warnings (content exceeding margins) OVERFULL=$(grep -c "Overfull" paper/main.log 2>/dev/null || echo 0) echo "Overfull hbox warnings: $OVERFULL" grep "Overfull" paper/main.log 2>/dev/null | head -10 # 3. Underfull hbox warnings (loose spacing) UNDERFULL=$(grep -c "Underfull" paper/main.log 2>/dev/null || echo 0) echo "Underfull hbox warnings: $UNDERFULL" # 4. Bad boxes summary grep -c "badness" paper/main.log 2>/dev/null || echo "0 badness warnings" ``` **Auto-fix patterns:** | Issue | Fix | |-------|-----| | Overfull hbox in equation | Wrap in `\resizebox` or split with `\split`/`aligned` | | Overfull hbox in table | Reduce font (`\small`/`\footnotesize`) or use `\resizebox{\linewidth}{!}{...}` | | Overfull hbox in text | Rephrase sentence or add `\allowbreak` / `\-` hints | | Over page limit | Move content to appendix, compress tables, reduce figure sizes | | Underfull hbox (loose) | Rephrase for better line filling or add `\looseness=-1` | If any overfull hbox > 10pt is found, fix it and recompile before documenting. ### Step 9: Document Results Create `PAPER_IMPROVEMENT_LOG.md` in the paper directory: ```markdown # Paper Improvement Log ## Score Progression | Round | Score | Verdict | Key Changes | |-------|-------|---------|-------------| | Round 0 (original) | X/10 | No/Almost/Yes | Baseline | | Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] | | Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] | ## Round 1 Review & Fixes
GPT-5.4 xhigh Review (Round 1) [Full raw review text, verbatim]
### Fixes Implemented 1. [Fix description] 2. [Fix description] ... ## Round 2 Review & Fixes
GPT-5.4 xhigh Review (Round 2) [Full raw review text, verbatim]
### Fixes Implemented 1. [Fix description] 2. [Fix description] ... ## PDFs - `main_round0_original.pdf` — Original generated paper - `main_round1.pdf` — After Round 1 fixes - `main_round2.pdf` — Final version after Round 2 fixes ``` ### Step 9: Summary Report to user: - Score progression table - Number of CRITICAL/MAJOR/MINOR issues fixed per round - Final page count - Remaining issues (if any) ### Feishu Notification (if configured) After each round's review AND at final completion, check `~/.claude/feishu.json`: - **After each round**: Send `review_scored` — "Round N: X/10 — [key changes]" - **After final round**: Send `pipeline_done` — score progression table + final page count - If config absent or mode `"off"`: skip entirely (no-op) ## Output ``` paper/ ├── main_round0_original.pdf # Original ├── main_round1.pdf # After Round 1 ├── main_round2.pdf # After Round 2 (final) ├── main.pdf # = main_round2.pdf └── PAPER_IMPROVEMENT_LOG.md # Full review log with scores ``` ## Key Rules - **Large file handling**: If the Write tool fails due to file size, immediately retry using Bash (`cat << 'EOF' > file`) to write in chunks. Do NOT ask the user for permission — just do it silently. - **Preserve all PDF versions** — user needs to compare progression - **Save FULL raw review text** — do not summarize or truncate GPT-5.4 responses - **Use `mcp__codex__codex-reply`** for Round 2 to maintain conversation context - **Always recompile after fixes** — verify 0 errors before proceeding - **Do not fabricate experimental results** — synthetic validation must describe methodology, not invent numbers - **Respect the paper's claims** — soften overclaims rather than adding unsupported new claims - **Global consistency** — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions) ## Typical Score Progression Based on end-to-end testing on a 9-page ICLR 2026 theory paper: | Round | Score | Key Improvements | |-------|-------|-----------------| | Round 0 | 4/10 (content) | Baseline: assumption-model mismatch, overclaims, notation issues | | Round 1 | 6/10 (content) | Fixed assumptions, softened claims, added interpretation, renamed notation | | Round 2 | 7/10 (content) | Added synthetic validation, formal truncation proposition, stronger limitations | | Round 3 | 5→8.5/10 (format) | Removed hero fig, appendix, compressed conclusion, fixed overfull hbox | **+4.5 points across 3 rounds** (2 content + 1 format) is typical for a well-structured but rough first draft. Final: 8 pages main body, 0 overfull hbox, ICLR-compliant.