---
name: literature-review-agent
description: Step 3 of the PaperOrchestra pipeline (arXiv:2604.05018). Execute the literature search strategy from outline.json — discover candidate papers via web search, verify them through Semantic Scholar (Levenshtein > 70 fuzzy title match, temporal cutoff, dedup by paperId), build a BibTeX file, and draft Introduction + Related Work using ≥90% of the verified pool. Runs in parallel with the plotting-agent. TRIGGER when the orchestrator delegates Step 3 or when the user asks to "find citations for my paper", "draft the related work", or "build the bibliography".
---
# Literature Review Agent (Step 3)
Faithful implementation of the Hybrid Literature Agent from PaperOrchestra
(Song et al., 2026, arXiv:2604.05018, §4 Step 3, App. D.3, App. F.1 p.46).
**Cost: ~20–30 LLM calls.** This is one of the two longest steps (the other is
plotting). Wall-time floor is set by Semantic Scholar's 1 QPS verification
limit.
## Inputs
- `workspace/outline.json` — specifically `intro_related_work_plan` with the
Introduction search directions and the 2-4 Related Work methodology
clusters
- `workspace/inputs/conference_guidelines.md` — used to derive `cutoff_date`
- `workspace/inputs/idea.md`, `workspace/inputs/experimental_log.md` — for
framing the Intro and grounding the Related Work positioning
## Outputs
- `workspace/citation_pool.json` — verified Semantic Scholar metadata for
every paper that survived verification
- `workspace/refs.bib` — BibTeX file generated from the verified pool
- `workspace/drafts/intro_relwork.tex` — drafted Introduction and Related
Work sections, written into the template, with the rest of the template
preserved verbatim
## Two-phase pipeline (App. D.3)
```
PHASE 1 — Parallel Candidate Discovery
For each search direction in introduction_strategy.search_directions:
For each limitation_search_query in each related_work cluster:
- Use the host's web search tool to discover up to ~10 candidate papers.
- Run up to 10 discovery queries in parallel (host-permitting).
- Collect (title, snippet, url) tuples — no verification yet.
→ PRE-DEDUP before Phase 2 (see Step 1.5 below)
PHASE 2 — Sequential Citation Verification (1 QPS, with cache)
For each candidate (after pre-dedup), sequentially:
0. Check s2_cache.json first (scripts/s2_cache.py --check).
If HIT: use cached response, skip live S2 call. No throttle needed.
If MISS: proceed with live request below.
1. Query Semantic Scholar by title:
GET https://api.semanticscholar.org/graph/v1/paper/search?query=
&fields=title,abstract,year,authors,venue,externalIds&limit=5
(Public endpoint, no key. Throttle to 1 QPS for live requests only.)
2. Store the S2 response in cache: s2_cache.py --store.
3. Pick the top hit. Check Levenshtein title ratio against the original
candidate title. If ratio < 70: discard.
4. Bonus: if year and venue exactly align with hints, add a +5 point
match-quality bonus.
5. Require: abstract is non-empty.
6. Require: paper.year (or month if known) strictly predates cutoff_date.
Months default to day-1: e.g., "October 2024" → 2024-10-01.
7. If all checks pass, add to verified pool.
After all candidates are verified, dedup by Semantic Scholar paperId.
```
The host agent does the LLM/web work; the deterministic helpers in `scripts/`
do the math.
## Step-by-step
### 0. Derive `cutoff_date`
Parse `conference_guidelines.md` for the submission deadline. The paper aligns
research cutoff with venue submission deadline (App. D.1):
| Venue | Cutoff |
|---|---|
| CVPR 2025 | Nov 2024 |
| ICLR 2025 | Oct 2024 |
| Other | One month before the stated submission deadline |
Encode as `YYYY-MM-DD`. Months default to day-1 (e.g., `2024-10-01`).
### 1. Phase 1: Parallel Candidate Discovery
From `outline.json`:
- All `introduction_strategy.search_directions` (3-5 queries)
- For each cluster in `related_work_strategy.subsections`:
- The cluster's `sota_investigation_mission` becomes a search query
- All `limitation_search_queries` (1-3 each)
For each query, **use your host's web search tool** (e.g., `WebSearch` in
Claude Code, `@web` in Cursor, the search tool in Antigravity). Collect the
top ~10 candidates per query: title, abstract snippet, source URL.
If your host supports parallel sub-tasks, fire up to 10 concurrent search
queries. If not, run sequentially — slower but functionally equivalent.
#### Optional: Exa as a Phase 1 backend
If your host has no native web search, OR you want a research-paper-focused
backend with better signal-to-noise, you can use [Exa](https://exa.ai) via
the bundled `scripts/exa_search.py` helper. It is **opt-in** and reads
`EXA_API_KEY` from the environment — the repo never commits a key.
```bash
export EXA_API_KEY="your-key-here" # get one at https://dashboard.exa.ai/
python skills/literature-review-agent/scripts/exa_search.py \
--query "Sparse attention long context transformers" \
--num-results 15 \
--discovered-for "related_work[2.1]"
```
Output is a normalized candidate list ready to merge into
`raw_candidates.json`. Phase 2 verification (Semantic Scholar fuzzy match,
cutoff, dedup) is unchanged. See `references/exa-search-cookbook.md` for
the full recipe, query patterns, cost estimates, and security notes.
Combine all discovered candidates into a single working list. Tag each with
the originating query ID so you can later attribute it to "intro" vs
"related_work[i]".
### 1.5. Pre-dedup before Phase 2
**Always run this before starting Phase 2.** Multiple search queries routinely
return the same papers (e.g., "Attention is All You Need" appears in almost
every NLP discovery query). Verifying duplicates wastes 30-40% of S2 quota
at 1 QPS.
```bash
python skills/literature-review-agent/scripts/pre_dedup_candidates.py \
--in workspace/raw_candidates.json \
--out workspace/deduped_candidates.json
# Prints: "150 candidates → 97 unique (53 duplicates removed)"
```
Use `workspace/deduped_candidates.json` as input to Phase 2.
### 2. Phase 2: Sequential Verification via Semantic Scholar (with cache)
For each candidate in `deduped_candidates.json`, in **sequential** order:
**Step A — check cache first** (no S2 call, no throttle needed):
```bash
python skills/literature-review-agent/scripts/s2_cache.py \
--cache workspace/cache/s2_cache.json \
--check ""
# exit 0 + prints JSON → use cached response, skip Step B
# exit 1 → proceed to Step B
```
**Step B — live S2 request** (cache MISS only, throttle to 1 QPS):
**Preferred:** use the bundled `scripts/s2_search.py` helper — it handles
auth, retries, and 429 back-off automatically:
```bash
python skills/literature-review-agent/scripts/s2_search.py \
--query "" --limit 5
# If SEMANTIC_SCHOLAR_API_KEY is set the key is forwarded automatically.
# If not, the public unauthenticated endpoint is used (≤1 QPS, still works).
```
Check whether the key is configured before starting Phase 2:
```bash
python skills/literature-review-agent/scripts/s2_search.py --check-key
```
**Fallback:** if you prefer your host's URL fetch tool, GET:
```
https://api.semanticscholar.org/graph/v1/paper/search?query=&limit=5&fields=title,abstract,year,authors,venue,externalIds
```
Add header `x-api-key: ` if the env var is set.
Be polite: ≤1 request per second for live requests. Cache hits are free.
**Step C — store in cache** (after every successful live request):
```bash
python skills/literature-review-agent/scripts/s2_cache.py \
--cache workspace/cache/s2_cache.json \
--store "" \
--response ''
```
For the top hit:
```bash
python skills/literature-review-agent/scripts/levenshtein_match.py \
--candidate "Original candidate title" \
--found "S2 returned title"
# prints integer 0-100. Discard if < 70.
```
Then check the temporal cutoff:
```bash
python skills/literature-review-agent/scripts/check_cutoff.py \
--paper-year 2024 \
--paper-month 9 \
--cutoff 2024-10-01
# exit 0 if strictly predates, exit 1 if not
```
If both checks pass AND the abstract is non-empty, append the paper's full
S2 metadata to the verified pool.
### 3. Dedup and assemble the pool
After all candidates are verified:
```bash
python skills/literature-review-agent/scripts/dedupe_by_id.py \
--in raw_pool.json \
--out workspace/citation_pool.json
```
The dedupe script keys on `paperId` (Semantic Scholar's internal unique ID),
falling back to `externalIds.DOI`, then `externalIds.ArXiv`, then a
normalized title.
The script also computes and writes `min_cite_paper_count` =
`floor(0.9 * len(papers))` — the minimum number of papers the writing step
must cite (the paper's ≥90% integration rule, App. D.3).
**Immediately after dedupe_by_id.py**, validate and auto-fix the pool schema:
```bash
python skills/literature-review-agent/scripts/validate_pool.py \
--pool workspace/citation_pool.json --fix
# Catches and fixes authors-as-strings, reports missing required fields.
# Must pass before proceeding to Step 4.
```
### 4. Build the BibTeX file
```bash
python skills/literature-review-agent/scripts/bibtex_format.py \
--pool workspace/citation_pool.json \
--out workspace/refs.bib
```
The script generates citation keys deterministically from `firstauthor + year
+ first significant word of title` (e.g., `vaswani2017attention`). It writes
out only `@article` / `@inproceedings` / `@misc` entries — never invents
fields. It also writes the canonical `bibtex_key` back into each paper record
in `citation_pool.json`.
**Immediately after bibtex_format.py**, sync keys in `intro_relwork.tex`:
```bash
python skills/literature-review-agent/scripts/sync_keys.py \
--pool workspace/citation_pool.json \
--tex workspace/drafts/intro_relwork.tex \
--inplace
# Replaces every \cite{agent_key} with \cite{canonical_bibtex_key}.
# Eliminates citation_coverage gate failures caused by key mismatch.
```
These two steps replace the manual Python snippets that were previously
required. The pipeline is now:
```
dedupe_by_id → validate_pool --fix → bibtex_format → sync_keys
```
### 5. Draft Introduction + Related Work
This is where you (the host agent) actually write text. Load the
**verbatim Literature Review Agent prompt** at `references/prompt.md`.
Substitute the template placeholders:
| Placeholder | Value |
|---|---|
| `intro_related_work_plan` | full JSON object from `outline.json` |
| `project_idea` | contents of `idea.md` |
| `project_experimental_log` | contents of `experimental_log.md` |
| `citation_checklist` | the BibTeX keys from `refs.bib` |
| `collected_papers` | list of `{key, title, abstract}` from `citation_pool.json` |
| `paper_count` | `len(citation_pool.papers)` |
| `min_cite_paper_count` | from `citation_pool.json` |
| `cutoff_date` | the date you derived in Step 0 |
**Also prepend the Anti-Leakage Prompt** from
`../paper-orchestra/references/anti-leakage-prompt.md`.
Run your LLM with the combined prompt against `template.tex`. The agent's
job is to fill in the empty Introduction and Related Work sections of the
template **and leave everything else untouched**. Output: the full
`template.tex` with those two sections filled. Save to
`workspace/drafts/intro_relwork.tex`.
### 6. Verify ≥90% citation coverage
```bash
python skills/literature-review-agent/scripts/citation_coverage.py \
--tex workspace/drafts/intro_relwork.tex \
--pool workspace/citation_pool.json
# exit 0 if ≥90% of pool is cited; exit 1 otherwise
```
If the gate fails, re-prompt the writing step explicitly listing the missing
keys and asking the agent to integrate them where contextually appropriate.
## Critical rules from the prompt
These are excerpted from `references/prompt.md`. The host agent MUST honor
them on the writing call:
- **Cite ONLY from `collected_papers`.** Never invent BibTeX keys, never
reference papers not in the pool.
- **Cite at least `min_cite_paper_count` of them** in Intro + Related Work
combined.
- **TIMELINE RULE**: Do not treat any papers published after `cutoff_date`
as prior baselines to beat. They are concurrent work only.
- **EVALUATION RULE**: Do not claim our method beats / achieves SOTA over a
specific cited paper UNLESS that paper is explicitly evaluated against in
`experimental_log.md`. Frame other recent papers strictly as concurrent,
orthogonal, or conceptual work.
- **Output format**: return the full code for the updated `template.tex`,
with the two empty sections (Introduction and Related Work) filled in,
and **all the other code** (packages, styles, other sections) **identical
to the original** template.tex.
- Wrap output in ```` ```latex ... ``` ```` fences.
- Do not change `\usepackage[capitalize]{cleveref}` to `cleverref` (there is
no `cleverref.sty`).
## Degraded mode (no web search)
If your host has no web search tool, switch to degraded mode:
1. If the user has placed a pre-built `workspace/inputs/refs.bib` in the
workspace, load it directly into `workspace/refs.bib` and skip Phase 1
and Phase 2.
2. Otherwise, emit `workspace/drafts/intro_relwork.tex` containing the
template with two TODO markers in the Intro and Related Work sections,
and tell the user the pipeline cannot complete Step 3 without web search.
## Resources
- `references/prompt.md` — verbatim Literature Review Agent prompt from App. F.1
- `references/discovery-pipeline.md` — Phase 1 + Phase 2 explained in detail
- `references/verification-rules.md` — Levenshtein cutoff, year alignment, dedup
- `references/citation-density-rule.md` — the ≥90% integration rule
- `references/s2-api-cookbook.md` — Semantic Scholar URLs, fields, rate limits
- `references/exa-search-cookbook.md` — optional Exa backend for Phase 1 (research-paper-focused web search)
- `scripts/pre_dedup_candidates.py` — **NEW** dedup Phase 1 candidates before Phase 2 (saves 30-40% S2 quota)
- `scripts/s2_cache.py` — **NEW** persistent S2 response cache (eliminates re-verification on re-runs)
- `scripts/validate_pool.py` — **NEW** validate & auto-fix citation_pool.json schema (authors format)
- `scripts/sync_keys.py` — **NEW** sync cite keys in .tex with canonical bibtex_keys after bibtex_format.py
- `scripts/levenshtein_match.py` — fuzzy title match (ratio > 70)
- `scripts/check_cutoff.py` — date cmp w/ month → day-1 default
- `scripts/dedupe_by_id.py` — dedup verified pool by S2 paperId
- `scripts/bibtex_format.py` — build refs.bib from JSON pool
- `scripts/citation_coverage.py` — ≥90% citation coverage gate
- `scripts/s2_search.py` — **NEW** Semantic Scholar title-search helper; reads `SEMANTIC_SCHOLAR_API_KEY` from env (optional — falls back to unauthenticated)
- `scripts/exa_search.py` — optional Exa Phase 1 backend (reads `EXA_API_KEY` from env)