--- name: arxiv-search description: | Retrieve paper metadata from arXiv using keyword queries and save results as JSONL (`papers/papers_raw.jsonl`). **Trigger**: arXiv, arxiv, paper search, metadata retrieval, 文献检索, 论文检索, 拉取元数据, 离线导入. **Use when**: 需要一个初始论文集合(survey/snapshot 的 Stage C1),来源为 arXiv(在线检索或离线导入 export)。 **Skip if**: 已经有可用的 `papers/papers_raw.jsonl`,或数据源不是 arXiv。 **Network**: 在线检索需要网络;离线 `--input ` 不需要网络。 **Guardrail**: 只做 metadata;不要在 `output/` 写长 prose。 --- # arXiv Search (metadata-first) Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation. When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly. ## Input - `queries.md` (keywords, excludes, time window) ## Outputs - `papers/papers_raw.jsonl` (JSONL; 1 paper per line) - Each record includes at least: `title`, `authors`, `year`, `url`, `abstract` - When using the arXiv API online mode, records also include helpful metadata: `arxiv_id`, `pdf_url`, `categories`, `primary_category`, `published`, `updated`, `doi`, `journal_ref`, `comment` - Convenience index (optional but generated by the script): - `papers/papers_raw.csv` ## Decision: online vs offline - If you have network access: run arXiv API retrieval. - If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields. - Hybrid: if you import offline but still have network later, you can **enrich missing fields** (abstract/authors/categories) via arXiv `id_list` using `--enrich-metadata` or `queries.md` `enrich_metadata: true`. ## Workflow (heuristic) 1. Read `queries.md` and expand into concrete query strings. 2. Retrieve results (online) or import an export (offline). 3. Normalize every record to include at least: - `title`, `authors` (array), `year`, `url`, `abstract` 4. Keep the set broad at this stage; dedupe/ranking comes next. 5. Apply time window and `max_results` if specified. ## Quality checklist - [ ] `papers/papers_raw.jsonl` exists. - [ ] Each line is valid JSON and contains `title`, `authors`, `year`, `url`. ## Side effects - Allowed: create/overwrite `papers/papers_raw.jsonl`; append notes to `STATUS.md`. - Not allowed: write prose sections in `output/` before writing is approved. ## Script ### Quick Start - `python .codex/skills/arxiv-search/scripts/run.py --help` - Online: `python .codex/skills/arxiv-search/scripts/run.py --workspace --query "" --max-results 200` - Offline import: `python .codex/skills/arxiv-search/scripts/run.py --workspace --input ` ### All Options - `--query `: repeatable; multiple queries are unioned - `--exclude `: repeatable; excludes applied after retrieval - `--max-results `: cap total retrieved - `--input `: offline mode (CSV/JSON/JSONL) - `--enrich-metadata`: best-effort enrich via arXiv `id_list` (needs network) - `queries.md` also supports: `keywords`, `exclude`, `time window`, `max_results`, `enrich_metadata` ### Examples - Online (multi-query + excludes): - `python .codex/skills/arxiv-search/scripts/run.py --workspace --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300` - Fetch a single paper by arXiv ID (direct `id_list` fetch): - `python .codex/skills/arxiv-search/scripts/run.py --workspace --query 2509.02547 --max-results 1` - Offline auto-detect (no flags): - Place `papers/import.csv` (or `.json/.jsonl`) under the workspace, then run: `python .codex/skills/arxiv-search/scripts/run.py --workspace ` - Offline import + time window (via `queries.md`): - Set `- time window: { from: 2022, to: 2025 }` then run offline import normally ## Troubleshooting ### Common Issues #### Issue: `papers/papers_raw.jsonl` is empty **Symptom**: - Script exits with “No results returned …” or output file is empty. **Causes**: - Network is blocked (online mode). - Queries are too narrow or `queries.md` is empty. **Solutions**: - Use offline import: place `papers/import.csv|json|jsonl` in the workspace or pass `--input`. - Broaden keywords and reduce excludes in `queries.md`. - Run with explicit `--query` to sanity-check the parser. #### Issue: Offline import records miss fields **Symptom**: - Downstream steps fail because records miss `authors/year/abstract/url`. **Causes**: - Export columns don’t match expected fields; upstream export is incomplete. **Solutions**: - Ensure the export contains at least `title`, `authors`, `year`, `url`, `abstract`. - If you later have network, use `--enrich-metadata` to backfill missing fields (best effort). ### Recovery Checklist - [ ] Confirm `queries.md` has non-empty `keywords` (or pass `--query`). - [ ] If offline: confirm workspace has `papers/import.*` and rerun. - [ ] Spot-check 3–5 JSONL lines: valid JSON + required fields.