--- name: "knowledge-paper-extractor" description: "Use when you need to extract metadata (title, authors, abstract) and references from scientific PDF papers or book chapters." metadata: stage: "alpha" source: "MIGRATED" requires: - "knowledge-research-contract" --- # Knowledge Paper Extractor Extracts structured metadata and bibliographic references from scientific PDF documents. ## Goal Given a scientific paper or book chapter in PDF format, produce structured output containing: - **Title**, **authors**, and **abstract** of the document - **All cited references** with structured fields (author, title, DOI, journal, year) ## Core Workflow 1. Receive or identify the target PDF file path. 2. Run the extraction script. 3. Review the generated output files in the output directory. ## Script Integration Extract metadata and references from any scientific PDF: ```bash uv run .agent/skills/knowledge-paper-extractor/scripts/extract_paper.py /path/to/paper.pdf --output-dir ./output ``` ### Output Files The script creates three files in the output directory: | File | Content | |---|---| | `metadata.json` | Title, authors, abstract, and raw PDF metadata | | `references.json` | Structured list of all cited references via `refextract` | | `references.csl.json` | CSL-JSON resolved references via Crossref API | | `summary.md` | Human-readable Markdown summary with DOIs | ### CLI Options ``` uv run scripts/extract_paper.py [--output-dir DIR] ``` - `PDF_PATH` — Path to the scientific PDF (required) - `--output-dir` — Output directory (default: `./output`) ## How It Works - **References**: Extracted via `refextract`, which parses the reference section of scholarly PDFs and returns structured bibliographic data. - **Metadata**: Extracted via `pdfminer.six` (first-page text with layout/font analysis) and `pypdf` (embedded PDF metadata). Title is identified by the largest font on page 1; abstract is located between "Abstract" and the next major section heading. ## Constraints - Input must be a PDF file (not DOCX, HTML, etc.). - Heuristic metadata extraction works best on standard academic paper layouts (two-column or single-column with clear title/abstract sections). - `refextract` requires that the PDF contains a recognizable reference/bibliography section.