---
name: bibliography
description: Classifies and organizes literature by theme, method, and conclusion; use when you need to batch-read a folder of PDF/MD/DOCX/TXT files and output a structured CSV for literature reviews and annotation management.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)
# Bibliography
## When to Use
- You are conducting a literature review and need consistent summaries plus structured metadata (theme/method/conclusion) across many papers.
- You have a mixed-format reading folder (`.pdf`, `.md`, `.docx`, `.txt`) and want a single CSV for downstream analysis (e.g., Excel, R, Python).
- You need to organize annotations by **keywords (theme)**, **experimental methods (method)**, and **key conclusions (conclusion)**.
- You want a two-step pipeline: first generate a human-readable summary Markdown, then generate a machine-friendly CSV from that Markdown.
- You need robust handling of PDFs by converting them to Markdown first (via `pdf-extract`) and then using **only Markdown content** for extraction.
## Key Features
- Batch scans an input directory for `.pdf`, `.md`, `.docx`, and `.txt` literature files.
- Converts PDFs to Markdown via `pdf-extract`, then **ignores non-Markdown artifacts** (e.g., image folders).
- Extracts and normalizes, per document:
- Title
- Summary (prefer original abstract)
- Keywords (theme)
- Experimental Methods (method names only)
- Key Conclusions (single sentence)
- Commentary (one-sentence, tactful evaluation)
- Produces exactly **two outputs**:
1. A consolidated **Summary Markdown** saved under `outputs/`
2. A single **CSV** generated from that Summary Markdown
- Enforces UTF-8 output to prevent garbled characters; fills missing fields with `"Not recognized"` instead of leaving blanks.
- Uses the CSV field order and headers defined in `assets/bibliography_template.csv`.
## Dependencies
- `pdf-extract` (version: not specified; required when PDFs are present)
- Input formats supported (no external version constraints specified):
- PDF
- Markdown (`.md`)
- DOCX (`.docx`)
- Plain text (`.txt`)
## Example Usage
### Goal
Read all literature files in a folder, generate a consolidated summary Markdown, then generate a CSV following `assets/bibliography_template.csv`.
### Inputs
- Input directory (example): `./inputs/literature/`
- Output directory: `./outputs/`
- Output CSV path (example): `./outputs/bibliography.csv`
### Expected Outputs (exactly two files)
- `./outputs/bibliography_summary.md`
- `./outputs/bibliography.csv`
### Example Summary Markdown Structure (generated first)
```md
# Bibliography Summary
## Document 1
- Title:
- Summary:
- Keywords: keyword1 | keyword2 | keyword3
- Experimental Methods:
- Key Conclusions:
- Commentary:
## Document 2
...
```
### Example CSV (generated from the Summary Markdown)
The CSV must follow the header order defined in:
- `assets/bibliography_template.csv`
Rules:
- One row per document.
- No empty cells; use `Not recognized` when extraction fails.
- Save as UTF-8.
## Implementation Details
### 1) Input Reading and Normalization
- Traverse the input directory and process files with extensions:
- `.pdf`, `.md`, `.docx`, `.txt`
**PDF handling**
- If PDFs exist, convert them to Markdown using `pdf-extract`.
- Use only the generated `.md` content; ignore image directories or other byproducts.
- Locate `pdf-extract` as follows:
1. First, look for a sibling skill directory containing `SKILL.md` at the same level as this skill’s parent directory.
2. If not found, ask the user to confirm the actual `pdf-extract` path.
**DOCX handling**
- Extract body text while preserving title/paragraph order as much as possible.
**MD/TXT handling**
- Read text directly.
- If garbled characters appear or key fields cannot be recognized, attempt to detect and read using the original encoding (commonly `GB18030` / `GBK`) before extraction.
### 2) Generate the Summary Markdown First (Single Source of Truth)
Before producing the CSV, generate a consolidated Summary Markdown containing, for each document:
- **Title**
- **Summary**
- Prefer the original **Abstract**.
- If no “Abstract” exists, use the closest equivalent section (e.g., “Summary”, “Highlights”, or an “Objective–Method–Result–Conclusion” style segment).
- **Keywords**
- **Experimental Methods**
- **Key Conclusions**
- **Commentary**
- Exactly one sentence.
- Avoid harsh criticism; if the work has low value, use tactful phrasing.
This Summary Markdown must be saved with **UTF-8** encoding and stored under `outputs/`. The CSV must be generated **only** from this Markdown (not directly from raw files).
### 3) Field Extraction Rules (Theme / Method / Conclusion)
- **Keywords (theme)**
- Prefer the original keywords from the document.
- Separate multiple keywords with `|`.
- If no keywords are found, generate **3–5** keyword phrases based on the abstract and append:
- `(generated based on abstract)`
- **Experimental Methods (method)**
- Output method names only (no long descriptions).
- **Key Conclusions (conclusion)**
- One sentence that covers all main points.
### 4) CSV Output Constraints
- Output exactly **one** CSV file at the end.
- CSV field order and headers must match `assets/bibliography_template.csv`.
- Encoding must be **UTF-8** to avoid garbled characters.
- If any field cannot be extracted, write `Not recognized` (never leave empty).
- Only two files may be generated in total:
1. Summary Markdown
2. CSV
- No temporary/intermediate/auxiliary files may be left behind (including extracted text dumps, caches, logs, images, backups). If conversion/extraction requires intermediate artifacts, keep them in memory or ensure all non-target files are deleted before final output.
- Do not use PowerShell to directly write/manipulate CSV/Markdown to avoid encoding/newline issues; always generate and save using UTF-8.
### Reference
- Detailed rules and field descriptions: `references/guide.md`