---
name: pubmed-database
description: >-
  Programmatic PubMed access via NCBI E-utilities REST API. Covers
  Boolean/MeSH queries, field-tagged search, endpoints (ESearch,
  EFetch, ESummary, EPost, ELink), history server for batches,
  citation matching, systematic review strategies. Use for biomedical
  literature search or automated pipelines.
license: CC-BY-4.0
---

# PubMed Database

## Overview

PubMed is the U.S. National Library of Medicine's database providing free access to 36M+ biomedical citations from MEDLINE and life sciences journals. This skill covers programmatic access via the E-utilities REST API and advanced search query construction using Boolean operators, MeSH terms, and field tags.

## When to Use

- Searching biomedical literature with structured Boolean/MeSH queries
- Building automated literature monitoring or extraction pipelines
- Conducting systematic literature reviews or meta-analyses
- Retrieving article metadata, abstracts, or citation information by PMID/DOI
- Finding related articles or exploring citation networks programmatically
- Batch processing large sets of PubMed records
- For Python-native PubMed access, prefer BioPython (`Bio.Entrez`) — this skill covers direct REST API usage
- For broader academic search (non-biomedical), use OpenAlex or Semantic Scholar APIs

## Prerequisites

```bash
pip install requests  # HTTP client for E-utilities API
# Optional: pip install biopython  — Bio.Entrez wrapper (higher-level API)
```

**API Rate Limits**:
- Without API key: **3 requests/second**
- With API key: **10 requests/second** (register at https://www.ncbi.nlm.nih.gov/account/)
- Always include `User-Agent` header with contact email

## Quick Start

```python
import requests
import time

BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
API_KEY = "YOUR_API_KEY"  # Optional but recommended

def pubmed_request(endpoint, params):
    """Reusable helper for E-utilities API calls with rate limiting."""
    params.setdefault("api_key", API_KEY)
    response = requests.get(f"{BASE_URL}{endpoint}", params=params)
    response.raise_for_status()
    time.sleep(0.1 if API_KEY != "YOUR_API_KEY" else 0.34)  # Rate limit
    return response

# Search → Fetch workflow
search_resp = pubmed_request("esearch.fcgi", {
    "db": "pubmed", "term": "CRISPR[tiab] AND 2024[dp]",
    "retmax": 5, "retmode": "json"
})
pmids = search_resp.json()["esearchresult"]["idlist"]
print(f"Found {len(pmids)} articles: {pmids}")

fetch_resp = pubmed_request("efetch.fcgi", {
    "db": "pubmed", "id": ",".join(pmids),
    "rettype": "abstract", "retmode": "text"
})
print(fetch_resp.text[:500])
```

## Core API

### 1. Search Query Construction

Build PubMed queries using Boolean operators, field tags, and MeSH terms.

```python
# Boolean operators: AND, OR, NOT (must be uppercase)
queries = {
    "basic": "diabetes AND treatment AND 2024[dp]",
    "synonyms": "(metformin OR insulin) AND type 2 diabetes",
    "exclude": "cancer NOT review[pt]",
    "phrase": '"gene expression" AND RNA-seq',
    "field_tags": "smith ja[au] AND cancer[tiab] AND 2023[dp]",
}

# Common field tags:
# [tiab] = title/abstract    [au] = author       [mh] = MeSH term
# [pt]   = publication type   [dp] = date          [ta] = journal
# [1au]  = first author       [lastau] = last author
# [affil] = affiliation       [doi] = DOI          [pmid] = PubMed ID

# Date filtering
date_queries = {
    "single_year": "cancer AND 2024[dp]",
    "range": "cancer AND 2020:2024[dp]",
    "specific": "cancer AND 2024/03/15[dp]",
}
```

```python
# MeSH terms — controlled vocabulary for precise searching
mesh_queries = {
    # [mh] includes narrower terms automatically
    "broad": "diabetes mellitus[mh]",
    # [majr] limits to major topic focus
    "focused": "diabetes mellitus[majr]",
    # MeSH + subheading
    "therapy": "diabetes mellitus, type 2[mh]/drug therapy",
    # Substance name
    "drug": "metformin[nm] AND diabetes mellitus[mh]",
}

# Common MeSH subheadings:
# /diagnosis  /drug therapy  /epidemiology  /etiology
# /prevention & control  /therapy  /genetics
```

### 2. ESearch — Search and Retrieve PMIDs

```python
# Basic search
resp = pubmed_request("esearch.fcgi", {
    "db": "pubmed",
    "term": "CRISPR[tiab] AND genome editing[tiab] AND 2024[dp]",
    "retmax": 100,
    "retmode": "json",
    "sort": "relevance",  # or "pub_date", "first_author"
})
result = resp.json()["esearchresult"]
pmids = result["idlist"]
total = result["count"]
print(f"Total hits: {total}, Retrieved: {len(pmids)}")

# With history server (for large result sets > 500)
resp = pubmed_request("esearch.fcgi", {
    "db": "pubmed",
    "term": "cancer AND 2024[dp]",
    "usehistory": "y",
    "retmode": "json",
})
result = resp.json()["esearchresult"]
webenv = result["webenv"]
query_key = result["querykey"]
total = int(result["count"])
print(f"Stored {total} results on history server")
```

### 3. EFetch — Download Full Records

```python
# Fetch abstracts as text
resp = pubmed_request("efetch.fcgi", {
    "db": "pubmed",
    "id": ",".join(pmids[:10]),
    "rettype": "abstract",
    "retmode": "text",
})
print(resp.text[:500])

# Fetch XML for structured parsing
resp = pubmed_request("efetch.fcgi", {
    "db": "pubmed",
    "id": ",".join(pmids[:10]),
    "rettype": "xml",
    "retmode": "xml",
})

# Fetch from history server (batch processing)
batch_size = 500
for start in range(0, total, batch_size):
    resp = pubmed_request("efetch.fcgi", {
        "db": "pubmed",
        "query_key": query_key,
        "WebEnv": webenv,
        "retstart": start,
        "retmax": batch_size,
        "rettype": "xml",
        "retmode": "xml",
    })
    print(f"Fetched records {start}–{start + batch_size}")
    time.sleep(0.5)  # Extra delay for large batches
```

### 4. ESummary and ELink — Summaries and Related Articles

```python
# ESummary — lightweight document summaries
resp = pubmed_request("esummary.fcgi", {
    "db": "pubmed",
    "id": ",".join(pmids[:5]),
    "retmode": "json",
})
for uid, data in resp.json()["result"].items():
    if uid == "uids":
        continue
    print(f"PMID {uid}: {data.get('title', '')[:80]}")
    print(f"  Journal: {data.get('fulljournalname', '')}, "
          f"Date: {data.get('pubdate', '')}")

# ELink — find related articles
resp = pubmed_request("elink.fcgi", {
    "dbfrom": "pubmed",
    "db": "pubmed",
    "id": pmids[0],
    "cmd": "neighbor",
    "retmode": "json",
})
# Links to other NCBI databases
resp = pubmed_request("elink.fcgi", {
    "dbfrom": "pubmed",
    "db": "pmc",  # PubMed Central
    "id": pmids[0],
    "retmode": "json",
})
```

### 5. Citation Matching and Identifier Lookup

```python
# Search by identifiers
id_queries = {
    "pmid": "12345678[pmid]",
    "doi": "10.1056/NEJMoa123456[doi]",
    "pmc": "PMC123456[pmc]",
}

# ECitMatch — match partial citations to PMIDs
# Format: journal|year|volume|first_page|author_name|key|
citation = "Science|2008|320|5880|1185|key1|"
resp = pubmed_request("ecitmatch.cgi", {
    "db": "pubmed",
    "rettype": "xml",
    "bdata": citation,
})
print(f"Matched PMID: {resp.text.strip()}")

# Batch citation matching
citations = [
    "Nature|2020|580|7801|71|ref1|",
    "Science|2019|366|6463|347|ref2|",
]
resp = pubmed_request("ecitmatch.cgi", {
    "db": "pubmed",
    "rettype": "xml",
    "bdata": "\r".join(citations),
})
```

### 6. Publication Filtering

```python
# Filter by publication type
type_filters = {
    "rcts": "randomized controlled trial[pt]",
    "reviews": "systematic review[pt]",
    "meta": "meta-analysis[pt]",
    "guidelines": "guideline[pt]",
    "case_reports": "case reports[pt]",
}

# Filter by text availability
availability = {
    "free_text": "free full text[sb]",
    "has_abstract": "hasabstract[text]",
}

# Combine filters
query = (
    "diabetes mellitus[mh] AND "
    "randomized controlled trial[pt] AND "
    "2023:2024[dp] AND "
    "free full text[sb] AND "
    "english[la]"
)
resp = pubmed_request("esearch.fcgi", {
    "db": "pubmed", "term": query, "retmax": 100, "retmode": "json"
})
print(f"Free RCTs on diabetes (2023-2024): {resp.json()['esearchresult']['count']}")
```

## Key Concepts

### E-utilities Endpoint Summary

| Endpoint | Purpose | Key Parameters |
|----------|---------|----------------|
| `esearch.fcgi` | Search, return PMIDs | `term`, `retmax`, `sort`, `usehistory` |
| `efetch.fcgi` | Download full records | `id`/`query_key`+`WebEnv`, `rettype`, `retmode` |
| `esummary.fcgi` | Lightweight summaries | `id`, `retmode` |
| `epost.fcgi` | Upload UIDs to server | `id` (comma-separated) |
| `elink.fcgi` | Related articles, cross-DB | `id`, `dbfrom`, `db`, `cmd` |
| `einfo.fcgi` | List databases/fields | `db` (optional) |
| `egquery.fcgi` | Count hits across DBs | `term` |
| `espell.fcgi` | Spelling suggestions | `term` |
| `ecitmatch.cgi` | Match citations to PMIDs | `bdata` |

### History Server Pattern

For result sets >500 articles, use the history server to avoid URL length limits:

1. **ESearch** with `usehistory=y` → returns `WebEnv` + `QueryKey`
2. **EFetch** in batches using `WebEnv` + `QueryKey` + `retstart`/`retmax`
3. **EPost** to upload additional PMIDs to the same `WebEnv`

### Automatic Term Mapping (ATM)

When no field tag is specified, PubMed maps terms through: MeSH Translation Table → Journals Translation Table → Author Index → Full Text. Bypass ATM with explicit field tags or quoted phrases.

### Common MeSH Subheadings

| Subheading | Abbreviation | Use |
|------------|-------------|-----|
| /diagnosis | /DI | Diagnostic methods |
| /drug therapy | /DT | Pharmaceutical treatment |
| /epidemiology | /EP | Disease patterns |
| /etiology | /ET | Disease causes |
| /genetics | /GE | Genetic aspects |
| /prevention & control | /PC | Preventive measures |
| /therapy | /TH | Treatment approaches |

## Common Workflows

### Workflow 1: Systematic Review Search

```python
import requests, time, json

# 1. Define PICO-structured query
query = (
    # Population
    "(diabetes mellitus, type 2[mh] OR type 2 diabetes[tiab]) AND "
    # Intervention + Comparison
    "(metformin[nm] OR lifestyle modification[tiab]) AND "
    # Outcome
    "(glycemic control[tiab] OR HbA1c[tiab]) AND "
    # Study design filter
    "(randomized controlled trial[pt] OR systematic review[pt]) AND "
    # Date range
    "2020:2024[dp]"
)

# 2. Search with history server
resp = pubmed_request("esearch.fcgi", {
    "db": "pubmed", "term": query,
    "usehistory": "y", "retmode": "json"
})
result = resp.json()["esearchresult"]
total = int(result["count"])
print(f"Systematic review hits: {total}")

# 3. Batch fetch all results as XML
import xml.etree.ElementTree as ET
articles = []
for start in range(0, total, 200):
    resp = pubmed_request("efetch.fcgi", {
        "db": "pubmed", "query_key": result["querykey"],
        "WebEnv": result["webenv"],
        "retstart": start, "retmax": 200,
        "rettype": "xml", "retmode": "xml"
    })
    root = ET.fromstring(resp.text)
    for article in root.findall('.//PubmedArticle'):
        pmid = article.findtext('.//PMID')
        title = article.findtext('.//ArticleTitle')
        articles.append({"pmid": pmid, "title": title})
    time.sleep(0.5)

print(f"Retrieved {len(articles)} articles for screening")
```

### Workflow 2: Literature Monitoring Pipeline

```python
import json, datetime

# 1. Construct monitoring query
topic_query = (
    "(CRISPR[tiab] OR gene editing[tiab]) AND "
    "(therapeutics[tiab] OR clinical trial[pt])"
)

# 2. Search recent publications (last 30 days)
today = datetime.date.today()
start_date = today - datetime.timedelta(days=30)
query = f"{topic_query} AND {start_date.strftime('%Y/%m/%d')}:{today.strftime('%Y/%m/%d')}[dp]"

resp = pubmed_request("esearch.fcgi", {
    "db": "pubmed", "term": query,
    "retmax": 100, "retmode": "json", "sort": "pub_date"
})
pmids = resp.json()["esearchresult"]["idlist"]

# 3. Get summaries for new articles
if pmids:
    resp = pubmed_request("esummary.fcgi", {
        "db": "pubmed", "id": ",".join(pmids), "retmode": "json"
    })
    for uid in pmids:
        info = resp.json()["result"].get(uid, {})
        print(f"[{uid}] {info.get('title', 'N/A')[:80]}")
        print(f"  {info.get('fulljournalname', '')} — {info.get('pubdate', '')}")
```

## Key Parameters

| Parameter | Endpoint | Default | Effect |
|-----------|----------|---------|--------|
| `term` | ESearch | Required | Search query with Boolean/field tags |
| `retmax` | ESearch/EFetch | 20 | Max records returned (up to 10,000) |
| `retstart` | ESearch/EFetch | 0 | Offset for pagination |
| `rettype` | EFetch | `full` | Output type: `abstract`, `medline`, `xml`, `uilist` |
| `retmode` | All | `xml` | Output format: `xml`, `json`, `text` |
| `sort` | ESearch | `relevance` | Sort order: `relevance`, `pub_date`, `first_author` |
| `usehistory` | ESearch | `n` | Enable history server: `y` for large result sets |
| `api_key` | All | None | NCBI API key for 10 req/sec (vs 3 without) |
| `cmd` | ELink | `neighbor` | Link type: `neighbor`, `neighbor_score`, `prlinks` |
| `datetype` | ESearch | `pdat` | Date field: `pdat` (publication), `edat` (entrez) |

## Best Practices

1. **Always use an API key** — register at NCBI for 10 req/sec instead of 3
2. **Use history server for >500 results** — avoids URL length limits and enables batch fetching
3. **Include rate limiting** — `time.sleep(0.1)` with API key, `time.sleep(0.34)` without
4. **Cache results locally** — PubMed data changes slowly; cache responses to minimize API calls
5. **Use MeSH terms + free text** — combine `[mh]` and `[tiab]` for comprehensive coverage: `(diabetes mellitus[mh] OR diabetes[tiab])`
6. **Document search strategies** — for systematic reviews, record exact queries, dates, and result counts
7. **Parse XML for structured data** — text output is human-readable but XML preserves field structure for automated extraction

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| HTTP 429 (Too Many Requests) | Exceeding rate limit | Add `time.sleep()`; use API key for higher limit |
| HTTP 414 (URI Too Long) | Too many PMIDs in URL | Use history server (`usehistory=y`) or EPost |
| Empty result set | Overly restrictive query | Remove filters one at a time; check ATM with EInfo |
| Unexpected MeSH mapping | Automatic Term Mapping | Use explicit field tags: `term[tiab]` instead of bare `term` |
| Missing abstracts | Pre-1975 articles or certain types | Filter: `hasabstract[text]` |
| XML parsing errors | Malformed response | Check `retmode=xml` and `rettype=xml`; handle encoding |
| Stale history server | Session expired (8h inactivity) | Re-run ESearch with `usehistory=y` to get new WebEnv |
| Truncated results | Default `retmax=20` | Set `retmax=100` or higher (max 10,000) |

## Common Recipes

### Recipe: Download Abstracts for a Gene Set

```python
import requests
import time

def fetch_abstracts(gene_list, max_per_gene=5):
    """Retrieve PubMed abstracts for each gene in a list."""
    base = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
    records = []
    for gene in gene_list:
        r = requests.get(f"{base}/esearch.fcgi",
                         params={"db": "pubmed", "term": f"{gene}[gene] AND Homo sapiens[orgn]",
                                 "retmax": max_per_gene, "retmode": "json"})
        ids = r.json()["esearchresult"]["idlist"]
        if ids:
            fetch = requests.get(f"{base}/efetch.fcgi",
                                 params={"db": "pubmed", "id": ",".join(ids), "rettype": "abstract"})
            records.append({"gene": gene, "pmids": ids, "text": fetch.text[:500]})
        time.sleep(0.34)
    return records

results = fetch_abstracts(["BRCA1", "TP53", "EGFR"])
for r in results:
    print(f"{r['gene']}: {r['pmids']}")
```

### Recipe: Track New Publications via Date Filter

```python
import requests
from datetime import date, timedelta

# Find papers published in the last 7 days on a topic
week_ago = (date.today() - timedelta(days=7)).strftime("%Y/%m/%d")
today = date.today().strftime("%Y/%m/%d")

resp = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
                    params={"db": "pubmed", "term": "CRISPR AND cancer",
                            "datetype": "pdat", "mindate": week_ago, "maxdate": today,
                            "retmax": 20, "retmode": "json"})
data = resp.json()["esearchresult"]
print(f"New CRISPR+cancer papers this week: {data['count']}")
print("PMIDs:", data["idlist"])
```

## Bundled Resources

- `references/search_syntax.md` — Complete field tag reference, Boolean/wildcard/proximity syntax, automatic term mapping rules, all filter types (age groups, species, languages), and clinical query filters
- `references/common_queries.md` — Ready-to-use query templates organized by domain (disease-specific, population-specific, methodology, drug research, epidemiology) with ~40 example patterns

Not migrated from original: `references/api_reference.md` (298 lines) — endpoint parameter details are consolidated into Core API sections 2-5 and the E-utilities Endpoint Summary table in Key Concepts.

## References

- PubMed Help: https://pubmed.ncbi.nlm.nih.gov/help/
- E-utilities Documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- NCBI API Key Registration: https://www.ncbi.nlm.nih.gov/account/

## Related Skills

- **biopython** — higher-level Python wrapper (`Bio.Entrez`) for E-utilities
- **openalex-database** — broader academic literature beyond biomedical
- **literature-review** — systematic review methodology and PRISMA framework