---
name: scientific-literature-search
description: "Systematic strategies for searching scientific literature across PubMed, arXiv, Google Scholar, and AI-assisted tools. Covers PICO framework for clinical questions, three-tiered search (database-specific, AI-assisted, content extraction), PubMed field tags and MeSH, boolean query construction, and full-text extraction. Use when planning a literature search or choosing a search tier."
license: CC-BY-4.0
---

# Scientific Literature Search

## Overview

Scientific literature search is the foundation of evidence-based research. A well-executed search maximizes recall (finding all relevant papers) while maintaining precision (avoiding irrelevant results). This guide provides a systematic approach that combines database-specific query strategies, AI-assisted synthesis, and direct content extraction, organized into a three-tiered framework that scales from targeted lookups to comprehensive landscape reviews.

## Key Concepts

### The PICO Framework

For clinical and biomedical questions, structure queries using the PICO framework:

- **P** (Population): Who are you studying? (e.g., "Diabetes Mellitus"[MeSH])
- **I** (Intervention): What treatment or exposure? (e.g., "Metformin"[MeSH])
- **C** (Comparison): What is the alternative? (e.g., placebo, standard care)
- **O** (Outcome): What result are you measuring? (e.g., "Cardiovascular Diseases"[MeSH])

PICO queries can be combined with publication type filters to target specific evidence levels:

```
"Diabetes Mellitus"[MeSH] AND "Metformin"[MeSH] AND "Cardiovascular Diseases"[MeSH] AND ("clinical trial"[Publication Type] OR "meta-analysis"[Publication Type])
```

### Three-Tiered Search Strategy

Literature search is most effective when approached in tiers of increasing breadth:

**Tier 1 -- Database-Specific Searches (Most Reliable)**

Query established academic databases (PubMed, arXiv, Google Scholar) for peer-reviewed, indexed content. This is the most reliable tier and should always be the starting point.

- PubMed (via Biopython `Bio.Entrez`): Primary database for biomedical and life science literature. Supports MeSH controlled vocabulary and advanced field tags.
- arXiv (via the `arxiv` package): Preprint server for physics, mathematics, computer science, and quantitative biology. Results appear faster than peer-reviewed journals.
- Google Scholar (via the `scholarly` package): Broadest coverage across all academic disciplines. Note: has aggressive rate limits on automated queries.

Best for: finding specific papers, systematic reviews, clinical evidence, preprints.

**Tier 2 -- AI-Assisted Web Search (Comprehensive)**

Use the Claude API with the `web_search_20250305` server-side tool to synthesize broader context, identify research trends, and surface recent developments not yet indexed in databases. Also use general web search (e.g. via the `duckduckgo-search` package) for protocols, tutorials, and software documentation.

Best for: understanding the research landscape, complex multi-faceted questions, finding recent developments, identifying key researchers.

Avoid for: specific paper lookups (use Tier 1), citation counts (use Google Scholar), systematic reviews requiring reproducibility, searches where exact query terms must be documented.

**Tier 3 -- Direct Content Extraction (Deep Dive)**

Extract and analyze full-text content, PDFs, and supplementary materials from identified papers using `trafilatura` (HTML article extraction), `pypdf` (PDF text), and the Crossref API (DOI → supplementary file URLs).

Best for: detailed methodology extraction, data retrieval, protocol identification, supplementary data access.

### PubMed Field Tags

PubMed supports field-specific searching to improve precision:

| Tag | Description | Example |
|-----|-------------|---------|
| `[MeSH]` | Medical Subject Heading (controlled vocabulary) | `"Neoplasms"[MeSH]` |
| `[Title]` | Title field only | `"CRISPR"[Title]` |
| `[Title/Abstract]` | Title or abstract | `"gene therapy"[Title/Abstract]` |
| `[Author]` | Author name | `"Zhang F"[Author]` |
| `[Journal]` | Journal name | `"Nature"[Journal]` |
| `[Publication Type]` | Article type filter | `"Review"[Publication Type]` |
| `[Date - Publication]` | Publication date range | `"2020/01/01"[Date - Publication]:"2024/12/31"[Date - Publication]` |
| `[MeSH Major Topic]` | MeSH term as major focus of the article | `"CRISPR-Cas Systems"[MeSH Major Topic]` |

### Boolean Operators

Boolean operators control how search terms combine:

```python
# AND: All terms must be present -- narrows results
results = query_pubmed("CRISPR AND cancer AND therapy")

# OR: Any term can be present -- broadens results (use for synonyms)
results = query_pubmed("(tumor OR tumour OR neoplasm) AND immunotherapy")

# NOT: Exclude terms -- use sparingly to avoid losing relevant papers
results = query_pubmed("cancer immunotherapy NOT review")
```

Use parentheses to group OR terms together before combining with AND.

### arXiv Subject Categories

arXiv organizes preprints by subject category. Biology-related categories include:

| Category | Description |
|----------|-------------|
| `q-bio.BM` | Biomolecules |
| `q-bio.CB` | Cell Behavior |
| `q-bio.GN` | Genomics |
| `q-bio.MN` | Molecular Networks |
| `q-bio.NC` | Neurons and Cognition |
| `q-bio.QM` | Quantitative Methods |
| `cs.AI` | Artificial Intelligence |
| `cs.LG` | Machine Learning |

## Decision Framework

Use this tree to determine which search tier and database to start with:

```
What type of question are you answering?
├── Clinical / biomedical question
│   ├── Specific drug or treatment → Tier 1: PubMed with PICO query
│   ├── Disease mechanism → Tier 1: PubMed with MeSH terms
│   └── Clinical trial evidence → Tier 1: PubMed filtered by Publication Type
├── Computational / quantitative methods
│   ├── ML model or algorithm → Tier 1: arXiv (cs.LG, cs.AI)
│   ├── Computational biology method → Tier 1: arXiv (q-bio.*) + PubMed
│   └── Software tool or pipeline → Tier 2: AI-assisted web search
├── Broad research landscape
│   ├── Current state of a field → Tier 2: AI-assisted web search
│   ├── Recent developments (last 6 months) → Tier 2: AI-assisted web search
│   └── Cross-disciplinary question → Tier 1: Google Scholar + Tier 2
├── Specific paper or data
│   ├── Known paper details → Tier 1: any database by title/author/DOI
│   ├── Methodology or protocol → Tier 3: full-text extraction
│   └── Supplementary data → Tier 3: DOI-based supplementary fetch
└── Protocols / reagents
    ├── Lab protocol → Tier 2: web search for protocols.io, etc.
    └── Validated reagents → Tier 2: AI-assisted web search
```

| Scenario | Recommended Tier and Database | Rationale |
|----------|-------------------------------|-----------|
| Systematic review of clinical evidence | Tier 1: PubMed with MeSH + publication type filters | Reproducible, documented search strategy required |
| Finding a preprint on a new ML method | Tier 1: arXiv with category and keyword search | Preprints appear on arXiv before journals |
| Understanding the research landscape | Tier 2: AI-assisted web search | Requires synthesis across many sources |
| Extracting a specific protocol from a paper | Tier 3: PDF content extraction | Need full-text access to methods section |
| Finding papers across disciplines | Tier 1: Google Scholar | Broadest coverage across fields |
| Identifying key researchers in a niche area | Tier 2: AI-assisted web search | Requires contextual synthesis |
| Downloading supplementary data tables | Tier 3: DOI-based supplementary fetch | Direct access to supplementary files |

## Best Practices

1. **Use controlled vocabulary (MeSH) for PubMed searches**: Free-text searches miss papers that use different terminology. MeSH terms map synonyms to a single concept, improving recall without sacrificing precision.
   ```python
   # Free text misses synonyms
   query_pubmed("heart attack treatment")
   # MeSH captures all synonyms
   query_pubmed('"Myocardial Infarction"[MeSH] AND "Drug Therapy"[MeSH]')
   ```

2. **Include synonyms and alternative terms with OR**: Scientific concepts often have multiple names (e.g., tumor/tumour/neoplasm). Group synonyms with OR inside parentheses to avoid missing relevant papers.
   ```python
   query_pubmed("(myocardial infarction OR heart attack) AND (treatment OR therapy)")
   ```

3. **Use phrase searching for multi-word concepts**: Quoting exact phrases prevents the search engine from splitting terms and matching them independently.
   ```python
   query_pubmed('"single cell RNA sequencing" AND methods')
   ```

4. **Filter by publication type when seeking specific evidence**: Clinical trials, systematic reviews, and meta-analyses each answer different questions. Use `[Publication Type]` to target the evidence level you need.
   ```python
   query_pubmed("COVID-19 vaccine efficacy AND clinical trial[Publication Type]")
   ```

5. **Start broad, then narrow iteratively**: Begin with core concepts (2-3 terms) and review initial results. Add specificity based on what you find -- more terms, date ranges, field tags, or publication types.
   ```python
   # Step 1: Broad
   results = query_pubmed("CRISPR base editing iPSC", max_papers=20)
   # Step 2: Add MeSH and specificity
   results = query_pubmed(
       '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency',
       max_papers=20
   )
   # Step 3: Filter by date
   results = query_pubmed(
       '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency AND ("2022"[Date - Publication]:"2024"[Date - Publication])',
       max_papers=20
   )
   ```

6. **Cross-reference multiple databases**: No single database covers all literature. Use PubMed for biomedical content, arXiv for computational preprints, and Google Scholar for cross-disciplinary coverage.

7. **Assess result quality systematically**: Evaluate papers for source reliability (peer-reviewed journal), author credentials, recency, study design appropriateness, sample size adequacy, reproducibility, declared conflicts of interest, and citation count.

## Common Pitfalls

1. **Overly long and specific queries**: Packing too many terms into a single query causes missed results because all terms must match simultaneously.
   - *How to avoid*: Limit queries to core concepts (3-5 terms). Run separate searches for sub-topics and combine results manually.
   ```python
   # Too specific -- misses relevant papers
   query_pubmed("CRISPR Cas9 gene editing HEK293T cells 2024 efficiency optimization delivery")
   # Better -- core concepts only
   query_pubmed("CRISPR Cas9 gene editing optimization efficiency")
   ```

2. **Relying on a single database**: PubMed has biomedical focus, arXiv covers preprints, Google Scholar spans disciplines. Using only one database guarantees blind spots.
   - *How to avoid*: Always search at least two databases. For computational biology, combine PubMed and arXiv. For cross-disciplinary topics, include Google Scholar.

3. **Ignoring publication dates**: Scientific knowledge evolves rapidly. Foundational papers remain relevant, but methods and clinical evidence may be superseded.
   - *How to avoid*: Check publication dates in all results. For methods papers, prefer the last 3-5 years. For foundational concepts, older papers are acceptable but verify with recent reviews.

4. **Skipping title and abstract review before deep-diving**: Not all search results that match keywords are actually relevant. Downloading and reading full texts without screening wastes time.
   - *How to avoid*: Always screen titles and abstracts first. Only extract full text (Tier 3) for papers that pass screening.

5. **Using NOT operators too aggressively**: The NOT operator can inadvertently exclude relevant papers that mention the excluded term in a different context.
   - *How to avoid*: Use NOT sparingly. Prefer adding positive terms to narrow results rather than excluding terms. When you must use NOT, verify that excluded results are genuinely irrelevant.

6. **Ignoring Google Scholar rate limits**: Google Scholar aggressively rate-limits automated queries, which can block further searches.
   - *How to avoid*: Use Google Scholar sparingly. Add delays between requests. Prefer PubMed or arXiv for bulk searching and reserve Google Scholar for cross-disciplinary checks.

7. **Not documenting the search strategy**: For systematic reviews and reproducible research, an undocumented search cannot be verified or reproduced.
   - *How to avoid*: Record your search terms, databases queried, date ranges, and number of results at each stage. This is essential for systematic reviews and good practice for all searches.

## Workflow

1. **Step 1: Define the research question**
   - Identify the main concept, population/model, intervention/method, desired outcome, and time frame
   - For clinical questions, map to the PICO framework
   - Example: "Find recent papers on CRISPR base editing efficiency in human iPSCs" decomposes to: main concept = CRISPR base editing, model = human iPSCs, outcome = efficiency, time frame = last 3 years

2. **Step 2: Construct and execute database queries (Tier 1)**
   - Start with PubMed for biomedical topics, arXiv for computational topics
   - Begin with a broad query using 2-3 core terms
   - Refine with MeSH terms, field tags, date filters, and publication type filters
   ```python
   from Bio import Entrez
   import arxiv
   from scholarly import scholarly

   Entrez.email = "your.email@example.com"  # NCBI requires a contact email

   # PubMed: biomedical literature
   handle = Entrez.esearch(
       db="pubmed",
       term='"CRISPR-Cas Systems"[MeSH] AND "Gene Editing"[MeSH]',
       retmax=20,
   )
   pubmed_ids = Entrez.read(handle)["IdList"]
   handle.close()

   # arXiv: computational biology preprints
   arxiv_results = list(
       arxiv.Search(query="protein structure prediction", max_results=10).results()
   )

   # Google Scholar: broad cross-disciplinary coverage
   scholar_results = scholarly.search_pubs("single cell RNA sequencing analysis methods")
   ```

3. **Step 3: Supplement with AI-assisted search (Tier 2)**
   - Use AI-assisted web search for landscape overviews and recent developments
   - Use general web search for protocols, tutorials, and documentation
   ```python
   from anthropic import Anthropic

   client = Anthropic()
   response = client.messages.create(
       model="claude-opus-4-7",
       max_tokens=4096,
       tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 3}],
       messages=[{
           "role": "user",
           "content": "What are the latest developments in CAR-T cell therapy for solid tumors in 2024?",
       }],
   )
   print(response.content)
   ```

4. **Step 4: Evaluate and filter results**
   - Screen titles and abstracts for relevance
   - Prioritize by recency, journal quality, citation count, and study design
   - For clinical evidence, prioritize RCTs, systematic reviews, and meta-analyses
   - For methods, prioritize protocol papers and method comparisons
   - Decision point: If too many results, add more specific terms or filters. If too few, broaden terms and add synonyms.

5. **Step 5: Deep dive into key papers (Tier 3)**
   - Extract full text from high-priority papers
   - Download supplementary materials for data and protocols
   - Check reference lists for additional relevant papers
   ```python
   import io
   import os
   from pathlib import Path
   from urllib.parse import urlparse

   import requests
   import trafilatura
   from pypdf import PdfReader

   # Extract article content from URL (clean main text, drops nav/ads)
   downloaded = trafilatura.fetch_url("https://www.nature.com/articles/nature12373")
   article_text = trafilatura.extract(downloaded)

   # Extract text from a PDF
   pdf_bytes = requests.get("https://arxiv.org/pdf/1706.03762.pdf", timeout=30).content
   reader = PdfReader(io.BytesIO(pdf_bytes))
   pdf_text = "\n".join(page.extract_text() or "" for page in reader.pages)

   # Download supplementary files via Crossref DOI metadata
   doi = "10.1038/nature12373"
   meta = requests.get(f"https://api.crossref.org/works/{doi}", timeout=30).json()
   out_dir = Path("./supplementary_materials")
   out_dir.mkdir(exist_ok=True)
   for link in meta.get("message", {}).get("link", []):
       url = link.get("URL")
       if not url:
           continue
       fname = os.path.basename(urlparse(url).path) or "supplement.bin"
       (out_dir / fname).write_bytes(requests.get(url, timeout=60).content)
   ```

6. **Step 6: Document and iterate**
   - Record all search terms, databases, filters, and result counts
   - If gaps remain, revisit Steps 2-3 with refined queries
   - For systematic reviews, follow PRISMA guidelines for reporting

## Common Search Scenarios

The following scenarios illustrate how to combine the three tiers for typical research questions.

### Finding Methods and Protocols

Start with PubMed for published methodology papers, then supplement with web search for step-by-step protocols from resources like protocols.io.

```python
from Bio import Entrez
from duckduckgo_search import DDGS

Entrez.email = "your.email@example.com"

# Search for methodology papers in PubMed
handle = Entrez.esearch(
    db="pubmed",
    term='"Western Blotting"[MeSH] AND (protocol OR method OR technique)',
    retmax=10,
)
pubmed_ids = Entrez.read(handle)["IdList"]
handle.close()

# Check web for step-by-step protocols
web_hits = DDGS().text("Western blot protocol for membrane proteins", max_results=5)
```

### Understanding Disease Mechanisms

Begin with review articles for a broad overview, then drill into specific mechanistic studies.

```python
# Find review articles first for an overview
results = query_pubmed(
    '"Alzheimer Disease"[MeSH] AND pathophysiology AND review[Publication Type]',
    max_papers=10
)

# Then find specific mechanistic studies
results = query_pubmed(
    '"Alzheimer Disease"[MeSH] AND ("amyloid beta"[MeSH] OR tau) AND mechanism',
    max_papers=20
)
```

### Finding Drug and Treatment Information

Use publication type filters to separate clinical trial evidence from systematic reviews.

```python
# Clinical trials for a specific drug-condition pair
results = query_pubmed(
    '"Drug Name"[Substance Name] AND "Condition"[MeSH] AND clinical trial[Publication Type]',
    max_papers=20
)

# Systematic reviews and meta-analyses
results = query_pubmed(
    '"Drug Name" AND "Condition" AND (systematic review[Publication Type] OR meta-analysis[Publication Type])',
    max_papers=10
)
```

### Tracking Latest Developments

Combine AI-assisted search for synthesis with database searches for recent indexed publications.

```python
from anthropic import Anthropic
from Bio import Entrez

client = Anthropic()
Entrez.email = "your.email@example.com"

# AI-assisted synthesis of recent advances (Claude API web search tool)
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 3}],
    messages=[{
        "role": "user",
        "content": "What are the most significant advances in CAR-T cell therapy in 2024?",
    }],
)

# Supplement with recent PubMed results
handle = Entrez.esearch(
    db="pubmed",
    term='"Chimeric Antigen Receptor T-Cell Therapy"[MeSH] AND "2024"[Date - Publication]',
    retmax=20,
)
pubmed_ids = Entrez.read(handle)["IdList"]
handle.close()
```

### Finding Specific Reagents and Materials

Use AI-assisted search for validated reagent recommendations, supplemented by general web search.

```python
from anthropic import Anthropic
from duckduckgo_search import DDGS

client = Anthropic()

# Search for validated reagents (Claude API + web search tool)
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 2}],
    messages=[{
        "role": "user",
        "content": "validated antibodies for Western blot detection of p53 protein",
    }],
)

# Search supplier databases
supplier_hits = DDGS().text("p53 antibody Western blot validated", max_results=5)
```

### Comparative Analysis Across Methods

Use AI-assisted search for synthesized comparisons of techniques or tools.

```python
from anthropic import Anthropic

client = Anthropic()

# Compare approaches with AI synthesis (Claude API web search tool)
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 5}],
    messages=[{
        "role": "user",
        "content": "Compare different CRISPR delivery methods for in vivo gene editing: viral vectors vs lipid nanoparticles",
    }],
)
print(response.content)
```

## Quality Assessment Checklist

When evaluating search results, apply these criteria:

- **Source reliability**: Is the paper from a peer-reviewed journal?
- **Author credentials**: Are the authors established experts in the field?
- **Recency**: Is the information current enough for your purpose?
- **Study design**: Is the design appropriate for the question (e.g., RCT for efficacy, cohort for risk)?
- **Sample size**: Is it adequate for the conclusions drawn?
- **Reproducibility**: Are methods described clearly enough to replicate?
- **Conflicts of interest**: Are any conflicts declared?
- **Citation count**: Has the paper been well-cited by subsequent work?

## Further Reading

- [PubMed Help](https://pubmed.ncbi.nlm.nih.gov/help/) -- Official guide to PubMed search syntax, field tags, filters, and advanced features
- [arXiv Help Pages](https://info.arxiv.org/help/index.html) -- Documentation on arXiv search, subject categories, and submission process
- [MeSH Browser](https://meshb.nlm.nih.gov/) -- NLM tool for browsing and searching the Medical Subject Headings controlled vocabulary
- [PRISMA Statement](http://www.prisma-statement.org/) -- Guidelines for transparent reporting of systematic reviews and meta-analyses
- [Cochrane Handbook for Systematic Reviews](https://training.cochrane.org/handbook) -- Gold-standard methodology for systematic literature reviews

## Related Skills

- `pubmed-database` -- Direct PubMed API access for programmatic literature retrieval
- `scientific-manuscript-writing` -- Structuring literature review sections within manuscripts
- `research-question-formulation` -- Frameworks for defining answerable research questions