---
name: "gene-database"
description: "NCBI Gene via E-utilities: curated records across 1M+ taxa. Official symbols, aliases, RefSeq IDs, summaries, coordinates, GO, interactions. Use for gene ID resolution and cross-species function queries. For sequences use Ensembl; for expression use geo-database."
license: "CC0-1.0"
---

# NCBI Gene Database

## Overview

NCBI Gene is the authoritative curated database for gene-centric information, covering 1M+ genes across hundreds of thousands of taxa. Each gene record includes the official symbol, aliases, full name, functional summary, genomic coordinates (GRCh38/GRCh37), RefSeq accessions, GO annotations, interaction partners, and links to related databases. Access is free via E-utilities REST API (no API key required, though recommended).

## When to Use

- Resolving gene aliases and synonyms to the current official HGNC/NCBI symbol
- Fetching the NCBI Gene ID (integer) for a gene symbol for downstream API calls (e.g., dbSNP, ClinVar, GEO)
- Retrieving curated gene summaries and function descriptions programmatically
- Pulling RefSeq mRNA (NM_) and protein (NP_) accessions associated with a gene
- Querying GO functional annotations (Biological Process, Molecular Function, Cellular Component)
- Cross-species gene queries using the same Gene ID space
- For expression profiles across conditions use `geo-database`; for variant annotations use `clinvar-database` or `ensembl-database`

## Prerequisites

- **Python packages**: `requests`, `xml.etree.ElementTree` (stdlib), `pandas` (optional)
- **Data requirements**: gene symbols, NCBI Gene IDs, or tax IDs
- **Environment**: internet connection; NCBI email required (set `email` parameter)
- **Rate limits**: 3 req/s unauthenticated; 10 req/s with free NCBI API key

```bash
pip install requests pandas
```

## Quick Start

```python
import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def gene_search(query, retmax=5):
    r = requests.get(f"{BASE}/esearch.fcgi",
                     params={"db": "gene", "term": query,
                             "retmax": retmax, "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    return r.json()["esearchresult"]["idlist"]

# Find human BRCA1 gene ID
ids = gene_search("BRCA1[sym] AND Homo sapiens[orgn]")
print(f"Gene IDs for BRCA1: {ids}")  # → ['672']
```

## Core API

### Query 1: Search by Symbol, Name, or Function

Use ESearch with field tags for precise queries.

```python
import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

# Exact symbol match for human gene
r = requests.get(f"{BASE}/esearch.fcgi",
                 params={"db": "gene", "email": EMAIL, "retmode": "json",
                         "term": "TP53[sym] AND Homo sapiens[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
print(f"TP53 Gene ID: {ids}")  # → ['7157']
```

```python
# Search by function keyword
r = requests.get(f"{BASE}/esearch.fcgi",
                 params={"db": "gene", "email": EMAIL, "retmode": "json",
                         "term": "CRISPR[title] AND Homo sapiens[orgn]", "retmax": 5})
ids = r.json()["esearchresult"]["idlist"]
print(f"CRISPR-related gene IDs: {ids}")
```

### Query 2: Fetch Gene Summary (JSON/ESummary)

Retrieve key metadata fields for a list of Gene IDs.

```python
import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def esummary_gene(gene_ids):
    r = requests.post(f"{BASE}/esummary.fcgi",
                      data={"db": "gene", "id": ",".join(gene_ids),
                            "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    return r.json()["result"]

result = esummary_gene(["672", "675", "7157"])  # BRCA1, BRCA2, TP53

for uid in result.get("uids", []):
    g = result[uid]
    print(f"\n{g.get('name')} (ID {uid})")
    print(f"  Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}")
    print(f"  Chr location    : {g.get('maplocation')}")
    print(f"  Summary (first 100): {g.get('summary', '')[:100]}...")
    print(f"  Aliases: {g.get('otheraliases', 'none')}")
```

### Query 3: Fetch Full Gene Record (XML)

Retrieve the complete gene record in XML for RefSeq accessions, GO terms, and interaction data.

```python
import requests
import xml.etree.ElementTree as ET

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def efetch_gene_xml(gene_id):
    r = requests.get(f"{BASE}/efetch.fcgi",
                     params={"db": "gene", "id": gene_id,
                             "rettype": "gene_table", "retmode": "text", "email": EMAIL})
    r.raise_for_status()
    return r.text

# Get gene table (tab-delimited overview)
table = efetch_gene_xml("672")
print(table[:500])
```

```python
# XML for RefSeq accession extraction
r = requests.get(f"{BASE}/efetch.fcgi",
                 params={"db": "gene", "id": "672",
                         "rettype": "xml", "retmode": "xml", "email": EMAIL})
root = ET.fromstring(r.text)

# Extract RefSeq mRNA accessions
for ref in root.iter("Gene-commentary"):
    acc = ref.find("Gene-commentary_accession")
    ver = ref.find("Gene-commentary_version")
    typ = ref.find("Gene-commentary_type")
    if acc is not None and acc.text and acc.text.startswith("NM_"):
        print(f"RefSeq mRNA: {acc.text}.{ver.text if ver is not None else ''}")
```

### Query 4: Batch Symbol-to-ID Mapping

Map a list of gene symbols to NCBI Gene IDs efficiently.

```python
import requests, time

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def symbols_to_ids(symbols, organism="Homo sapiens"):
    """Map gene symbols to NCBI Gene IDs. Returns dict {symbol: gene_id}."""
    mapping = {}
    for sym in symbols:
        r = requests.get(f"{BASE}/esearch.fcgi",
                         params={"db": "gene", "email": EMAIL, "retmode": "json",
                                 "term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"})
        ids = r.json()["esearchresult"]["idlist"]
        mapping[sym] = ids[0] if ids else None
        time.sleep(0.1)
    return mapping

genes = ["EGFR", "KRAS", "BRAF", "PIK3CA", "PTEN"]
id_map = symbols_to_ids(genes)
for sym, gid in id_map.items():
    print(f"{sym:10s} → Gene ID {gid}")
```

### Query 5: GO Annotation Retrieval

Parse GO terms from the gene XML record.

```python
import requests
import xml.etree.ElementTree as ET

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

r = requests.get(f"{BASE}/efetch.fcgi",
                 params={"db": "gene", "id": "7157",
                         "rettype": "xml", "retmode": "xml", "email": EMAIL})
root = ET.fromstring(r.text)

# Extract GO annotations
go_terms = []
for ref in root.iter("Gene-commentary"):
    heading = ref.find("Gene-commentary_heading")
    label = ref.find("Gene-commentary_label")
    if heading is not None and "Gene Ontology" in heading.text:
        if label is not None:
            go_terms.append(label.text)

print(f"TP53 GO terms ({len(go_terms)} found):")
for term in go_terms[:10]:
    print(f"  {term}")
```

### Query 6: Cross-Species Gene Query

Find orthologs across species using NCBI Gene IDs.

```python
import requests, time

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def find_ortholog(human_gene_id, target_organism):
    """Find ortholog Gene ID in target species via NCBI Gene homologs."""
    r = requests.get(f"{BASE}/elink.fcgi",
                     params={"dbfrom": "gene", "db": "gene",
                             "id": human_gene_id, "linkname": "gene_gene_homolog",
                             "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    linksets = r.json().get("linksets", [])
    if not linksets:
        return []
    homolog_ids = [str(l["id"]) for l in linksets[0].get("linksetdbs", [{}])[0].get("links", [])]
    return homolog_ids[:10]

# Human TP53 (7157) homologs
homolog_ids = find_ortholog("7157", "Mus musculus")
print(f"Homolog Gene IDs for TP53: {homolog_ids}")
```

## Key Concepts

### NCBI Gene ID vs. HGNC ID vs. Ensembl ID

NCBI Gene IDs are integers assigned per gene per organism (e.g., human TP53 = 7157). These are distinct from HGNC IDs (e.g., HGNC:11998) and Ensembl IDs (ENSG00000141510). Many downstream NCBI databases (ClinVar, dbSNP, GEO) use NCBI Gene IDs internally.

### `alive[prop]` Filter

NCBI Gene records for discontinued genes have `status=discontinued`. Always add `AND alive[prop]` to symbol queries to exclude retired entries and avoid retrieving stale data.

## Common Workflows

### Workflow 1: Build a Gene Annotation Table

**Goal**: For a list of gene symbols, retrieve Gene ID, official name, chromosomal location, and description.

```python
import requests, time, pandas as pd

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def search_gene(sym, organism="Homo sapiens"):
    r = requests.get(f"{BASE}/esearch.fcgi",
                     params={"db": "gene", "email": EMAIL, "retmode": "json",
                             "term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"})
    ids = r.json()["esearchresult"]["idlist"]
    return ids[0] if ids else None

def batch_summary(gene_ids):
    r = requests.post(f"{BASE}/esummary.fcgi",
                      data={"db": "gene", "id": ",".join(gene_ids),
                            "retmode": "json", "email": EMAIL})
    return r.json()["result"]

symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC", "KRAS", "PTEN"]

# Step 1: Symbol → Gene ID
id_map = {}
for sym in symbols:
    gid = search_gene(sym)
    id_map[sym] = gid
    time.sleep(0.12)

# Step 2: Batch summary
valid_ids = [v for v in id_map.values() if v]
result = batch_summary(valid_ids)

rows = []
sym_to_id = {v: k for k, v in id_map.items() if v}
for uid in result.get("uids", []):
    g = result[uid]
    rows.append({
        "symbol": sym_to_id.get(uid, g.get("name")),
        "gene_id": uid,
        "full_name": g.get("description"),
        "chr_location": g.get("maplocation"),
        "summary": g.get("summary", "")[:200],
    })

df = pd.DataFrame(rows)
df.to_csv("gene_annotations.csv", index=False)
print(df[["symbol", "gene_id", "full_name", "chr_location"]].to_string(index=False))
```

### Workflow 2: Find All Genes in a Pathway Keyword

**Goal**: Retrieve all human genes associated with a biological keyword from the NCBI Gene summary field.

```python
import requests, time, pandas as pd

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

keyword = "DNA mismatch repair"
r = requests.get(f"{BASE}/esearch.fcgi",
                 params={"db": "gene", "email": EMAIL, "retmode": "json",
                         "retmax": 50,
                         "term": f"{keyword}[title/abstract] AND Homo sapiens[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
print(f"Found {len(ids)} genes related to '{keyword}'")

# Fetch summaries
r2 = requests.post(f"{BASE}/esummary.fcgi",
                   data={"db": "gene", "id": ",".join(ids), "retmode": "json", "email": EMAIL})
result = r2.json()["result"]

rows = []
for uid in result.get("uids", []):
    g = result[uid]
    rows.append({"gene_id": uid, "symbol": g.get("name"),
                 "description": g.get("description"),
                 "location": g.get("maplocation")})

df = pd.DataFrame(rows)
print(df.to_string(index=False))
df.to_csv(f"{keyword.replace(' ', '_')}_genes.csv", index=False)
```

## Key Parameters

| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| `retmax` | ESearch | `20` | `1`–`10000` | Max records returned |
| `retmode` | ESearch/ESummary | `"xml"` | `"json"`, `"xml"` | Response format |
| `rettype` | EFetch | depends | `"xml"`, `"gene_table"`, `"text"` | Record format for full fetch |
| `[sym]` field tag | ESearch | — | gene symbol | Match exact official symbol only |
| `[orgn]` field tag | ESearch | — | organism name or tax ID | Filter by taxonomy |
| `alive[prop]` | ESearch | — | boolean flag | Exclude discontinued gene records |

## Best Practices

1. **Always add `alive[prop]`**: Discontinued gene records remain in the database. Without this filter, symbol searches may return outdated records.

2. **Use Gene IDs in pipelines**: Downstream NCBI databases (ClinVar, dbSNP, GEO) accept Gene IDs; avoid re-searching by symbol in each call.

3. **Use ESummary for metadata, EFetch for full records**: ESummary returns JSON with all common fields; EFetch XML is needed only for RefSeq accessions, GO terms, or interaction links.

4. **Register for a free API key**: Triple your rate limit (3 → 10 req/s) at https://www.ncbi.nlm.nih.gov/account/. Pass as `api_key` parameter.

5. **Batch with ESummary**: POST up to 200 Gene IDs per call to ESummary instead of querying one at a time.

## Common Recipes

### Recipe: Gene ID to RefSeq NM Accession

When to use: Get the canonical mRNA accession for a protein-coding gene.

```python
import requests, re

EMAIL = "your@email.com"
GENE_ID = "672"  # BRCA1

r = requests.get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
    params={"db": "gene", "id": GENE_ID, "rettype": "gene_table",
            "retmode": "text", "email": EMAIL}
)
nm_accessions = re.findall(r"NM_\d+\.\d+", r.text)
print(f"RefSeq mRNA accessions: {list(set(nm_accessions))}")
```

### Recipe: Retrieve Gene Aliases

When to use: Resolve legacy/alias symbols to the current official NCBI symbol.

```python
import requests

EMAIL = "your@email.com"

# P53 is an alias for TP53
r = requests.get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
    params={"db": "gene", "email": EMAIL, "retmode": "json",
            "term": "p53[sym] AND Homo sapiens[orgn] AND alive[prop]"}
)
ids = r.json()["esearchresult"]["idlist"]

r2 = requests.post("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",
                   data={"db": "gene", "id": ",".join(ids[:1]),
                         "retmode": "json", "email": EMAIL})
g = r2.json()["result"][ids[0]]
print(f"Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}")
print(f"Other aliases   : {g.get('otheraliases')}")
print(f"Designations    : {g.get('otherdesignations', '')[:100]}")
```

### Recipe: List All Genes on a Chromosome

When to use: Get all protein-coding genes on a specific human chromosome.

```python
import requests

EMAIL = "your@email.com"

r = requests.get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
    params={"db": "gene", "email": EMAIL, "retmode": "json", "retmax": 5,
            "term": "17[chr] AND Homo sapiens[orgn] AND protein coding[filter] AND alive[prop]"}
)
result = r.json()["esearchresult"]
print(f"Protein-coding genes on chr17: {result['count']} total")
print(f"Sample IDs: {result['idlist']}")
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| Empty `idlist` for known symbol | Symbol is an alias, not the official term | Use `[gene name]` or `[title]` field tag; check aliases via ESummary |
| Wrong species returned | Missing organism filter | Add `AND Homo sapiens[orgn]` or target tax ID (`9606[taxid]`) |
| Discontinued gene returned | Missing `alive[prop]` filter | Append `AND alive[prop]` to all symbol queries |
| `HTTP 429` rate limit | Too many requests | Add `time.sleep(0.35)` between calls; use NCBI API key |
| ESummary missing `uids` key | All IDs invalid/absent | Check `id` values are valid integers, not empty strings |
| XML parse error | Malformed XML for rare genes | Wrap ET.fromstring in try/except; retry with `rettype=text` |

## Related Skills

- `geo-database` — Gene Expression Omnibus for retrieving expression data linked to genes found here
- `clinvar-database` — Clinical variant data indexed by NCBI Gene IDs
- `ensembl-database` — Complementary gene annotations with VEP and comparative genomics
- `biopython-molecular-biology` — Biopython Entrez module wraps E-utilities with typed return values

## References

- [NCBI Gene database](https://www.ncbi.nlm.nih.gov/gene/) — Official homepage and search interface
- [E-utilities documentation](https://www.ncbi.nlm.nih.gov/books/NBK25499/) — Complete API reference for ESearch, ESummary, EFetch
- [NCBI Gene field tags](https://www.ncbi.nlm.nih.gov/books/NBK3840/) — Field tag reference for constructing Entrez queries
- [NCBI API Key registration](https://www.ncbi.nlm.nih.gov/account/) — Free registration for 10 req/s rate limit