---
name: "ensembl-database"
description: "Ensembl REST API for gene/transcript/variant annotations in 300+ species. Gene info by symbol/ID, sequence, cross-refs (HGNC, RefSeq, UniProt), regulatory features. For bulk local use pyensembl; for pathways use kegg-database."
license: "Apache-2.0"
---

# Ensembl Genome Database

## Overview

Ensembl is a comprehensive genome annotation database covering 300+ vertebrate and non-vertebrate species. The Ensembl REST API provides programmatic access to gene models, transcript/protein sequences, variant annotations, cross-references, regulatory features, and comparative genomics without requiring any login or API key.

## When to Use

- Retrieving official gene and transcript annotations (stable IDs, biotype, genomic coordinates) for human or model organism genes
- Converting between gene identifier namespaces (HGNC symbol ↔ Ensembl ID ↔ RefSeq ↔ UniProt)
- Fetching genomic or cDNA/CDS/protein sequences for a gene or transcript
- Looking up variant consequences and functional impact (VEP) for a list of SNPs
- Querying regulatory features (promoters, enhancers, CTCF sites) in a genomic region
- Performing comparative genomics queries (orthologs, paralogs, gene trees) across species
- For local offline access to large genomic annotations, use `pyensembl` instead
- For pathway and metabolic annotations, use `kegg-database` or `reactome-database` instead

## Prerequisites

- **Python packages**: `requests`
- **Data requirements**: gene symbols, Ensembl stable IDs (ENSG…/ENST…/ENSP…), or genomic coordinates
- **Environment**: internet connection required; no API key needed
- **Rate limits**: max ~15 requests/second; use `expand=1` and batch endpoints to minimize calls

```bash
pip install requests
```

## Quick Start

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

def ensembl_get(endpoint, params=None):
    r = requests.get(f"{BASE}{endpoint}", headers=HEADERS, params=params)
    r.raise_for_status()
    return r.json()

# Look up human BRCA1
gene = ensembl_get("/lookup/symbol/homo_sapiens/BRCA1", params={"expand": 1})
print(f"ID: {gene['id']}, Chr: {gene['seq_region_name']}:{gene['start']}-{gene['end']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")
```

## Core API

### Query 1: Gene Lookup by Symbol or Stable ID

Retrieve gene metadata from a gene symbol or Ensembl stable ID.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# By gene symbol
r = requests.get(
    f"{BASE}/lookup/symbol/homo_sapiens/TP53",
    headers=HEADERS,
    params={"expand": 1}
)
gene = r.json()
print(f"Ensembl ID : {gene['id']}")
print(f"Location   : {gene['seq_region_name']}:{gene['start']}-{gene['end']} ({gene['strand']})")
print(f"Biotype    : {gene['biotype']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")
```

```python
# By stable ID (works for genes, transcripts, proteins)
r = requests.get(
    f"{BASE}/lookup/id/ENSG00000141510",
    headers=HEADERS,
    params={"expand": 0}
)
obj = r.json()
print(f"Symbol: {obj.get('display_name')}, Species: {obj.get('species')}")
```

### Query 2: Batch Lookup

Retrieve information for multiple IDs in one call (POST endpoint).

```python
import requests, json

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Batch lookup by symbols
symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC"]
r = requests.post(
    f"{BASE}/lookup/symbol/homo_sapiens",
    headers=HEADERS,
    data=json.dumps({"symbols": symbols})
)
results = r.json()
for sym, data in results.items():
    if data:
        print(f"{sym}: {data['id']} ({data['seq_region_name']}:{data['start']}-{data['end']})")
```

### Query 3: Sequence Retrieval

Fetch genomic, cDNA, CDS, or protein sequences.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "text/plain"}

# Protein sequence for canonical transcript
r = requests.get(
    f"{BASE}/sequence/id/ENST00000269305",
    headers=HEADERS,
    params={"type": "protein"}
)
seq = r.text
print(f"Protein sequence ({len(seq)} aa): {seq[:60]}...")
```

```python
# Genomic region sequence
HEADERS_JSON = {"Content-Type": "application/json"}
r = requests.get(
    f"{BASE}/sequence/region/human/17:43044295..43125364",
    headers=HEADERS_JSON,
    params={"coord_system_version": "GRCh38"}
)
result = r.json()
print(f"Retrieved {len(result['seq'])} bp of genomic sequence")
```

### Query 4: Cross-References (ID Mapping)

Map Ensembl IDs to external database identifiers.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# All xrefs for a gene
r = requests.get(
    f"{BASE}/xrefs/id/ENSG00000141510",
    headers=HEADERS
)
xrefs = r.json()

# Group by database
from collections import defaultdict
by_db = defaultdict(list)
for x in xrefs:
    by_db[x["dbname"]].append(x["primary_id"])

for db in ["HGNC", "RefSeq_gene_name", "Uniprot_gn", "MIM_gene"]:
    if db in by_db:
        print(f"{db}: {by_db[db]}")
```

### Query 5: Variant Consequence Annotation (VEP)

Predict functional consequences of variants via REST VEP endpoint.

```python
import requests, json

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Annotate a list of hgvs notations
variants = ["17:g.43094692C>T", "13:g.32929387C>T"]
r = requests.post(
    f"{BASE}/vep/human/hgvs",
    headers=HEADERS,
    data=json.dumps({"hgvs_notations": variants})
)
for v in r.json():
    print(f"\nVariant: {v.get('input')}")
    for tc in v.get("transcript_consequences", [])[:2]:
        print(f"  Gene: {tc.get('gene_symbol')}, Impact: {tc.get('impact')}, Consequence: {tc.get('consequence_terms')}")
```

```python
# Annotate by rsID
r = requests.get(
    f"{BASE}/vep/human/id/rs699",
    headers=HEADERS
)
v = r.json()[0]
print(f"rsID rs699 in gene: {v['transcript_consequences'][0]['gene_symbol']}")
print(f"Consequence: {v['transcript_consequences'][0]['consequence_terms']}")
```

### Query 6: Regulatory Features

Query regulatory build features in a genomic region.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Regulatory features in BRCA1 region
r = requests.get(
    f"{BASE}/overlap/region/human/17:43044000-43126000",
    headers=HEADERS,
    params={"feature": "regulatory"}
)
features = r.json()
print(f"Found {len(features)} regulatory features")
for f in features[:5]:
    print(f"  {f.get('feature_type')}: {f.get('start')}-{f.get('end')} ({f.get('description', 'n/a')})")
```

### Query 7: Comparative Genomics (Orthologs / Gene Trees)

Find orthologs and paralogs across species.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Get mouse ortholog for human TP53
r = requests.get(
    f"{BASE}/homology/symbol/human/TP53",
    headers=HEADERS,
    params={"target_species": "mus_musculus", "type": "orthologues"}
)
data = r.json()
for homo in data["data"][0]["homologies"][:3]:
    tgt = homo["target"]
    print(f"Mouse ortholog: {tgt['id']} ({tgt.get('perc_id', 'n/a')}% identity)")
```

## Key Concepts

### Stable IDs and Versioning

Ensembl uses stable IDs with optional version suffixes (e.g., `ENSG00000141510.17`). Genes (`ENSG`), transcripts (`ENST`), proteins (`ENSP`), and exons (`ENSE`) each have their own prefix. IDs are preserved across releases when possible; retired IDs can still be resolved via the archive API.

### Assembly Versions

Human genome: GRCh38 (current) and GRCh37 (legacy, via `grch37.rest.ensembl.org`). Always specify which assembly your coordinates belong to when making region-based queries.

## Common Workflows

### Workflow 1: Gene-to-Protein Information Pipeline

**Goal**: Retrieve all key annotations for a gene list — coordinates, transcripts, xrefs, and canonical protein sequence.

```python
import requests, json, time

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

def batch_lookup(symbols, species="homo_sapiens"):
    r = requests.post(
        f"{BASE}/lookup/symbol/{species}",
        headers=HEADERS,
        data=json.dumps({"symbols": symbols, "expand": 1})
    )
    return r.json()

def canonical_transcript(gene_data):
    """Return the ID of the canonical (longest CDS) transcript."""
    transcripts = gene_data.get("Transcript", [])
    coding = [t for t in transcripts if t.get("biotype") == "protein_coding"]
    if not coding:
        return None
    return max(coding, key=lambda t: t.get("Translation", {}).get("length", 0))

genes = ["BRCA1", "BRCA2", "TP53"]
lookup = batch_lookup(genes)

for sym in genes:
    g = lookup.get(sym)
    if not g:
        print(f"{sym}: not found")
        continue
    canon = canonical_transcript(g)
    print(f"\n{sym} ({g['id']})")
    print(f"  Location: {g['seq_region_name']}:{g['start']}-{g['end']}")
    if canon:
        prot_len = canon.get("Translation", {}).get("length", "n/a")
        print(f"  Canonical transcript: {canon['id']} ({prot_len} aa)")
    time.sleep(0.1)  # be polite
```

### Workflow 2: Variant Annotation Pipeline

**Goal**: Annotate a VCF-style variant list with gene, consequence, and impact.

```python
import requests, json, pandas as pd

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Input: list of hgvs notations
hgvs_list = [
    "17:g.43094692C>T",
    "17:g.43063873A>G",
    "13:g.32929387C>T",
]

# Annotate in batches of 200
def vep_batch(hgvs_batch):
    r = requests.post(
        f"{BASE}/vep/human/hgvs",
        headers=HEADERS,
        data=json.dumps({"hgvs_notations": hgvs_batch})
    )
    r.raise_for_status()
    return r.json()

records = []
for ann in vep_batch(hgvs_list):
    for tc in ann.get("transcript_consequences", []):
        if tc.get("canonical") == 1:
            records.append({
                "variant": ann["input"],
                "gene": tc.get("gene_symbol"),
                "consequence": ",".join(tc.get("consequence_terms", [])),
                "impact": tc.get("impact"),
                "biotype": tc.get("biotype"),
            })

df = pd.DataFrame(records)
print(df.to_string(index=False))
df.to_csv("vep_results.csv", index=False)
print(f"\nSaved {len(df)} variant annotations → vep_results.csv")
```

## Key Parameters

| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| `expand` | Lookup | `0` | `0` or `1` | Include nested transcripts/translations |
| `type` | Sequence | `"genomic"` | `"genomic"`, `"cDNA"`, `"CDS"`, `"protein"` | Sequence type to return |
| `target_species` | Homology | `None` | Species name or taxon ID | Filter homologs to target species |
| `feature` | Overlap | required | `"gene"`, `"transcript"`, `"regulatory"`, `"variation"` | Feature type to retrieve |
| `coord_system_version` | Region | `"GRCh38"` | `"GRCh38"`, `"GRCh37"` | Genome assembly |
| `content_type` | All | via header | `"application/json"`, `"text/plain"` | Response format |

## Best Practices

1. **Use batch endpoints**: POST `/lookup/symbol/{species}` and POST `/vep/human/hgvs` accept up to 1000 IDs; single-ID GET requests in a loop will hit rate limits quickly.

2. **Pin assembly version**: For region-based queries always specify `coord_system_version=GRCh38` (or use `grch37.rest.ensembl.org` for legacy coordinates) to avoid silent mismatch errors.

3. **Cache responses**: Gene metadata rarely changes between Ensembl releases; cache results to disk (`joblib.Memory`) to avoid redundant API calls during development.
   ```python
   from joblib import Memory
   mem = Memory("cache/", verbose=0)
   cached_lookup = mem.cache(batch_lookup)
   ```

4. **Use `expand=0` for metadata**: When you only need gene coordinates and biotype (not transcript details), keep `expand=0` for smaller payloads and faster responses.

5. **Check canonical flag in VEP**: VEP returns consequences for all overlapping transcripts; filter on `tc.get("canonical") == 1` to get the biologically most relevant consequence per variant.

## Common Recipes

### Recipe: Symbol → Ensembl ID Mapping Table

When to use: Build a lookup table from gene symbols to Ensembl IDs for downstream analysis.

```python
import requests, json, pandas as pd

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

symbols = ["EGFR", "KRAS", "BRAF", "PIK3CA", "PTEN", "AKT1", "MYC", "RB1"]
r = requests.post(
    f"{BASE}/lookup/symbol/homo_sapiens",
    headers=HEADERS,
    data=json.dumps({"symbols": symbols})
)
data = r.json()
rows = [{"symbol": s, "ensembl_id": d["id"] if d else None,
         "chrom": d["seq_region_name"] if d else None} for s, d in data.items()]
df = pd.DataFrame(rows)
df.to_csv("symbol_to_ensembl.csv", index=False)
print(df.to_string(index=False))
```

### Recipe: Region Gene Overlap

When to use: Find all genes overlapping a genomic interval (e.g., a GWAS locus).

```python
import requests, pandas as pd

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

chrom, start, end = "17", 43044295, 43125364
r = requests.get(
    f"{BASE}/overlap/region/human/{chrom}:{start}-{end}",
    headers=HEADERS,
    params={"feature": "gene", "biotype": "protein_coding"}
)
genes = r.json()
df = pd.DataFrame([{
    "id": g["id"], "name": g.get("external_name"),
    "start": g["start"], "end": g["end"], "strand": g["strand"]
} for g in genes])
print(df.to_string(index=False))
print(f"\n{len(df)} protein-coding genes in region")
```

### Recipe: Species List

When to use: Check which species are available in Ensembl before querying.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

r = requests.get(f"{BASE}/info/species", headers=HEADERS)
species_list = r.json()["species"]
print(f"Total species: {len(species_list)}")
vertebrates = [s for s in species_list if s.get("division") == "EnsemblVertebrates"]
print(f"Vertebrates: {len(vertebrates)}")
for s in vertebrates[:5]:
    print(f"  {s['common_name']} ({s['name']}): {s['assembly']}")
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `HTTP 429 Too Many Requests` | Exceeding ~15 req/s rate limit | Add `time.sleep(0.1)` between requests; use batch POST endpoints |
| `HTTP 400 Bad Request` on VEP | Malformed HGVS notation | Verify format: `chr:g.posREF>ALT` (e.g., `17:g.43094692C>T`) |
| `Gene not found` | Gene symbol not in Ensembl | Try alternative symbol; check species name (use `homo_sapiens` not `human` for symbols) |
| Region query returns wrong genes | Assembly mismatch | Set `coord_system_version=GRCh38` or use `grch37.rest.ensembl.org` |
| Old ID not resolving | Retired Ensembl ID | Query `GET /archive/id/{id}` to get current mapping |
| `HTTP 503 Service Unavailable` | Server maintenance | Retry after a few minutes; check Ensembl status at status.ensembl.org |

## Related Skills

- `gget-genomic-databases` — CLI/Python wrapper covering Ensembl + 20 other databases; use for quick lookups without raw API code
- `biopython-molecular-biology` — Biopython's `Entrez` module for NCBI databases (alternative for RefSeq/GenBank queries)
- `kegg-database` — Pathway/metabolic annotations for the same gene set
- `reactome-database` — Pathway enrichment and hierarchy queries

## References

- [Ensembl REST API documentation](https://rest.ensembl.org) — Interactive API explorer and endpoint reference
- [Ensembl Help & Documentation](https://www.ensembl.org/info/docs/api/rest/rest_access.html) — REST API overview
- [Ensembl stable IDs guide](https://www.ensembl.org/info/genome/stable_ids/index.html) — ID versioning policy
- [VEP documentation](https://www.ensembl.org/info/docs/tools/vep/index.html) — Variant Effect Predictor full reference