--- name: kegg-database description: "KEGG REST API (academic only). Pathways, genes, compounds, enzymes, diseases, drugs via 7 ops (info/list/find/get/conv/link/ddi). ID conversion (NCBI/UniProt/PubChem). Use bioservices for multi-DB Python." license: Non-academic use of KEGG requires a commercial license --- # KEGG Database — Biological Pathway & Molecular Network Queries ## Overview KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive bioinformatics resource for biological pathway analysis, molecular interaction networks, and cross-database ID conversion. Access is via a direct REST API with no authentication — all operations use simple HTTP GET requests returning tab-delimited text. ## When to Use - Mapping genes to biological pathways (e.g., "which pathways involve TP53?") - Retrieving metabolic pathway details, gene lists, or compound structures - Converting identifiers between KEGG, NCBI Gene, UniProt, and PubChem - Checking drug-drug interactions from KEGG's pharmacological database - Building pathway enrichment context (all genes per pathway for an organism) - Cross-referencing compounds, reactions, enzymes, and pathways - For **Python-native multi-database queries** (KEGG + UniProt + Ensembl in one script), prefer `bioservices` instead - For **pathway visualization**, use KEGG Mapper (https://www.kegg.jp/kegg/mapper/) directly ## Prerequisites ```bash pip install requests ``` **API constraints**: - **Academic use only** — commercial use requires a separate KEGG license - **Max 10 entries** per `get`/`list`/`conv`/`link`/`ddi` call (image/kgml/json: 1 entry only) - **No explicit rate limit**, but add `time.sleep(0.5)` between batch requests to avoid server-side throttling - Base URL: `https://rest.kegg.jp/` ## Quick Start ```python import requests import time BASE = "https://rest.kegg.jp" def kegg_get(operation, *args): """Generic KEGG REST API caller.""" url = f"{BASE}/{operation}/{'/'.join(args)}" resp = requests.get(url) resp.raise_for_status() return resp.text # Find pathways linked to human gene TP53 pathways = kegg_get("link", "pathway", "hsa:7157") print(pathways[:200]) # hsa:7157 path:hsa04010 # hsa:7157 path:hsa04110 # ... # Get pathway details detail = kegg_get("get", "hsa04110") print(detail[:300]) ``` ## Core API ### 1. Database Information — `kegg_info` Retrieve metadata and statistics about KEGG databases. ```python import requests BASE = "https://rest.kegg.jp" # Database-level info info = requests.get(f"{BASE}/info/pathway").text print(info[:200]) # pathway Pathway # Release 112.0, Dec 2025 # Kanehisa Laboratories # ... # Organism-level info hsa_info = requests.get(f"{BASE}/info/hsa").text print(hsa_info[:200]) ``` **Common databases**: `kegg`, `pathway`, `module`, `brite`, `genes`, `genome`, `compound`, `glycan`, `reaction`, `enzyme`, `disease`, `drug` ### 2. Listing Entries — `kegg_list` List entry identifiers and names from any KEGG database. ```python import requests BASE = "https://rest.kegg.jp" # All human pathways hsa_pathways = requests.get(f"{BASE}/list/pathway/hsa").text for line in hsa_pathways.strip().split("\n")[:5]: pathway_id, name = line.split("\t") print(f"{pathway_id}: {name}") # path:hsa00010: Glycolysis / Gluconeogenesis - Homo sapiens (human) # ... # Specific entries (max 10, joined with +) genes = requests.get(f"{BASE}/list/hsa:10458+hsa:10459").text print(genes) ``` **Common organism codes**: `hsa` (human), `mmu` (mouse), `dme` (fruit fly), `sce` (yeast), `eco` (E. coli) ### 3. Keyword Search — `kegg_find` Search databases by keywords or molecular properties. ```python import requests import time BASE = "https://rest.kegg.jp" # Keyword search in genes results = requests.get(f"{BASE}/find/genes/p53").text print(f"Found {len(results.strip().split(chr(10)))} entries") time.sleep(0.5) # Chemical formula search (exact match) compounds = requests.get(f"{BASE}/find/compound/C7H10N4O2/formula").text print(compounds[:200]) time.sleep(0.5) # Molecular weight range search drugs = requests.get(f"{BASE}/find/drug/300-310/exact_mass").text print(drugs[:200]) ``` **Search options**: append `/formula` (exact match), `/exact_mass` (range), `/mol_weight` (range) to compound/drug queries. ### 4. Entry Retrieval — `kegg_get` Retrieve complete database entries or specific data formats. ```python import requests import time BASE = "https://rest.kegg.jp" # Full pathway entry (text format) pathway = requests.get(f"{BASE}/get/hsa00010").text print(pathway[:500]) time.sleep(0.5) # Multiple entries (max 10, joined with +) genes = requests.get(f"{BASE}/get/hsa:10458+hsa:10459").text # Protein sequence (FASTA) fasta = requests.get(f"{BASE}/get/hsa:10458/aaseq").text print(fasta[:200]) time.sleep(0.5) # Compound structure (MOL format) mol = requests.get(f"{BASE}/get/cpd:C00002/mol").text # ATP # Pathway image (PNG, single entry only) img_resp = requests.get(f"{BASE}/get/hsa05130/image") with open("pathway.png", "wb") as f: f.write(img_resp.content) print(f"Saved pathway image: {len(img_resp.content)} bytes") ``` **Output formats**: `aaseq` (protein FASTA), `ntseq` (nucleotide FASTA), `mol` (MOL), `kcf` (KCF), `image` (PNG), `kgml` (XML), `json` (pathway JSON). Image/KGML/JSON accept **one entry only**. ### 5. ID Conversion — `kegg_conv` Convert identifiers between KEGG and external databases. ```python import requests import time BASE = "https://rest.kegg.jp" # KEGG gene → NCBI Gene ID (specific gene) ncbi = requests.get(f"{BASE}/conv/ncbi-geneid/hsa:10458").text print(ncbi.strip()) # hsa:10458 ncbi-geneid:10458 time.sleep(0.5) # KEGG gene → UniProt uniprot = requests.get(f"{BASE}/conv/uniprot/hsa:10458").text print(uniprot.strip()) time.sleep(0.5) # Bulk conversion: all human genes → NCBI Gene IDs all_conv = requests.get(f"{BASE}/conv/ncbi-geneid/hsa").text lines = all_conv.strip().split("\n") print(f"Total conversions: {len(lines)}") # Reverse: NCBI Gene ID → KEGG reverse = requests.get(f"{BASE}/conv/hsa/ncbi-geneid:7157").text print(reverse.strip()) # TP53 ``` **Supported external databases**: `ncbi-geneid`, `ncbi-proteinid`, `uniprot`, `pubchem`, `chebi` ### 6. Cross-Referencing — `kegg_link` Find related entries within and between KEGG databases. ```python import requests import time BASE = "https://rest.kegg.jp" # Genes in glycolysis pathway genes = requests.get(f"{BASE}/link/genes/hsa00010").text gene_list = [line.split("\t")[1] for line in genes.strip().split("\n") if line] print(f"Glycolysis genes: {len(gene_list)}") time.sleep(0.5) # Pathways containing a specific gene pathways = requests.get(f"{BASE}/link/pathway/hsa:7157").text # TP53 print(pathways[:300]) time.sleep(0.5) # Compounds in a pathway compounds = requests.get(f"{BASE}/link/compound/hsa00010").text print(f"Compounds in glycolysis: {len(compounds.strip().split(chr(10)))}") # Map genes to KO (orthology) groups ko = requests.get(f"{BASE}/link/ko/hsa:10458").text print(ko.strip()) ``` **Common links**: genes ↔ pathway, pathway ↔ compound, pathway ↔ enzyme, genes ↔ ko (orthology) ### 7. Drug-Drug Interactions — `kegg_ddi` Check pharmacological interactions between drugs. ```python import requests BASE = "https://rest.kegg.jp" # Single drug — all known interactions interactions = requests.get(f"{BASE}/ddi/D00001").text print(f"Interactions: {len(interactions.strip().split(chr(10)))}") # Pairwise check (max 10 drugs, joined with +) pair = requests.get(f"{BASE}/ddi/D00001+D00002+D00003").text print(pair[:300]) ``` ## Key Concepts ### Identifier Formats | Type | Format | Example | |------|--------|---------| | Reference pathway | `map#####` | `map00010` (Glycolysis, generic) | | Organism pathway | `{org}#####` | `hsa00010` (Glycolysis, human) | | Gene | `{org}:{number}` | `hsa:7157` (TP53) | | Compound | `cpd:C#####` | `cpd:C00002` (ATP) | | Drug | `dr:D#####` | `dr:D00001` | | Enzyme | `ec:{EC_number}` | `ec:1.1.1.1` | | KO (orthology) | `ko:K#####` | `ko:K00001` | ### Pathway Categories KEGG organizes pathways into seven major categories: 1. **Metabolism** — `map001xx` (Glycolysis, TCA cycle, amino acid metabolism) 2. **Genetic Information Processing** — `map030xx` (Ribosome, Spliceosome, DNA repair) 3. **Environmental Information Processing** — `map040xx` (MAPK signaling, ABC transporters) 4. **Cellular Processes** — `map041xx` (Autophagy, Apoptosis, Cell cycle) 5. **Organismal Systems** — `map046xx` (Immune, Endocrine, Nervous) 6. **Human Diseases** — `map052xx` (Cancer, Neurodegenerative, Infectious) 7. **Drug Development** — Chronological and target-based classifications ## Common Workflows ### Workflow: Gene to Pathway Mapping Find all pathways associated with a gene of interest. ```python import requests import time BASE = "https://rest.kegg.jp" # Step 1: Find gene by keyword results = requests.get(f"{BASE}/find/genes/BRCA1+homo+sapiens").text print("Gene search results:") for line in results.strip().split("\n")[:5]: print(f" {line}") time.sleep(0.5) # Step 2: Get pathways linked to BRCA1 pathways = requests.get(f"{BASE}/link/pathway/hsa:672").text pathway_ids = [line.split("\t")[1].replace("path:", "") for line in pathways.strip().split("\n") if line] print(f"\nBRCA1 is in {len(pathway_ids)} pathways:") time.sleep(0.5) # Step 3: Get pathway names for pid in pathway_ids[:5]: info = requests.get(f"{BASE}/get/{pid}").text # Extract NAME field for line in info.split("\n"): if line.startswith("NAME"): print(f" {pid}: {line.replace('NAME', '').strip()}") break time.sleep(0.5) ``` ### Workflow: Pathway Enrichment Context Build a gene-set collection for all pathways of an organism. ```python import requests import time BASE = "https://rest.kegg.jp" # Step 1: List all human pathways pathways_text = requests.get(f"{BASE}/list/pathway/hsa").text pathways = {} for line in pathways_text.strip().split("\n"): pid, name = line.split("\t", 1) pathways[pid.replace("path:", "")] = name print(f"Total human pathways: {len(pathways)}") time.sleep(0.5) # Step 2: Get genes for each pathway (sample first 3 for demo) gene_sets = {} for pid in list(pathways.keys())[:3]: genes_text = requests.get(f"{BASE}/link/genes/{pid}").text gene_ids = [line.split("\t")[1] for line in genes_text.strip().split("\n") if line] gene_sets[pid] = gene_ids print(f" {pid}: {len(gene_ids)} genes") time.sleep(0.5) # Step 3: Convert to NCBI Gene IDs for enrichment tools # (use kegg_conv for bulk conversion) ``` ### Workflow: Compound-Pathway-Reaction Analysis Trace a compound through metabolic reactions and pathways. ```python import requests import time BASE = "https://rest.kegg.jp" # Step 1: Search for compound results = requests.get(f"{BASE}/find/compound/glucose").text print("Compound search:") for line in results.strip().split("\n")[:3]: print(f" {line}") time.sleep(0.5) # Step 2: Find reactions involving glucose (C00031) reactions = requests.get(f"{BASE}/link/reaction/cpd:C00031").text rxn_ids = [line.split("\t")[1] for line in reactions.strip().split("\n") if line] print(f"\nReactions involving glucose: {len(rxn_ids)}") time.sleep(0.5) # Step 3: Find pathways for a specific reaction pathways = requests.get(f"{BASE}/link/pathway/rn:R00299").text print(f"\nPathways for R00299:") print(pathways[:300]) time.sleep(0.5) # Step 4: Get pathway detail detail = requests.get(f"{BASE}/get/map00010").text print(f"\nGlycolysis pathway detail (first 500 chars):") print(detail[:500]) ``` ### Workflow: Cross-Database ID Integration Map KEGG identifiers to UniProt, NCBI, and PubChem for multi-database workflows. ```python import requests import time BASE = "https://rest.kegg.jp" # Step 1: Convert gene to multiple external IDs gene = "hsa:7157" # TP53 uniprot = requests.get(f"{BASE}/conv/uniprot/{gene}").text.strip() print(f"UniProt: {uniprot}") time.sleep(0.5) ncbi = requests.get(f"{BASE}/conv/ncbi-geneid/{gene}").text.strip() print(f"NCBI Gene: {ncbi}") time.sleep(0.5) # Step 2: Get protein sequence from KEGG fasta = requests.get(f"{BASE}/get/{gene}/aaseq").text print(f"\nProtein sequence (first 200 chars):\n{fasta[:200]}") time.sleep(0.5) # Step 3: Convert compounds to PubChem CIDs cpd_conv = requests.get(f"{BASE}/conv/pubchem/cpd:C00002").text.strip() # ATP print(f"\nATP PubChem: {cpd_conv}") ``` ## Key Parameters | Parameter | Function/Endpoint | Default | Options | Effect | |-----------|-------------------|---------|---------|--------| | `organism` | `list`, `link`, `conv` | None | 3-4 letter code | Filter by organism (e.g., `hsa`, `mmu`) | | `option` | `find` | None | `formula`, `exact_mass`, `mol_weight` | Search mode for compounds/drugs | | `format` | `get` | text | `aaseq`, `ntseq`, `mol`, `kcf`, `image`, `kgml`, `json` | Output format | | `+` separator | `get`, `list`, `ddi` | — | Max 10 entries | Batch query (join IDs with `+`) | | `target_db` | `conv` | — | `ncbi-geneid`, `uniprot`, `pubchem`, `chebi` | External database for ID conversion | | `target_db` | `link` | — | `pathway`, `genes`, `compound`, `ko`, `enzyme` | Related KEGG database | ## Best Practices 1. **Add delays between batch requests**: No explicit rate limit, but `time.sleep(0.5)` between requests prevents throttling and is courteous to the shared academic resource. 2. **Anti-pattern — fetching all entries without filtering**: Use `kegg_list` to enumerate IDs first, then `kegg_get` for specific entries. Avoid downloading entire databases when you need a subset. 3. **Parse tab-delimited output consistently**: All KEGG responses use `\t` as field separator and `\n` as record separator. Always `.strip()` before splitting. 4. **Respect the 10-entry batch limit**: `kegg_get`, `kegg_list`, `kegg_conv`, `kegg_link`, `kegg_ddi` accept max 10 entries (joined with `+`). Image/KGML/JSON formats accept only 1. 5. **Use organism-specific pathway IDs**: `hsa00010` (human glycolysis) returns organism-specific gene mappings; `map00010` (reference) returns generic entries. Always prefer organism-specific when analyzing a known organism. 6. **Cache frequently-used conversions**: Full organism ID conversions (`kegg_conv('ncbi-geneid', 'hsa')`) return large results. Cache locally rather than repeating. ## Common Recipes ### Recipe: Parse KEGG Flat-File Entry ```python def parse_kegg_entry(text): """Parse a KEGG flat-file entry into a dictionary.""" entry = {} current_key = None for line in text.split("\n"): if line.startswith("///"): break if line[:12].strip(): # New field current_key = line[:12].strip() entry[current_key] = line[12:].strip() elif current_key: # Continuation entry[current_key] += "\n" + line[12:].strip() return entry import requests pathway = requests.get("https://rest.kegg.jp/get/hsa00010").text parsed = parse_kegg_entry(pathway) print(f"Name: {parsed.get('NAME', 'N/A')}") print(f"Description: {parsed.get('DESCRIPTION', 'N/A')[:200]}") ``` ### Recipe: Organism Comparison ```python import requests import time BASE = "https://rest.kegg.jp" organisms = {"hsa": "Human", "mmu": "Mouse", "sce": "Yeast"} pathway = "00010" # Glycolysis for org, name in organisms.items(): genes = requests.get(f"{BASE}/link/genes/{org}{pathway}").text count = len([l for l in genes.strip().split("\n") if l]) print(f"{name} ({org}): {count} genes in Glycolysis") time.sleep(0.5) # Human (hsa): 68 genes in Glycolysis # Mouse (mmu): 67 genes in Glycolysis # Yeast (sce): 31 genes in Glycolysis ``` ### Recipe: Build Gene-to-Pathway Mapping Table ```python import requests import time BASE = "https://rest.kegg.jp" # Get all human gene-pathway links links = requests.get(f"{BASE}/link/pathway/hsa").text gene_pathways = {} for line in links.strip().split("\n"): if not line: continue gene, pathway = line.split("\t") gene_pathways.setdefault(gene, []).append(pathway.replace("path:", "")) print(f"Genes with pathway annotations: {len(gene_pathways)}") # Show top genes by pathway count top = sorted(gene_pathways.items(), key=lambda x: -len(x[1]))[:5] for gene, paths in top: print(f" {gene}: {len(paths)} pathways") ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `404 Not Found` | Entry or database doesn't exist | Verify ID format and organism code; use `kegg_list` to check valid IDs | | `400 Bad Request` | Malformed API URL | Check URL path: `/{operation}/{arg1}/{arg2}`; no query params | | Empty response | Search term too specific or no matches | Broaden keywords; try partial matches; check organism code | | Image/KGML returns error | Batch query with image/kgml/json format | These formats accept **one entry only** — remove `+` joins | | `403 Forbidden` | Server-side rate limiting | Add `time.sleep(1)` between requests; reduce batch frequency | | Wrong gene IDs returned | Using reference pathway (`map`) instead of organism-specific | Use organism prefix: `hsa00010` not `map00010` for gene links | | ID conversion returns empty | External DB doesn't cover that entry | Not all KEGG entries have UniProt/NCBI mappings; check with `kegg_list` first | | Response encoding issues | Non-ASCII characters in compound names | Use `resp.encoding = 'utf-8'` or `resp.text` (requests auto-detects) | ## Related Skills - **gget-genomic-databases** — unified Python interface to Ensembl, NCBI, UniProt; use for gene-level queries when KEGG pathway context isn't needed - **biopython-molecular-biology** — BioPython's `Bio.KEGG` module provides an alternative Python API for KEGG parsing - **pubchem-compound-search** — for compound property lookups beyond KEGG's structural data; use `kegg_conv('pubchem', ...)` to bridge IDs ## References - [KEGG REST API documentation](https://www.kegg.jp/kegg/rest/keggapi.html) — official API specification - [KEGG website](https://www.kegg.jp/) — pathway browser, KEGG Mapper, BlastKOALA - [KEGG organism codes](https://www.kegg.jp/kegg/catalog/org_list.html) — full list of 3-4 letter organism codes - Kanehisa, M. et al. (2023) "KEGG for taxonomy-based analysis of pathways and genomes" *Nucleic Acids Research* 51:D483-D489