--- name: kegg-database description: "Direct REST API access to KEGG (academic use only). Pathway analysis, gene-pathway mapping, metabolic pathways, drug interactions, ID conversion. For Python workflows with multiple databases, prefer bioservices. Use this for direct HTTP/REST work or KEGG-specific control." --- # KEGG Database ## Overview KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive bioinformatics resource for biological pathway analysis and molecular interaction networks. **Important**: KEGG API is made available only for academic use by academic users. ## When to Use This Skill This skill should be used when querying pathways, genes, compounds, enzymes, diseases, and drugs across multiple organisms using KEGG's REST API. ## Quick Start The skill provides: 1. Python helper functions (`scripts/kegg_api.py`) for all KEGG REST API operations 2. Comprehensive reference documentation (`references/kegg_reference.md`) with detailed API specifications When users request KEGG data, determine which operation is needed and use the appropriate function from `scripts/kegg_api.py`. ## Core Operations ### 1. Database Information (`kegg_info`) Retrieve metadata and statistics about KEGG databases. **When to use**: Understanding database structure, checking available data, getting release information. **Usage**: ```python from scripts.kegg_api import kegg_info # Get pathway database info info = kegg_info('pathway') # Get organism-specific info hsa_info = kegg_info('hsa') # Human genome ``` **Common databases**: `kegg`, `pathway`, `module`, `brite`, `genes`, `genome`, `compound`, `glycan`, `reaction`, `enzyme`, `disease`, `drug` ### 2. Listing Entries (`kegg_list`) List entry identifiers and names from KEGG databases. **When to use**: Getting all pathways for an organism, listing genes, retrieving compound catalogs. **Usage**: ```python from scripts.kegg_api import kegg_list # List all reference pathways pathways = kegg_list('pathway') # List human-specific pathways hsa_pathways = kegg_list('pathway', 'hsa') # List specific genes (max 10) genes = kegg_list('hsa:10458+hsa:10459') ``` **Common organism codes**: `hsa` (human), `mmu` (mouse), `dme` (fruit fly), `sce` (yeast), `eco` (E. coli) ### 3. Searching (`kegg_find`) Search KEGG databases by keywords or molecular properties. **When to use**: Finding genes by name/description, searching compounds by formula or mass, discovering entries by keywords. **Usage**: ```python from scripts.kegg_api import kegg_find # Keyword search results = kegg_find('genes', 'p53') shiga_toxin = kegg_find('genes', 'shiga toxin') # Chemical formula search (exact match) compounds = kegg_find('compound', 'C7H10N4O2', 'formula') # Molecular weight range search drugs = kegg_find('drug', '300-310', 'exact_mass') ``` **Search options**: `formula` (exact match), `exact_mass` (range), `mol_weight` (range) ### 4. Retrieving Entries (`kegg_get`) Get complete database entries or specific data formats. **When to use**: Retrieving pathway details, getting gene/protein sequences, downloading pathway maps, accessing compound structures. **Usage**: ```python from scripts.kegg_api import kegg_get # Get pathway entry pathway = kegg_get('hsa00010') # Glycolysis pathway # Get multiple entries (max 10) genes = kegg_get(['hsa:10458', 'hsa:10459']) # Get protein sequence (FASTA) sequence = kegg_get('hsa:10458', 'aaseq') # Get nucleotide sequence nt_seq = kegg_get('hsa:10458', 'ntseq') # Get compound structure mol_file = kegg_get('cpd:C00002', 'mol') # ATP in MOL format # Get pathway as JSON (single entry only) pathway_json = kegg_get('hsa05130', 'json') # Get pathway image (single entry only) pathway_img = kegg_get('hsa05130', 'image') ``` **Output formats**: `aaseq` (protein FASTA), `ntseq` (nucleotide FASTA), `mol` (MOL format), `kcf` (KCF format), `image` (PNG), `kgml` (XML), `json` (pathway JSON) **Important**: Image, KGML, and JSON formats allow only one entry at a time. ### 5. ID Conversion (`kegg_conv`) Convert identifiers between KEGG and external databases. **When to use**: Integrating KEGG data with other databases, mapping gene IDs, converting compound identifiers. **Usage**: ```python from scripts.kegg_api import kegg_conv # Convert all human genes to NCBI Gene IDs conversions = kegg_conv('ncbi-geneid', 'hsa') # Convert specific gene gene_id = kegg_conv('ncbi-geneid', 'hsa:10458') # Convert to UniProt uniprot_id = kegg_conv('uniprot', 'hsa:10458') # Convert compounds to PubChem pubchem_ids = kegg_conv('pubchem', 'compound') # Reverse conversion (NCBI Gene ID to KEGG) kegg_id = kegg_conv('hsa', 'ncbi-geneid') ``` **Supported conversions**: `ncbi-geneid`, `ncbi-proteinid`, `uniprot`, `pubchem`, `chebi` ### 6. Cross-Referencing (`kegg_link`) Find related entries within and between KEGG databases. **When to use**: Finding pathways containing genes, getting genes in a pathway, mapping genes to KO groups, finding compounds in pathways. **Usage**: ```python from scripts.kegg_api import kegg_link # Find pathways linked to human genes pathways = kegg_link('pathway', 'hsa') # Get genes in a specific pathway genes = kegg_link('genes', 'hsa00010') # Glycolysis genes # Find pathways containing a specific gene gene_pathways = kegg_link('pathway', 'hsa:10458') # Find compounds in a pathway compounds = kegg_link('compound', 'hsa00010') # Map genes to KO (orthology) groups ko_groups = kegg_link('ko', 'hsa:10458') ``` **Common links**: genes ↔ pathway, pathway ↔ compound, pathway ↔ enzyme, genes ↔ ko (orthology) ### 7. Drug-Drug Interactions (`kegg_ddi`) Check for drug-drug interactions. **When to use**: Analyzing drug combinations, checking for contraindications, pharmacological research. **Usage**: ```python from scripts.kegg_api import kegg_ddi # Check single drug interactions = kegg_ddi('D00001') # Check multiple drugs (max 10) interactions = kegg_ddi(['D00001', 'D00002', 'D00003']) ``` ## Common Analysis Workflows ### Workflow 1: Gene to Pathway Mapping **Use case**: Finding pathways associated with genes of interest (e.g., for pathway enrichment analysis). ```python from scripts.kegg_api import kegg_find, kegg_link, kegg_get # Step 1: Find gene ID by name gene_results = kegg_find('genes', 'p53') # Step 2: Link gene to pathways pathways = kegg_link('pathway', 'hsa:7157') # TP53 gene # Step 3: Get detailed pathway information for pathway_line in pathways.split('\n'): if pathway_line: pathway_id = pathway_line.split('\t')[1].replace('path:', '') pathway_info = kegg_get(pathway_id) # Process pathway information ``` ### Workflow 2: Pathway Enrichment Context **Use case**: Getting all genes in organism pathways for enrichment analysis. ```python from scripts.kegg_api import kegg_list, kegg_link # Step 1: List all human pathways pathways = kegg_list('pathway', 'hsa') # Step 2: For each pathway, get associated genes for pathway_line in pathways.split('\n'): if pathway_line: pathway_id = pathway_line.split('\t')[0] genes = kegg_link('genes', pathway_id) # Process genes for enrichment analysis ``` ### Workflow 3: Compound to Pathway Analysis **Use case**: Finding metabolic pathways containing compounds of interest. ```python from scripts.kegg_api import kegg_find, kegg_link, kegg_get # Step 1: Search for compound compound_results = kegg_find('compound', 'glucose') # Step 2: Link compound to reactions reactions = kegg_link('reaction', 'cpd:C00031') # Glucose # Step 3: Link reactions to pathways pathways = kegg_link('pathway', 'rn:R00299') # Specific reaction # Step 4: Get pathway details pathway_info = kegg_get('map00010') # Glycolysis ``` ### Workflow 4: Cross-Database Integration **Use case**: Integrating KEGG data with UniProt, NCBI, or PubChem databases. ```python from scripts.kegg_api import kegg_conv, kegg_get # Step 1: Convert KEGG gene IDs to external database IDs uniprot_map = kegg_conv('uniprot', 'hsa') ncbi_map = kegg_conv('ncbi-geneid', 'hsa') # Step 2: Parse conversion results for line in uniprot_map.split('\n'): if line: kegg_id, uniprot_id = line.split('\t') # Use external IDs for integration # Step 3: Get sequences using KEGG sequence = kegg_get('hsa:10458', 'aaseq') ``` ### Workflow 5: Organism-Specific Pathway Analysis **Use case**: Comparing pathways across different organisms. ```python from scripts.kegg_api import kegg_list, kegg_get # Step 1: List pathways for multiple organisms human_pathways = kegg_list('pathway', 'hsa') mouse_pathways = kegg_list('pathway', 'mmu') yeast_pathways = kegg_list('pathway', 'sce') # Step 2: Get reference pathway for comparison ref_pathway = kegg_get('map00010') # Reference glycolysis # Step 3: Get organism-specific versions hsa_glycolysis = kegg_get('hsa00010') mmu_glycolysis = kegg_get('mmu00010') ``` ## Pathway Categories KEGG organizes pathways into seven major categories. When interpreting pathway IDs or recommending pathways to users: 1. **Metabolism** (e.g., `map00010` - Glycolysis, `map00190` - Oxidative phosphorylation) 2. **Genetic Information Processing** (e.g., `map03010` - Ribosome, `map03040` - Spliceosome) 3. **Environmental Information Processing** (e.g., `map04010` - MAPK signaling, `map02010` - ABC transporters) 4. **Cellular Processes** (e.g., `map04140` - Autophagy, `map04210` - Apoptosis) 5. **Organismal Systems** (e.g., `map04610` - Complement cascade, `map04910` - Insulin signaling) 6. **Human Diseases** (e.g., `map05200` - Pathways in cancer, `map05010` - Alzheimer disease) 7. **Drug Development** (chronological and target-based classifications) Reference `references/kegg_reference.md` for detailed pathway lists and classifications. ## Important Identifiers and Formats ### Pathway IDs - `map#####` - Reference pathway (generic, not organism-specific) - `hsa#####` - Human pathway - `mmu#####` - Mouse pathway ### Gene IDs - Format: `organism:gene_number` (e.g., `hsa:10458`) ### Compound IDs - Format: `cpd:C#####` (e.g., `cpd:C00002` for ATP) ### Drug IDs - Format: `dr:D#####` (e.g., `dr:D00001`) ### Enzyme IDs - Format: `ec:EC_number` (e.g., `ec:1.1.1.1`) ### KO (KEGG Orthology) IDs - Format: `ko:K#####` (e.g., `ko:K00001`) ## API Limitations Respect these constraints when using the KEGG API: 1. **Entry limits**: Maximum 10 entries per operation (except image/kgml/json: 1 entry only) 2. **Academic use**: API is for academic use only; commercial use requires licensing 3. **HTTP status codes**: Check for 200 (success), 400 (bad request), 404 (not found) 4. **Rate limiting**: No explicit limit, but avoid rapid-fire requests ## Detailed Reference For comprehensive API documentation, database specifications, organism codes, and advanced usage, refer to `references/kegg_reference.md`. This includes: - Complete list of KEGG databases - Detailed API operation syntax - All organism codes - HTTP status codes and error handling - Integration with Biopython and R/Bioconductor - Best practices for API usage ## Troubleshooting **404 Not Found**: Entry or database doesn't exist; verify IDs and organism codes **400 Bad Request**: Syntax error in API call; check parameter formatting **Empty results**: Search term may not match entries; try broader keywords **Image/KGML errors**: These formats only work with single entries; remove batch processing ## Additional Tools For interactive pathway visualization and annotation: - **KEGG Mapper**: https://www.kegg.jp/kegg/mapper/ - **BlastKOALA**: Automated genome annotation - **GhostKOALA**: Metagenome/metatranscriptome annotation