--- name: "metabolomics-workbench-database" description: "Query Metabolomics Workbench REST API (4,200+ NIH studies) for metabolite ID, study discovery, RefMet standardization, m/z precursor searches, MetStat filtering, gene/protein annotations. Use hmdb-database for local XML; pubchem-compound-search for compounds." license: "Unknown" --- # Metabolomics Workbench Database — REST API Access ## Overview Query the Metabolomics Workbench (MW) REST API to access 4,200+ metabolomics studies hosted at UCSD under NIH Common Fund sponsorship. The API provides six query contexts: compound/metabolite lookups, study metadata and experimental data retrieval, RefMet standardized nomenclature, MetStat study filtering by species/disease/analysis type, m/z precursor ion searches for compound identification, and gene/protein annotation from the Metabolomics Gene/Protein (MGP) database. ## When to Use - Searching for metabolite structures, identifiers, or chemical properties by PubChem CID, KEGG ID, InChI key, or formula - Discovering metabolomics studies by species, disease, analysis type, or polarity - Standardizing metabolite names to RefMet nomenclature for cross-study comparison - Identifying unknown compounds from mass spectrometry m/z values with adduct type matching - Retrieving experimental metabolomics data (concentrations, abundances) from published studies - Querying gene or protein annotations linked to metabolomics pathways - Downloading study data in mwTab format for local analysis - For local metabolite database parsing (220K+ entries, NMR/MS spectra) use `hmdb-database` instead - For live compound property searches (110M+ compounds) use `pubchem-compound-search` instead ## Prerequisites - **Python packages**: `requests`, `pandas` - **No API key required**: MW REST API is publicly accessible without authentication - **Rate limits**: MW does not enforce strict rate limits for reasonable use. For bulk queries (100+), add 0.5-1s delays between requests - **Base URL**: `https://www.metabolomicsworkbench.org/rest` ```bash pip install requests pandas ``` ## Quick Start ```python import requests import time BASE = "https://www.metabolomicsworkbench.org/rest" def mw_query(context, input_item, input_value, output_item="all", fmt="json"): """Query Metabolomics Workbench REST API. URL pattern: /context/input_item/input_value/output_item/output_format """ url = f"{BASE}/{context}/{input_item}/{input_value}/{output_item}/{fmt}" resp = requests.get(url, timeout=30) resp.raise_for_status() return resp.json() if fmt == "json" else resp.text # Example: look up glucose by name result = mw_query("compound", "name", "glucose") print(result) # {'regno': '...', 'formula': 'C6H12O6', 'exactmass': '180.063388...', ...} ``` ## Core API ### Module 1: Compound Queries Search metabolite records by various identifiers. Returns chemical properties, structure info, and cross-references. ```python # Search by PubChem CID compound = mw_query("compound", "pubchem_cid", "5793") print(compound.get("name"), compound.get("formula")) # Glucose C6H12O6 # Search by KEGG compound ID compound = mw_query("compound", "kegg_id", "C00031") print(compound.get("name"), compound.get("exactmass")) # D-Glucose 180.06338810 # Search by InChI key compound = mw_query("compound", "inchi_key", "WQZGKKKJIJFFOK-GASJEMHNSA-N") # Search by molecular formula matches = mw_query("compound", "formula", "C6H12O6") # Returns all compounds with this formula # Search by registry number (MW internal ID) compound = mw_query("compound", "regno", "11") # Available input_items: regno, formula, name, pubchem_cid, kegg_id, inchi_key, lm_id, hmdb_id, bmrb_id ``` ### Module 2: Study Access Retrieve study metadata, experimental factors, and data from deposited metabolomics studies. ```python # Get study summary by study ID study = mw_query("study", "study_id", "ST000001", "summary") print(study.get("study_title"), study.get("institute")) # Get study metadata (species, analysis type, etc.) study_meta = mw_query("study", "study_id", "ST000001") print(study_meta.get("species"), study_meta.get("analysis_type")) # Get study factors (experimental conditions) factors = mw_query("study", "study_id", "ST000001", "factors") # Returns factor names and levels for the study # Get study data (metabolite measurements) data = mw_query("study", "study_id", "ST000001", "data") # Returns concentration/abundance values per sample # Get analysis details analysis = mw_query("study", "study_id", "ST000001", "analysis") print(analysis.get("analysis_type"), analysis.get("instrument_name")) # Download mwTab file (tab-delimited study format) mwtab_text = mw_query("study", "study_id", "ST000001", "mwtab", fmt="txt") # Returns full mwTab formatted text ``` ### Module 3: RefMet Nomenclature Standardize metabolite names using RefMet (Reference Metabolomics) classification. RefMet provides a hierarchical nomenclature: super_class > main_class > sub_class. ```python # Standardize a metabolite name to RefMet refmet = mw_query("refmet", "name", "Palmitic acid") print(refmet.get("refmet_name"), refmet.get("super_class")) # Palmitic acid Fatty Acyls # Get all metabolites in a main_class fatty_acids = mw_query("refmet", "main_class", "Fatty acids") print(f"Found {len(fatty_acids) if isinstance(fatty_acids, list) else 1} entries") # Get all metabolites in a sub_class lg_fa = mw_query("refmet", "sub_class", "Long-chain fatty acids") # Search by exact mass (with tolerance) # Use match="name" for name matching refmet_match = mw_query("refmet", "match", "Palmitic acid") print(refmet_match.get("formula"), refmet_match.get("exactmass")) # C16H32O2 256.24023 ``` ### Module 4: MetStat Filtering Filter studies using semicolon-delimited filter strings. MetStat queries enable discovery of studies by analysis type, polarity, species, and disease. ```python # MetStat filter format: "analysis_type:value;polarity:value;species:value;disease:value" # Each field is optional; separate multiple filters with semicolons # Find all LC-MS studies in human results = mw_query("metstat", "filter", "analysis_type:LC-MS;species:Human") print(f"Found {len(results) if isinstance(results, list) else 1} studies") # Filter by disease diabetes_studies = mw_query("metstat", "filter", "disease:Diabetes;species:Human") # Filter by polarity pos_studies = mw_query("metstat", "filter", "analysis_type:LC-MS;polarity:positive") # Combined multi-field filter filtered = mw_query("metstat", "filter", "analysis_type:LC-MS;polarity:positive;species:Human;disease:Cancer") # Available filter fields and common values: # analysis_type: LC-MS, GC-MS, CE-MS, NMR # polarity: positive, negative # species: Human, Mouse, Rat, etc. # disease: Cancer, Diabetes, Obesity, etc. ``` ### Module 5: m/z Search (Moverz) Search for metabolites by precursor ion m/z value. Essential for compound identification from mass spectrometry data. ```python # Search by m/z with adduct type and tolerance # Format: moverz/mz_value/tolerance/ion_type/... mz_results = mw_query("moverz", "mz", "180.063/0.005/M+H") # Returns candidate compounds matching the m/z within tolerance # Negative mode search mz_neg = mw_query("moverz", "mz", "179.056/0.005/M-H") # Sodium adduct search mz_na = mw_query("moverz", "mz", "203.053/0.01/M+Na") # Search with wider tolerance for low-resolution instruments mz_wide = mw_query("moverz", "mz", "180.063/0.5/M+H") print("Candidates:", mz_results) # Returns: compound name, formula, exact mass, delta (mass error) ``` ### Module 6: Gene Information Query gene annotations from the Metabolomics Gene/Protein (MGP) database. ```python # Search by gene symbol gene = mw_query("gene", "gene_symbol", "HMGCR") print(gene.get("gene_name"), gene.get("taxonomy")) # 3-hydroxy-3-methylglutaryl-CoA reductase Homo sapiens # Search by gene ID gene_by_id = mw_query("gene", "gene_id", "3156") print(gene_by_id.get("gene_symbol")) # Search by taxonomy human_genes = mw_query("gene", "taxonomy", "Homo sapiens") ``` ### Module 7: Protein Data Retrieve protein sequence and annotation data from the MGP database. ```python # Search by UniProt ID protein = mw_query("protein", "uniprot_id", "P04035") print(protein.get("protein_name"), protein.get("gene_symbol")) # Search by gene symbol for protein info protein_by_gene = mw_query("protein", "gene_symbol", "HMGCR") print(protein_by_gene.get("sequence")[:50] if protein_by_gene.get("sequence") else "No seq") # Search by MGP ID protein_mgp = mw_query("protein", "mgp_id", "MGP000001") ``` ## Key Concepts ### API URL Structure All MW REST API endpoints follow the same pattern: ``` https://www.metabolomicsworkbench.org/rest/{context}/{input_item}/{input_value}/{output_item}/{format} ``` | Component | Description | Example Values | |-----------|-------------|----------------| | `context` | Query domain | `compound`, `study`, `refmet`, `metstat`, `moverz`, `gene`, `protein` | | `input_item` | Search field | `name`, `pubchem_cid`, `study_id`, `mz`, `gene_symbol` | | `input_value` | Search term | `glucose`, `5793`, `ST000001` | | `output_item` | Data to return | `all`, `summary`, `factors`, `data`, `analysis`, `mwtab` | | `format` | Response format | `json`, `txt` | ### RefMet Classification Hierarchy RefMet standardizes metabolite naming with three classification levels: | Super Class | Main Class (examples) | Sub Class (examples) | |-------------|----------------------|---------------------| | Fatty Acyls | Fatty acids, Eicosanoids | Short/Medium/Long/Very long-chain FA | | Glycerolipids | Monoradylglycerols, Diradylglycerols | Monoacylglycerols, Diacylglycerols | | Glycerophospholipids | Glycerophosphocholines, -ethanolamines | Lysophosphatidylcholines | | Sphingolipids | Sphingoid bases, Ceramides | Ceramide phosphocholines | | Steroids | Cholesterol esters, Bile acids | C18/C19/C21 steroids | | Prenol Lipids | Isoprenoids, Quinones | Ubiquinones, Terpenes | | Organic acids | Amino acids, Carboxylic acids | Alpha amino acids | | Nucleosides | Purine nucleosides, Pyrimidine | Adenosine, Cytidine | | Carbohydrates | Monosaccharides, Disaccharides | Hexoses, Pentoses | ### Ion Adduct Types (Moverz) Common adduct types for m/z searches (mass spectrometry): | Adduct | Mode | Mass Shift | Use When | |--------|------|-----------|----------| | M+H | Positive | +1.0073 | Default positive mode | | M+Na | Positive | +22.9892 | Sodium adducts (common in ESI) | | M+K | Positive | +38.9632 | Potassium adducts | | M+NH4 | Positive | +18.0338 | Ammonium adducts (lipids) | | M-H | Negative | -1.0073 | Default negative mode | | M-H-H2O | Negative | -19.0178 | Dehydrated anions | | M+Cl | Negative | +34.9689 | Chloride adducts | | M+FA-H | Negative | +44.9982 | Formate adducts (LC-MS) | | M+2H | Positive | +1.0073 (z=2) | Doubly charged ions | | M-2H | Negative | -1.0073 (z=2) | Doubly charged negative | ### MetStat Filter Syntax MetStat uses semicolon-delimited key:value pairs. All fields are optional: ``` analysis_type:{value};polarity:{value};species:{value};disease:{value} ``` - Omit any field to leave it unfiltered - Values are case-sensitive (use exact values: `Human` not `human`) - Combine as many fields as needed ## Common Workflows ### Workflow 1: Metabolite Identification Pipeline Standardize a metabolite name, find related studies, and retrieve experimental data. ```python import pandas as pd # Step 1: Standardize name via RefMet refmet = mw_query("refmet", "name", "Palmitic acid") std_name = refmet.get("refmet_name", "Palmitic acid") formula = refmet.get("formula") print(f"Standardized: {std_name}, Formula: {formula}") # Step 2: Search compound database for cross-references compound = mw_query("compound", "name", std_name) print(f"PubChem CID: {compound.get('pubchem_cid')}, " f"KEGG: {compound.get('kegg_id')}, HMDB: {compound.get('hmdb_id')}") # Step 3: Find studies containing this metabolite via MetStat studies = mw_query("metstat", "filter", "species:Human") # Filter client-side for studies with the metabolite of interest if isinstance(studies, list): print(f"Found {len(studies)} human metabolomics studies") # Step 4: Get data from a specific study study_data = mw_query("study", "study_id", "ST000001", "data") if isinstance(study_data, list): df = pd.DataFrame(study_data) print(f"Data shape: {df.shape}") print(df.head()) ``` ### Workflow 2: MS Compound Identification Identify unknown compounds from mass spectrometry m/z values. ```python # Step 1: Search positive mode m/z target_mz = "256.240" tolerance = "0.01" candidates_pos = mw_query("moverz", "mz", f"{target_mz}/{tolerance}/M+H") # Step 2: Also check sodium adduct candidates_na = mw_query("moverz", "mz", f"{target_mz}/{tolerance}/M+Na") time.sleep(0.5) # Step 3: For each candidate, get full compound info if isinstance(candidates_pos, list): for candidate in candidates_pos[:5]: # Top 5 candidates name = candidate.get("name", "Unknown") delta = candidate.get("delta", "N/A") print(f"Candidate: {name}, Mass error: {delta}") # Get detailed compound info detail = mw_query("compound", "name", name) print(f" Formula: {detail.get('formula')}, " f"KEGG: {detail.get('kegg_id')}") time.sleep(0.5) # Step 4: Standardize top candidate via RefMet if candidates_pos: top_name = candidates_pos[0].get("name", "") refmet = mw_query("refmet", "name", top_name) print(f"RefMet class: {refmet.get('super_class')} > " f"{refmet.get('main_class')} > {refmet.get('sub_class')}") ``` ### Workflow 3: Disease Metabolomics Exploration Discover metabolomics studies for a disease and extract experimental data. ```python import pandas as pd # Step 1: Filter studies by disease and analysis type diabetes_lc = mw_query("metstat", "filter", "disease:Diabetes;analysis_type:LC-MS;species:Human") print(f"Found {len(diabetes_lc) if isinstance(diabetes_lc, list) else 1} studies") # Step 2: Get study details for top results if isinstance(diabetes_lc, list): for study_entry in diabetes_lc[:3]: sid = study_entry.get("study_id", "") if sid: summary = mw_query("study", "study_id", sid, "summary") print(f"\n{sid}: {summary.get('study_title', 'N/A')}") print(f" Institute: {summary.get('institute', 'N/A')}") time.sleep(0.5) # Step 3: Get experimental factors and data from one study target_study = "ST000001" factors = mw_query("study", "study_id", target_study, "factors") print(f"\nFactors for {target_study}:", factors) data = mw_query("study", "study_id", target_study, "data") if isinstance(data, list): df = pd.DataFrame(data) print(f"Dataset: {df.shape[0]} rows x {df.shape[1]} columns") print(df.describe()) ``` ## Key Parameters | Function/Endpoint | Parameter | Description | Example Values | |-------------------|-----------|-------------|----------------| | `compound` | `input_item` | Search field | `name`, `pubchem_cid`, `kegg_id`, `formula`, `inchi_key`, `hmdb_id`, `lm_id`, `bmrb_id`, `regno` | | `study` | `output_item` | Data to retrieve | `summary`, `factors`, `data`, `analysis`, `mwtab`, `all` | | `refmet` | `input_item` | Classification level | `name`, `main_class`, `sub_class`, `super_class`, `match` | | `metstat` | filter string | Semicolon-delimited | `analysis_type:LC-MS;species:Human;disease:Cancer` | | `moverz` | `mz` value | m/z / tolerance / adduct | `180.063/0.005/M+H` | | `gene` | `input_item` | Gene search field | `gene_symbol`, `gene_id`, `taxonomy` | | `protein` | `input_item` | Protein search field | `uniprot_id`, `gene_symbol`, `mgp_id` | | All | `fmt` | Response format | `json` (default), `txt` | ## Best Practices 1. **Use RefMet for standardization**: Always standardize metabolite names through RefMet before cross-study comparisons. Different studies may use synonyms for the same compound 2. **Add delays for bulk queries**: Insert `time.sleep(0.5)` between requests when querying >100 endpoints to avoid overloading the server 3. **Check response types**: The API may return a dict (single result) or list (multiple results). Always handle both: `results if isinstance(results, list) else [results]` 4. **Use specific output_items**: Request `summary`, `factors`, or `data` individually rather than `all` to reduce response size and parse time 5. **Validate m/z tolerance**: Use tight tolerance (0.005 Da) for high-resolution instruments (Orbitrap, TOF) and wider tolerance (0.5 Da) for low-resolution instruments 6. **MetStat values are case-sensitive**: Use exact values (`Human` not `human`, `LC-MS` not `lc-ms`). Check available values via the MW web interface if unsure 7. **Cache compound lookups**: Compound data changes infrequently. Cache results locally to avoid redundant API calls during iterative analysis ## Common Recipes ### Recipe: Batch Metabolite Standardization via RefMet ```python metabolite_names = ["palmitic acid", "oleic acid", "stearic acid", "linoleic acid", "arachidonic acid"] standardized = [] for name in metabolite_names: result = mw_query("refmet", "name", name) standardized.append({ "original": name, "refmet_name": result.get("refmet_name", name), "super_class": result.get("super_class", ""), "main_class": result.get("main_class", ""), "formula": result.get("formula", "") }) time.sleep(0.5) df_std = pd.DataFrame(standardized) print(df_std.to_string(index=False)) ``` ### Recipe: Cross-Database ID Mapping ```python # Map a compound across PubChem, KEGG, HMDB compound = mw_query("compound", "name", "L-Alanine") id_map = { "MW_regno": compound.get("regno"), "PubChem_CID": compound.get("pubchem_cid"), "KEGG_ID": compound.get("kegg_id"), "HMDB_ID": compound.get("hmdb_id"), "LipidMaps_ID": compound.get("lm_id"), "Formula": compound.get("formula"), "Exact_Mass": compound.get("exactmass") } for db, val in id_map.items(): print(f" {db}: {val}") ``` ### Recipe: Export Study Data to DataFrame ```python import pandas as pd study_id = "ST000001" # Get study data and convert to DataFrame raw_data = mw_query("study", "study_id", study_id, "data") if isinstance(raw_data, list): df = pd.DataFrame(raw_data) elif isinstance(raw_data, dict): df = pd.DataFrame([raw_data]) else: df = pd.DataFrame() # Get study metadata for context meta = mw_query("study", "study_id", study_id, "summary") print(f"Study: {meta.get('study_title', study_id)}") print(f"Species: {meta.get('species')}, Analysis: {meta.get('analysis_type')}") print(f"Data shape: {df.shape}") print(df.head()) # Export to CSV df.to_csv(f"{study_id}_data.csv", index=False) ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | Empty JSON response `{}` | Invalid input_item or input_value | Verify the context/input_item combination is valid (see Key Parameters table). Check spelling and case | | `ConnectionError` or timeout | MW server temporarily unavailable | Retry after 30s. MW occasionally has maintenance windows. Add `timeout=30` to requests | | MetStat returns no results | Case-sensitive filter values | Use exact case: `Human` not `human`, `LC-MS` not `lc-ms`. Check available values on MW website | | m/z search returns too many hits | Tolerance too wide | Reduce tolerance from 0.5 to 0.01 or 0.005 Da for high-resolution instruments | | m/z search returns no hits | Wrong adduct type or too-tight tolerance | Try alternative adducts (M+H, M+Na, M-H). Widen tolerance. Verify the m/z value is correct | | `JSONDecodeError` on response | Endpoint returns text, not JSON | Some endpoints (e.g., `mwtab` output) return plain text. Use `fmt="txt"` instead of `"json"` | | Study data missing columns | Study uses different data format | Check `analysis` output first to understand the study's data structure. Not all studies have uniform column names | | RefMet name not found | Metabolite not in RefMet database | Try alternative names or synonyms. RefMet covers ~120K standardized names but some rare metabolites may be absent | ## Bundled Resources This entry is self-contained. The original `references/api_reference.md` (494 lines) covering all 7 API contexts (Compound, Study, RefMet, MetStat, Moverz, Gene, Protein) has been fully consolidated inline: - **Compound endpoint**: input_items and output fields consolidated into Core API Module 1 + Key Parameters table - **Study endpoint**: output_items (summary, factors, data, analysis, mwtab) consolidated into Core API Module 2 - **RefMet endpoint**: classification hierarchy consolidated into Core API Module 3 + Key Concepts RefMet table - **MetStat endpoint**: filter syntax consolidated into Core API Module 4 + Key Concepts MetStat section - **Moverz endpoint**: adduct types consolidated into Core API Module 5 + Key Concepts Ion Adduct table - **Gene/Protein endpoints**: consolidated into Core API Modules 6 and 7 - Omitted: raw curl examples (replaced with Python helper function), HTML output format examples (rarely used programmatically) ## Related Skills - **hmdb-database** -- local XML parsing for 220K+ metabolites with NMR/MS spectral data; use when MW does not have the metabolite or you need spectral peak lists - **pubchem-compound-search** -- broader compound property lookups (110M+ compounds) via PubChemPy; use for general chemistry queries beyond metabolomics - **matchms-spectral-matching** -- spectral similarity scoring for metabolite identification from MS/MS data; complementary to MW m/z searches - **pyopenms-mass-spectrometry** -- full LC-MS/MS data processing pipeline; use for raw spectra processing before querying MW for identification - **kegg-database** -- pathway and compound queries; use KEGG IDs from MW compound lookups for pathway context ## References - Metabolomics Workbench REST API: https://www.metabolomicsworkbench.org/tools/MWRestAPIv1.0.pdf - MW REST Interactive URL Creator: https://www.metabolomicsworkbench.org/databases/metabolites/mw-rest.php - Sud et al. "Metabolomics Workbench: An international repository for metabolomics data" Nucleic Acids Research (2016) https://doi.org/10.1093/nar/gkv1042 - RefMet nomenclature: https://www.metabolomicsworkbench.org/databases/refmet/index.php