--- name: tooluniverse-data-wrangling description: Universal data access patterns for downloading and parsing scientific data when ToolUniverse tools don't cover the source, only return metadata, or you need bulk records. Use for VCF/h5ad/BAM/SDF/GCT parsing, multi-step API workflows (search to filter to download to parse), thousands of records at once, or sources with no dedicated tool. Write Python code via Bash for every step. disable-model-invocation: true --- # Data Wrangling: Universal Access Patterns Reference for downloading and parsing scientific data from any source. Write and run Python code via Bash for every step. ## When to Use - ToolUniverse tool returned metadata/search results but you need **raw or bulk data** - Data is in a format tools don't parse (VCF, h5ad, BAM, SDF, GCT) - You need a **multi-step API workflow** (search -> filter -> download -> parse) - The data source has **no ToolUniverse tool** at all - You need **thousands of records**, not the 10-100 a tool returns ## Decision: Tool vs Code | Situation | Use | |-----------|-----| | Single record lookup, simple search, <100 results | ToolUniverse tool (`execute_tool`) | | Bulk download, custom filtering, format conversion | Write Python code | | Tool exists but returns truncated results | Write code using the same API the tool wraps | | No tool exists for this source | Write code directly | --- ## Section A: Format Cookbook ### Tabular ```python import pandas as pd, io df = pd.read_csv("data.csv") # CSV df = pd.read_csv("data.tsv", sep="\t") # TSV df = pd.read_sas(io.BytesIO(content), format="xport") # SAS Transport (XPT) — NHANES, CDC df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native df = pd.read_stata("data.dta") # Stata — ICPSR, HRS df = pd.read_parquet("data.parquet") # Parquet — MIMIC-IV df = pd.read_excel("data.xlsx") # Excel df = pd.read_spss("data.sav") # SPSS df = pd.read_fwf("data.dat") # Fixed-width — legacy surveys ``` ### Genomics ```python from Bio import SeqIO records = list(SeqIO.parse("seqs.fasta", "fasta")) # FASTA records = list(SeqIO.parse("reads.fastq", "fastq")) # FASTQ # VCF (no cyvcf2 needed) vcf_lines = [l for l in open("vars.vcf") if not l.startswith("##")] df = pd.read_csv(io.StringIO("".join(vcf_lines)), sep="\t") df = pd.read_csv("genes.gff3", sep="\t", comment="#", # GFF/GTF names=["seqid","source","type","start","end","score","strand","phase","attrs"]) df = pd.read_csv("regions.bed", sep="\t", header=None, # BED names=["chrom","start","end","name","score","strand"]) import pysam # BAM (requires pysam) bam = pysam.AlignmentFile("aligned.bam", "rb") for read in bam.fetch("chr1", 1000, 2000): print(read.query_name) ``` ### Structural ```python from Bio.PDB import PDBParser, MMCIFParser parser = PDBParser(QUIET=True) structure = parser.get_structure("prot", "structure.pdb") # PDB parser = MMCIFParser(QUIET=True) structure = parser.get_structure("prot", "structure.cif") # mmCIF from rdkit import Chem # SDF/MOL (requires rdkit) supplier = Chem.SDMolSupplier("compounds.sdf") mols = [m for m in supplier if m is not None] ``` ### Omics Matrices ```python import anndata adata = anndata.read_h5ad("expression.h5ad") # AnnData (scRNA-seq, spatial) import scipy.io mat = scipy.io.mmread("matrix.mtx") # 10X Genomics MTX barcodes = pd.read_csv("barcodes.tsv", header=None)[0].tolist() features = pd.read_csv("features.tsv", sep="\t", header=None)[1].tolist() df = pd.read_csv("expression.gct", sep="\t", skiprows=2) # GCT (gene expression) import loompy # Loom (legacy single-cell) ds = loompy.connect("data.loom") ``` ### Mass Spectrometry & Flow Cytometry ```python from pyteomics import mzml # mzML (proteomics, requires pyteomics) spectra = list(mzml.read("spectra.mzML")) import fcsparser # FCS (flow cytometry, requires fcsparser) meta, data = fcsparser.parse("sample.fcs", reformat_meta=True) ``` ### Neuroimaging ```python import nibabel as nib # NIfTI (requires nibabel) img = nib.load("brain.nii.gz") data = img.get_fdata() # 3D/4D numpy array # DICOM (requires pydicom) import pydicom dcm = pydicom.dcmread("scan.dcm") pixel_data = dcm.pixel_array ``` ### Phylogenetics & Systems Biology ```python from Bio import Phylo # Newick/Nexus (BioPython) tree = Phylo.read("tree.nwk", "newick") tree = Phylo.read("tree.nex", "nexus") import libsbml # SBML (systems biology, requires python-libsbml) reader = libsbml.SBMLReader() doc = reader.readSBML("model.xml") model = doc.getModel() ``` ### Serialized ```python import json, xml.etree.ElementTree as ET, h5py data = json.load(open("data.json")) # JSON df = pd.read_json("records.json") # JSON -> DataFrame tree = ET.parse("data.xml"); root = tree.getroot() # XML f = h5py.File("data.h5", "r"); dataset = f["group/data"][:] # HDF5 ``` ### Compressed ```python df = pd.read_csv("data.csv.gz") # gzip (pandas auto-detects) df = pd.read_csv("data.tsv.gz", sep="\t") # gzip TSV import zipfile with zipfile.ZipFile(io.BytesIO(content)) as z: # ZIP df = pd.read_csv(z.open(z.namelist()[0])) import tarfile with tarfile.open("archive.tar.gz") as t: # tar.gz f = t.extractfile(t.getnames()[0]) df = pd.read_csv(f) ``` --- ## Section B: API Patterns by Domain Each category shows: which ToolUniverse tools exist, and how to go beyond them with direct API calls. ### 1. NCBI E-utilities (Gene, Nucleotide, Protein, SRA, GEO) Tools: `NCBIGene_search`, `NCBI_search_nucleotide`, `SRA_search_experiments`, `geo_search_datasets` ```python import requests base = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" # Search -> get IDs -> fetch records in batches ids = requests.get(f"{base}/esearch.fcgi?db=gene&term=BRCA1+AND+human&retmax=500&retmode=json").json() id_list = ids["esearchresult"]["idlist"] # Fetch in batches of 500 for i in range(0, len(id_list), 500): batch = ",".join(id_list[i:i+500]) data = requests.get(f"{base}/efetch.fcgi?db=gene&id={batch}&retmode=xml").text ``` ### 2. EBI APIs (UniProt, PDBe, ChEMBL, Ensembl, InterPro) Tools: `UniProt_search`, `PDBe_*`, `ChEMBL_*`, `Ensembl_*`, `InterPro_*` ```python # UniProt bulk TSV download with cursor pagination url = "https://rest.uniprot.org/uniprotkb/search?query=organism_id:9606+AND+keyword:kinase&format=tsv&size=500" all_rows = [] while url: resp = requests.get(url) all_rows.append(resp.text) url = resp.headers.get("Link", "").split(";")[0].strip("<>") if "Link" in resp.headers else None ``` ### 3. NCI GDC (TCGA/TARGET Cancer Data) Tools: `GDC_search_cases`, `GDC_list_files`, `GDC_get_clinical_data` ```python # Bulk clinical data with filters filters = {"op":"and","content":[ {"op":"=","content":{"field":"project.project_id","value":"TCGA-BRCA"}}, {"op":"=","content":{"field":"demographic.vital_status","value":"Dead"}} ]} cases = requests.post("https://api.gdc.cancer.gov/cases", json={ "filters": filters, "fields": "demographic.vital_status,diagnoses.days_to_death", "size": 1000, "from": 0 }).json()["data"]["hits"] ``` ### 4. CDC Health Surveys (NHANES, BRFSS, WONDER) Tools: `NHANES_download_and_parse`, `cdc_data_search_datasets` ```python # Direct NHANES XPT download (any cycle, any component) cycle, component = "2017-2018", "DEMO_J" url = f"https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/{component}.XPT" df = pd.read_sas(io.BytesIO(requests.get(url).content), format="xport") ``` ### 5. GWAS & Genetics (GWAS Catalog, gnomAD, ClinVar) Tools: `gwas_search_associations`, `gnomAD_*`, `ClinVar_*` ```python # GWAS Catalog full download (37MB TSV, all associations) url = "https://www.ebi.ac.uk/gwas/api/search/downloads/alternative" df = pd.read_csv(url, sep="\t") # Filter locally hits = df[df["DISEASE/TRAIT"].str.contains("diabetes", case=False, na=False)] ``` ### 6. Chemical (PubChem, ChEMBL, KEGG) Tools: `PubChem_*`, `ChEMBL_*`, `KEGG_*` ```python # PubChem batch property retrieval (up to 100 CIDs at once) cids = "2244,5988,3672" # aspirin, sucrose, ibuprofen url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cids}/property/MolecularWeight,XLogP,TPSA/JSON" props = requests.get(url).json()["PropertyTable"]["Properties"] ``` ### 7. Expression (GEO, ArrayExpress, GTEx) Tools: `geo_search_datasets`, `arrayexpress_search_experiments` ```python # GEO series matrix direct download geo_id = "GSE12345" url = f"https://ftp.ncbi.nlm.nih.gov/geo/series/{geo_id[:5]}nnn/{geo_id}/matrix/{geo_id}_series_matrix.txt.gz" df = pd.read_csv(url, sep="\t", comment="!", index_col=0) # GTEx bulk expression (median TPM per tissue) url = "https://storage.googleapis.com/adult-gtex/bulk-gex/v8/rna-seq/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz" df = pd.read_csv(url, sep="\t", skiprows=2) ``` ### 8. Clinical (ClinicalTrials.gov, FDA/OpenFDA, FAERS) Tools: `search_clinical_trials`, `OpenFDA_*` ```python # ClinicalTrials.gov v2 API with pagination all_studies = [] token = None while True: params = {"query.cond": "lung cancer", "query.intr": "immunotherapy", "pageSize": 100} if token: params["pageToken"] = token resp = requests.get("https://clinicaltrials.gov/api/v2/studies", params=params).json() all_studies.extend(resp.get("studies", [])) token = resp.get("nextPageToken") if not token: break ``` ### 9. Literature (PubMed, PMC, EuropePMC) Tools: `PubMed_search_articles`, `EuropePMC_search_articles` ```python # EuropePMC full-text search with cursor cursor = "*" all_results = [] while cursor: resp = requests.get("https://www.ebi.ac.uk/europepmc/webservices/rest/search", params={"query": "BRCA1 AND resistance", "format": "json", "pageSize": 100, "cursorMark": cursor}).json() all_results.extend(resp.get("resultList", {}).get("result", [])) cursor = resp.get("nextCursorMark") if len(all_results) < resp.get("hitCount", 0) else None ``` ### 10. Data Repositories (Zenodo, Figshare, Dryad, DataCite) Tools: `DataCite_search_dois`, `Zenodo_search_records`, `Dryad_search_datasets` ```python # Zenodo: search + download files record = requests.get("https://zenodo.org/api/records", params={"q": "proteomics cancer", "size": 5}).json()["hits"]["hits"][0] for f in record["files"]: content = requests.get(f["links"]["self"]).content # download each file ``` ### 11-24. Specialized Domains For these 14 additional domains, read [references/specialized-domains.md](references/specialized-domains.md) when you need the specific API pattern: | # | Domain | Key APIs/Tools | When to Read | |---|--------|---------------|--------------| | 11 | Proteomics | PRIDE, MassIVE, ProteomeXchange | Mass spec data download | | 12 | Metabolomics | MetaboLights, Metabolomics Workbench, HMDB | Metabolite/spectra data | | 13 | Microbiome | MGnify, GMREPO | Metagenome profiles | | 14 | Ecology | GBIF, iNaturalist, OBIS | Species occurrence data | | 15 | Model Organisms | FlyBase, WormBase, ZFIN, RGD | Gene data for non-human species | | 16 | Pathways & Networks | Reactome, STRING, BioGRID | Network/pathway export | | 17 | Ontologies | OLS, GO, HPO | Term hierarchy traversal | | 18 | Immunology | IEDB, VDJdb, ImmPort | Epitope/receptor data | | 19 | Drug & Pharma | PharmGKB, DGIdb, SIDER | Drug-gene interactions | | 20 | Imaging & Atlases | TCIA, HPA, Allen Brain Atlas | Imaging collections | | 21 | Protein Structure | RCSB PDB, AlphaFold | PDB/CIF file download | | 22 | Clinical Genomics | ClinVar, ClinGen, CIViC | Variant interpretation bulk | | 23 | Single-Cell | cellxgene, ARCHS4 | scRNA-seq data portals | | 24 | Toxicology | CTD, EPA CompTox | Chemical-gene-disease | --- ## Section C: Restricted/Uncovered Data Sources These sources require registration or have no ToolUniverse tool. For each, the table shows access requirements and how to get data programmatically once credentialed. **Note**: ToolUniverse has 2300+ tools — use `find_tools("your topic")` to discover tools not listed above. Section B covers the most common API *patterns*; many more databases use the same patterns (e.g., all EBI databases follow the EBI REST pattern in #2). | Source | Access | Wait Time | Format | Contents | |--------|--------|-----------|--------|----------| | **UK Biobank** | Restricted (institutional) | 2-6 months | CSV/Bulk | 500K participants, genetics + imaging + health records | | **dbGaP** | Controlled (PI application) | 1-3 months | SRA/VCF/phenotype | GWAS genotypes + phenotypes from thousands of studies | | **MIMIC-IV** | Credentialed (PhysioNet) | 1-2 weeks | CSV/Parquet | ICU clinical data, 300K+ admissions | | **ICPSR** | Registration | 1-3 days | Stata/CSV | Social/health science archives (10K+ studies) | | **HRS** | Registration | 1-3 days | Stata | Health & Retirement Study, 20K+ older Americans, biennial | | **ELSA** | Registration | 1-3 days | Stata/SPSS | English Longitudinal Study of Ageing | | **SHARE** | Registration | 1-2 weeks | Stata | Survey of Health, Ageing, Retirement in Europe (28 countries) | | **Materials Project** | Free API key | Instant | JSON | 150K+ computed materials properties | | **Human Cell Atlas** | Open | Instant | h5ad/loom | Single-cell atlas across human tissues | | **ADNI** | Application | 1-2 months | DICOM/CSV | Alzheimer's neuroimaging + biomarkers + cognition | | **OpenNeuro** | Open | Instant | NIfTI/BIDS | 800+ neuroimaging datasets | | **CIBERSORTx** | Free registration | Instant | GCT/TSV | Cell type deconvolution from bulk expression | | **FlowRepository** | Open | Instant | FCS | Flow cytometry experiments | | **SynBioHub** | Open | Instant | SBOL/GenBank | Synthetic biology parts and designs | For restricted sources: search literature (PubMed) for published analyses using that dataset. Papers cite their data source and often deposit derived data in public repositories (GEO, SRA, Zenodo). --- ## Section D: Universal Patterns ### Pagination ```python # Pattern 1: offset + limit (most REST APIs) all_records = [] offset = 0 while True: resp = requests.get(f"{api_url}?offset={offset}&limit=500", timeout=30).json() batch = resp.get("data", resp.get("results", resp.get("hits", []))) if not batch: break all_records.extend(batch) offset += len(batch) # Pattern 2: cursor/token (EuropePMC, ClinicalTrials.gov, UniProt) token = None while True: params = {"pageSize": 100} if token: params["pageToken"] = token resp = requests.get(api_url, params=params).json() all_records.extend(resp["results"]) token = resp.get("nextPageToken") if not token: break ``` ### Rate Limiting & Retries ```python import time def fetch_with_retry(url, max_retries=3, **kwargs): for attempt in range(max_retries): resp = requests.get(url, timeout=30, **kwargs) if resp.status_code == 200: return resp if resp.status_code == 429: # rate limited wait = int(resp.headers.get("Retry-After", 2 ** attempt)) time.sleep(wait) else: time.sleep(2 ** attempt) raise RuntimeError(f"Failed after {max_retries} retries: {url}") ``` ### Authentication ```python import os # API key in header (most common) headers = {"Authorization": f"Bearer {os.environ.get('API_KEY', '')}"} # API key as query param params = {"api_key": os.environ.get("API_KEY", "")} # No auth needed for most scientific APIs (NCBI, EBI, PubChem, GDC, CDC) ``` ### Bulk Download with Streaming ```python def download_large_file(url, output_path): with requests.get(url, stream=True, timeout=300) as r: r.raise_for_status() with open(output_path, "wb") as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) ``` ### Error Handling ```python resp = requests.get(url, timeout=30) if resp.status_code != 200: raise ValueError(f"HTTP {resp.status_code}: {resp.text[:200]}") # Guard against HTML error pages (CDC, NCBI return 200 with HTML for missing files) if resp.content[:5] in (b"