--- name: tooluniverse-regulatory-variant-analysis description: Non-coding/regulatory variant interpretation — GWAS association lookup, eQTL evidence (GTEx), chromatin state (ENCODE), regulatory variant scoring (RegulomeDB, CADD), and TF-binding disruption. Use for non-coding GWAS hit interpretation, eQTL-based gene assignment, and regulatory mechanism reasoning. Distinct from coding-variant tools. disable-model-invocation: true --- ## COMPUTE, DON'T DESCRIBE When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it. # Regulatory Variant Analysis Skill Systematic regulatory variant interpretation: discover trait associations from GWAS, map eQTL effects, annotate chromatin context, assess regulatory element overlap, and produce evidence-graded functional impact predictions for non-coding variants. ## When to Use - "What GWAS associations exist for rs12913832?" - "Find eQTLs for the APOE locus in brain tissue" - "What regulatory elements overlap this variant region?" - "Which SNPs are associated with type 2 diabetes from GWAS?" - "Is this intronic variant in an active enhancer?" - "What is the RegulomeDB score for rs429358?" - "Find ENCODE histone marks at the BRCA1 promoter region" - "Map trait ontology terms for 'blood pressure' to EFO IDs" **NOT for** (use other skills instead): - Coding variant pathogenicity -> Use `tooluniverse-variant-interpretation` - Full clinical variant classification (ACMG) -> Use `tooluniverse-variant-interpretation` - Gene-disease associations (not variant-specific) -> Use `tooluniverse-gene-disease-association` - Pharmacogenomic variant annotation -> Use `tooluniverse-pharmacogenomics` - Epigenomics data processing (BED/narrowPeak files) -> Use `tooluniverse-epigenomics` --- ## Non-Coding Variant Impact Reasoning When evaluating a non-coding variant, build evidence across four questions: **1. Is the variant in a regulatory element?** Use RegulomeDB to assess whether the variant overlaps TF binding sites, chromatin accessibility peaks, or known regulatory annotations. A low RegulomeDB score (categories 1a-2a) indicates strong evidence that the position is functionally active. Confirm with ENCODE histone marks: H3K27ac signals active enhancers and active promoters; H3K4me1 alone marks poised enhancers; H3K4me3 marks active promoters; H3K27me3 marks silenced regions. **2. Does it alter a transcription factor binding site?** Check RegulomeDB's TF binding evidence and ENCODE TF ChIP-seq experiments. A variant that falls within a TF footprint and disrupts the consensus motif is mechanistically actionable, especially if the TF is known to be relevant in the disease tissue. **3. Is there eQTL evidence linking it to a gene?** Query GTEx to determine whether the variant (or variants in tight LD) modulates expression of a nearby gene in a tissue-specific or ubiquitous manner. A tissue-specific eQTL suggests cell-type-specific regulation; a ubiquitous eQTL suggests a core regulatory element. The direction of the NES (positive = alternative allele increases expression, negative = decreases) and effect size matter for interpretation. **4. Is there GWAS evidence for trait association?** Search the GWAS Catalog for the rsID or the surrounding locus. Genome-wide significant associations (p < 5×10⁻⁸) in relevant traits anchor the variant's biological importance. Cross-reference with OpenTargets for locus-to-gene mapping from multiple GWAS studies. **Synthesizing the evidence**: Build a multi-layer case. A variant with GWAS significance + eQTL evidence + RegulomeDB score 1a-2a + active chromatin (H3K27ac) in the relevant tissue represents high-confidence regulatory impact. Two or three converging lines of evidence (e.g., eQTL plus active enhancer) constitute moderate confidence. A single line, or a variant only in a poised but not active regulatory context, represents lower confidence. --- ## Workflow Overview ``` Input (rsID, genomic coordinates, trait/disease, gene) | v Phase 0: Variant/Trait Resolution Resolve rsIDs, map trait names to EFO/MONDO IDs via OLS | v Phase 1: GWAS Association Lookup GWAS Catalog associations, p-values, effect sizes, study metadata | v Phase 2: eQTL Analysis GTEx tissue-specific eQTLs, target gene identification | v Phase 3: Regulatory Element Annotation ENCODE histone marks, RegulomeDB scores, chromatin state | v Phase 4: OpenTargets GWAS Integration OpenTargets GWAS study aggregation, locus-to-gene mapping | v Phase 5: Functional Impact Synthesis Integrate all evidence, assign regulatory impact level | v Phase 6: Report Evidence-graded regulatory variant report ``` --- ## Phase 0: Variant/Trait Resolution Use `ols_search_terms` to resolve trait names to ontology IDs before GWAS queries. Restrict to `ontology="efo"` for GWAS traits; OpenTargets prefers MONDO IDs (e.g., MONDO_0005148 for type 2 diabetes rather than EFO_0001360). Use `EnsemblVEP_annotate_rsid` (param is `variant_id`, not `rsid`) for initial consequence annotation and nearest gene identification. --- ## Phase 1: GWAS Association Lookup `gwas_search_associations` is the primary tool: accepts `disease_trait` (free text), `efo_id` (preferred for precision), `rs_id`, and `p_value` threshold. Use `p_value=5e-8` for genome-wide significance. For locus-level discovery, `gwas_get_variants_for_trait` retrieves all SNPs for a trait. `gwas_get_snps_for_gene` finds GWAS-cataloged SNPs mapped to a specific gene. **Reasoning tip**: When GWAS Catalog returns empty for a free-text trait, switch to the `efo_id` parameter — the catalog uses controlled vocabulary and free-text matching is imprecise. --- ## Phase 2: eQTL Analysis `GTEx_query_eqtl` accepts a gene symbol (auto-resolved to GENCODE ID) or Ensembl gene ID. It returns tissue-specific SNP-gene associations with NES (normalized effect size) and p-value per tissue. When interpreting results, ask: does the eQTL effect occur in the tissue most relevant to the disease? A brain-specific eQTL for a neurodegenerative disease variant is more compelling than a ubiquitous one. Use `GTEx_get_median_gene_expression` to confirm that the target gene is actually expressed in the relevant tissue before placing weight on eQTL evidence. **Note**: GTEx API uses v8 data; gtex_v10 endpoints may return empty for some queries. --- ## Phase 3: Regulatory Element Annotation `RegulomeDB_query_variant` (param: `rsid`) returns a regulatory score and feature annotations. Scores in categories 1a–2a indicate strong regulatory evidence (eQTL overlap + TF binding + chromatin accessibility). Scores 3a–6 represent progressively weaker evidence. `ENCODE_search_histone_experiments` accepts `histone_mark` (e.g., "H3K27ac") and `biosample_term_name` (tissue or cell line name — NOT a disease name; ENCODE uses biological sample names like "liver" or "breast epithelium"). Use `assay_title="TF ChIP-seq"` (not just "ChIP-seq") when querying TF binding data. **Reasoning tip**: RegulomeDB aggregates ENCODE, Roadmap, and other data. If ENCODE doesn't have the specific biosample, RegulomeDB may still have aggregate evidence from related cell types. --- ## Phase 4: OpenTargets GWAS Integration `OpenTargets_search_gwas_studies_by_disease` takes `diseaseIds` as an array of MONDO IDs. It provides locus-to-gene (L2G) scores from multiple GWAS studies, which go beyond simple proximity to incorporate colocalisation, eQTL, and chromatin data. Use `OpenTargets_multi_entity_search_by_query_string` or `OpenTargets_get_disease_id_description_by_name` to resolve disease names to MONDO/EFO IDs first. --- ## Phase 5: Functional Impact Synthesis After collecting evidence, reason through the layers: - **High impact**: GWAS genome-wide significant + eQTL with meaningful NES + RegulomeDB score ≤ 2 + active chromatin (H3K27ac) in relevant tissue. Multiple independent lines converge on the same locus and gene. - **Moderate impact**: Two to three lines of evidence (e.g., eQTL + active enhancer overlap, or GWAS significant + RegulomeDB ≤ 3) without full convergence. - **Low impact**: Single line of evidence, or only computational annotation (VEP consequence category) without functional data. - **No evidence**: No regulatory annotations in any source; the variant may be in a non-functional region or the relevant cell type is not represented in available datasets. --- ## Fallback Strategies - **GWAS Catalog returns empty**: Switch from free-text `disease_trait` to `efo_id`; broaden the trait term. - **GTEx eQTL empty for gene**: Verify gene symbol spelling; try Ensembl ID; increase `size` parameter. - **RegulomeDB returns no data**: Query ENCODE directly; the variant may lack regulatory annotations in available data. - **OpenTargets GWAS returns None**: Verify MONDO/EFO ID format; try `OpenTargets_multi_entity_search_by_query_string` first to confirm the correct ID. - **ENCODE tissue not found**: ENCODE uses specific biosample names; RegulomeDB aggregates data from many cell types and may cover the gap. --- ## Example Workflows ### GWAS Variant Functional Annotation (rs429358 / APOE) ``` Step 1: gwas_search_associations(rs_id="rs429358") -> All trait associations (Alzheimer's disease, LDL cholesterol, etc.) Step 2: GTEx_query_eqtl(gene_symbol="APOE") -> Tissue-specific eQTL evidence; note effect in brain vs liver Step 3: RegulomeDB_query_variant(rsid="rs429358") -> Regulatory score and TF binding annotations Step 4: ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name="brain") -> Active enhancer context near the variant Step 5: Synthesize: does GWAS significance + eQTL + active chromatin converge on one gene? ``` ### Non-Coding Variant Assessment (Intronic/UTR Variant) ``` Step 1: EnsemblVEP_annotate_rsid(variant_id="rs12345678") -> Confirm non-coding consequence, identify nearest gene Step 2: RegulomeDB_query_variant(rsid="rs12345678") -> Is this position in a regulatory context? Step 3: gwas_search_associations(rs_id="rs12345678") -> Any GWAS associations in relevant traits? Step 4: GTEx_query_eqtl(gene_symbol=nearest_gene) -> Does this variant or nearby variants modulate expression? Step 5: ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name=relevant_tissue) -> Active chromatin confirmation Step 6: Classify impact based on convergence of evidence lines ``` --- ## Limitations - GWAS Catalog covers published GWAS only; unpublished studies are not included. - GTEx eQTL data is from v8; v10 endpoints may return empty. - RegulomeDB annotations depend on available ENCODE/Roadmap data for the specific cell type. - eQTL analysis identifies correlation, not causation; fine-mapping is needed to identify causal variants. - RegulomeDB scores are heuristic; a score of 1a does not guarantee functional impact. - GWAS associations are population-level; individual variant effects depend on genetic background.