---
name: tooluniverse-variant-to-mechanism
description: End-to-end variant-to-mechanism analysis — trace a variant (rsID/coordinates) through regulatory context, target gene(s), molecular pathway(s), and phenotypic consequences. Integrates 7+ databases across 3 evidence layers (regulatory, molecular, disease) for a mechanistic model. Use for GWAS-hit-to-mechanism, eQTL-causal-gene tracing, and full causal-chain reports.
disable-model-invocation: true
---

# Variant-to-Mechanism Analysis Skill

Trace the full causal chain from a genetic variant to its disease mechanism: regulatory context,
target gene(s), molecular pathways, and phenotypic consequences. Integrates 7+ databases across
3 evidence layers (regulatory, molecular, disease) to build an evidence-graded mechanistic model.

**IMPORTANT**: Always use English terms in tool calls. Respond in the user's language.

---

## LOOK UP, DON'T GUESS
When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.

---

## When to Use This Skill

Apply when users:
- Ask "how does rs7903146 cause type 2 diabetes?"
- Want to trace a GWAS variant to its biological mechanism
- Need to connect a non-coding variant to downstream pathways
- Ask "what gene does this variant affect and through what pathway?"
- Want a mechanistic narrative for a regulatory variant
- Need to identify the causal gene for a GWAS locus
- Ask "what is the functional impact of this intronic SNP?"

**NOT for** (use other skills instead):
- Coding variant pathogenicity / ACMG classification -> Use `tooluniverse-variant-interpretation`
- Pure regulatory element annotation without mechanism -> Use `tooluniverse-regulatory-genomics`
- Pure GWAS hit listing without mechanism -> Use `tooluniverse-gwas-snp-interpretation`
- Pharmacogenomic variant annotation -> Use `tooluniverse-pharmacogenomics`
- Gene-disease association without variant context -> Use `tooluniverse-gene-disease-association`

---

## Input Parameters

| Parameter | Required | Description | Example |
|-----------|----------|-------------|---------|
| **variant** | Yes | rsID or genomic coordinates | `"rs7903146"` or `"10:112998590:C:T"` |
| **trait** | No | Disease/trait context (helps prioritize) | `"type 2 diabetes"` |
| **tissue** | No | Tissue of interest for eQTL/expression | `"pancreas"` |

---

## Workflow Overview

**Phase 1** Variant Characterization (VEP, population frequency, CADD) ->
**Phase 2** Regulatory Context (GWAS, RegulomeDB, ENCODE, cCREs) ->
**Phase 3** Target Gene Identification (GTEx eQTL, OpenTargets L2G) ->
**Phase 4** Gene Function & Pathways (STRING, Reactome, GO) ->
**Phase 5** Disease Connection (OpenTargets, GenCC, DisGeNET) ->
**Phase 6** Mechanistic Synthesis (causal chain, evidence grading, confidence scoring)

---

## Phase 1: Variant Characterization

Resolve variant identity and get basic annotations.

```python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

# Step 1: VEP annotation
vep = tu.tools.EnsemblVEP_annotate_rsid(variant_id="rs7903146")
# NOTE: param is `variant_id`, NOT `rsid`
# Response format variable: list, {data, metadata}, or {error} -- handle all three
# Extract: consequence_type, gene, chromosome, position, alleles

# Step 2: Population frequency and pathogenicity scores
myvar = tu.tools.MyVariant_query_variants(
    query="rs7903146",  # NOTE: param is `query`, NOT `variant_id`
    fields="dbsnp,gnomad_genome,cadd,clinvar"
)
# Returns: {hits: [{_id, dbsnp, gnomad_genome: {af: {af, ...ancestry}}, cadd: {phred}, clinvar}]}
# gnomAD AF: frequency in general population
# CADD phred: >=20 = top 1% deleterious, >=30 = top 0.1%

# Step 3: Confirm gene context
gwas_snp = tu.tools.gwas_search_snps(rs_id="rs7903146")
# Returns: SNP location, mapped genes, functional class
```

---

## Phase 2: Regulatory Context

Assess the variant's regulatory environment.

### 2a: GWAS Associations

```python
# All trait associations for this variant
gwas = tu.tools.gwas_search_associations(rs_id="rs7903146", size=50)
# Returns: {data: [{disease_trait, p_value, or_per_copy_number, beta, study_accession}]}
# NOTE: Use gwas_search_associations, NOT gwas_get_associations_for_trait (BROKEN)
```

### 2b: RegulomeDB Score

```python
regulome = tu.tools.RegulomeDB_query_variant(rsid="rs7903146")
# Returns: {status, data: {score/ranking, probability, features}}
# Score 1a-2a = strong regulatory evidence
# Uses GRCh38 (NOT hg19)
```

### 2c: ENCODE Chromatin Context

```python
# Check enhancer (H3K27ac), promoter (H3K4me3), and accessibility in disease-relevant tissue
encode_h3k27ac = tu.tools.ENCODE_search_histone_experiments(
    histone_mark="H3K27ac", biosample_term_name="pancreas", limit=10)
encode_h3k4me3 = tu.tools.ENCODE_search_histone_experiments(
    histone_mark="H3K4me3", biosample_term_name="pancreas", limit=10)
encode_atac = tu.tools.ENCODE_search_chromatin_accessibility(
    biosample_term_name="pancreas", limit=10)
```

### 2d: cCRE Overlap (if coordinates known)

```python
# Check candidate cis-regulatory elements
ccres = tu.tools.UCSC_get_encode_cCREs(
    chrom="chr10",
    start=112998000,
    end=112999000
)
# Returns: cCRE type (PLS=promoter, pELS=proximal enhancer, dELS=distal enhancer, CTCF-only)
```

### 2e: OLS Trait Resolution (if trait provided)

```python
# Resolve trait to ontology ID for downstream tools
ols = tu.tools.ols_search_terms(query="type 2 diabetes", ontology="efo")
# Returns: {status, data: [{iri, label, short_form, ontology_name}]}
# For OpenTargets: use MONDO_0005148 for T2D (NOT EFO_0001360)
```

---

## Phase 3: Target Gene Identification

Identify which gene(s) the variant regulates. This is the critical link.

### 3a: GTEx eQTLs (primary evidence)

```python
# Find eQTL associations for the nearest gene
eqtls = tu.tools.GTEx_query_eqtl(gene_symbol="TCF7L2", size=100)
# Returns: {status, data: [{snpId, geneSymbol, tissueSiteDetailId, pValue, nes}]}
# Filter for the specific rsID in the results
# NES > 0: alt allele increases expression; NES < 0: decreases

# Check expression levels to prioritize tissues
expression = tu.tools.GTEx_get_median_gene_expression(gene_symbol="TCF7L2")
# Returns: median TPM across GTEx tissues
# IMPORTANT: GTEx uses v8 data (v10 endpoints may return empty)
```

### 3b: OpenTargets Locus-to-Gene (L2G)

```python
# Get credible sets and L2G predictions
credible = tu.tools.OpenTargets_get_variant_credible_sets(
    variant_id="10_112998590_C_T"  # chr_pos_ref_alt format
)
# Returns: credible set membership, L2G scores for causal genes
# L2G > 0.5: high confidence causal gene
# L2G 0.1-0.5: moderate confidence
```

### 3c: Nearest Gene Mapping

```python
# Get gene details for the candidate target
gene_info = tu.tools.MyGene_query_genes(
    query="symbol:TCF7L2",
    species="human",
    fields="symbol,ensembl.gene,entrezgene,name,summary,go",
    size=5
)
# Extract Ensembl ID for OpenTargets queries
# Filter by exact symbol match (first hit may not be correct gene)
```

### How to Reason About Target Gene Identification

The nearest gene is often NOT the causal gene for non-coding variants. Regulatory elements can act over hundreds of kilobases, skipping intervening genes entirely. Before looking at tools, reason about which gene is the likely target:

- **eQTL in disease-relevant tissue is the strongest evidence.** If a variant is an eQTL for Gene X in pancreas and the trait is diabetes, that is a direct mechanistic link. An eQTL in an unrelated tissue is weaker but still informative.
- **L2G scores integrate multiple signals** (distance, chromatin interaction, eQTL) into a single prediction. L2G > 0.5 is high confidence, but it is a computational prediction, not experimental proof. Combine it with eQTL data -- when both agree, confidence is high.
- **Proximity alone is unreliable.** Many GWAS loci have the causal gene 2-3 genes away from the lead SNP. Never default to "nearest gene" without checking eQTL and L2G.
- **When multiple candidate genes have evidence**, present all of them ranked by strength. The answer may genuinely be uncertain.

---

## Phase 4: Gene Function & Pathways

Once target gene(s) identified, characterize molecular function.

```python
# STRING PPI network — NOTE: param is `identifiers` (string), NOT `protein_ids`
ppi = tu.tools.STRING_get_interaction_partners(
    identifiers="TCF7L2", species=9606, required_score=700)

# STRING functional enrichment (GO, KEGG, Reactome)
enrichment = tu.tools.STRING_get_functional_enrichment(
    identifiers="TCF7L2", species=9606)

# Reactome pathway enrichment — space-separated STRING, NOT array
reactome = tu.tools.ReactomeAnalysis_pathway_enrichment(
    identifiers="TCF7L2 CTNNB1 APC")

# UniProt function (returns list of strings, NOT dict)
uniprot_func = tu.tools.UniProt_get_function_by_accession(accession="Q9NQB0")

# PANTHER GO enrichment — GO:0008150=BP, GO:0003674=MF, GO:0005575=CC
panther = tu.tools.PANTHER_enrichment(
    gene_list="TCF7L2,CTNNB1,APC,LEF1", organism=9606,
    annotation_dataset="GO:0008150")
```

---

## Phase 5: Disease Connection

Link the molecular mechanism to disease phenotype.

```python
# Disease associations for target gene (requires Ensembl ID)
ot_diseases = tu.tools.OpenTargets_get_diseases_phenotypes_by_target_ensembl(
    ensemblId="ENSG00000148737")

# Specific gene-disease evidence (both IDs must be pre-resolved)
ot_evidence = tu.tools.OpenTargets_target_disease_evidence(
    ensemblId="ENSG00000148737", efoId="MONDO_0005148")

# GenCC curated validity (Definitive/Strong/Moderate/Limited/Disputed/Refuted)
gencc = tu.tools.GenCC_search_gene(gene_symbol="TCF7L2")

# DisGeNET scored associations (requires DISGENET_API_KEY)
disgenet = tu.tools.DisGeNET_search_gene(gene="TCF7L2", limit=20)

# OpenTargets GWAS studies — use MONDO IDs, NOT EFO
ot_gwas = tu.tools.OpenTargets_search_gwas_studies_by_disease(
    diseaseIds=["MONDO_0005148"], size=20)
```

---

## Phase 6: Mechanistic Synthesis

### How to Build and Evaluate the Causal Chain

The goal is an explicit chain: **Variant -> changed protein or expression -> altered pathway -> disease phenotype.** Each arrow is a link that needs evidence. A chain with all links supported is strong. A chain with a missing link is a hypothesis, not a conclusion.

**Start with the variant's location.** Is this coding or non-coding? Coding variants directly change the protein -- a missense variant swaps one amino acid, a nonsense variant truncates it. Non-coding variants change REGULATION -- they affect how much protein is made, not what protein is made. This distinction determines your entire analysis path.

**For coding variants**, ask: Is this position conserved across species? Is it in a functional domain (active site, binding interface, transmembrane helix)? A variant in a conserved active site residue is more likely pathogenic than one in a disordered loop. Check ClinVar and CADD for existing assessments, but reason about the structural context yourself.

**For non-coding variants**, ask: Is there an eQTL linking this variant to a nearby gene? Is the variant in a TFBS, enhancer, or promoter (from RegulomeDB and ENCODE)? A variant in open chromatin with an eQTL in disease-relevant tissue has a clear regulatory mechanism. A variant in a gene desert with no eQTL is much harder to interpret -- acknowledge this uncertainty.

**Evaluate each link independently.** For each arrow in the chain, ask: what is the evidence? Is it replicated? Is it from the right tissue? Then assess the chain as a whole:
- All links well-supported with convergent evidence from multiple databases = **Established** mechanism
- Most links strong but one relies on a single source = **Strong** mechanism
- Evidence is mixed or only moderate across links = **Moderate** mechanism
- One or more links have no direct evidence, only inference = **Preliminary** mechanism
- The connection rests on proximity or text-mining alone = **Speculative** mechanism

**The evidence hierarchy matters.** Clinical observation (ClinVar pathogenic with clinical data) is stronger than functional assays (eQTL, chromatin marks), which are stronger than computational predictions (CADD score, L2G model). A CADD score alone is weak evidence. Multiple convergent computational predictions are moderate. An eQTL replicated across studies in disease-relevant tissue is strong.

**Always consider alternative mechanisms.** If eQTL points to Gene A but L2G favors Gene B, present both models. If the variant could act through splicing rather than expression, note that. Honest uncertainty is more useful than false confidence.

---

## Evidence Grading

Grade each piece of evidence by how directly it supports a mechanistic link:

- **T1 (Strongest)**: Replicated GWAS association combined with functional validation -- an eQTL in the right tissue plus regulatory chromatin marks at the variant position. This is convergent evidence from independent sources.
- **T2**: Curated biological knowledge -- the target gene is in a known disease-relevant pathway (Reactome, GenCC Definitive). This does not prove the variant acts through this pathway, but it makes the mechanism biologically plausible.
- **T3**: Computational predictions -- L2G scores, PPI network enrichment, GO term enrichment. These are hypothesis-generating, not confirmatory. Useful when T1/T2 evidence is sparse.
- **T4**: Single-source or indirect evidence -- nearest gene assignment, literature co-mention, text-mining associations. These should never be the sole basis for a mechanistic claim.

---

## Common Mistakes and Fallbacks

**Parameter pitfalls** (correct param names are shown in code blocks above):
- `EnsemblVEP_annotate_rsid` uses `variant_id`, NOT `rsid`
- `MyVariant_query_variants` uses `query`, NOT `variant_id`
- `ReactomeAnalysis_pathway_enrichment` takes space-separated STRING, NOT array
- OpenTargets disease search uses `queryString`, NOT `query`
- Use MONDO IDs (e.g., MONDO_0005148 for T2D), NOT EFO IDs in OpenTargets
- `gwas_get_associations_for_trait` is BROKEN; use `gwas_search_associations`
- ENCODE biosample names are tissues/cell lines, not disease names

**When data is missing**, adapt the analysis:
- No eQTL: check additional tissues, use L2G alone, note reduced confidence
- No GWAS hits: variant may be sub-threshold or in LD with the true signal
- No RegulomeDB data: check ENCODE directly for chromatin marks at the position
- No L2G data: fall back to eQTL + nearest gene with appropriate caveats
- Gene not in STRING/Reactome: use UniProt function + GO annotations
- Multiple candidate genes: present all ranked by evidence, do not force a single answer

---

## Reasoning Through Common Scenarios

### Scenario 1: Non-coding variant with clear regulatory signal

When VEP shows an intronic or intergenic variant, GWAS shows strong trait associations, and RegulomeDB returns a high score (1a-2a), the path is straightforward: the variant disrupts a regulatory element. The key question becomes WHICH GENE it regulates. Check eQTL in disease-relevant tissue first, then L2G. When both converge on the same gene, follow the pathway analysis through to disease. This is the "happy path" and yields Established confidence.

### Scenario 2: Non-coding variant with ambiguous target gene

When the variant is intergenic with multiple nearby genes and eQTL/L2G evidence is weak or conflicting, resist the temptation to pick the nearest gene. Instead: check eQTLs for ALL candidate genes in the locus, compare L2G scores, and consider whether one candidate has known disease-relevant biology. Present multiple models ranked by evidence. Report as Preliminary or Moderate.

### Scenario 3: No functional signal at all

When there is no eQTL, no L2G support, and minimal regulatory annotation, acknowledge the gap honestly. The variant may act through a mechanism not captured by current databases (cell-type-specific regulation, splicing QTL, 3D genome interactions). State what was checked, what was negative, and what experiments would resolve the question. This is a valid and useful finding.

---

## Programmatic Access (Beyond Tools)

When ToolUniverse tools return partial annotations, aggregate from multiple sources directly:

```python
import requests, pandas as pd

rsid = "rs12345"

# Ensembl VEP REST — functional consequence prediction
vep = requests.get(f"https://rest.ensembl.org/vep/human/id/{rsid}?content-type=application/json").json()

# GTEx eQTL — tissue-specific expression effects
eqtl = requests.get(f"https://gtexportal.org/api/v2/association/singleTissueEqtl",
    params={"snpId": rsid, "datasetId": "gtex_v8"}).json()

# ClinVar bulk (variant_summary.txt.gz, ~200MB, all variants)
# Download once, filter locally:
# df = pd.read_csv("https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz", sep="\t")
# hits = df[df["RS# (dbSNP)"] == int(rsid.replace("rs",""))]

# Combine into unified annotation DataFrame
annotations = {"rsid": rsid, "vep_consequence": vep[0]["most_severe_consequence"] if vep else None,
    "eqtl_tissues": len(eqtl.get("singleTissueEqtl", [])),
    "sources_checked": ["VEP", "GTEx", "ClinVar"]}
```

See `tooluniverse-data-wrangling` skill for API patterns and error handling.

---

## Limitations

- GTEx eQTLs are bulk tissue (v8 data); cell-type-specific and splicing effects may be missed.
- L2G and CADD are computational predictions, not experimental proof.
- GWAS associations are correlative; LD tagging means the lead SNP may not be the causal variant.
- Long-range regulatory effects (>1Mb) and 3D genome interactions are poorly captured by current tools.
- OpenTargets variant credible sets require chr_pos_ref_alt format; get alleles from VEP first.