--- name: tooluniverse-variant-predictor-dms-validation description: "Validate a variant-effect predictor (AlphaMissense, ESM-C SAE, ESM logits, EVE, conservation scores, or any per-variant numeric score) against experimental deep mutational scanning (DMS) data. Computes per-variant predictor scores, splits variants into neutral vs disruptive groups by DMS effect, runs a Mann-Whitney U test on the predictor scores, and sweeps the stratification thresholds for robustness. Use when you need to know whether a predictor's scores track real functional disruption on a specific protein." disable-model-invocation: true --- # Variant-effect predictor benchmarking against DMS The core user question: **"I have a variant-effect predictor — does it actually correlate with experimental DMS measurements on this protein?"** The predictor can be anything that assigns a numeric score to single missense variants: - ESM-C 6B Sparse Autoencoder (SAE) feature drops at the mutation site - AlphaMissense pathogenicity scores - ESM logits-based variant scoring (`ESM_score_sequence`) - EVE / EVE++ scores - Conservation scores (ConSurf, Rate4Site) - DynaMut2 ΔΔG predictions - A custom in-house model This skill validates ALL of these against a DMS dataset with the same statistical framework. SAE is shown as the worked example because the surrounding skills in this collection are SAE-themed, but the procedure is predictor-agnostic. --- ## When to use this skill - You picked a variant-effect predictor and want to know if it's worth trusting on your protein of interest - You're comparing two predictors on the same DMS dataset (run this skill twice, compare the resulting per-K Mann-Whitney U p-values + effect sizes) - Reviewers want robustness evidence — the parameter sweep (K, neutral band, disruptive quantile) is exactly that - You're publishing a new variant-effect method and need a benchmark figure **Not for**: - Per-position / per-feature interpretation — use `tooluniverse-residue-functional-mechanism-interpretation` - Single-variant interpretation (you have one variant, no DMS) — use `tooluniverse-protein-sae-variant-interpretation` or `tooluniverse-protein-lof-mechanism` - Building a predictor from scratch — out of scope --- ## Required inputs | Input | Format | Example | |---|---|---| | DMS effect matrix | (20 amino acids × n_positions) `np.array`, NaN for unmeasured | from `MaveDB_get_effect_matrix` | | Disruptive tail convention | `"top"` (ΔΔG positive = destabilizing) or `"bottom"` (fitness low = LoF) | metadata from DMS retrieval step | | Per-variant predictor scores | (20 × n_positions) `np.array` matching DMS layout | computed from your chosen predictor — see Step 2 | | Aggregation K | `int`, mean of top-K when the predictor outputs many sub-scores per variant | only relevant for multi-feature predictors like SAE; ignore otherwise | --- ## Workflow ### Step 1: Retrieve DMS as an effect matrix One call returns a ready-to-analyze `(20 × n_positions)` matrix — HGVS parsing, single-missense filtering, score-field detection, and (optional) UniProt numbering verification are done inside the tool: ```python r = MaveDB_get_effect_matrix( urn="urn:mavedb:00000115-a-7", uniprot_id="P01116", # optional but recommended — enables numbering check ) matrix = np.array(r["data"]["matrix"], dtype=np.float32) # (20, n_positions) positions = r["data"]["positions"] amino_acid_order = r["data"]["amino_acid_order"] # always 'ACDEFGHIKLMNPQRSTVWY' # Audit fields — surface in your report: # r["data"]["n_parsed_single_missense"], n_dropped, score_field_used, # numbering_offset, numbering_check ``` If `numbering_offset != 0`, the MaveDB position numbering differs from the UniProt canonical sequence — apply the offset before joining to any other source (PDB / SAE features / AlphaMissense). ### Step 2: Compute per-variant predictor scores Pick **one** of these predictor sources. The choice changes Step 2 only; Steps 3–5 are identical. #### Predictor option A — ESM-C 6B SAE (the worked example) For every variant, sum SAE activations across the residue window, compute drop = max(0, WT − mut), and aggregate to one score per variant via the top-K mean of drops: ```python import numpy as np def sae_drop_per_variant(wt_pooled, variant_pooled, K=3): """SAE drop for one variant = mean of the K largest feature drops.""" drops = np.maximum(0.0, wt_pooled - variant_pooled) # (n_features,) sorted_desc = -np.sort(-drops) return float(sorted_desc[:K].mean()) ``` **Two ways to score a saturation sweep** — pick based on batch size: | When | Use | Forge cost (for 19 alts × N positions) | |---|---|---| | Saturation at ≤100 variants OR you only need top-K-per-variant deltas | `ESM_score_variant_sae_batch(sequence, variants=[...], top_k_features=10)` | **1 + 19N** (1 ref + 1 per mut) | | Full-protein per-feature tensor (e.g. for downstream PCA / clustering) | Loop `ESM_get_sae_features(sequence=mutant)`, cache by `(sequence, position)` | **2 × 19N** (cached reruns free) | The batch tool is the right default — it halves Forge cost vs the per-variant disruption pattern, and the cap of 100 variants per call covers saturation mutagenesis at one position (19) or short positional sweeps (e.g. positions 10-15: 90 variants). For longer sweeps, split into multiple batch calls. For the full per-residue × per-feature tensor needed by some predictor analyses, fall back to the loop pattern. The full library scale is well-tested: `tests/integration/test_dms_pipeline_e2e_kras.py` runs ~300 mutants for KRAS positions 10–25. #### Predictor option B — AlphaMissense (hegelab proxy: categorical, not per-substitution numeric) The TU AlphaMissense tools proxy a public API (hegelab.org) that returns **residue-level categorical assignments**, not per-(position, alt_aa) numeric scores. Three calls are available; pick the one that matches your scale: | Tool | Signature | Returns | |---|---|---| | `AlphaMissense_get_variant_score` | `uniprot_id`, `variant` (e.g. `"p.G12V"`) | Residue-level data INCLUDING the alt_aa's bin (`benign` / `ambiguous` / `pathogenic`); also `mean`/`mean_all` | | `AlphaMissense_get_residue_scores` | `uniprot_id`, `position` | Same residue-level data (per-position lookup, cheaper than 19 variant calls) | | `AlphaMissense_get_protein_scores` | `uniprot_id` | Whole-protein dump in one call (cheapest for full-protein DMS analysis) | The response shape is the same for all three. Example for KRAS pos 12: ```python r = AlphaMissense_get_residue_scores(uniprot_id="P01116", position=12) # r["data"]["scores"]: # {"uid":"P01116","aa":"G","resi":12, # "benign":"", "ambiguous":"", # "pathogenic":"6:A,C,D,R,S,V", # ← SNV-reachable subs only # "pathogenic_all":"19:A,C,D,E,F,...,Y", # ← all 19 substitutions # "mean":0.9885, # mean over SNV-reachable subs # "mean_all":0.9950} # mean over all 19 substitutions # r["data"]["thresholds"]: # {"pathogenic":"> 0.564","ambiguous":"0.34 - 0.564","benign":"< 0.34"} ``` **To populate the `(20, n_positions)` predictor matrix**, parse the bin strings and map each alt_aa to a numeric score (bin-midpoint is the standard imputation since the tool doesn't expose true per-substitution numerics): ```python BIN_MIDPOINTS = {"benign": 0.17, "ambiguous": 0.452, "pathogenic": 0.782} def parse_bin_list(bin_str): """'6:A,C,D,R,S,V' → ['A','C','D','R','S','V']; '' → [].""" if not bin_str: return [] _, aas = bin_str.split(":", 1) return aas.split(",") def am_per_variant_matrix(uniprot_id, positions, aa_index): AAS = list(aa_index) M = np.full((20, len(positions)), np.nan, dtype=np.float32) # One whole-protein call is much cheaper than n_positions calls p = AlphaMissense_get_protein_scores(uniprot_id=uniprot_id) per_res = {row["resi"]: row for row in p["data"]["scores"]} for pos_idx, pos in enumerate(positions): row = per_res.get(pos, {}) for cat in ("benign", "ambiguous", "pathogenic"): for alt in parse_bin_list(row.get(f"{cat}_all", "")): if alt in aa_index: M[aa_index[alt], pos_idx] = BIN_MIDPOINTS[cat] return M ``` **Higher-resolution alternative — DeepMind bulk CSV** (only if you genuinely need true per-substitution numerics rather than bin-midpoints): ```python # The official DeepMind release contains per-variant continuous scores # (not just bin assignments). Not currently wrapped by a TU tool — fetch directly: # https://alphafold.ebi.ac.uk/files/AF--F1-aa-substitutions.csv # or stream-filter https://storage.googleapis.com/dm_alphamissense/AlphaMissense_aa_substitutions.tsv.gz ``` Use the bulk CSV when you need rank-correlation analysis (Spearman benefits from continuous values, not 3-bin midpoints). Use the TU proxy when you only need the binary "is this variant in the pathogenic bin" signal. Both options are free, no API key required. #### Predictor option C — ESM logits-based score ```python ESM_score_sequence( sequence=mutant_sequence, model="esmc-600m-2024-12", ) # returns per-residue logits; compute mutant-vs-WT log-odds at the mutation site ``` #### Predictor option D — any external score Bring your own. Just produce a `(20, n_positions)` `np.ndarray` aligned to the DMS matrix. ### Step 3: Stratify DMS into neutral vs disruptive ```python def categorize(dms_matrix, disruptive_tail, neutral_abs=0.1, disruptive_quantile=0.05): """Split variants into neutral and disruptive masks. disruptive_tail: 'top' (positive = destabilizing, e.g. folding ΔΔG) 'bottom' (low = LoF, e.g. fitness) """ flat = dms_matrix[~np.isnan(dms_matrix)] if disruptive_tail == "top": cut = np.quantile(flat, 1 - disruptive_quantile) disruptive = dms_matrix >= cut elif disruptive_tail == "bottom": cut = np.quantile(flat, disruptive_quantile) disruptive = dms_matrix <= cut else: raise ValueError("disruptive_tail must be 'top' or 'bottom'") neutral = np.abs(dms_matrix) <= neutral_abs disruptive = disruptive & ~np.isnan(dms_matrix) neutral = neutral & ~np.isnan(dms_matrix) & ~disruptive return neutral, disruptive ``` **Keep the neutral band tight** (`|effect| ≤ 0.1`). A loose neutral band leaks weakly-disruptive variants into the "neutral" group and erodes the contrast. **Sign matters** — get `disruptive_tail` from the DMS-retrieval skill's metadata; a flipped sign silently inverts every conclusion. ### Step 3.5 (MANDATORY): Pre-MWU sanity gate — catch silent NaN failures **Before running any statistical test**, verify the predictor scores actually populated. Silent NaN matrices are the most common failure mode in this workflow — a batch SAE compute that errored midway, an AlphaMissense fetch that skipped variants, an ESM forge call that timed out — and they produce "successful" runs that report meaningless statistics. ```python neutral, disruptive = categorize(dms_matrix, disruptive_tail) s_neutral_all = predictor_scores[neutral] s_disruptive_all = predictor_scores[disruptive] s_neutral_finite = s_neutral_all[~np.isnan(s_neutral_all)] s_disruptive_finite = s_disruptive_all[~np.isnan(s_disruptive_all)] # Hard gate: NaN coverage check coverage_n = len(s_neutral_finite) / max(len(s_neutral_all), 1) coverage_d = len(s_disruptive_finite) / max(len(s_disruptive_all), 1) if len(s_neutral_finite) < 5 or len(s_disruptive_finite) < 5: raise ValueError( f"Insufficient predictor scores: only {len(s_neutral_finite)} neutral " f"and {len(s_disruptive_finite)} disruptive non-NaN values. " f"Coverage = {coverage_n:.0%} / {coverage_d:.0%}. " f"Predictor matrix may be empty / failed to populate. " f"INVESTIGATE THE PREDICTOR COMPUTATION STEP before continuing." ) if coverage_n < 0.5 or coverage_d < 0.5: print(f"WARNING: predictor coverage only {coverage_n:.0%} (neutral) / " f"{coverage_d:.0%} (disruptive). Results may be biased toward the " f"non-NaN subset.") ``` If you hit the `ValueError`: do not paper over with "predictor X wins by default". The right response is to debug the predictor computation step (re-run, check API keys, check log files) and report what failed. **Sign-convention double-check** (also mandatory if the dataset is new): verify `disruptive_tail` against the data using an internal landmark. The metadata field can be misleading on subsets (e.g. a window of TP53 DBD might run the opposite direction from the full TP53 abundance assay). The cheapest check is Spearman correlation between DMS effect and a predictor known to align in a fixed direction (AlphaMissense pathogenicity score is always positive=more-damaging): ```python from scipy.stats import spearmanr flat_dms = dms_matrix.ravel() flat_pred = predictor_scores.ravel() mask = ~(np.isnan(flat_dms) | np.isnan(flat_pred)) rho, p_corr = spearmanr(flat_dms[mask], flat_pred[mask]) expected_sign = "+" if disruptive_tail == "top" else "-" got_sign = "+" if rho > 0 else "-" if got_sign != expected_sign and abs(rho) > 0.1: print(f"WARNING: Spearman rho = {rho:.3f} (sign={got_sign}) but " f"disruptive_tail='{disruptive_tail}' expects sign={expected_sign}. " f"Sign convention may be inverted for THIS subset. Verify before " f"interpreting MWU.") ``` ### Step 4: One-sided Mann-Whitney U test ```python from scipy.stats import mannwhitneyu u, p = mannwhitneyu(s_disruptive_finite, s_neutral_finite, alternative="greater") print(f"disruptive median = {np.median(s_disruptive_finite):.4f}, " f"neutral median = {np.median(s_neutral_finite):.4f}, p = {p:.3g}") ``` For SAE-style multi-feature predictors with a top-K parameter, run this for K ∈ {1, 3, 10} and report all three — the best K is usually different across predictors and you want the comparison transparent, not tuned. **MWU is the discrimination test, but it's not the only valid analysis.** Consider also reporting (one or both): - **Spearman rank correlation** between predictor and DMS effect — captures graded calibration the binary neutral/disruptive split discards. Best for "is this predictor calibrated?" questions, not just "does it discriminate?" - **AUROC** at a clinically-meaningful disruption threshold (e.g. ΔΔG > 1 kcal/mol) — gives a 0-1 number practitioners recognize and lets you compare to literature predictors directly. The eval that motivated this skill (KRAS folding AlphaMissense benchmark) got *qualitatively similar* answers from MWU (p=0.20, "not reliable") and Spearman (ρ=0.23, p<1e-30, "weakly calibrated") — but Spearman's reading was richer. If the user just asks "is X reliable?", give both unless one is clearly inappropriate for the data shape. ### Step 5: Robustness sweep ```python sweep = [] for neutral_abs in (0.05, 0.1, 0.2): for q in (0.05, 0.1): neut, disr = categorize(dms_matrix, disruptive_tail, neutral_abs=neutral_abs, disruptive_quantile=q) s_n = predictor_scores[neut][~np.isnan(predictor_scores[neut])] s_d = predictor_scores[disr][~np.isnan(predictor_scores[disr])] if len(s_n) < 5 or len(s_d) < 5: continue _u, p = mannwhitneyu(s_d, s_n, alternative="greater") sweep.append({"neutral_abs": neutral_abs, "disruptive_q": q, "p": p, "n_n": len(s_n), "n_d": len(s_d)}) ``` A predictor that passes only at one (neutral_abs, q) point is suspect. A predictor that passes across the grid is robust. ### Step 6: Visualize + interpret ```python import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(4, 4)) ax.boxplot( [s_neutral, s_disruptive], labels=[f"neutral (n={len(s_neutral)})", f"disruptive (n={len(s_disruptive)})"] ) ax.set_ylabel("Predictor score") ax.set_title(f"p = {p:.3g}") plt.tight_layout() plt.savefig("predictor_vs_dms.png", dpi=150) ``` --- ## Interpretation | Result | What it means | |---|---| | **p < 0.01 AND robust across sweep** | Predictor reliably distinguishes disruptive from neutral on this protein. Safe to use for prioritization (not classification — see limits) | | **p < 0.05 but flips signs in sweep** | Borderline. Effect exists but is sensitive to stratification — needs more data or tighter neutral band | | **p > 0.05** | Predictor is not informative on this DMS assay. Possible causes: wrong `disruptive_tail`, predictor mis-calibrated for this protein family, DMS measures something the predictor wasn't trained for | | Significant **at K=1 but not K=10** (multi-feature predictors only) | Disruption is concentrated in a few features (K=1 best) vs distributed across many (K=10) | --- ## Comparing two predictors Run this skill twice on the same DMS dataset (e.g. SAE drops AND AlphaMissense), keep the same stratification (`disruptive_tail`, `neutral_abs`, `disruptive_quantile`), and compare: | Metric | Better predictor has | |---|---| | Lower p-value | More confidence of discrimination | | Larger median gap (disruptive − neutral) | Larger effect size | | Fewer NaNs in coverage | Predicts more variants | | Robust across sweep | More reliable in different conditions | --- ## Honest limitations 1. **Per-protein test only**. Generalizing "predictor X is good" requires running this on multiple proteins from different families. 2. **Per-assay only**. A predictor calibrated for stability might fail on binding-fitness assays — same protein, different DMS, different result. 3. **Coarse signal**. This says "predictor responds to mutational disruptiveness." It does NOT say "predictor identifies the *right* residues" — for that, run `tooluniverse-residue-functional-mechanism-interpretation`. 4. **MWU assumes independence**. DMS variants at the same position are mildly correlated (shared structural context); the p-values are slightly optimistic. 5. **Tail definitions are arbitrary**. 5% disruptive cutoff and 0.1 neutral band are reasonable defaults; the right thresholds depend on the assay's signal-to-noise. The sweep is what gives you robustness, not any single choice. --- ## Cross-references | Step | Tool / Skill | |---|---| | DMS retrieval | `MaveDB_get_effect_matrix` | | Per-variant SAE scoring (≤100 variants) | `ESM_score_variant_sae_batch` (preferred — N+1 calls) | | Per-variant SAE scoring (full tensor / unlimited) | `ESM_get_sae_features` (loop + cache), `ESM_score_variant_sae_disruption` (single variant) | | Per-variant AlphaMissense (single lookup) | `AlphaMissense_get_variant_score(uniprot_id, variant)` | | Per-position AlphaMissense (saturation) | `AlphaMissense_get_residue_scores(uniprot_id, position)` | | Whole-protein AlphaMissense (cheapest for DMS) | `AlphaMissense_get_protein_scores(uniprot_id)` | | Per-variant ESM logits | `ESM_score_sequence` | | Structural prior (for predictor analysis) | `Structure_annotate_per_residue` | | Next step: per-hotspot mechanism | `tooluniverse-residue-functional-mechanism-interpretation` | | Final visualization | Step 7 of `tooluniverse-residue-functional-mechanism-interpretation` (annotated heatmap + callouts) |