---
name: tooluniverse-population-genetics
description: Population genetics analysis — allele frequencies (gnomAD, 1000 Genomes), Hardy-Weinberg equilibrium testing, Fst between populations, GWAS associations, evolutionary constraint scores. Use for cross-population variant comparison, ancestry-aware allele frequency lookups, and population-level evolutionary analysis.
disable-model-invocation: true
---

# Population Genetics Analysis

**MC Strategy**: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."

Analyze population-level genetic variation, allele frequencies, GWAS associations, clinical significance, and evolutionary constraints using ToolUniverse tools.

## When to Use

Activate this skill when the user asks about:
- Allele frequencies across populations (gnomAD, 1000 Genomes)
- GWAS associations for diseases/traits
- Clinical variant interpretation (ClinVar, VEP)
- Gene-level constraint metrics (pLI, LOEUF, o/e ratios)
- Selection, drift, linkage disequilibrium, or population structure
- Variant annotation and functional consequences

## LOOK UP, DON'T GUESS

Query gnomAD/1000Genomes/GWAS Catalog FIRST for allele frequencies and associations. Preferred: use the `PopGen_hwe_test`, `PopGen_fst`, `PopGen_inbreeding`, and `PopGen_haplotype_count` tools for HWE, Fst, inbreeding, and haplotype calculations. Fallback: run `popgen_calculator.py` directly. For theoretical problems (delta-q, drift, LD decay), apply the formulas in the Theoretical Reasoning section below.

---

## Tool Quick Reference

| Tool | Key Parameters | Notes |
|------|---------------|-------|
| `gnomad_search_variants` | `query` (REQUIRED) | Resolve rsID to variant_id format "CHR-POS-REF-ALT" |
| `gnomad_get_variant` | `variant_id` (REQUIRED), `dataset` | Population frequencies. Default dataset: gnomad_r3; use gnomad_r4 for latest |
| `gnomad_get_gene_constraints` | `gene_symbol` (REQUIRED) | pLI, o/e ratios. May timeout -- retry once |
| `MyVariant_query_variants` | `query` (REQUIRED) | Aggregated: ClinVar + dbSNP + gnomAD + CADD. Uses hg19 coordinates |
| `EnsemblVEP_annotate_rsid` | `variant_id` (REQUIRED) | Functional consequence, SIFT, PolyPhen. Param is "variant_id" NOT "rsid" |
| `EnsemblVEP_variant_recoder` | `variant_id` (REQUIRED) | Convert between rsID/HGVS/VCF/SPDI |
| `gwas_get_snps_for_gene` | `gene_symbol` (REQUIRED) | All GWAS SNPs for a gene |
| `gwas_search_associations` | `query` (REQUIRED) | GWAS for a disease/trait (NOT gene name -- use gwas_get_snps_for_gene for genes) |
| `gwas_get_variants_for_trait` | `trait` (REQUIRED) | Variants associated with a trait |
| `ClinVar_search_variants` | `gene`, `condition`, `significance` | At least one filter required |
| `RegulomeDB_query_variant` | `rsid` (REQUIRED) | Regulatory scoring (1a=strongest to 7=minimal) |

### Critical Gotchas

1. **gnomAD variant_id**: Format is `"CHR-POS-REF-ALT"` (no "chr" prefix). Always resolve rsIDs via `gnomad_search_variants` first.
2. **gwas_search_associations**: Takes disease/trait names ONLY. Gene names will fail. Use `gwas_get_snps_for_gene` for gene-based lookups.
3. **gwas_search_snps**: BROKEN (HTTP 500). Use `gwas_get_snps_for_gene` instead.
4. **VEP/ClinVar responses**: Format is variable (list, `{data, metadata}`, or `{error}`). Handle all three.

---

## Workflow Patterns

**Variant frequency**: `gnomad_search_variants` -> `gnomad_get_variant(dataset="gnomad_r4")` -> `MyVariant_query_variants` (1000G pop breakdowns) -> `EnsemblVEP_annotate_rsid`

**GWAS for disease**: `gwas_search_associations` -> `gwas_get_variants_for_trait` -> `gnomad_get_variant` for top hits -> `EuropePMC_search_articles`

**Gene characterization**: `gnomad_get_gene_constraints` -> `gwas_get_snps_for_gene` -> `ClinVar_search_variants` -> `PubMed_search_articles`

**Pathogenicity assessment**: `EnsemblVEP_annotate_rsid` -> `MyVariant_query_variants` (CADD, ClinVar) -> `gnomad_get_variant` (frequency) -> `RegulomeDB_query_variant` (if non-coding)

---

## Theoretical Reasoning (CRITICAL for computation problems)

These formulas are needed for quantitative population genetics problems. Work through step by step, showing intermediate values.

### Allele Frequency Change Under Selection (delta-q)

For a recessive deleterious allele (fitness: AA=1, Aa=1, aa=1-s):
```
delta_q = -s * q^2 * p / (1 - s * q^2)
```
where p = freq(A), q = freq(a), s = selection coefficient.

For dominant deleterious (AA=1, Aa=1-s, aa=1-s):
```
delta_q = -s * q * p / (1 - s * q * (2 - q))
```

For heterozygote advantage (AA=1-s1, Aa=1, aa=1-s2):
```
equilibrium: q_hat = s1 / (s1 + s2)
```
Example: plug in s1 and s2 from the question; q_hat = s1/(s1+s2).

**Selection against recessives is slow at low q** because most a alleles hide in heterozygotes. Time to reduce q from q0 to qt: t ~ (1/qt - 1/q0) / s generations.

### Genetic Drift in Small Populations

**Variance in allele frequency per generation**: Var(delta_p) = p*q / (2*Ne)

**Probability of fixation** of a new neutral mutation: 1/(2*Ne)

**Time to fixation** (given it fixes): ~4*Ne generations for neutral alleles

**Heterozygosity decay**: H_t = H_0 * (1 - 1/(2*Ne))^t

After t generations, fraction of heterozygosity lost ~ 1 - e^(-t/(2*Ne))

**Effective population size (Ne)** adjustments:
- Unequal sex ratio: Ne = 4*Nf*Nm / (Nf + Nm)
- Fluctuating size: Ne = harmonic mean of N across generations
- Bottleneck: dominated by the smallest generation size

**Drift vs selection**: Drift dominates when |s| < 1/(2*Ne). A variant with s=0.01 behaves neutrally in a population of Ne < 50.

### Linkage Disequilibrium (LD) Decay

**D** = freq(AB) - freq(A)*freq(B), where A and B are alleles at two loci.

**Decay with recombination**: D_t = D_0 * (1 - r)^t, where r = recombination fraction, t = generations.

**Half-life of LD**: t_half = -ln(2) / ln(1-r) ~ 0.693/r generations (for small r).

**r-squared** (normalized LD): r^2 = D^2 / (pA * pa * pB * pb). Range 0-1.

**Expected r^2 in finite population at equilibrium**: E[r^2] = 1 / (1 + 4*Ne*r) (for drift-recombination balance).

**Practical implications**:
- Tightly linked loci (r < 0.01): LD persists for hundreds of generations
- Loosely linked (r = 0.5, independent assortment): LD halves every generation
- GWAS tag SNPs work because LD extends over blocks; block size depends on Ne and recombination rate
- African populations have shorter LD blocks (larger historical Ne) -> need denser SNP arrays

### Hardy-Weinberg Equilibrium

For alleles A (freq p) and a (freq q=1-p): expected genotypes AA=p^2, Aa=2pq, aa=q^2.

**Chi-square test**: df=1 (2 alleles). Preferred: use `PopGen_hwe_test` tool. Fallback: `popgen_calculator.py --type hwe --AA N1 --Aa N2 --aa N3`.

**Causes of HWE departure**: non-random mating, selection, migration, drift, genotyping error. Excess homozygotes -> inbreeding or population structure (Wahlund effect). Excess heterozygotes -> overdominant selection or negative assortative mating.

### Heritability

- **H^2 (broad-sense)** = V_G / V_P; **h^2 (narrow-sense)** = V_A / V_P
- V_G includes ALL genetic variance: additive + dominance + epistasis. Trap: "broad-sense" is not just additive.
- Under HWE with two alleles (p, q): genotype frequencies are p^2, 2pq, q^2
- Phenotype frequency from genotype: sum(genotype_freq * penetrance) for each genotype class
- For quantitative traits: V_P = V_G + V_E (no covariance assumed)
- With dominance: assign genotypic values (e.g., AA=a, Aa=d, aa=-a), compute mean, then V_G from freq-weighted squared deviations
- **PGS vs SNP-h² trap**: PGS R² is NOT necessarily ≤ h²_SNP. With large GWAS, PGS can exceed SNP-h² by tagging rare causal variants through LD with common SNPs. The word "necessarily" makes this claim False. h²_SNP is estimated from common variants; PGS can capture additional variance.

### Path Analysis (Causal Diagrams)

- Trace ALL paths from cause to effect through the diagram (direct + indirect)
- Each path's contribution = product of path coefficients along that path
- Total effect (correlation) = sum of contributions from all paths
- Indirect effects can mask (suppression) or inflate (confounding) the direct effect
- Unanalyzed correlations (double-headed arrows) count as valid path segments
- **Never ignore indirect paths** — the total is rarely just the direct arrow

### Genetic Combinatorics (F2 crosses, haplotype counting)

For n SNPs between two inbred (homozygous) strains:
- F1 is heterozygous at all n loci
- F2 distinct haplotypes = 2^n (each SNP contributes parental A or B allele)
- F2 distinct diploid genotypes = 3^n (AA, AB, BB at each locus)
- F2 unique chromosomes (distinct haplotypes) = 2^n (e.g., 5 SNPs → 2^5 = 32; but subtract the 2 parental haplotypes if "novel" is asked → 30)
- **ALWAYS write and run Python code** (`python3 -c "..."`) for these counts. Never enumerate by hand.
- For specimens/counting from field data: parse the data into a structure and compute programmatically.

### Mutation-Selection Balance

Equilibrium frequency of a deleterious allele:
- Recessive lethal: q_hat = sqrt(mu/s) ~ sqrt(mu) when s=1
- Dominant lethal: q_hat = mu/s
- Example: mu=1e-5, s=1 (recessive lethal) -> q_hat = 0.003 (carrier freq ~ 0.006)

### F-statistics and Population Structure

- **Fis**: Inbreeding within subpopulations (heterozygote deficit within demes)
- **Fst**: Differentiation between subpopulations. Fst = Var(p) / (p_bar * q_bar)
- **Fit**: Total inbreeding. (1-Fit) = (1-Fis)(1-Fst)
- Fst interpretation: <0.05 little, 0.05-0.15 moderate, 0.15-0.25 great, >0.25 very great differentiation
- Preferred: use `PopGen_fst` tool. Fallback: `popgen_calculator.py --type fst --p1 X --p2 Y --n1 N1 --n2 N2`

---

## Mendelian Genetics Reasoning Framework

For any genetics cross problem, follow these steps IN ORDER. Do not skip steps.

### Step 1: Identify genes, locations, and allele relationships
- List every gene involved in the cross
- Determine chromosomal location: autosomal vs X-linked (X-linked genes show different inheritance in males vs females)
- Determine allele relationships: dominant/recessive, codominant, incomplete dominance
- Note any epistasis, suppressor, or modifier interactions between genes

### Step 2: Write parental genotypes explicitly
- Use standard notation (e.g., Aa Bb for autosomal; X^w X^+ for X-linked)
- For X-linked genes, males are hemizygous (X^w Y), not homozygous
- If parental genotypes are not given, deduce them from phenotypes and pedigree context

### Step 3: Draw Punnett square(s) for each gene
- For multi-gene crosses, handle each gene independently (if unlinked) then combine
- For linked genes, use recombination frequency to adjust gamete ratios
- For X-linked genes, remember that fathers pass X to all daughters and Y to all sons

### Step 4: Calculate expected phenotypic ratios
- Multiply independent gene ratios (e.g., 3:1 x 3:1 = 9:3:3:1)
- For X-linked: calculate male and female ratios separately, then combine or report separately as required

### Step 5: Verify ratios sum to 1.0
- Convert all ratios to fractions and confirm they sum to 1
- If they don't sum to 1, there is an error in the Punnett square or gamete calculation

### Step 6: Apply phenotype modification rules AFTER computing genotypic ratios
- For epistasis: first compute the full genotypic ratios (e.g., 9:3:3:1), then collapse genotype classes that produce the same phenotype
- For suppressor genes: a suppressor homozygote (su/su) restores wild-type in an otherwise mutant background. Apply suppression AFTER determining which individuals carry the mutant allele
- Example: 9 A_B_ : 3 A_bb : 3 aaB_ : 1 aabb with recessive epistasis (aa masks B) becomes 9:3:4

---

## E. coli Hfr Mapping Framework

For bacterial conjugation and Hfr mapping problems:

### Core Principles
- In Hfr x F- crosses, the Hfr chromosome is transferred linearly starting from the origin of transfer (oriT)
- **Gene transfer order = chromosomal order from the origin**
- Early markers (entering first) are closest to the origin of transfer
- Late markers (entering last) are farthest from the origin

### Interrupted Mating Experiments
- Genes that appear in recombinants at earlier time points are closer to oriT
- The time of entry gives the order and approximate distance between genes
- Recombinants require integration by homologous recombination (double crossover)

### Recombination Frequency Between Markers
- **KEY TRAP**: Highest recombination frequency occurs between markers that are FARTHEST APART on the transferred segment
- This is because more time elapses between entry of distant markers, providing more opportunity for recombination events between them
- Conversely, markers that enter close together in time show LOW recombination between them
- Do NOT confuse "highest recombination frequency" with "first markers to enter" -- these are opposite concepts

### Ordering Markers from Hfr Data
1. Use time-of-entry data to establish gene order relative to oriT
2. Use recombination frequency data between pairs of selected markers to confirm/refine order
3. Multiple Hfr strains with different origins can be used to build a circular map

---

## MCQ Elimination Strategy for Genetics

### General MCQ Protocol
1. **ALWAYS evaluate ALL options** before choosing an answer
2. Never select the first option that seems correct -- there may be a better or more precise answer
3. Read the question stem carefully for qualifiers: "MOST likely", "LEAST likely", "NOT true", "ALWAYS", "NEVER"

### "Which is NOT true" Questions
- Evaluate EACH statement independently as True or False
- Mark each option with T or F before selecting
- The answer is the statement marked F
- Double-check: verify the "false" statement is genuinely false, not just misleadingly worded

### "Which mechanism" Questions
- Test each proposed mechanism against ALL observations given in the question
- A correct mechanism must explain every observation, not just some
- Eliminate mechanisms that contradict even one observation

### Specific Traps to Watch For
- **Subfunctionalization vs neofunctionalization**: Subfunctionalization = partitioning of EXISTING ancestral functions between duplicates (both copies needed to perform original function). Neofunctionalization = one copy acquires a genuinely NEW function not present in the ancestor
- **Copy-neutral LOH**: Caused by mitotic recombination (segmental, affects part of a chromosome), NOT uniparental disomy (UPD, which is whole-chromosome). The question may try to conflate these
- **Penetrance vs expressivity**: Penetrance = fraction of individuals with genotype who show ANY phenotype. Expressivity = degree/severity of phenotype among those who show it. These are distinct concepts
- **Complementation vs recombination**: Complementation = two mutations in DIFFERENT genes restore wild-type in trans. Recombination = exchange between two mutations in the SAME or different genes. Complementation is tested in F1 (heterozygote); recombination is tested in progeny

---

## Common Genetics Reasoning Traps

These are specific patterns that have caused reasoning failures in hard genetics questions. Review before answering genetics MCQs.

### Suppressor Genetics
- A suppressor mutation, when homozygous, restores wild-type phenotype in an otherwise mutant background
- In F2 crosses involving both the original mutation and an autosomal recessive suppressor:
  - Treat as a dihybrid cross — the primary mutation and the suppressor segregate independently
  - Only 1/4 of F2 are homozygous for the suppressor
  - The suppressor only acts in individuals that are also homozygous for the primary mutation
  - Use a Punnett square to enumerate all genotypic classes, then apply the suppression rule to determine phenotypes

### Non-disjunction (Bridges' Experiments)
- Bridges used non-disjunction to prove the chromosome theory of inheritance
- X0 males arise from female meiosis non-disjunction events
- **Meiosis I non-disjunction**: both X chromosomes go to one pole -> XX egg + O egg (nullo-X)
- **Meiosis II non-disjunction**: sister chromatids fail to separate -> XX egg from one secondary oocyte
- The classic Bridges result: exceptional white-eyed females (X^w X^w) and red-eyed males (from nullo-X eggs + Y sperm = X0, but these are typically sterile)
- Key distinction: know which type of non-disjunction (MI vs MII) produces which specific gamete types

### GWAS LD Blocks
- SNPs WITHIN the same LD block are correlated and can inflate false positive associations (one causal SNP drags along non-causal tag SNPs)
- SNPs ACROSS different LD blocks are largely independent and do NOT create misleading cross-locus associations
- LD block structure varies by population (shorter in African populations due to larger historical Ne)
- Fine-mapping within an LD block is needed to distinguish the causal variant from hitchhiking tag SNPs

### Gene Retention After Whole-Genome Duplication
- **Neofunctionalization**: One copy acquires a NEW function -> most commonly cited reason for gene RETENTION after duplication (preserves both copies because each is now essential)
- **Subfunctionalization**: Ancestral functions are PARTITIONED between copies -> explains DIVERGENCE of duplicate copies, but both copies must be retained to maintain the full ancestral function
- **Dosage balance**: Some genes are retained in duplicate to maintain stoichiometric balance in protein complexes
- Trap: Questions may ask "what explains retention" vs "what explains divergence" -- these have different best answers
- For retention: neofunctionalization (new function makes both copies essential)
- For divergence of expression/function: subfunctionalization (partitioning of ancestral roles)

---

## Advanced Genetics Traps v2

### PGS vs Heritability: "Necessarily True" Logic
For "necessarily true" questions about PGS and heritability: a statement is necessarily true only if it holds when V_D=0 AND when V_D=V_G. Test the extremes.

### Path Diagram Sign Assignment Protocol
Do NOT guess path signs from general knowledge. Signs may differ from well-known systems. Follow this protocol:
1. **Establish reference direction**: What varies? What is increasing?
2. **For each path X→Y**: Ask ONLY "when X increases, does Y increase (+) or decrease (-)?"
3. **Use the question's experimental context** (knockout/control comparisons, provided data) to determine signs — not intuition
4. **Expect negative paths**: Path diagrams test your ability to identify negative relationships. All-positive is almost always wrong. Direct residual paths (e) often have opposite sign from expectation.

### Chi-Square: "Most Likely to Reject" Protocol
Compute chi-square from the expected ratio given in the question. Compare to chi-square-critical at df = (number of phenotype classes - 1). Pick the answer choice with the highest chi-square, but also check which pattern is biologically diagnostic of the alternative hypothesis.

### LD and Misleading GWAS Associations
LD block boundaries at recombination hotspots are a source of GWAS false localization — strong signal in the block does not guarantee the causal variant is in the block.

### Low-Frequency Allele Detection
Duplex sequencing (unique molecular identifiers + double-strand consensus) detects alleles at 0.01% frequency — far below standard NGS even at 80X depth. Simply increasing read depth does NOT help for ultra-rare variants because the Illumina error rate (~0.1%) masks variants rarer than ~1% regardless of depth. Error correction methods (UMIs, duplex consensus) are needed to distinguish true rare variants from sequencing errors.

---

## Bundled Computation Script

**Script**: `skills/tooluniverse-population-genetics/scripts/popgen_calculator.py`

**Preferred**: Use ToolUniverse tools (via MCP/SDK) instead of the script when possible:
- `PopGen_hwe_test` tool -- HWE chi-square test. Fallback: `popgen_calculator.py --type hwe`
- `PopGen_fst` tool -- Weir-Cockerham Fst. Fallback: `popgen_calculator.py --type fst`
- `PopGen_inbreeding` tool -- Inbreeding coefficient from pedigree. Fallback: `popgen_calculator.py --type inbreeding`
- `PopGen_haplotype_count` tool -- Expected haplotype diversity. Fallback: `popgen_calculator.py --type haplotypes`

**Fallback script** modes (all require `--type`):
- `hwe`: `--AA N --Aa N --aa N` -- chi-square HWE test with p-value
- `fst`: `--p1 F --p2 F --n1 N --n2 N` -- Weir-Cockerham Fst
- `inbreeding`: `--pedigree TYPE --generations G` -- F from pedigree (self, full-sib, half-sib, first-cousin, etc.)
- `haplotypes`: `--snps N --generations G --recomb_rate R` -- expected haplotype diversity

---

## Key Concepts

- **MAF**: Minor allele frequency. Common: >5%. Rare: <1%. Ultra-rare: <0.01%.
- **pLI**: P(LoF intolerant). >0.9 = haploinsufficient gene.
- **LOEUF**: LoF o/e upper fraction. <0.35 = highly constrained.
- **CADD PHRED**: >=10 top 10%, >=20 top 1%, >=30 top 0.1% most deleterious.
- **Genome-wide significance**: GWAS p < 5e-8 (Bonferroni for ~1M independent tests).
- **Effect size**: OR > 1 = risk allele, < 1 = protective. Beta > 0 = increases trait.

## Evidence Grading

- **T1**: ClinVar pathogenic/likely pathogenic, FDA pharmacogenomics
- **T2**: gnomAD frequencies, GTEx eQTLs, GWAS genome-wide significant
- **T3**: CADD/SIFT/PolyPhen predictions, RegulomeDB, constraint metrics
- **T4**: VEP consequence terms, dbSNP annotations, literature mentions