---
name: genomics
description: Genomics and transcriptomics analysis strategies
---

# Genomics and Transcriptomics Analysis

## When to Use This Skill

- When data contains gene expression measurements (RNA-seq, microarray)
- When analyzing differential gene expression
- When performing pathway or gene set enrichment analysis
- When interpreting genetic variants or mutations

## Core Concepts

### Gene Expression Data Types

**RNA-seq counts:**
- Raw read counts per gene
- Requires normalization (TPM, RPKM, DESeq2)
- Suitable for differential expression analysis

**Microarray intensities:**
- Probe fluorescence intensities
- Log-transformed, background-corrected
- Legacy platform, less common now

**Single-cell RNA-seq:**
- Expression per cell (not bulk tissue)
- High sparsity (many zeros)
- Specialized analysis methods

### Gene Nomenclature

**Human genes:**
- Official symbols: HUGO Gene Nomenclature Committee (HGNC)
- Example: *TP53* (tumor protein p53)
- Italicized in publications

**Mouse genes:**
- Similar to human but capitalization differs
- Example: *Tp53* (first letter capital, rest lowercase)

**Protein names:**
- Not italicized
- Example: p53 protein

**Always verify gene symbols** - aliases and outdated names are common.

## Differential Expression Analysis

### Workflow

```python
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

# Load expression data (genes × samples)
# Rows = genes, Columns = samples
expr_data = pd.read_csv("expression_data.csv", index_col=0)

# Define groups
group1_samples = ["Sample1", "Sample2", "Sample3"]
group2_samples = ["Sample4", "Sample5", "Sample6"]

results = []

for gene in expr_data.index:
    group1_expr = expr_data.loc[gene, group1_samples]
    group2_expr = expr_data.loc[gene, group2_samples]

    # T-test
    t_stat, p_value = ttest_ind(group1_expr, group2_expr)

    # Fold change
    mean1 = group1_expr.mean()
    mean2 = group2_expr.mean()
    log2fc = np.log2(mean1 / mean2) if mean2 > 0 else np.nan

    results.append({
        "gene": gene,
        "log2FC": log2fc,
        "p_value": p_value,
        "mean_group1": mean1,
        "mean_group2": mean2
    })

results_df = pd.DataFrame(results)

# Multiple testing correction
results_df["p_adj"] = multipletests(results_df["p_value"], method="fdr_bh")[1]

# Define significant genes
significant = results_df[
    (results_df["p_adj"] < 0.05) &
    (abs(results_df["log2FC"]) > 1)  # 2-fold change
]

print(f"Significant genes: {len(significant)}")
print(f"Upregulated: {sum(significant['log2FC'] > 0)}")
print(f"Downregulated: {sum(significant['log2FC'] < 0)}")
```

### Volcano Plot

**Visualize differential expression:**

```python
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(
    results_df["log2FC"],
    -np.log10(results_df["p_adj"]),
    alpha=0.5, s=10, c="gray"
)

# Highlight significant genes
sig_mask = (results_df["p_adj"] < 0.05) & (abs(results_df["log2FC"]) > 1)
plt.scatter(
    results_df.loc[sig_mask, "log2FC"],
    -np.log10(results_df.loc[sig_mask, "p_adj"]),
    alpha=0.7, s=20, c="red", label="Significant"
)

plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(adjusted p-value)")
plt.axhline(-np.log10(0.05), linestyle="--", color="black", linewidth=0.5)
plt.axvline(-1, linestyle="--", color="black", linewidth=0.5)
plt.axvline(1, linestyle="--", color="black", linewidth=0.5)
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)
```

## Gene Set Enrichment

### Simple Pathway Enrichment

**When:** You have a list of significant genes and want to know which pathways are affected

```python
# Define gene sets (pathways)
gene_sets = {
    "Cell Cycle": ["TP53", "CDK1", "CCNB1", "CDC20", ...],
    "Apoptosis": ["TP53", "BAX", "BCL2", "CASP3", ...],
    "DNA Repair": ["TP53", "BRCA1", "BRCA2", "ATM", ...],
    # ... more pathways
}

# Fisher's exact test for enrichment
from scipy.stats import fisher_exact

all_genes = set(expr_data.index)
sig_genes = set(significant["gene"])

enrichment_results = []

for pathway, pathway_genes in gene_sets.items():
    pathway_genes = set(pathway_genes) & all_genes  # Only genes in dataset

    # 2x2 contingency table
    a = len(sig_genes & pathway_genes)  # Sig & in pathway
    b = len(sig_genes - pathway_genes)  # Sig & not in pathway
    c = len(pathway_genes - sig_genes)  # Not sig & in pathway
    d = len(all_genes - sig_genes - pathway_genes)  # Not sig & not in pathway

    oddsratio, p_value = fisher_exact([[a, b], [c, d]], alternative='greater')

    enrichment_results.append({
        "pathway": pathway,
        "overlap": a,
        "pathway_size": len(pathway_genes),
        "odds_ratio": oddsratio,
        "p_value": p_value
    })

enrich_df = pd.DataFrame(enrichment_results)
enrich_df["p_adj"] = multipletests(enrich_df["p_value"], method="fdr_bh")[1]
enrich_df = enrich_df.sort_values("p_adj")

print(enrich_df.head(10))
```

### Gene Ontology (GO) Terms

**Common GO categories:**
- **Biological Process (BP)**: What the gene does (e.g., "cell cycle", "apoptosis")
- **Molecular Function (MF)**: Biochemical activity (e.g., "kinase activity")
- **Cellular Component (CC)**: Where it acts (e.g., "nucleus", "mitochondrion")

**Resources:**
- Gene Ontology: http://geneontology.org/
- Enrichr: https://maayanlab.cloud/Enrichr/ (web-based enrichment)
- DAVID: https://david.ncifcrf.gov/

### KEGG Pathway Enrichment

**KEGG = Kyoto Encyclopedia of Genes and Genomes**

Provides curated pathway maps for:
- Metabolic pathways
- Signaling pathways
- Disease pathways

**Example pathways:**
- hsa04110: Cell cycle
- hsa04210: Apoptosis
- hsa04151: PI3K-Akt signaling

## Common Analysis Patterns

### Pattern 1: Transcription Factor Activity

**Observation:** Many genes upregulated

**Hypothesis:** Shared transcription factor (TF)

**Test:**
```python
# Check if significant genes share TF binding motifs
tf_targets = {
    "TP53": ["BAX", "CDKN1A", "MDM2", "GADD45A", ...],
    "MYC": ["CDK4", "CCND1", "E2F1", ...],
    # ... more TFs
}

# Test for enrichment (same as pathway enrichment)
```

**Interpretation:** Enrichment suggests TF is active/inactive in condition

### Pattern 2: Pathway Coordination

**Observation:** Genes in same pathway all up/down together

**Interpretation:** Pathway-level regulation (not individual genes)

**Example:**
```
All glycolysis genes ↑↑ → Increased glycolysis
All oxidative phosphorylation genes ↓↓ → Metabolic shift
```

### Pattern 3: Compensatory Response

**Observation:** Opposite regulation of related pathways

**Example:**
```
De novo biosynthesis genes ↓
Salvage pathway genes ↑
→ Metabolic switch to energy-efficient salvage
```

## Correlation Analysis

### Co-expression Networks

**When:** Identify genes that change together

```python
from scipy.stats import pearsonr

# Compute pairwise correlations
genes = significant["gene"].tolist()[:50]  # Top 50 for tractability
corr_matrix = expr_data.loc[genes].T.corr()

# Filter high correlations
high_corr = []
for i in range(len(genes)):
    for j in range(i+1, len(genes)):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            high_corr.append({
                "gene1": genes[i],
                "gene2": genes[j],
                "correlation": corr_matrix.iloc[i, j]
            })

print(f"High correlations (|r| > 0.8): {len(high_corr)}")
```

**Interpretation:**
- Positive correlation → co-regulated (same pathway, shared TF)
- Negative correlation → antagonistic regulation

### Network Visualization

```python
import networkx as nx

# Build network
G = nx.Graph()
for item in high_corr:
    G.add_edge(item["gene1"], item["gene2"], weight=abs(item["correlation"]))

# Find communities (clusters of co-expressed genes)
from networkx.algorithms import community
communities = community.greedy_modularity_communities(G)

for i, comm in enumerate(communities):
    print(f"Community {i}: {list(comm)}")
```

## Literature Search Strategies

### Effective Queries

**For gene function:**
```
"[GENE] function"
"[GENE] role in [PROCESS]"
"[GENE] knockout phenotype"
```

**For pathway context:**
```
"[GENE] pathway"
"[GENE] interacting proteins"
"[GENE] regulation"
```

**For disease relevance:**
```
"[GENE] [DISEASE]"
"[GENE] mutation [DISEASE]"
```

### Key Databases

1. **NCBI Gene**: Gene summaries and references
2. **UniProt**: Protein function and domains
3. **STRING**: Protein-protein interactions
4. **GeneCards**: Comprehensive gene info
5. **PubMed**: Literature search

## Genomics-Specific Hypotheses

### Template Hypotheses

**H1: Transcriptional Regulation**
```
"Condition X activates transcription factor [TF], upregulating
target genes [G1, G2, G3] in pathway [P]"
```

**H2: Pathway Activation**
```
"Condition X activates [pathway], evidenced by coordinated
upregulation of pathway genes and increased activity signature"
```

**H3: Epigenetic Regulation**
```
"Condition X alters chromatin state at [locus], changing
expression of genes [G1, G2]"
```

**H4: Post-transcriptional Regulation**
```
"MicroRNA [miR] is upregulated, suppressing target genes [G1, G2],
explaining decreased protein levels despite unchanged mRNA"
```

## Quality Control

Before interpreting results:

- [ ] Check for batch effects (PCA colored by batch)
- [ ] Verify sample labels are correct
- [ ] Check for outlier samples (hierarchical clustering)
- [ ] Confirm expression distribution (should be roughly normal after log transform)
- [ ] Verify normalization (samples should have similar distributions)

## Common Pitfalls

❌ **Ignoring log transformation**
- Expression data should be log-transformed for most analyses
- Fold changes are linear differences in log space

❌ **Using nominal p-values for many genes**
- Always correct for multiple testing (FDR)
- Use adjusted p-values for significance

❌ **Overinterpreting small fold changes**
- log2FC < 0.5 (1.4-fold) may not be biologically meaningful
- Use stricter thresholds for noisy data

❌ **Confusing gene expression with protein activity**
- mRNA ≠ protein levels
- Protein activity may require post-translational modifications

❌ **Cherry-picking genes**
- Don't select genes to fit a story
- Use unbiased pathway enrichment

## Integration with Other Data Types

### Transcriptomics + Metabolomics

**Strategy:**
```
1. Identify differentially expressed metabolic enzymes
2. Map to KEGG pathways
3. Check if corresponding metabolites are changed
4. Build integrated metabolic model
```

**Example:**
```
Gene: PHGDH (phosphoglycerate dehydrogenase) ↑
Metabolite: Serine ↑
→ Integrated finding: Increased serine biosynthesis
```

### Transcriptomics + Proteomics

**Compare mRNA vs protein changes:**
- Concordant (both up/down) → transcriptional regulation
- Discordant (mRNA ≠ protein) → post-transcriptional regulation

## Key Principle

**Gene expression is the messenger, not the message.**

mRNA changes indicate *potential* for protein changes. Always consider:
- Post-transcriptional regulation (miRNA, RNA stability)
- Translational control
- Protein stability and degradation
- Post-translational modifications

Connect expression changes to phenotype through pathways and functional validation.