# Cancer Multi-Omics Analysis

Integrated analysis of expression, mutation, copy number, and methylation data from TCGA and GEO for solid tumor characterization.

## When to Use This Skill

Activate when the user requests:
- TCGA data download and processing for any cancer type
- Differential expression analysis between tumor conditions
- Pathway and gene set enrichment analysis
- Mutation landscape analysis (oncoplot, signatures, co-occurrence)
- Copy number variation analysis
- DNA methylation analysis (450K/EPIC arrays)
- Integration of two or more omics layers

## Inputs

| Data Type | Format | Source |
|-----------|--------|--------|
| Expression | Raw counts (STAR - Counts) | TCGA via TCGAbiolinks (v2.38+), GEO via GEOquery |
| Mutations | MAF | TCGA GDC, cBioPortal |
| Copy number | Segment files, GISTIC2.0 output | TCGA GDC, GDAC Firehose |
| Methylation | IDAT files or beta-value matrices | TCGA GDC, GEO |
| Clinical | Tabular | TCGA GDC, cBioPortal |

---

## Expression Analysis

### Data Retrieval

```r
library(TCGAbiolinks)  # v2.38.0+, Bioconductor 3.22

query <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)
GDCdownload(query, directory = "GDCdata")
se <- GDCprepare(query, directory = "GDCdata")
# Returns SummarizedExperiment; raw counts in assay(se, "unstranded")
# TPM in assay(se, "tpm_unstrand") — use for visualization only, never for DE

# Extract clinical data
clinical <- as.data.frame(colData(se))
```

### GEO Data Retrieval (Alternative)

```r
library(GEOquery)
gse <- getGEO("GSE72094", GSEMatrix = TRUE)[[1]]
# For count data from GEO Supplementary files:
# Download counts matrix manually, read with read.csv/read.delim
```

### Normalization Decision Tree

```
Input type?
  Raw counts (HTSeq, STAR)
    -> DE analysis: feed directly to DESeq2 (handles normalization internally)
    -> Visualization (PCA, heatmap): apply vst() or rlog() from DESeq2
    -> Cross-sample comparison: vst() preferred for n > 30 samples (faster than rlog)
  TPM/FPKM (already normalized)
    -> DE analysis: STOP. Go back, get raw counts. TPM/FPKM invalid for DESeq2/edgeR.
    -> Correlation/visualization: log2(TPM + 1), acceptable
    -> Gene set scoring (ssGSEA, GSVA): TPM acceptable as input
  RSEM expected counts (non-integer)
    -> round() before DESeq2: DESeqDataSetFromMatrix(countData = round(counts), ...)
```

### Differential Expression (DESeq2 v1.50+)

```r
library(DESeq2)  # v1.50.2+, Bioconductor 3.22
set.seed(42)

counts_mat <- assay(se, "unstranded")
col_data <- colData(se)

# Ensure consistent sample ordering
stopifnot(identical(colnames(counts_mat), rownames(col_data)))

dds <- DESeqDataSetFromMatrix(
  countData = counts_mat,
  colData = col_data,
  design = ~ sample_type  # "Primary Tumor" vs "Solid Tissue Normal"
)

# Pre-filter: keep genes with >= 10 counts in at least 5 samples
# More stringent than rowSums >= 10; reduces false discoveries
keep <- rowSums(counts(dds) >= 10) >= 5
dds <- dds[keep, ]

# Set reference level explicitly
dds$sample_type <- relevel(factor(dds$sample_type), ref = "Solid Tissue Normal")

dds <- DESeq(dds)

# Check available contrasts before extracting results
resultsNames(dds)

res <- results(dds,
  name = "sample_type_Primary.Tumor_vs_Solid.Tissue.Normal",
  alpha = 0.05
)

# LFC shrinkage for ranking and visualization (not for significance testing)
# apeglm is recommended: Cauchy prior, uses likelihood directly
res_shrunk <- lfcShrink(dds,
  coef = "sample_type_Primary.Tumor_vs_Solid.Tissue.Normal",
  type = "apeglm"
)

# Export significant genes
sig <- subset(as.data.frame(res_shrunk), padj < 0.05 & abs(log2FoldChange) > 1)
sig <- sig[order(sig$padj), ]
write.csv(sig, "results/significant_degs.csv")
message(sprintf("DEGs (padj<0.05, |LFC|>1): %d up, %d down",
  sum(sig$log2FoldChange > 0), sum(sig$log2FoldChange < 0)))
```

### Multi-Factor Design (Batch + Condition)

```r
# When batch information is available, include in design
dds <- DESeqDataSetFromMatrix(
  countData = counts_mat, colData = col_data,
  design = ~ batch + sample_type  # batch first, condition last
)
# DESeq2 accounts for batch in the model; no separate correction needed for DE
```

### Batch Correction for Visualization

```
Batch correction decision:
  For DE testing -> Include batch in DESeq2 design formula. Do NOT pre-correct.
  For PCA/heatmap (log-scale data) -> limma::removeBatchEffect on vst/rlog values
  For downstream tools requiring counts -> ComBat-seq (sva package) on raw counts
  NEVER: ComBat (original) on raw counts. NEVER: removeBatchEffect before DE.
```

```r
# Visualization with batch removal
vsd <- vst(dds, blind = FALSE)
library(limma)
assay(vsd) <- removeBatchEffect(assay(vsd), batch = vsd$batch)
# Use this corrected vsd for PCA and heatmaps only

# ComBat-seq for count-level correction (when other tools need corrected counts)
library(sva)
adjusted_counts <- ComBat_seq(
  counts = as.matrix(counts_mat),
  batch = col_data$batch,
  group = col_data$sample_type  # preserve biological signal
)
```

### Gene ID Conversion

```r
# TCGA uses Ensembl IDs (ENSG...). Convert to symbols for downstream analysis.
library(org.Hs.eg.db)

# Method 1: AnnotationDbi (fast, local)
gene_symbols <- mapIds(org.Hs.eg.db,
  keys = sub("\\..*", "", rownames(res)),  # strip version suffix
  keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first")

# Remove unmapped and duplicates
res$symbol <- gene_symbols[sub("\\..*", "", rownames(res))]
res <- res[!is.na(res$symbol) & !duplicated(res$symbol), ]
```

### Pathway and Enrichment Analysis

#### Over-Representation Analysis (ORA)

```r
library(clusterProfiler)
library(org.Hs.eg.db)

# ORA on significant DEGs — use when you have a clear gene list
sig_genes <- res$symbol[res$padj < 0.05 & abs(res$log2FoldChange) > 1]

ego <- enrichGO(
  gene = sig_genes, OrgDb = org.Hs.eg.db, keyType = "SYMBOL",
  ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.2
)

# KEGG enrichment (requires Entrez IDs)
entrez_ids <- mapIds(org.Hs.eg.db, keys = sig_genes,
  keytype = "SYMBOL", column = "ENTREZID")
ekegg <- enrichKEGG(gene = entrez_ids[!is.na(entrez_ids)],
  organism = "hsa", pvalueCutoff = 0.05)
```

#### Gene Set Enrichment Analysis (GSEA)

```r
# GSEA on the FULL ranked gene list — never pre-filter
# Rank by: sign(log2FC) * -log10(pvalue) or stat column from DESeq2
gene_list <- res_shrunk$log2FoldChange
names(gene_list) <- res_shrunk$symbol
gene_list <- sort(gene_list[!is.na(names(gene_list))], decreasing = TRUE)

# Remove duplicates (keep highest absolute value)
gene_list <- gene_list[!duplicated(names(gene_list))]

gsea_go <- gseGO(
  geneList = gene_list, OrgDb = org.Hs.eg.db, keyType = "SYMBOL",
  ont = "BP", minGSSize = 15, maxGSSize = 500,
  pvalueCutoff = 0.05, eps = 0, seed = TRUE
)
```

#### MSigDB Gene Sets (Hallmark, C2, C5)

```r
library(msigdbr)
# Hallmark gene sets (most interpretable for cancer biology)
h_sets <- msigdbr(species = "Homo sapiens", category = "H")
h_list <- split(h_sets$gene_symbol, h_sets$gs_name)

gsea_h <- GSEA(
  geneList = gene_list,
  TERM2GENE = h_sets[, c("gs_name", "gene_symbol")],
  minGSSize = 15, maxGSSize = 500,
  pvalueCutoff = 0.05, eps = 0, seed = TRUE
)
```

#### Sample-Level Pathway Scoring (GSVA/ssGSEA)

```r
library(GSVA)
# GSVA transforms expression matrix to pathway activity scores per sample
# Input: log2-transformed expression (TPM or vst counts), not raw counts
expr_mat <- assay(vst(dds, blind = FALSE))

gsva_params <- gsvaParam(expr_mat, h_list, kcdf = "Gaussian")
gsva_scores <- gsva(gsva_params)
# Output: pathway x sample matrix of enrichment scores
# Use for: clustering, survival association, heatmaps
```

### Visualization

```r
library(ggplot2); library(ggrepel); library(ComplexHeatmap)

# --- PCA ---
vsd <- vst(dds, blind = FALSE)
pcaData <- plotPCA(vsd, intgroup = "sample_type", returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = sample_type)) +
  geom_point(size = 2) +
  labs(x = paste0("PC1: ", percentVar[1], "%"),
       y = paste0("PC2: ", percentVar[2], "%")) +
  theme_classic(base_size = 12)

# --- Volcano plot ---
df <- as.data.frame(res_shrunk)
df$sig <- ifelse(df$padj < 0.05 & abs(df$log2FoldChange) > 1,
  ifelse(df$log2FoldChange > 0, "Up", "Down"), "NS")
top10 <- head(df[order(df$padj), ], 10)
ggplot(df, aes(log2FoldChange, -log10(padj), color = sig)) +
  geom_point(size = 0.5, alpha = 0.5) +
  scale_color_manual(values = c(Up = "#D55E00", Down = "#0072B2", NS = "grey70")) +
  geom_text_repel(data = top10, aes(label = rownames(top10)),
    size = 3, color = "black", max.overlaps = 15) +
  theme_classic(base_size = 12)
```

### Output Specification

| Output | Format | Description |
|--------|--------|-------------|
| `significant_degs.csv` | CSV | DEGs with log2FC, padj, gene symbol |
| `dds.rds` | RDS | DESeqDataSet object for downstream use |
| `vsd.rds` | RDS | VST-transformed data for visualization |
| `gsva_scores.rds` | RDS | Pathway x sample enrichment matrix |
| `pca_plot.pdf` | PDF | PCA colored by condition |
| `volcano_plot.pdf` | PDF | Volcano with top genes labeled |

### Validation Checks

```
After running expression analysis, verify:
  DEG count: Expect 1000-5000 DEGs for tumor vs normal (varies by cancer type)
    TCGA-LUAD: ~3000-4000 DEGs at padj<0.05, |LFC|>1
    < 100 DEGs: check design formula, sample labels, reference level
    > 10000 DEGs: check for confounders, tumor purity bias
  PCA: tumor and normal should separate on PC1 or PC2
  Known markers: TP53, EGFR, KRAS mutations reflected in expression?
  Direction: oncogenes should trend up in tumor, TSGs down
  Volcano symmetry: roughly symmetric unless strong biological bias expected
```

## Mutation Analysis

### Data Retrieval

```r
library(TCGAbiolinks)
library(maftools)  # v2.22+, Bioconductor 3.22

# Method 1: GDCquery_Maf (downloads open-access MAF, hg38-aligned)
# Pipelines: mutect2, muse, varscan2, somaticsniper
maf_df <- GDCquery_Maf("LUAD", pipelines = "mutect2")
maf <- read.maf(maf = maf_df, clinicalData = clinical_df)

# Method 2: GDC query (more control over filters)
query_maf <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Simple Nucleotide Variation",
  data.type = "Masked Somatic Mutation",
  access = "open"
)
GDCdownload(query_maf, directory = "GDCdata")
maf_file <- GDCprepare(query_maf, directory = "GDCdata")
maf <- read.maf(maf = maf_file, clinicalData = clinical_df)
```

### Pipeline Selection

```
Which caller to use?
  Multi-caller consensus (MC3) -> Most stringent, used in TCGA PanCancer Atlas
    Access via: maftools::tcgaLoad(study = "LUAD")
  Single caller:
    MuTect2 -> Best sensitivity for low-frequency variants
    MuSE -> Good for paired tumor-normal, models tumor heterogeneity
    VarScan2 -> Works with and without matched normal
    SomaticSniper -> Bayesian, good specificity
  For most analyses -> MuTect2 or MC3 consensus
```

### Visualization

```r
plotmafSummary(maf, rmOutlier = TRUE, addStat = "median")

oncoplot(maf, top = 20,
  clinicalFeatures = c("ajcc_pathologic_stage", "gender"),
  sortByAnnotation = TRUE)

# Lollipop: protein domain plot for a single gene
lollipopPlot(maf, gene = "TP53", AACol = "HGVSp_Short",
  showMutationRate = TRUE)

# Rainfall: inter-mutation distance, detects kataegis (localized hypermutation)
rainfallPlot(maf, detectChangePoints = TRUE, pointSize = 0.4)
```

### Tumor Mutation Burden

```r
# TMB = total nonsynonymous mutations / capture region size (MB)
laml.tmb <- tmb(maf, captureSize = 50, logScale = TRUE)

# Compare against all 33 TCGA cohorts (uses MC3 data, 35.8 MB capture)
tcgaCompare(maf, cohortName = "LUAD-study", logscale = TRUE,
  capture_size = 50)
# Useful for: immunotherapy response prediction (TMB-high cutoff varies by cancer type)
```

### Somatic Interactions

```r
# Pairwise Fisher's exact test for co-occurrence and mutual exclusivity
somaticInteractions(maf, top = 25, pvalue = c(0.05, 0.01))
# Green = co-occurring, brown = mutually exclusive
# Classic examples: TP53 + KRAS co-occurrence in LUAD; EGFR vs KRAS mutual exclusivity
```

### Clinical Enrichment

```r
# Test which mutations are enriched in clinical subgroups
ce <- clinicalEnrichment(maf, clinicalFeature = "ajcc_pathologic_stage")
ce$groupwise_comparision[p_value < 0.05]
plotEnrichmentResults(ce, pvalue = 0.05, geneFontSize = 0.8)
```

### Mutational Signatures

```r
library(BSgenome.Hsapiens.UCSC.hg38)

# Step 1: Extract trinucleotide context matrix (96 substitution classes)
# Needs >= 30 samples for reliable decomposition
tnm <- trinucleotideMatrix(maf, prefix = "chr", add = TRUE,
  ref_genome = "BSgenome.Hsapiens.UCSC.hg38")

# Step 2: Estimate optimal number of signatures (cophenetic correlation)
est <- estimateSignatures(mat = tnm, nTry = 6)
# Look for the elbow in the cophenetic plot

# Step 3: Extract n signatures via NMF
sig <- extractSignatures(mat = tnm, n = 3)  # n from step 2

# Step 4: Compare to COSMIC
# "legacy" = 30 COSMIC v2 signatures, "SBS" = 65 COSMIC v3 signatures
cosim <- compareSignatures(nmfRes = sig, sig_db = "SBS")
# cosim$cosine_similarities: sample x COSMIC matrix
# Match: cosine similarity > 0.85 is a confident assignment

plotSignatures(sig, contributions = FALSE)  # signature profiles
plotSignatures(sig, contributions = TRUE)   # per-sample contributions
```

### Driver Gene Detection

```r
# oncodrive: identifies genes with clustered mutations (positional clustering)
# Not dN/dS — uses the concept that driver mutations cluster in specific protein regions
oncodrive_res <- oncodrive(maf, AACol = "HGVSp_Short", minMut = 5,
  pvalMethod = "zscore")
plotOncodrive(oncodrive_res, fdrCutOff = 0.1, useFraction = TRUE)

# For proper dN/dS analysis, use the dndscv package separately:
# library(dndscv); dndsout <- dndscv(mutations_df)
```

## Copy Number Analysis

### Data Retrieval

```r
library(TCGAbiolinks)

# Download masked copy number segments (preferred: excludes germline-prone regions)
query_cnv <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Copy Number Variation",
  data.type = "Masked Copy Number Segment"
)
GDCdownload(query_cnv, directory = "GDCdata")
cnv_seg <- GDCprepare(query_cnv, directory = "GDCdata")
# Returns: Sample, Chromosome, Start, End, Num_Probes, Segment_Mean
# Segment_Mean is log2(copy_number / 2): 0 = diploid, >0 = gain, <0 = loss
```

### Segment Interpretation

```
Segment_Mean thresholds (log2 ratio):
  > 0.3   -> Gain
  > 0.7   -> Amplification (high-level gain)
  < -0.3  -> Loss (heterozygous deletion)
  < -0.7  -> Deep deletion (likely homozygous)
  These are conventions, not absolute cutoffs. Adjust based on tumor purity.
```

### GISTIC2.0 Analysis

```r
# GISTIC identifies recurrent focal CNV events across a cohort
# Pre-computed GISTIC results available from Firehose/GDAC for most TCGA cancers:
# https://gdac.broadinstitute.org/

# Read GISTIC output with maftools
gistic <- readGistic(
  gisticAllLesionsFile = "all_lesions.conf_99.txt",
  gisticAmpGenesFile = "amp_genes.conf_99.txt",
  gisticDelGenesFile = "del_genes.conf_99.txt",
  gisticScoresFile = "scores.gistic",
  isTCGA = TRUE
)

gisticChromPlot(gistic)            # genome-wide q-values per arm
gisticBubblePlot(gistic)           # focal events with significance
gisticOncoPlot(gistic, top = 10)   # oncoplot of recurrent CNVs
```

### Gene-Level Copy Number from Segments

```r
# Map segments to genes when GISTIC output is not available
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
seg_gr <- GRanges(
  seqnames = paste0("chr", cnv_seg$Chromosome),
  ranges = IRanges(start = cnv_seg$Start, end = cnv_seg$End),
  segment_mean = cnv_seg$Segment_Mean,
  sample = cnv_seg$Sample
)

# Find overlaps: assign segment value to each gene
hits <- findOverlaps(genes, seg_gr)
# For genes spanning multiple segments, use the segment with largest overlap
```

### Integration: Mutation + CNV

```r
# Add CNV data to MAF object for combined oncoplot
# cnv_table: data.frame with columns: Gene, Sample_name, CN (amp/del)
oncoplot(maf, top = 20,
  additionalFeature = list(cnv_data),
  additionalFeaturePch = 15,
  additionalFeatureCol = c(amp = "red", del = "blue"))
```

## Methylation Analysis

### Data Retrieval (TCGA)

```r
library(TCGAbiolinks)

# Download 450K methylation data from GDC (harmonized, hg38)
query_met <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "DNA Methylation",
  platform = "Illumina Human Methylation 450",
  sample.type = "Primary Tumor"
)
GDCdownload(query_met)
met_se <- GDCprepare(query_met)
# Returns SummarizedExperiment; beta values as assay matrix
# rowRanges contains probe genomic coordinates (hg38)

# Also download matched normals for differential analysis
query_met_normal <- GDCquery(
  project = "TCGA-LUAD",
  data.category = "DNA Methylation",
  platform = "Illumina Human Methylation 450",
  sample.type = "Solid Tissue Normal"
)
```

### Processing with minfi

```r
library(minfi)

# --- From IDAT files (non-TCGA data) ---
targets <- read.metharray.sheet("/path/to/idats/", pattern = "csv$")
rgSet <- read.metharray.exp(targets = targets, extended = TRUE)
# extended = TRUE needed for Noob background correction

# --- Detection p-values (probe reliability) ---
detP <- detectionP(rgSet)
# Remove samples with mean detection p > 0.01
keep_samples <- colMeans(detP) < 0.01
rgSet <- rgSet[, keep_samples]

# --- Normalization ---
# Funnorm: best for cancer (large methylation differences between groups)
# Uses control probes to remove technical variation via PCA
grSet <- preprocessFunnorm(rgSet, nPCs = 2, bgCorr = TRUE, dyeCorr = TRUE)
# Returns GenomicRatioSet with beta and M-values

# Alternative: preprocessQuantile for subtle differences (cell type studies)
# Alternative: preprocessNoob for single-sample normalization (clinical)
```

### Probe Filtering

```r
# Order matters: filter after normalization

# 1. Failed probes (detection p > 0.01 in > 5% of samples)
keep <- rowSums(detP[rownames(grSet), colnames(grSet)] < 0.01) >
  (ncol(grSet) * 0.95)
grSet <- grSet[keep, ]

# 2. SNP-affected probes (polymorphisms at CpG site or single-base extension)
grSet <- dropLociWithSnps(grSet, snps = c("SBE", "CpG"), maf = 0)

# 3. Cross-reactive probes (map to multiple genomic locations)
# Use DMRcate's built-in filter or maxprobes package
library(DMRcate)
grSet <- rmSNPandCH(grSet, rmcrosshyb = TRUE, rmXY = FALSE)

# 4. Sex chromosome probes (remove for mixed-sex cohorts)
grSet <- grSet[!(seqnames(grSet) %in% c("chrX", "chrY")), ]

# Typical probe counts after filtering:
# 450K: ~485K -> ~420-440K retained
# EPIC: ~866K -> ~780-810K retained
```

### Alternative: ChAMP Pipeline

```r
library(ChAMP)

# All-in-one loading with built-in filtering
# Handles detection p, SNPs, cross-reactive, sex chromosomes in one call
myLoad <- champ.load(
  directory = "/path/to/idats/",
  arraytype = "EPIC",         # or "450K"
  filterDetP = TRUE,          # detection p-value filter
  detPcut = 0.01,
  filterSNPs = TRUE,          # remove SNP-affected probes
  filterMultiHit = TRUE,      # remove cross-reactive probes
  filterXY = TRUE,            # remove sex chromosome probes
  filterBeads = TRUE,         # remove low bead count probes
  filterNoCG = TRUE           # remove non-CpG probes
)

# Normalization: BMIQ corrects Type I/II probe bias
myNorm <- champ.norm(beta = myLoad$beta, arraytype = "EPIC", method = "BMIQ")

# Batch effect detection
champ.SVD(beta = myNorm, pd = myLoad$pd)

# Batch correction if needed
myNorm <- champ.runCombat(beta = myNorm, pd = myLoad$pd, batchname = c("Slide"))
```

```
When to use which:
  minfi -> Fine-grained control, Funnorm (best for cancer), EPIC v2 support
  ChAMP -> Rapid standardized pipeline, built-in filtering + batch correction
  Both use limma internally for DMP calling
```

### Beta-Values vs M-Values

```
Beta-values: ratio = M / (M + U + 100), range [0, 1]
  Use for: biological interpretation, visualization, reporting delta-beta
  Problem: heteroscedastic (variance compressed at 0 and 1)

M-values: log2(M / U), range (-Inf, Inf)
  Use for: ALL statistical testing (limma, t-tests, linear models)
  Reason: approximately normal, satisfies linear model assumptions

Conversion: M = log2(beta / (1 - beta))
Always test on M-values, report delta-beta for biological effect size.
```

### Differentially Methylated Positions (DMPs)

```r
library(limma)

# Extract M-values for statistical testing
mVals <- getM(grSet)
bVals <- getBeta(grSet)

# Design matrix
group <- factor(pData(grSet)$sample_type,
  levels = c("Solid Tissue Normal", "Primary Tumor"))
design <- model.matrix(~ 0 + group)
colnames(design) <- c("Normal", "Tumor")

# Contrast
contMatrix <- makeContrasts(Tumor - Normal, levels = design)

# Fit — trend = TRUE is critical for M-values
fit <- lmFit(mVals, design)
fit2 <- contrasts.fit(fit, contMatrix)
fit2 <- eBayes(fit2, trend = TRUE)

# Results
dmps <- topTable(fit2, coef = 1, number = Inf, sort.by = "p")

# Add delta-beta for biological interpretation
dmps$deltaBeta <- rowMeans(bVals[rownames(dmps), group == "Primary Tumor"]) -
  rowMeans(bVals[rownames(dmps), group == "Solid Tissue Normal"])

# Filter: statistical + biological significance
sig_dmps <- dmps[dmps$adj.P.Val < 0.05 & abs(dmps$deltaBeta) > 0.2, ]
message(sprintf("Significant DMPs: %d hyper, %d hypo",
  sum(sig_dmps$deltaBeta > 0), sum(sig_dmps$deltaBeta < 0)))
```

### Differentially Methylated Regions (DMRs)

```r
library(DMRcate)

# DMRcate: recommended for arrays. Uses kernel smoothing of limma statistics.
# Faster and better benchmarked than bumphunter for 450K/EPIC data.

# Step 1: Annotate CpGs with limma statistics
myAnnotation <- cpg.annotate(
  datatype = "array",
  object = mVals,
  what = "M",
  analysis.type = "differential",
  design = design,
  contrasts = TRUE,
  cont.matrix = contMatrix,
  coef = "Tumor - Normal",
  arraytype = "450K",    # or "EPIC"
  fdr = 0.05
)

# Step 2: Find DMRs
dmResults <- dmrcate(myAnnotation,
  lambda = 1000,    # Gaussian kernel bandwidth (bp)
  C = 2,            # scaling factor
  min.cpgs = 3      # minimum CpGs per DMR
)

# Step 3: Extract results as GRanges
dmr_ranges <- extractRanges(dmResults)
# Columns: coord, no.cpgs, min_smoothed_fdr, Stouffer, HMFDR,
#   Fisher, maxdiff, meandiff, overlapping.genes
```

### Methylation Subtypes (CIMP)

```r
library(ConsensusClusterPlus)

# CIMP identification: cluster on CpG island probes with highest variance
ann <- getAnnotation(grSet)

# Select CpG island probes only
island_probes <- rownames(ann)[ann$Relation_to_Island == "Island"]
island_beta <- bVals[intersect(island_probes, rownames(bVals)), ]

# Top 5000 most variable CpG island probes
vars <- apply(island_beta, 1, var, na.rm = TRUE)
top_probes <- names(sort(vars, decreasing = TRUE))[1:5000]
mat <- island_beta[top_probes, ]

# Consensus clustering (k = 2-6, 1000 iterations)
results <- ConsensusClusterPlus(mat,
  maxK = 6, reps = 1000, pItem = 0.8, pFeature = 1,
  clusterAlg = "hc", distance = "euclidean",
  innerLinkage = "ward.D2", finalLinkage = "ward.D2",
  title = "CIMP_clustering", plot = "pdf", seed = 42)

# Optimal k: look at delta area plot and consensus CDF
# k = 2 or 3 typical: CIMP-high, CIMP-low, (non-CIMP)
cluster_assignments <- results[[3]]$consensusClass  # for k=3
```

```
Cancer-specific CIMP markers:
  Colorectal (COAD): CACNA1G, IGF2, NEUROG1, RUNX3, SOCS1 (5-marker panel)
    CIMP-H -> BRAF V600E, MLH1 hypermethylation, MSI-H
    CIMP-L -> KRAS mutations
  Glioblastoma (GBM): G-CIMP defined by IDH1 mutation + global hypermethylation
    G-CIMP -> proneural subtype, younger patients, better survival
  Pan-cancer: 19/26 TCGA cancer types have confirmed CIMP subtypes
```

### Integration: Methylation-Expression Correlation

```r
# --- Match samples across platforms ---
met_patients <- substr(colnames(bVals), 1, 12)
exp_patients <- substr(colnames(expr_mat), 1, 12)
common <- intersect(met_patients, exp_patients)

met_idx <- match(common, met_patients)
exp_idx <- match(common, exp_patients)

# --- Build promoter methylation per gene ---
ann <- getAnnotation(grSet)
promoter <- ann[grep("TSS200|TSS1500|1stExon", ann$UCSC_RefGene_Group), ]

# Probe-to-gene mapping (probes can map to multiple genes)
probe_gene <- data.frame(
  probe = rep(rownames(promoter),
    lengths(strsplit(promoter$UCSC_RefGene_Name, ";"))),
  gene = unlist(strsplit(promoter$UCSC_RefGene_Name, ";")),
  stringsAsFactors = FALSE
)
probe_gene <- unique(probe_gene[probe_gene$gene != "", ])
probe_gene <- probe_gene[probe_gene$probe %in% rownames(bVals), ]

# Average promoter beta per gene per sample
gene_meth <- tapply(seq_len(nrow(probe_gene)), probe_gene$gene, function(idx) {
  probes <- probe_gene$probe[idx]
  colMeans(bVals[probes, met_idx, drop = FALSE], na.rm = TRUE)
})
gene_meth_mat <- do.call(rbind, gene_meth)

# --- Spearman correlation per gene ---
exp_matched <- expr_mat[, exp_idx]
shared_genes <- intersect(rownames(gene_meth_mat), rownames(exp_matched))

cor_results <- t(sapply(shared_genes, function(g) {
  m <- as.numeric(gene_meth_mat[g, ])
  e <- as.numeric(exp_matched[g, ])
  ok <- !is.na(m) & !is.na(e)
  if (sum(ok) < 10) return(c(rho = NA, pval = NA))
  ct <- cor.test(m[ok], e[ok], method = "spearman", exact = FALSE)
  c(rho = ct$estimate, pval = ct$p.value)
}))
cor_df <- as.data.frame(cor_results)
cor_df$padj <- p.adjust(cor_df$pval, method = "BH")

# Genes silenced by promoter methylation
silenced <- cor_df[which(cor_df$rho < -0.3 & cor_df$padj < 0.05), ]
```

### Pathway Analysis (Probe-Bias Corrected)

```r
library(missMethyl)

# Standard GO/KEGG enrichment is biased for methylation arrays:
# genes with more probes are more likely to be called significant.
# gometh() uses Wallenius' noncentral hypergeometric test to correct this.

sig_cpgs <- rownames(sig_dmps)
all_cpgs <- rownames(dmps)

go_meth <- gometh(
  sig.cpg = sig_cpgs,
  all.cpg = all_cpgs,
  collection = "GO",
  array.type = "450K",     # or "EPIC"
  plot.bias = TRUE,        # diagnostic: shows probe-number bias
  prior.prob = TRUE         # applies bias correction
)

kegg_meth <- gometh(
  sig.cpg = sig_cpgs, all.cpg = all_cpgs,
  collection = "KEGG", array.type = "450K", prior.prob = TRUE
)
# Never use standard enrichGO/enrichKEGG on methylation probe lists
```

### Visualization

```r
library(ComplexHeatmap); library(circlize)

# --- DMP heatmap (top 100 probes) ---
top_probes <- rownames(sig_dmps)[1:100]
mat <- bVals[top_probes, ]

col_fun <- colorRamp2(c(0, 0.5, 1), c("#2166AC", "white", "#B2182B"))
ha <- HeatmapAnnotation(
  Group = pData(grSet)$sample_type,
  col = list(Group = c("Primary Tumor" = "#D55E00", "Solid Tissue Normal" = "#0072B2"))
)

Heatmap(mat, name = "Beta", col = col_fun, top_annotation = ha,
  show_row_names = FALSE, clustering_distance_rows = "euclidean",
  clustering_method_rows = "ward.D2",
  column_title = "Top 100 Differentially Methylated Positions")

# --- Methylation-expression scatter for a single gene ---
library(ggplot2)
gene <- "MGMT"
df <- data.frame(
  methylation = as.numeric(gene_meth_mat[gene, ]),
  expression = as.numeric(exp_matched[gene, ])
)
ggplot(df, aes(methylation, expression)) +
  geom_point(alpha = 0.5, size = 1.5) +
  geom_smooth(method = "lm", color = "steelblue", se = TRUE) +
  annotate("text", x = 0.8, y = max(df$expression, na.rm = TRUE),
    label = sprintf("rho = %.2f", cor(df$methylation, df$expression,
      method = "spearman", use = "complete.obs")), hjust = 1) +
  labs(x = paste0(gene, " promoter methylation (beta)"),
       y = paste0(gene, " expression (log2)")) +
  theme_classic(base_size = 12)
```

### Output Specification

| Output | Format | Description |
|--------|--------|-------------|
| `dmps.csv` | CSV | All tested probes with delta-beta, adj.P.Val, genomic annotation |
| `sig_dmps.csv` | CSV | Significant DMPs (adj.P.Val < 0.05, \|delta-beta\| > 0.2) |
| `dmr_ranges.rds` | RDS | GRanges of differentially methylated regions |
| `grSet.rds` | RDS | Filtered, normalized GenomicRatioSet |
| `cimp_clusters.csv` | CSV | Sample-to-cluster assignments from consensus clustering |
| `meth_expr_cor.csv` | CSV | Per-gene Spearman rho and adjusted p-values |
| `dmp_heatmap.pdf` | PDF | Top DMPs across samples |

### Validation Checks

```
After running methylation analysis, verify:
  DMP count: Expect 10,000-100,000 DMPs for tumor vs normal at adj.P.Val < 0.05
    < 1,000 DMPs: check normalization, sample labels, detection p filtering
    Global hypomethylation in tumors is expected (more hypo than hyper DMPs)
  Delta-beta distribution: most DMPs should have |delta-beta| between 0.1-0.5
    |delta-beta| > 0.7 at many probes: check for sample swaps or contamination
  Known markers:
    MGMT promoter methylation in GBM (predicts temozolomide response)
    MLH1 methylation in MSI-H colorectal (silences mismatch repair)
    BRCA1 methylation in basal-like breast cancer
  Probe filtering: verify 40,000-65,000 probes removed (450K) or 55,000-85,000 (EPIC)
  Integration: promoter methylation-expression correlation should be predominantly negative
    Median rho around -0.05 to -0.15 across all genes
    Gene body methylation should show weak positive correlation
```

## Common Pitfalls

### Expression
1. **DESeq2 input**: Using FPKM/TPM as input. DESeq2 requires raw integer counts.
2. **Pre-filtering GSEA input**: GSEA requires the full ranked gene list. Never pre-filter to significant genes only (that is ORA, not GSEA).
3. **LFC shrinkage for hypothesis testing**: apeglm shrinkage is for ranking and visualization. Use unshrunken results for strict significance calls unless using `lfcThreshold > 0`.
4. **Gene ID mismatch**: TCGA uses Ensembl IDs with version suffixes (ENSG00000141510.18). Strip versions before mapping: `sub("\\..*", "", ensembl_ids)`.

### Mutation
5. **Mixing callers**: Different variant callers produce different mutation sets. Pick one pipeline (or use MC3 consensus) and stick with it across the cohort.
6. **Signature sample size**: Extracting signatures from < 30 samples is unreliable. For small cohorts, fit to known COSMIC signatures instead of de novo extraction.
7. **Reference genome for signatures**: `trinucleotideMatrix()` requires the correct BSgenome. GDC harmonized data uses hg38 — using hg19 reference silently produces wrong trinucleotide contexts.
8. **TMB capture size**: `tcgaCompare()` uses 35.8 MB (Agilent SureSelect). If your data uses a different capture kit, set `capture_size` accordingly or TMB comparison is meaningless.

### Copy Number
9. **Segment_Mean interpretation**: Segment_Mean is log2(CN/2), not raw copy number. A value of 0.0 is diploid, not zero copies.
10. **CNV thresholds and purity**: Fixed cutoffs (>0.3 gain, <-0.3 loss) assume high tumor purity. Low-purity samples compress the signal — a real amplification may show Segment_Mean of only 0.2.

### Methylation
11. **Beta-values for statistics**: Using beta-values in limma or t-tests. Beta-values are heteroscedastic — use M-values for all statistical testing, report delta-beta for effect size.
12. **Skipping cross-reactive probe removal**: ~40K probes on 450K and ~43K on EPIC map to multiple genomic locations. These produce false DMPs at the wrong loci.
13. **Standard enrichment on methylation results**: Genes with more probes on the array are more likely to appear significant. Use `missMethyl::gometh()` instead of `enrichGO()` for methylation pathway analysis.
14. **Mixing 450K and EPIC without harmonization**: EPIC has ~380K additional probes. Restrict to shared probes before combining cohorts, or use ComBat on the intersection.
15. **Ignoring probe type bias**: Type I and Type II probes have different dynamic ranges. BMIQ or Funnorm corrects this; raw beta-values without type correction are biased.

### General
16. **Genome build mismatch**: TCGA legacy archive uses hg19, GDC harmonized uses hg38. Never mix builds across data types. Check with `GenomeInfoDb::genome(se)`.
17. **Tumor purity**: TCGA samples range 20-95% purity. Use ESTIMATE or ABSOLUTE purity scores to filter (> 60%) or include as covariate.
18. **TCGA barcode matching**: When integrating data types, match on patient barcode (first 12 characters), not full barcode. A patient may have multiple aliquots.

## Related Skills

- [`immune-deconvolution`](../immune-deconvolution/SKILL.md): Estimate immune cell composition from the expression data produced here
- [`survival-analysis`](../survival-analysis/SKILL.md): Build prognostic models using mutation, expression, and methylation features as Cox covariates

## Public Datasets for Testing

| Dataset | Samples | Use Case |
|---------|---------|----------|
| TCGA-LUAD | 585 | Lung adenocarcinoma, well-characterized drivers (EGFR, KRAS, ALK) |
| TCGA-BRCA | 1098 | Breast cancer, strong molecular subtypes (Luminal A/B, HER2, Basal) |
| TCGA-GBM | 617 | Glioblastoma, classic for survival and methylation subtyping |
| TCGA-COAD | 521 | Colorectal, MSI-H vs MSS comparison |