# Survival Analysis

Time-to-event analysis for cancer clinical data. Covers Kaplan-Meier, Cox proportional hazards, competing risks, restricted mean survival time, and optimal cutpoint selection using the survival, ggsurvfit, tidycmprsk, and survRM2 packages.

## When to Use This Skill

Activate when the user requests:
- Kaplan-Meier survival curves with risk tables
- Log-rank or stratified log-rank tests between groups
- Cox proportional hazards modeling (univariable or multivariable)
- Proportional hazards assumption checking
- Competing risks analysis (cause-specific or Fine-Gray)
- Restricted mean survival time (RMST) comparisons
- Optimal biomarker cutpoint selection for survival
- Forest plots for multivariate Cox models

## Inputs

| Data Type | Format | Source |
|-----------|--------|--------|
| Clinical | Tabular (time, event, covariates) | TCGA GDC via TCGAbiolinks, cBioPortal |
| Time variable | Days/months to event or last follow-up | `days_to_death`, `days_to_last_follow_up` |
| Event indicator | Binary (0 = censored, 1 = event) | Derived from `vital_status` |
| Covariates | Categorical or continuous | Age, stage, gene expression, mutations |

---

## Preparing TCGA Survival Data

```r
library(TCGAbiolinks)
library(SummarizedExperiment)

query <- GDCquery(
  project = "TCGA-GBM",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)
GDCdownload(query, directory = "GDCdata")
se <- GDCprepare(query, directory = "GDCdata")
clinical <- as.data.frame(colData(se))

# Construct survival endpoint from TCGA clinical fields
# TCGA stores time in days. Convert to months for readability.
clinical$os_time <- ifelse(!is.na(clinical$days_to_death),
  as.numeric(clinical$days_to_death),
  as.numeric(clinical$days_to_last_follow_up))
clinical$os_event <- ifelse(clinical$vital_status == "Dead", 1, 0)
clinical$os_months <- clinical$os_time / 30.44

# Drop samples with missing survival data
clinical <- clinical[!is.na(clinical$os_months) & clinical$os_months > 0, ]
```

## Kaplan-Meier Estimation

```r
library(survival)    # v3.8+
library(ggsurvfit)   # v1.2+, pharmaverse — modern replacement for survminer

# survfit2() tracks the calling environment for cleaner legend labels
# Use survfit2() with ggsurvfit; use survfit() with base R plotting
km_fit <- survfit2(Surv(os_months, os_event) ~ gender, data = clinical)

ggsurvfit(km_fit) +
  add_confidence_interval() +
  add_risktable() +
  add_pvalue(location = "annotation") +
  scale_x_continuous(breaks = seq(0, 60, 12)) +
  labs(x = "Months", y = "Overall survival probability") +
  theme_classic(base_size = 12)

# Median survival per group
summary(km_fit)$table[, "median"]

# Median follow-up: reverse KM method (censor events, event the censored)
km_fu <- survfit(Surv(os_months, 1 - os_event) ~ 1, data = clinical)
summary(km_fu)$table["median"]
# Always report median follow-up alongside survival estimates
```

## Log-Rank Test

```r
# Two-group comparison
survdiff(Surv(os_months, os_event) ~ idh1_status, data = clinical)
# Returns chi-square and p-value. Tests H0: survival curves are identical.

# Stratified log-rank — adjust for a confounder without estimating its effect
survdiff(Surv(os_months, os_event) ~ treatment + strata(age_group), data = clinical)
```

```
When log-rank is appropriate:
  Groups have proportional hazards (non-crossing curves)
    -> log-rank is most powerful
  Curves cross or separate late (common in immunotherapy)
    -> log-rank has low power. Use RMST or weighted log-rank.
  > 2 groups
    -> log-rank gives a global test. Follow up with pairwise comparisons
       using p.adjust() for multiple testing correction.
```

## Cox Proportional Hazards

### Univariable

```r
# Continuous covariate
cox_age <- coxph(Surv(os_months, os_event) ~ age_at_diagnosis, data = clinical)
summary(cox_age)
# HR per 1-year increase. exp(coef) = HR, lower/upper .95 = CI

# Categorical covariate — set reference level explicitly
clinical$stage <- relevel(factor(clinical$stage), ref = "Stage I")
cox_stage <- coxph(Surv(os_months, os_event) ~ stage, data = clinical)
```

### Multivariable

```r
cox_multi <- coxph(
  Surv(os_months, os_event) ~ age_at_diagnosis + gender + idh1_status + mgmt_methylation,
  data = clinical
)
summary(cox_multi)

# Concordance index (C-statistic): 0.5 = random, 1.0 = perfect discrimination
# TCGA-GBM with IDH + MGMT: expect C-index ~0.65-0.75
concordance(cox_multi)
```

### Design Considerations

```
Variable selection:
  Include variables that are clinically established prognostic factors.
  Do NOT use stepwise selection (AIC/BIC) for final models in clinical contexts —
  it inflates Type I error and produces unstable models.
  Stepwise is acceptable for exploratory analysis with clear disclosure.

Events per variable (EPV):
  Classic rule: >= 10 events per predictor in the model.
  With 50 events, fit at most 5 covariates.
  More recent work (Riley 2019) recommends >= 20 EPV for adequate precision.
  Too many covariates relative to events = overfitting and unstable HRs.

Continuous vs categorical:
  Keep continuous variables continuous in Cox models. Dichotomizing age at
  median discards information and reduces power.
  If a cutpoint is needed for clinical communication, derive it separately
  (see Optimal Cutpoints below) and validate externally.
```

## Proportional Hazards Assumption

```r
# Schoenfeld residual test
zph <- cox.zph(cox_multi, transform = "km")
print(zph)
# Per-covariate p-values + global test
# p > 0.05: PH assumption not rejected (but doesn't prove PH holds)
# p < 0.05: evidence of non-proportionality for that covariate

# Visual check — more informative than the p-value alone
# Horizontal band of residuals -> PH holds
# Systematic trend (slope) -> PH violated, effect changes over time
plot(zph)
# Or with survminer: survminer::ggcoxzph(zph)
```

```
Interpreting Schoenfeld residuals:
  Positive slope: effect increases over time (HR drifts upward)
  Negative slope: effect weakens over time (early benefit that fades)
  U-shape or crossing zero: complex time-varying effect

  The p-value is sample-size dependent: large datasets reject PH
  for clinically negligible departures. Always look at the plot.
```

### Handling PH Violations

```r
# Option 1: Time-varying coefficient with tt()
# Preferred when a specific covariate violates PH
cox_tv <- coxph(
  Surv(os_months, os_event) ~ age_at_diagnosis + tt(treatment),
  data = clinical,
  tt = function(x, t, ...) x * log(t + 1)
)
# Tests whether treatment effect changes with log(time)
# Significant tt() term confirms time-varying effect

# Option 2: Stratification — for nuisance variables
# Each stratum gets its own baseline hazard; no HR estimated for that variable
cox_strat <- coxph(
  Surv(os_months, os_event) ~ age_at_diagnosis + mgmt_methylation + strata(center),
  data = clinical
)
# Re-run cox.zph() after stratification to confirm remaining covariates satisfy PH

# Option 3: Time-splitting with tmerge() — for time-varying covariates
# When a covariate value changes during follow-up (e.g., treatment switch)
library(survival)
td_data <- tmerge(
  data1 = clinical, data2 = clinical,
  id = patient_id,
  os_event = event(os_months, os_event),
  treatment_switch = tdc(switch_time, new_treatment)
)
cox_td <- coxph(Surv(tstart, tstop, os_event) ~ treatment_switch + age_at_diagnosis,
  data = td_data)
```

## Competing Risks

```
Two hazard models — they answer different questions:

Cause-specific hazard (CSH):
  "Does treatment affect the rate of cancer death among patients still alive?"
  Standard coxph() treating competing events as censored.
  Use for: etiological questions, causal inference.

Fine-Gray subdistribution hazard (SDH):
  "Does treatment affect the cumulative probability of dying from cancer
   when competing events are accounted for?"
  Use for: prediction, prognostic models, clinical decision-making.

Run both and report side-by-side.
A covariate can have a significant CSH but non-significant SDH (or vice versa)
because they measure different things.
```

```r
library(tidycmprsk)  # v1.1+, wraps cmprsk with tidy interface
library(ggsurvfit)

# Event coding for competing risks:
# 0 = censored, 1 = event of interest, 2 = competing event
# Example: 1 = cancer death, 2 = death from other cause
clinical$event_type <- factor(
  ifelse(clinical$os_event == 0, "censored",
    ifelse(clinical$cause_of_death == "cancer", "cancer_death", "other_death")),
  levels = c("censored", "cancer_death", "other_death")
)

# Cumulative incidence function (CIF)
cif <- cuminc(Surv(os_months, event_type) ~ treatment, data = clinical)

ggcuminc(cif, outcome = "cancer_death") +
  add_risktable() +
  labs(x = "Months", y = "Cumulative incidence of cancer death") +
  theme_classic(base_size = 12)

# Fine-Gray regression (subdistribution hazard)
fg <- crr(Surv(os_months, event_type) ~ age + treatment + stage,
  data = clinical, failcode = "cancer_death")
summary(fg)

# Cause-specific Cox (standard Cox, treating competing events as censored)
clinical$cancer_event <- ifelse(clinical$event_type == "cancer_death", 1, 0)
cs <- coxph(Surv(os_months, cancer_event) ~ age + treatment + stage, data = clinical)
summary(cs)
```

## Restricted Mean Survival Time

```r
library(survRM2)  # v1.0-4

# RMST: area under the survival curve up to time tau
# Avoids the PH assumption entirely — compares mean survival time between groups

rmst <- rmst2(
  time = clinical$os_months,
  status = clinical$os_event,
  arm = as.numeric(clinical$treatment == "experimental"),
  tau = 24  # restrict to 24 months — must be pre-specified
)
print(rmst)
# Returns:
#   RMST difference (treatment - control), CI, p-value
#   RMST ratio
#   RMTL ratio (restricted mean time lost)

# tau must be less than the minimum of the largest observed time in each group.
# Pre-specify tau based on clinical context (e.g., 12, 24, 36 months).
# Run sensitivity analysis across multiple tau values.
```

```
When to use RMST vs HR:
  Proportional hazards hold?
    -> HR is more efficient. Report both HR and RMST.
  Curves cross or separate late (immunotherapy, delayed effect)?
    -> RMST captures the difference that HR misses.
  Need a clinically interpretable number?
    -> RMST: "Treatment gives 3.2 extra months of survival on average"
       HR: "Treatment reduces the hazard by 30%" (less intuitive)
```

## Optimal Cutpoint Selection

```r
library(survminer)  # v0.5+, provides surv_cutpoint

# maxstat approach: maximally selected log-rank statistic
# Finds the cutpoint that maximally separates high/low groups
cp <- surv_cutpoint(clinical, time = "os_months", event = "os_event",
  variables = "biomarker_score")
summary(cp)
cat(sprintf("Optimal cutpoint: %.2f (corrected p = %.4f)\n",
  cp$cutpoint$cutpoint, cp$cutpoint$statistic))

# Categorize samples at the cutpoint
clinical$biomarker_group <- surv_categorize(cp)$biomarker_score

# Plot with survminer (surv_cutpoint is survminer-specific, not ggsurvfit)
fit_cp <- survfit(Surv(os_months, os_event) ~ biomarker_group, data = clinical)
ggsurvplot(fit_cp, data = clinical, pval = TRUE, risk.table = TRUE)
```

```
Cutpoint pitfalls — read before using:
  1. Inflated significance: searching all cutpoints maximizes the test statistic.
     The p-value from maxstat is corrected for this, but the effect size (HR)
     is still biased upward. A random variable can appear highly prognostic.

  2. Non-reproducibility: data-driven cutpoints rarely replicate.
     Any cutpoint found this way MUST be validated in an independent cohort
     or via cross-validation. Without validation, it is a hypothesis, not a finding.

  3. Information loss: dichotomizing a continuous biomarker at any cutpoint
     discards information. A continuous covariate in Cox regression is almost
     always more powerful than a dichotomized one.

  4. When cutpoints are justified:
     - Clinical decision-making (treat/don't treat needs a threshold)
     - An established cutpoint exists (Ki-67 > 14%, PSA > 4 ng/mL)
     - Communication to non-statistical audiences

  Use continuous modeling for primary analysis, cutpoints for clinical translation.
```

## Forest Plots

```r
library(forestmodel)  # v0.6+

# Directly from a coxph object — one function call
forest_model(cox_multi)
# Produces a ggplot2-based forest plot with HRs, 95% CIs, p-values

# Customize
forest_model(cox_multi,
  factor_separate_line = TRUE,       # categorical levels on separate lines
  format_options = forest_model_format_options(
    text_size = 3.5, point_size = 3
  )
)

# Alternative: survminer::ggforest() for quick plots
# Limitation: does not handle interactions or spline terms properly
survminer::ggforest(cox_multi, data = clinical)
```

## Output Specification

| Output | Format | Description |
|--------|--------|-------------|
| `km_curves.pdf` | PDF | Kaplan-Meier curves with risk table and p-value |
| `cox_summary.csv` | CSV | HR, 95% CI, p-value for each covariate |
| `cox_zph.pdf` | PDF | Schoenfeld residual plots for PH assumption |
| `forest_plot.pdf` | PDF | Multivariate Cox forest plot |
| `cif_curves.pdf` | PDF | Cumulative incidence curves for competing risks |
| `rmst_results.csv` | CSV | RMST difference, ratio, CI, p-value |
| `cutpoint_analysis.csv` | CSV | Optimal cutpoint, corrected p-value, HR at cutpoint |

## Validation Checks

```
After running survival analysis on TCGA-GBM, verify:

Overall survival:
  Median OS for all GBM: ~14-16 months
  If median OS > 24 months: likely includes IDH-mutant cases (not true GBM under WHO 2021)
  2-year survival: ~20-25%

IDH1 mutation (strongest single prognostic factor in glioma):
  IDH-mutant median OS: ~31 months
  IDH-wildtype median OS: ~14-16 months
  Log-rank p-value: should be < 0.001
  Cox HR for IDH-mutant vs wildtype: ~0.3-0.4 (strong protective effect)
  IDH1 R132H mutation frequency: ~6% of original TCGA-GBM cohort

MGMT promoter methylation:
  MGMT methylated: ~42% of IDH-wildtype GBMs
  Methylated vs unmethylated: HR ~0.5-0.6
  This is both prognostic (survival) and predictive (TMZ response)

Cox model quality:
  C-index with IDH + MGMT + age: ~0.65-0.75
  EPV check: with ~400 events and 3-5 covariates, EPV is adequate (> 80)

Proportional hazards:
  Age typically satisfies PH in GBM (chronic risk factor)
  Treatment effect may violate PH (delayed separation with TMZ)
  If global cox.zph p < 0.05, identify the violating covariate

WHO 2021 note:
  "Glioblastoma" is now restricted to IDH-wildtype tumors.
  IDH-mutant grade 4 gliomas are reclassified as "Astrocytoma, IDH-mutant, Grade 4."
  When using TCGA-GBM data, reclassify or note this in your analysis.
```

## Common Pitfalls

### Kaplan-Meier
1. **Median follow-up from the KM estimate**: The median of the KM curve is median survival time, not median follow-up. Calculate median follow-up using the reverse KM method: `survfit(Surv(time, 1-event) ~ 1)`. Short follow-up with few events makes survival estimates unreliable beyond the last observed event.
2. **Interpreting the tail**: KM curves at the right end are based on very few patients. Wide confidence intervals there mean the curve is unreliable. Truncate plots at the time when < 10% of patients remain at risk.

### Cox Regression
3. **Ignoring the PH assumption**: Running `coxph()` and reporting HRs without checking `cox.zph()` first. If PH is violated, the reported HR is an average over time and may be misleading for both early and late effects.
4. **Overfitting**: Including too many covariates relative to events. With 50 events and 10 covariates, the model is unstable. Follow the EPV >= 10 rule (preferably >= 20).
5. **Stepwise selection for final models**: Forward/backward stepwise selection inflates Type I error and produces non-reproducible models. Pre-specify covariates based on clinical knowledge.
6. **Ignoring collinearity**: Stage and tumor size are correlated. Including both in a Cox model inflates standard errors and produces unstable HRs. Check VIF or pairwise correlations before fitting.

### Competing Risks
7. **Standard KM for competing risks endpoints**: KM treats competing events as censored, which inflates cumulative incidence of the event of interest. Use cumulative incidence functions (CIF) via `tidycmprsk::cuminc()` when competing events are present.
8. **Interpreting Fine-Gray as causal**: Fine-Gray SDH measures the effect on cumulative incidence, not on the rate among those at risk. A covariate that increases competing event risk will artificially appear to decrease the subdistribution hazard for the primary event.

### RMST and Cutpoints
9. **Post-hoc tau selection**: Choosing tau after seeing the data invalidates the RMST test. Pre-specify tau based on clinical context (standard follow-up duration, planned analysis timepoint).
10. **Cutpoint without validation**: An optimal cutpoint found by maxstat will appear highly significant in the discovery cohort even for a non-prognostic biomarker. Without independent validation, the cutpoint is a hypothesis, not a result.

## Related Skills

- [`cancer-multiomics`](../cancer-multiomics/SKILL.md): TCGA data retrieval, mutation status and gene expression for Cox covariates
- [`immune-deconvolution`](../immune-deconvolution/SKILL.md): Immune cell fraction estimates as continuous survival predictors

## Public Datasets for Testing

| Dataset | Samples | Use Case |
|---------|---------|----------|
| TCGA-GBM | 611 | Glioblastoma, strong IDH1/MGMT prognostic markers, poor prognosis |
| TCGA-BRCA | 1098 | Breast cancer, molecular subtype-specific survival differences |
| TCGA-LUAD | 585 | Lung adenocarcinoma, EGFR/KRAS mutation-driven survival |
| TCGA-OV | 608 | Ovarian cancer, classic for competing risks (progression vs death) |