--- title: "Survival Analysis" editor: markdown: wrap: 72 --- ## Survival analysis - So far, have seen: - response variable counted or measured (regression) - response variable categorized (logistic regression) - But what if response is time until event (eg. time of survival after surgery)? - Additional complication: event might not have happened at end of study (eg. patient still alive). But knowing that patient has "not died yet" presumably informative. Such data called *censored*. - Enter *survival analysis*, in particular the "Cox proportional hazards model". - Explanatory variables in this context often called *covariates*. ## Packages - Install package `survival` if not done. Also use `broom` and `marginaleffects` from earlier. ```{r} library(tidyverse) library(survival) library(broom) library(marginaleffects) ``` ## Example: still dancing? - 12 women who have just started taking dancing lessons are followed for up to a year, to see whether they are still taking dancing lessons, or have quit. The "event" here is "quit". - This might depend on: - a treatment (visit to a dance competition) - woman's age (at start of study). ## Data \normalsize ``` Months Quit Treatment Age 1 1 0 16 2 1 0 24 2 1 0 18 3 0 0 27 4 1 0 25 7 1 1 26 8 1 1 36 10 1 1 38 10 0 1 45 12 1 1 47 ``` \normalsize ## About the data - `months` and `quit` are kind of combined response: - `Months` is number of months a woman was actually observed dancing - `quit` is 1 if woman quit, 0 if still dancing at end of study. - Treatment is 1 if woman went to dance competition, 0 otherwise. - Fit model and see whether `Age` or `Treatment` have effect on survival. - Want to do predictions for probabilities of still dancing as they depend on whatever is significant, and draw plot. ## Read data - Column-aligned: \normalsize ```{r bSurvival-2} url <- "http://ritsokiguess.site/datafiles/dancing.txt" dance <- read_table(url) ``` \normalsize ## The data \small ```{r bSurvival-3} dance ``` \normalsize ## Fit model - Response variable has to incorporate both the survival time (`Months`) and whether or not the event, quitting, happened (that is, if `Quit` is 1). - This is made using `Surv` from `survival` package, with two inputs: - the column that has the survival times - something that is `TRUE` or 1 if the event happened. - Easiest for us to create this when we fit the model, predicting response from explanatories: ```{r bSurvival-5 } dance.1 <- coxph(Surv(Months, Quit) ~ Treatment + Age, data = dance) ``` ## Output looks a lot like regression \scriptsize ```{r bSurvival-6} summary(dance.1) ``` \normalsize ## Conclusions - Use $\alpha=0.10$ here since not much data. - Three tests at bottom like global F-test. Consensus that something predicts survival time (whether or not dancer quit and/or how long it took). - `Age` (definitely), `Treatment` (marginally) both predict survival time. ## Behind the scenes - All depends on *hazard rate*, which is based on probability that event happens in the next short time period, given that event has not happened yet: - $X$ denotes time to event, $\delta$ is small time interval: - $h(t) = P(X \le t + \delta | X \ge t) / \delta$ - if $h(t)$ large, event likely to happen soon (lifetime short) - if $h(t)$ small, event unlikely to happen soon (lifetime long). ## Modelling lifetime - want to model hazard rate - but hazard rate always positive, so actually model *log* of hazard rate - modelling how (log-)hazard rate depends on other things eg $X_1 =$ age, $X_2 =$ treatment, with the $\beta$ being regression coefficients: - Cox model $h(t)=h_0(t)\exp(\beta_0+\beta_1X_1+\beta_2 X_2+\cdots)$, or: - $\log(h(t)) = \log(h_0(t)) + \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots$ - like a generalized linear model with log link. ## Predictions with `marginaleffects` - `marginaleffects` knows about survival models (with sufficient care) - Predicted survival probabilities depend on: - the combination of explanatory variables you are looking at - the time at which you are looking at them (when more time has passed, it is more likely that the event has happened, so the "survival probability" should be lower). - look at effect of age by comparing ages 20 and 40, and later look at the effect of treatment (values 1 and 0). - Also have to provide some times to predict for, in `Months`. ## Effect of age ```{r} new <- datagrid(model = dance.1, Age = c(20, 40), Months = c(3, 5, 7)) new ``` These are actually for women who *did not* go to the dance competition. ## The predictions ```{r} cbind(predictions(dance.1, newdata = new, type = "survival")) %>% select(Age, Treatment, Months, estimate) ``` The estimated survival probabilities go down over time. For example a 20-year-old woman here has estimated probability 0.0293 of still dancing after 5 months. ## A graph We can plot the predictions over time for an experimental condition such as age. The key for `plot_predictions` is to put time *first* in the `condition`: ```{r} plot_predictions(dance.1, condition = c("Months", "Age"), type = "survival") ``` ## Comments - The plot picks some representative ages. - It is (usually) best to be up and to the right (has the highest chance of surviving longest). - Hence the oldest women have the best chance to still be dancing longest (the youngest women are most likely to quit soonest). ## The effect of treatment The same procedure will get predictions for women who did or did not go to the dance competition, at various times: ```{r} new <- datagrid(model = dance.1, Treatment = c(0, 1), Months = c(3, 5, 7)) new ``` The age used for predictions is the mean of all ages. ## The predictions ```{r} cbind(predictions(dance.1, newdata = new, type = "survival")) %>% select(Age, Treatment, Months, estimate) ``` Women of this age have a high (0.879) chance of still dancing after 7 months if they went to the dance competition, but much lower (0.165) if they did not. ## A graph In `condition`, put the time variable first, and then the effect of interest: ```{r} plot_predictions(dance.1, condition = c("Months", "Treatment"), type = "survival") ``` ## Comments - The survival curve for Treatment 1 is higher all the way along - Hence at any time, the women who went to the dance competition have a higher chance of still dancing than those who did not. ## The model summary again ```{r} summary(dance.1) ``` ## Comments - The numbers in the `coef` column describe effect of that variable on log-hazard of quitting. - Both numbers are negative, so a higher value on both variables goes with a lower hazard of quitting: - an older woman is less likely to quit soon (more likely to be still dancing) - a woman who went to the dance competition (`Treatment = 1`) is less likely to quit soon vs. a woman who didn't (more likely to be still dancing). ## Model checking - With regression, usually plot residuals against fitted values. - Not quite same here (nonlinear model), but "martingale residuals" should have no pattern vs. "linear predictor". - Use `broom` ideas to get them, in `.resid` and `.fitted` as below. - Martingale residuals can go very negative, so won't always look normal. ## Martingale residuals ```{r} dance.1 %>% augment(dance) %>% ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth() ``` ## A more realistic example: lung cancer - When you load in an R package, get data sets to illustrate functions in the package. - One such is `lung`. Data set measuring survival in patients with advanced lung cancer. - Along with survival time, number of "performance scores" included, measuring how well patients can perform daily activities. - Sometimes high good, but sometimes bad! - Variables below, from the data set help file (`?lung`). ## The variables ![](lung-cancer-data.png) ## Uh oh, missing values \scriptsize ```{r bSurvival-13} lung ``` \normalsize ## A closer look ```{r bSurvival-14, echo=F} options(width = 70) ``` \tiny ```{r bSurvival-15} summary(lung) ``` \normalsize ## Remove obs with *any* missing values ```{r bSurvival-16} lung %>% drop_na() -> lung.complete lung.complete %>% select(meal.cal:wt.loss) %>% slice(1:10) ``` Missing values seem to be gone. ## Check! ```{r bSurvival-17} summary(lung.complete) ``` No missing values left. ## Model 1: use everything except `inst` \footnotesize ```{r bSurvival-18} names(lung.complete) ``` \normalsize - Event was death, goes with `status` of 2: ```{r bSurvival-19 } lung.1 <- coxph(Surv(time, status == 2) ~ . - inst - time - status, data = lung.complete ) ``` "Dot" means "all the other variables". ## `summary` of model 1 \tiny ```{r bSurvival-20} summary(lung.1) ``` \normalsize ## Overall significance The three tests of overall significance: \small ```{r bSurvival-21} glance(lung.1) %>% select(starts_with("p.value")) ``` \normalsize All strongly significant. *Something* predicts survival. ## Coefficients for model 1 \small ```{r bSurvival-22 } tidy(lung.1) %>% select(term, p.value) %>% arrange(p.value) ``` \normalsize - `sex` and `ph.ecog` definitely significant here - `age`, `pat.karno` and `meal.cal` definitely not - Take out definitely non-sig variables, and try again. ## Model 2 \normalsize ```{r bSurvival-23} lung.2 <- update(lung.1, . ~ . - age - pat.karno - meal.cal) tidy(lung.2) %>% select(term, p.value) ``` \normalsize ## Compare with first model: \normalsize ```{r bSurvival-24} anova(lung.2, lung.1) ``` \normalsize - No harm in taking out those variables. ## Model 3 Take out `ph.karno` and `wt.loss` as well. ```{r bSurvival-25} lung.3 <- update(lung.2, . ~ . - ph.karno - wt.loss) ``` ```{r tidy-lung-3} tidy(lung.3) %>% select(term, estimate, p.value) summary(lung.3) ``` ## Check whether that was OK ```{r bSurvival-26} anova(lung.3, lung.2) ``` *Just* OK. ## Commentary - OK (just) to take out those two covariates. - Both remaining variables strongly significant. - Nature of effect on survival time? Consider later. - Picture? ## Plotting survival probabilities - Assess (separately) the effect of `sex` and `ph.ecog` score using `plot_predictions` - Don't forget to add time (here actually called `time`) to the `condition`. ## Effect of `sex`: ```{r} plot_predictions(lung.3, condition = c("time", "sex"), type = "survival") ``` - Females (`sex = 2`) have better survival than males. - This graph from a mean `ph.ecog` score, but the male-female comparison is the same for any score. ## Effect of `ph.ecog` score: ```{r} plot_predictions(lung.3, condition = c("time", "ph.ecog"), type = "survival") ``` ## Comments - A lower `ph.ecog` score is better. - For example, a patient with a score of 0 has almost a 50-50 chance of living 500 days, but a patient with a score of 3 has almost no chance to survive that long. - Is this for males or females? See over. (The comparison of scores is the same for both.) ## Sex and `ph.ecog` score ```{r} plot_predictions(lung.3, condition = c("time", "ph.ecog", "sex"), type = "survival") ``` ## Comments - I think the previous graph was males. - This pair of graphs shows the effect of `ph.ecog` score (above and below on each facet), and the effect of males (left) vs. females (right). - The difference between males and females is about the same as 1 point on the `ph.ecog` scale (compare the red curve on the left facet with the green curve on the right facet). ## The summary again ```{r} summary(lung.3) ``` ## Comments - A higher-numbered sex (female) has a lower hazard of death (negative coef). That is, females are more likely to survive longer than males. - A higher `ph.ecog` score goes with a *higher* hazard of death (positive coef). So patients with a *lower* score are more likely to survive longer. - These are consistent with the graphs we drew. ## Martingale residuals for this model No problems here: ```{r bSurvival-32, fig.height=3} lung.3 %>% augment(lung.complete) %>% ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth() ``` ## When the Cox model fails (optional) - Invent some data where survival is best at middling age, and worse at high *and* low age: ```{r bSurvival-33 } age <- seq(20, 60, 5) survtime <- c(10, 12, 11, 21, 15, 20, 8, 9, 11) stat <- c(1, 1, 1, 1, 0, 1, 1, 1, 1) d <- tibble(age, survtime, stat) d %>% mutate(y = Surv(survtime, stat)) -> d ``` - Small survival time 15 in middle was actually censored, so would have been longer if observed. ## Fit Cox model \footnotesize ```{r bSurvival-34 } y.1 <- coxph(y ~ age, data = d) summary(y.1) ``` \normalsize ## Martingale residuals Down-and-up indicates incorrect relationship between age and survival: ```{r bSurvival-35, fig.height=3.4, message=F} #| warning = FALSE y.1 %>% augment(d) %>% ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth() ``` ## Attempt 2 Add squared term in age: ```{r bSurvival-36} y.2 <- coxph(y ~ age + I(age^2), data = d) tidy(y.2) %>% select(term, estimate, p.value) ``` - (Marginally) helpful. ## Martingale residuals this time Not great, but less problematic than before: ```{r bSurvival-37, fig.height=3.2, message=F} #| warning = FALSE y.2 %>% augment(d) %>% ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth() ```