---
title: "Survival Analysis"
editor: 
  markdown: 
    wrap: 72
---

## Survival analysis

-   So far, have seen:

    -   response variable counted or measured (regression)

    -   response variable categorized (logistic regression)

-   But what if response is time until event (eg. time of survival after
    surgery)?

-   Additional complication: event might not have happened at end of
    study (eg. patient still alive). But knowing that patient has "not
    died yet" presumably informative. Such data called *censored*.

-   Enter *survival analysis*, in particular the "Cox proportional
    hazards model".

-   Explanatory variables in this context often called *covariates*.

## Packages

-   Install package `survival` if not done. Also use `broom` and
    `marginaleffects` from earlier.

```{r}
library(tidyverse)
library(survival)
library(broom)
library(marginaleffects)
```

## Example: still dancing?

-   12 women who have just started taking dancing lessons are followed
    for up to a year, to see whether they are still taking dancing
    lessons, or have quit. The "event" here is "quit".

-   This might depend on:

    -   a treatment (visit to a dance competition)

    -   woman's age (at start of study).

## Data

\normalsize

```         
Months  Quit   Treatment Age
1        1        0      16
2        1        0      24
2        1        0      18
3        0        0      27
4        1        0      25
7        1        1      26
8        1        1      36
10       1        1      38
10       0        1      45
12       1        1      47
```

\normalsize

## About the data

-   `months` and `quit` are kind of combined response:

    -   `Months` is number of months a woman was actually observed
        dancing

    -   `quit` is 1 if woman quit, 0 if still dancing at end of study.

-   Treatment is 1 if woman went to dance competition, 0 otherwise.

-   Fit model and see whether `Age` or `Treatment` have effect on
    survival.

-   Want to do predictions for probabilities of still dancing as they
    depend on whatever is significant, and draw plot.

## Read data

-   Column-aligned:

\normalsize

```{r bSurvival-2}
url <- "http://ritsokiguess.site/datafiles/dancing.txt"
dance <- read_table(url)
```

\normalsize

## The data

\small

```{r bSurvival-3}
dance
```

\normalsize

## Fit model

-   Response variable has to incorporate both the survival time
    (`Months`) and whether or not the event, quitting, happened (that
    is, if `Quit` is 1).
-   This is made using `Surv` from `survival` package, with two inputs:
    -   the column that has the survival times
    -   something that is `TRUE` or 1 if the event happened.
-   Easiest for us to create this when we fit the model, predicting
    response from explanatories:

```{r bSurvival-5 }
dance.1 <- coxph(Surv(Months, Quit) ~ Treatment + Age, data = dance)
```

## Output looks a lot like regression

\scriptsize

```{r bSurvival-6}
summary(dance.1)
```

\normalsize

## Conclusions

-   Use $\alpha=0.10$ here since not much data.

-   Three tests at bottom like global F-test. Consensus that something
    predicts survival time (whether or not dancer quit and/or how long
    it took).

-   `Age` (definitely), `Treatment` (marginally) both predict survival
    time.

## Behind the scenes

-   All depends on *hazard rate*, which is based on probability that
    event happens in the next short time period, given that event has
    not happened yet:

-   $X$ denotes time to event, $\delta$ is small time interval:

-   $h(t) = P(X \le t + \delta | X \ge t) / \delta$

-   if $h(t)$ large, event likely to happen soon (lifetime short)

-   if $h(t)$ small, event unlikely to happen soon (lifetime long).

## Modelling lifetime

-   want to model hazard rate

-   but hazard rate always positive, so actually model *log* of hazard
    rate

-   modelling how (log-)hazard rate depends on other things eg $X_1 =$
    age, $X_2 =$ treatment, with the $\beta$ being regression
    coefficients:

-   Cox model $h(t)=h_0(t)\exp(\beta_0+\beta_1X_1+\beta_2 X_2+\cdots)$,
    or:

-   $\log(h(t)) = \log(h_0(t)) + \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots$

-   like a generalized linear model with log link.

## Predictions with `marginaleffects`

-   `marginaleffects` knows about survival models (with sufficient care)
-   Predicted survival probabilities depend on:
    -   the combination of explanatory variables you are looking at
    -   the time at which you are looking at them (when more time has
        passed, it is more likely that the event has happened, so the
        "survival probability" should be lower).
-   look at effect of age by comparing ages 20 and 40, and later look at
    the effect of treatment (values 1 and 0).
-   Also have to provide some times to predict for, in `Months`.

## Effect of age

```{r}
new <- datagrid(model = dance.1, Age = c(20, 40), Months = c(3, 5, 7))
new
```

These are actually for women who *did not* go to the dance competition.

## The predictions

```{r}
cbind(predictions(dance.1, newdata = new, type = "survival")) %>% 
  select(Age, Treatment, Months, estimate)
```

The estimated survival probabilities go down over time. For example a
20-year-old woman here has estimated probability 0.0293 of still dancing
after 5 months.

## A graph

We can plot the predictions over time for an experimental condition such
as age. The key for `plot_predictions` is to put time *first* in the
`condition`:

```{r}
plot_predictions(dance.1, condition = c("Months", "Age"), 
                 type = "survival")
```

## Comments

-   The plot picks some representative ages.
-   It is (usually) best to be up and to the right (has the highest
    chance of surviving longest).
-   Hence the oldest women have the best chance to still be dancing
    longest (the youngest women are most likely to quit soonest).

## The effect of treatment

The same procedure will get predictions for women who did or did not go
to the dance competition, at various times:

```{r}
new <- datagrid(model = dance.1, Treatment = c(0, 1), Months = c(3, 5, 7))
new
```

The age used for predictions is the mean of all ages.

## The predictions

```{r}
cbind(predictions(dance.1, newdata = new, type = "survival")) %>% 
  select(Age, Treatment, Months, estimate)
```

Women of this age have a high (0.879) chance of still dancing after 7
months if they went to the dance competition, but much lower (0.165) if
they did not.

## A graph

In `condition`, put the time variable first, and then the effect of
interest:

```{r}
plot_predictions(dance.1, condition = c("Months", "Treatment"), 
                 type = "survival")
```

## Comments

-   The survival curve for Treatment 1 is higher all the way along
-   Hence at any time, the women who went to the dance competition have
    a higher chance of still dancing than those who did not.

## The model summary again

```{r}
summary(dance.1)
```

## Comments

-   The numbers in the `coef` column describe effect of that variable on
    log-hazard of quitting.
-   Both numbers are negative, so a higher value on both variables goes
    with a lower hazard of quitting:
    -   an older woman is less likely to quit soon (more likely to be
        still dancing)
    -   a woman who went to the dance competition (`Treatment = 1`) is
        less likely to quit soon vs. a woman who didn't (more likely to
        be still dancing).

## Model checking

-   With regression, usually plot residuals against fitted values.

-   Not quite same here (nonlinear model), but "martingale residuals"
    should have no pattern vs. "linear predictor".

-   Use `broom` ideas to get them, in `.resid` and `.fitted` as below.

-   Martingale residuals can go very negative, so won't always look
    normal.

## Martingale residuals

```{r}
dance.1 %>% augment(dance) %>% 
  ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```

## A more realistic example: lung cancer

-   When you load in an R package, get data sets to illustrate functions
    in the package.

-   One such is `lung`. Data set measuring survival in patients with
    advanced lung cancer.

-   Along with survival time, number of "performance scores" included,
    measuring how well patients can perform daily activities.

-   Sometimes high good, but sometimes bad!

-   Variables below, from the data set help file (`?lung`).

## The variables

![](lung-cancer-data.png)

## Uh oh, missing values

\scriptsize

```{r bSurvival-13}
lung
```

\normalsize

## A closer look

```{r bSurvival-14, echo=F}
options(width = 70)
```

\tiny

```{r bSurvival-15}
summary(lung)
```

\normalsize

## Remove obs with *any* missing values

```{r bSurvival-16}
lung %>% drop_na() -> lung.complete
lung.complete %>%
  select(meal.cal:wt.loss) %>%
  slice(1:10)
```

Missing values seem to be gone.

## Check!

```{r bSurvival-17}
summary(lung.complete)
```

No missing values left.

## Model 1: use everything except `inst`

\footnotesize

```{r bSurvival-18}
names(lung.complete)
```

\normalsize

-   Event was death, goes with `status` of 2:

```{r bSurvival-19 }
lung.1 <- coxph(Surv(time, status == 2) ~ . - inst - time - status,
  data = lung.complete
)
```

"Dot" means "all the other variables".

## `summary` of model 1

\tiny

```{r bSurvival-20}
summary(lung.1)
```

\normalsize

## Overall significance

The three tests of overall significance: \small

```{r bSurvival-21}
glance(lung.1) %>% select(starts_with("p.value"))
```

\normalsize

All strongly significant. *Something* predicts survival.

## Coefficients for model 1

\small

```{r bSurvival-22 }
tidy(lung.1) %>% select(term, p.value) %>% arrange(p.value)
```

\normalsize

-   `sex` and `ph.ecog` definitely significant here

-   `age`, `pat.karno` and `meal.cal` definitely not

-   Take out definitely non-sig variables, and try again.

## Model 2

\normalsize

```{r bSurvival-23}
lung.2 <- update(lung.1, . ~ . - age - pat.karno - meal.cal)
tidy(lung.2) %>% select(term, p.value)
```

\normalsize

## Compare with first model:

\normalsize

```{r bSurvival-24}
anova(lung.2, lung.1)
```

\normalsize

-   No harm in taking out those variables.

## Model 3

Take out `ph.karno` and `wt.loss` as well.

```{r bSurvival-25}
lung.3 <- update(lung.2, . ~ . - ph.karno - wt.loss)
```

```{r tidy-lung-3}
tidy(lung.3) %>% select(term, estimate, p.value)
summary(lung.3)
```

## Check whether that was OK

```{r bSurvival-26}
anova(lung.3, lung.2)
```

*Just* OK.

## Commentary

-   OK (just) to take out those two covariates.

-   Both remaining variables strongly significant.

-   Nature of effect on survival time? Consider later.

-   Picture?

## Plotting survival probabilities

-   Assess (separately) the effect of `sex` and `ph.ecog` score using
    `plot_predictions`
-   Don't forget to add time (here actually called `time`) to the
    `condition`.

## Effect of `sex`:

```{r}
plot_predictions(lung.3, condition = c("time", "sex"), type = "survival")
```

-   Females (`sex = 2`) have better survival than males.
-   This graph from a mean `ph.ecog` score, but the male-female
    comparison is the same for any score.

## Effect of `ph.ecog` score:

```{r}
plot_predictions(lung.3, condition = c("time", "ph.ecog"), type = "survival")
```

## Comments

-   A lower `ph.ecog` score is better.
-   For example, a patient with a score of 0 has almost a 50-50 chance
    of living 500 days, but a patient with a score of 3 has almost no
    chance to survive that long.
-   Is this for males or females? See over. (The comparison of scores is
    the same for both.)

## Sex and `ph.ecog` score

```{r}
plot_predictions(lung.3, condition = c("time", "ph.ecog", "sex"), type = "survival")
```

## Comments

-   I think the previous graph was males.
-   This pair of graphs shows the effect of `ph.ecog` score (above and
    below on each facet), and the effect of males (left) vs. females
    (right).
-   The difference between males and females is about the same as 1
    point on the `ph.ecog` scale (compare the red curve on the left
    facet with the green curve on the right facet).

## The summary again

```{r}
summary(lung.3)
```

## Comments

-   A higher-numbered sex (female) has a lower hazard of death (negative
    coef). That is, females are more likely to survive longer than
    males.
-   A higher `ph.ecog` score goes with a *higher* hazard of death
    (positive coef). So patients with a *lower* score are more likely to
    survive longer.
-   These are consistent with the graphs we drew.

## Martingale residuals for this model

No problems here:

```{r bSurvival-32, fig.height=3}
lung.3 %>% augment(lung.complete) %>% 
  ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```

## When the Cox model fails (optional)

-   Invent some data where survival is best at middling age, and worse
    at high *and* low age:

```{r bSurvival-33 }
age <- seq(20, 60, 5)
survtime <- c(10, 12, 11, 21, 15, 20, 8, 9, 11)
stat <- c(1, 1, 1, 1, 0, 1, 1, 1, 1)
d <- tibble(age, survtime, stat)
d %>% mutate(y = Surv(survtime, stat)) -> d
```

-   Small survival time 15 in middle was actually censored, so would
    have been longer if observed.

## Fit Cox model

\footnotesize

```{r bSurvival-34 }
y.1 <- coxph(y ~ age, data = d)
summary(y.1)
```

\normalsize

## Martingale residuals

Down-and-up indicates incorrect relationship between age and survival:

```{r bSurvival-35, fig.height=3.4, message=F}
#| warning = FALSE

y.1 %>% augment(d) %>% 
  ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```

## Attempt 2

Add squared term in age:

```{r bSurvival-36}
y.2 <- coxph(y ~ age + I(age^2), data = d)
tidy(y.2) %>% select(term, estimate, p.value)
```

-   (Marginally) helpful.

## Martingale residuals this time

Not great, but less problematic than before:

```{r bSurvival-37, fig.height=3.2, message=F}
#| warning = FALSE
y.2 %>% augment(d) %>% 
  ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```