---
title: "MatSurv: Survival analysis and visualization in MATLAB"
tags:
- MATLAB
- survival analysis
authors:
- name: Jordan H Creed
affiliation: 1
- name: Travis A Gerke
affiliation: 1
- name: Anders E Berglund
affiliation: 1
affiliations:
- name: Moffitt Cancer Center, Tampa, Florida, United States
index: 1
date: 14 October 2019
bibliography: paper.bib
---
# Summary
Survival analysis is a set of methods for evaluating time-to-event
data that is widely applied across research disciplines. For example,
it is commonly used in clinical trials to compare the effect of treatments.
In cancer biology, it can be used to understand how low- or high-expression
of genes affect the aggressiveness of the tumor. Time-to-event data
frequently include censored data points, samples where no event was observed.
An event is, for example, death, relapse of disease, or a new metastatic tumor.
If none of these events occur during the study period, the time to-to-event is
unknown, we only know that no events were observed during the study time.
The methods described below were developed for this kind of data. For an
in-depth introduction to survival analysis, we can recommend the book by
Kleinbaum and David [@Kleinbaum1998]. In fact, much of the code used in MatSurv is
based on the equations given in the book. Commonly
reported elements of survival analysis include log-rank tests, hazard
ratios (HR) and Kaplan-Meier (KM) curves. KM-curves are used to compare
survival durations between two or more groups and give users a particular
estimate of survival probability at a given time; log-rank tests are
used to conduct statistical inference on survival durations between
groups; and HRs provide a ratio of the hazard rates between groups.
To further improve the KM-plot, it has been suggested that the KM-plot
should always be accomplished by a table that describes the number of
patients that are still “at-risk” at a specific timepoint [@morris2019].
MATLAB [@MATLAB:2019] currently lacks functions to easily create
KM-plots with accompanying risk tables. Furthermore, MATLAB does not
have a built-in log-rank test, nor is one available in any of the
existing toolboxes, including the Statistics and Machine Learning
Toolbox. The Statistics and Machine Learning Toolbox support Cox proportional
hazards regression using the *coxphfit* function and KM-plots can be created using
the *plot* or *stairs* functions. Our goal for MatSurv is to provide an easy-to-use tool that
creates publication quality KM-plots with corresponding risk tables. The
statistical procedures built into MatSurv can be used to compare two or
multiple groups. In addition, MatSurv allows the user to easily modify
the appearance of the created figure. The graphics were inspired by the
`survminer` R-package [@survminer].
# MatSurv Use and Features
MatSurv creates three different items, a KM-plot, a risk table and
statistical results. The KM-plot shows the events and also censoring
for the different groups and it is customizable using different input
parameters. The risk table shows the number of patients “at-risk” for
different time points and it is linked to the KM-plot. The statistical
results are reported as a structure and the different values are described
below. MatSurv uses the log-rank (Mantel-Cox) test to calculate the Chi-
squared test statistic and corresponding p-value describing evidence against
the null hypothesis that the curves are identical.
Users have two options for calculating HRs: the log-rank
or Mantel-Haenszel approach. HR can only be calculated when there are two
groups being compared. In the log-rank approach, HR =
(Oa/Ea)/(Ob/Eb), where
Oa & Ob are the observed events in each group and
Ea & Eb are the number of expected events. In the
Mantel-Haenszel approach, HR = exp((O1-E1)/V),
where O1 is the number of observed events in a group,
E1 is the expected number of events in the same group and V
is the total variance. The two methods give similar results but the log-rank
results will not work of there is no events in one of the groups. The 95%
confidence interval for the HR’s are also reported together with the inverse
of all the values.
In order to use MatSurv, simply put MatSurv.m in any directory of your
choice and make sure it is added to your path. At a minimum, the user
should provide `TimeVar`, a vector with numeric time to event, either
observed or censored, `EventVar`, a vector or cell array defining events
or censoring, and `GroupVar`, a vector or cell array defining the
comparison groups (see example code below).
```
[p,fh,stats]=MatSurv([], [], [], 'Xstep', 4, 'Title', 'MatSurv KM-Plot');
```
The function returns three pieces `p`, the log-rank p-value, `fh`, the
KM-plot figure handle, and `stats`, which are additional statistics from
the log-rank test. The user can further customize the style of their
KM-plot (line colors, labels, ticks, etc.) by making changes to the
figure handle.
When MatSurv is creating the groups based on the median value, the
default option uses values less than the median compared to all other
values, however this is a parameter that can be changed by the user.
The MatSurv software has no dependencies on toolboxes and runs
completely on base MATLAB functions.
# Comparison
The MatSurv output is comparable to that from `proc lifetest` in SAS and
`ggsurvplot` in R. Code for reproducing similar output in R and SAS are
shown below as well as the output from all 3 statistical programs (R,
SAS and MatSurv). The data used is from a classic and frequently used
example by Freireich [@freireich]. In addition, we have also used
the Acute Myeloid Leukemia (LAML) dataset [@laml] from The Cancer
Genome Atlas (TCGA). The following examples, use three different risk
groups as well as the effect of HGF gene expression has on survival.
### R
```
fit <- survfit(survobj ~ RISK_CYTO, data=dat)
ggsurvplot(fit,
risk.table=TRUE,
pval=TRUE,
risk.table.y.text.col=TRUE,
risk.table.y.text=FALSE,
palette = c("#445694", "#A23A2E", "#01665E"),
break.time.by=24)
```
### SAS
```
proc lifetest data=lamlv2(where=(RISK_CYTO ^= 'N.D.'))
intervals=(0 to 120 by 24) timelist = (0 to 120 by 24)
plots=survival(atrisk=0 to 120 by 24 test);
time OS_MONTHS*Surv(0);
strata RISK_CYTO/test=logrank;
run;
```
### MatSurv
```
load laml_RC_data.mat
[p,fh,stats]=MatSurv(laml_RC_TimeVar, laml_RC_EventVar,
laml_RC_GroupVar,... 'GroupsToUse',
{'Good', 'Intermediate', 'Poor'}, 'Xstep', 24,…
'LineColor',[0.2667,0.3373,0.5804;0.6353,0.2275,0.1804;0.0039,0.4000,0.3686]);
```
![Output for Survminer (A), SAS (B) and MatSurv (C). All three produce the same log-rank p-value of 4.02E-6](figure_20191226.png)
The results from MatSurv have been compared against both SAS and R and
found to return similar estimates. The Chi-Sq values and p-values for a
log-rank test in MatSurv, SAS, and R are provided below (Table 1).
| Data | Groups |MatSurv|MatSurv |SAS | SAS |Survminer| Survminer|
|:---------:| :------------:|:-----:|:------:|:-----:|:------:|:-------:|:--------:|
| | |chi-sq |p |chi-sq |p |chi-sq |p |
| Freireich | Groups |16.79 |4.17E-5 |16.79 |4.17E-5 |16.8 |4.17E-5 |
| LAML | RISK_CYTO |24.85 |4.02E-6 |24.85 |< 0.001 |24.8 |4.02E-6 |
| LAML | HGF Median |6.63 |0.01 |6.63 |0.01 |6.6 |0.01 |
| LAML | HGF Quartiles |13.01 |3.09E-4 |13.01 |3.09E-4 |13.0 |3.09E-4 |
| LAML | HGF [6,12] |16.78 |2.27E-4 |16.78 |2.27E-4 |16.8 |2.24E-4 |
# Acknowledgements
This work was supported in part by NCI Cancer Center Support Grant (P30-CA076292).
Patrick Leo, CCIPD, for contributing to this project.
# References