{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Use Case 9: Survival Analysis of Endometrial Cancer--PODXL, RAC2, and Tumor Stage\n",
"\n",
"Through modern statistical methods, we can determine survival risk based on a variety of factors. In this tutorial, we will walk through a small example of something you could do with our data to understand what factors relate with survival in various different types of cancer. In this use case, we will be looking at Endometrial Cancer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Import Data and Dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import cptac\n",
"import cptac.utils as ut\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import scipy\n",
"import lifelines\n",
"from lifelines import KaplanMeierFitter\n",
"from lifelines import CoxPHFitter\n",
"from lifelines.statistics import proportional_hazard_test\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"cptac warning: Your version of cptac (1.5.1) is out-of-date. Latest is 1.5.0. Please run 'pip install --upgrade cptac' to update it. (C:\\Users\\sabme\\anaconda3\\lib\\threading.py, line 910)\n"
]
}
],
"source": [
"en = cptac.Ucec()\n",
"clinical = en.get_clinical('mssm')\n",
"proteomics = en.get_proteomics('umich')\n",
"follow_up = en.get_followup('mssm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Gather Data from CPTAC API\n",
"The Endometrial cancer dataset contains months of follow-up data, including whether a patient is still alive (Survival Status) at each follow-up period. We will first merge the clinical and follow-up tables together for analysis. Then we will choose a few attributes to focus on, and narrow our dataframe to those attributes. While you could study a wide variety of factors related to survival, we will be focusing on tumor stage, grade and a proteins of interest listed below in *omics_genes*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we will join the *clinical* and *proteomics* dataframes to contain protein data for proteins of interest, and clinical data for each patient in the same dataframe."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"cols = list(clinical.columns)\n",
"omics_genes = ['RAC2', 'PODXL']\n",
"\n",
"clinical_and_protein = en.join_metadata_to_omics(metadata_name=\"clinical\",\n",
" metadata_source=\"mssm\",\n",
" metadata_cols=cols,\n",
" omics_name=\"proteomics\",\n",
" omics_source=\"umich\",\n",
" omics_genes=omics_genes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will rename the foreign key (\"PPID\" -> \"Patient_ID\") on the follow_up table to allow us to easily join that data with the dataframe of clinical and protein data"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
"Name tumor_code discovery_study type_of_analyzed_samples_mssm_clinical \\\n",
"Patient_ID \n",
"C3L-00006 UCEC Yes Tumor_and_Normal \n",
"C3L-00008 UCEC Yes Tumor \n",
"C3L-00032 UCEC Yes Tumor \n",
"C3L-00084 UCEC Yes Tumor \n",
"C3L-00090 UCEC Yes Tumor \n",
"... ... ... ... \n",
"NX5.N NaN NaN NaN \n",
"NX6.N NaN NaN NaN \n",
"NX7.N NaN NaN NaN \n",
"NX8.N NaN NaN NaN \n",
"NX9.N NaN NaN NaN \n",
"\n",
"Name type_of_analyzed_samples_mssm_clinical confirmatory_study age \\\n",
"Patient_ID \n",
"C3L-00006 NaN NaN 64 \n",
"C3L-00008 NaN NaN 58 \n",
"C3L-00032 NaN NaN 50 \n",
"C3L-00084 NaN NaN 74 \n",
"C3L-00090 NaN NaN 75 \n",
"... ... ... ... \n",
"NX5.N NaN NaN NaN \n",
"NX6.N NaN NaN NaN \n",
"NX7.N NaN NaN NaN \n",
"NX8.N NaN NaN NaN \n",
"NX9.N NaN NaN NaN \n",
"\n",
"Name sex race ethnicity \\\n",
"Patient_ID \n",
"C3L-00006 Female White Not Hispanic or Latino \n",
"C3L-00008 Female White Not Hispanic or Latino \n",
"C3L-00032 Female White Not Hispanic or Latino \n",
"C3L-00084 Female White Not Hispanic or Latino \n",
"C3L-00090 Female White Not Hispanic or Latino \n",
"... ... ... ... \n",
"NX5.N NaN NaN NaN \n",
"NX6.N NaN NaN NaN \n",
"NX7.N NaN NaN NaN \n",
"NX8.N NaN NaN NaN \n",
"NX9.N NaN NaN NaN \n",
"\n",
"Name ethnicity_race_ancestry_identified ... \\\n",
"Patient_ID ... \n",
"C3L-00006 White ... \n",
"C3L-00008 White ... \n",
"C3L-00032 White ... \n",
"C3L-00084 White ... \n",
"C3L-00090 White ... \n",
"... ... ... \n",
"NX5.N NaN ... \n",
"NX6.N NaN ... \n",
"NX7.N NaN ... \n",
"NX8.N NaN ... \n",
"NX9.N NaN ... \n",
"\n",
"Name number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_loco-regional \\\n",
"Patient_ID \n",
"C3L-00006 NaN \n",
"C3L-00008 NaN \n",
"C3L-00032 NaN \n",
"C3L-00084 NaN \n",
"C3L-00090 NaN \n",
"... ... \n",
"NX5.N NaN \n",
"NX6.N NaN \n",
"NX7.N NaN \n",
"NX8.N NaN \n",
"NX9.N NaN \n",
"\n",
"Name number_of_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor_event_metastasis \\\n",
"Patient_ID \n",
"C3L-00006 NaN \n",
"C3L-00008 NaN \n",
"C3L-00032 NaN \n",
"C3L-00084 NaN \n",
"C3L-00090 NaN \n",
"... ... \n",
"NX5.N NaN \n",
"NX6.N NaN \n",
"NX7.N NaN \n",
"NX8.N NaN \n",
"NX9.N NaN \n",
"\n",
"Name Recurrence-free survival, days \\\n",
"Patient_ID \n",
"C3L-00006 NaN \n",
"C3L-00008 NaN \n",
"C3L-00032 NaN \n",
"C3L-00084 NaN \n",
"C3L-00090 50.0 \n",
"... ... \n",
"NX5.N NaN \n",
"NX6.N NaN \n",
"NX7.N NaN \n",
"NX8.N NaN \n",
"NX9.N NaN \n",
"\n",
"Name Recurrence-free survival from collection, days \\\n",
"Patient_ID \n",
"C3L-00006 NaN \n",
"C3L-00008 NaN \n",
"C3L-00032 NaN \n",
"C3L-00084 NaN \n",
"C3L-00090 56.0 \n",
"... ... \n",
"NX5.N NaN \n",
"NX6.N NaN \n",
"NX7.N NaN \n",
"NX8.N NaN \n",
"NX9.N NaN \n",
"\n",
"Name Recurrence status (1, yes; 0, no) Overall survival, days \\\n",
"Patient_ID \n",
"C3L-00006 0.0 737.0 \n",
"C3L-00008 0.0 898.0 \n",
"C3L-00032 0.0 1710.0 \n",
"C3L-00084 0.0 335.0 \n",
"C3L-00090 1.0 1281.0 \n",
"... ... ... \n",
"NX5.N NaN NaN \n",
"NX6.N NaN NaN \n",
"NX7.N NaN NaN \n",
"NX8.N NaN NaN \n",
"NX9.N NaN NaN \n",
"\n",
"Name Overall survival from collection, days \\\n",
"Patient_ID \n",
"C3L-00006 737.0 \n",
"C3L-00008 898.0 \n",
"C3L-00032 1710.0 \n",
"C3L-00084 335.0 \n",
"C3L-00090 1287.0 \n",
"... ... \n",
"NX5.N NaN \n",
"NX6.N NaN \n",
"NX7.N NaN \n",
"NX8.N NaN \n",
"NX9.N NaN \n",
"\n",
"Name Survival status (1, dead; 0, alive) RAC2_umich_proteomics \\\n",
"Patient_ID \n",
"C3L-00006 0.0 -0.182830 \n",
"C3L-00008 0.0 -0.793159 \n",
"C3L-00032 0.0 0.583774 \n",
"C3L-00084 0.0 -0.193889 \n",
"C3L-00090 1.0 -0.361299 \n",
"... ... ... \n",
"NX5.N NaN 0.864272 \n",
"NX6.N NaN 0.841041 \n",
"NX7.N NaN 0.430521 \n",
"NX8.N NaN -0.000459 \n",
"NX9.N NaN 0.448693 \n",
"\n",
"Name PODXL_umich_proteomics \n",
"Patient_ID \n",
"C3L-00006 0.731055 \n",
"C3L-00008 0.451984 \n",
"C3L-00032 1.344697 \n",
"C3L-00084 -1.994844 \n",
"C3L-00090 0.154995 \n",
"... ... \n",
"NX5.N -0.980967 \n",
"NX6.N -0.260866 \n",
"NX7.N -0.498802 \n",
"NX8.N -0.140857 \n",
"NX9.N -0.370379 \n",
"\n",
"[152 rows x 126 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"follow_up = follow_up.rename({'PPID' : 'Patient_ID'}, axis='columns')\n",
"clin_prot_follow = pd.merge(clinical_and_protein, follow_up, on = \"Patient_ID\")\n",
"clinical_and_protein"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"#Determine columns to focus on, and create a subset to work with\n",
"columns_to_focus_on = ['Survival status (1, dead; 0, alive)',\n",
" 'number_of_days_from_date_of_collection_to_date_of_last_contact', \n",
" 'number_of_days_from_date_of_collection_to_date_of_death',\n",
" 'tumor_stage_pathological']\n",
"\n",
"#This adds the protein data that we got from the clinical and proteomics join\n",
"#so that it will be present in our subset of data to work with\n",
"for i in range(len(omics_genes)):\n",
" omics_genes[i] += '_umich_proteomics'\n",
" columns_to_focus_on.append(omics_genes[i])\n",
"\n",
"focus_group = clin_prot_follow[columns_to_focus_on].copy().drop_duplicates()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Kaplan Meier Plotting\n",
"Kaplan Meier plots show us the probability of some event occuring over a given length of time, based on some attribute(s). Oftentimes, they are used to plot the probability of death for clinical attributes, however they could also be used in a variety of other contexts. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will start by finding all patients that have died during the follow-up period and update the column to contain boolean values, where True denotes that the event occurred ('Deceased'), and False denotes that it did not ('Living'). We will then combine the two columns containing timeframe data ('Days_Between_Collection_And_Last_Contact', and 'Days_Between_Collection_And_Death'), to help us with plotting survival curves. These steps are necessary to fit the requirements of the *lifelines* package."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"#Make the Survival Status column boolean\n",
"focus_group['Survival status (1, dead; 0, alive)'] = focus_group['Survival status (1, dead; 0, alive)'].replace(0, False)\n",
"focus_group['Survival status (1, dead; 0, alive)'] = focus_group['Survival status (1, dead; 0, alive)'].replace(1, True)\n",
"focus_group['Survival status (1, dead; 0, alive)'] = focus_group['Survival status (1, dead; 0, alive)'].astype('bool')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"cols = ['number_of_days_from_date_of_collection_to_date_of_last_contact', 'number_of_days_from_date_of_collection_to_date_of_death']\n",
"\n",
"focus_group = focus_group.assign(Days_Until_Last_Contact_Or_Death=focus_group[cols].sum(1)).drop(cols, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will create a general Kaplan Meier Plot of overall survival for our cohort, using the KaplanMeierFitter() from the *lifelines* package."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"time = focus_group['Days_Until_Last_Contact_Or_Death']\n",
"status = focus_group['Survival status (1, dead; 0, alive)']\n",
"\n",
"kmf = KaplanMeierFitter()\n",
"kmf.fit(time, event_observed = status)\n",
"kmf.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 4 Prepare Data for Multivariate Kaplan Meier Plots and Cox's Proportional Hazard Test\n",
"We will now group our columns of interest into 3-4 distinct categories each, and assign them numeric values. It is necessary for the requirements of the *lifelines* package that the categories are assigned numeric values (other data types, including category, are not compatible with the functions we will be using)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df_genes = focus_group.copy()\n",
"\n",
"#Here, we are separating the protein abundance values for each of our proteins\n",
"#of interest into 3 groups, based on relative abundance of the protein\n",
"for col in omics_genes:\n",
" lower_25_filter = df_genes[col] <= df_genes[col].quantile(.25)\n",
" upper_25_filter = df_genes[col] >= df_genes[col].quantile(.75)\n",
"\n",
" df_genes[col] = np.where(lower_25_filter, \"Lower_25%\", df_genes[col])\n",
" df_genes[col] = np.where(upper_25_filter, \"Upper_25%\", df_genes[col])\n",
" df_genes[col] = np.where(~lower_25_filter & ~upper_25_filter, \"Middle_50%\", df_genes[col])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"#Here, we map numeric values to correspond with our 3 protein categories\n",
"proteomics_map = {\"Lower_25%\" : 1, \"Middle_50%\" : 2, \"Upper_25%\" : 3}\n",
"for gene in omics_genes:\n",
" df_genes[gene] = df_genes[gene].map(proteomics_map)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"#Here we map numeric values to corresponding tumor stages\n",
"figo_map = {\"Stage III\" : 3, \"Stage IV\" : 4, \n",
" \"Not Reported/ Unknown\" : np.nan,\n",
" \"Stage I\" : 1, \"Stage II\" : 2}\n",
"\n",
"df_genes['tumor_stage_pathological'] = df_genes['tumor_stage_pathological'].map(figo_map)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name Survival status (1, dead; 0, alive) tumor_stage_pathological \\\n",
"Patient_ID \n",
"C3L-00006 False 1.0 \n",
"C3L-00008 False 1.0 \n",
"C3L-00032 False 1.0 \n",
"C3L-00084 False 1.0 \n",
"C3L-00090 True 1.0 \n",
"... ... ... \n",
"C3N-01520 True 1.0 \n",
"C3N-01521 False 1.0 \n",
"C3N-01537 False 2.0 \n",
"C3N-01802 False 2.0 \n",
"C3N-01825 False 1.0 \n",
"\n",
"Name RAC2_umich_proteomics PODXL_umich_proteomics \\\n",
"Patient_ID \n",
"C3L-00006 2 3 \n",
"C3L-00008 1 3 \n",
"C3L-00032 3 3 \n",
"C3L-00084 2 1 \n",
"C3L-00090 2 2 \n",
"... ... ... \n",
"C3N-01520 2 2 \n",
"C3N-01521 3 3 \n",
"C3N-01537 3 2 \n",
"C3N-01802 1 1 \n",
"C3N-01825 3 1 \n",
"\n",
"Name Days_Until_Last_Contact_Or_Death \n",
"Patient_ID \n",
"C3L-00006 737.0 \n",
"C3L-00008 898.0 \n",
"C3L-00032 1710.0 \n",
"C3L-00084 335.0 \n",
"C3L-00090 1287.0 \n",
"... ... \n",
"C3N-01520 278.0 \n",
"C3N-01521 681.0 \n",
"C3N-01537 671.0 \n",
"C3N-01802 740.0 \n",
"C3N-01825 661.0 \n",
"\n",
"[103 rows x 5 columns]\n"
]
}
],
"source": [
"#Then we will drop missing values, as missing values \n",
"# will throw an error in the lifelines functions\n",
"print(df_genes)\n",
"df_clean = df_genes.dropna(axis=0, how='any').copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Verify that your columns are the correct data types. They may appear to be correct up front, but could actually be hidden as slightly different data types. The event of interest, in this case *Survival Status* needs to contain boolean values, and all other columns in the table must be of a numeric type (either int64 or float64)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Survival status (1, dead; 0, alive) : bool\n",
"tumor_stage_pathological : float64\n",
"RAC2_umich_proteomics : int64\n",
"PODXL_umich_proteomics : int64\n",
"Days_Until_Last_Contact_Or_Death : float64\n"
]
}
],
"source": [
"for col in df_clean.columns:\n",
" print(col, \":\", df_clean[col].dtype)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
Name
\n",
"
Survival status (1, dead; 0, alive)
\n",
"
tumor_stage_pathological
\n",
"
RAC2_umich_proteomics
\n",
"
PODXL_umich_proteomics
\n",
"
Days_Until_Last_Contact_Or_Death
\n",
"
\n",
"
\n",
"
Patient_ID
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
C3L-00006
\n",
"
False
\n",
"
1.0
\n",
"
2
\n",
"
3
\n",
"
737.0
\n",
"
\n",
"
\n",
"
C3L-00008
\n",
"
False
\n",
"
1.0
\n",
"
1
\n",
"
3
\n",
"
898.0
\n",
"
\n",
"
\n",
"
C3L-00032
\n",
"
False
\n",
"
1.0
\n",
"
3
\n",
"
3
\n",
"
1710.0
\n",
"
\n",
"
\n",
"
C3L-00084
\n",
"
False
\n",
"
1.0
\n",
"
2
\n",
"
1
\n",
"
335.0
\n",
"
\n",
"
\n",
"
C3L-00090
\n",
"
True
\n",
"
1.0
\n",
"
2
\n",
"
2
\n",
"
1287.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Name Survival status (1, dead; 0, alive) tumor_stage_pathological \\\n",
"Patient_ID \n",
"C3L-00006 False 1.0 \n",
"C3L-00008 False 1.0 \n",
"C3L-00032 False 1.0 \n",
"C3L-00084 False 1.0 \n",
"C3L-00090 True 1.0 \n",
"\n",
"Name RAC2_umich_proteomics PODXL_umich_proteomics \\\n",
"Patient_ID \n",
"C3L-00006 2 3 \n",
"C3L-00008 1 3 \n",
"C3L-00032 3 3 \n",
"C3L-00084 2 1 \n",
"C3L-00090 2 2 \n",
"\n",
"Name Days_Until_Last_Contact_Or_Death \n",
"Patient_ID \n",
"C3L-00006 737.0 \n",
"C3L-00008 898.0 \n",
"C3L-00032 1710.0 \n",
"C3L-00084 335.0 \n",
"C3L-00090 1287.0 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clean.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 5: Multivariate Survival Risk Plotting\n",
"\n",
"With the CoxPHFitter from the lifelines package we can create covariate survival plots, as shown below. The variables we are interested in exploring are Tumor Stage, RAC2 abundance, and PODXL abundance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we will fit our model to the data we have prepared using the CoxPHFitter() class from the lifelines module."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cph = CoxPHFitter()\n",
"cph.fit(df_clean, duration_col = \"Days_Until_Last_Contact_Or_Death\", \n",
" event_col = \"Survival status (1, dead; 0, alive)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we will plot each of the attributes to see how different levels of protein or different tumor stages affect survival outcomes in Endometrial Cancer patients."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"attributes = ['tumor_stage_pathological', 'PODXL_umich_proteomics', 'RAC2_umich_proteomics']"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for attribute in attributes:\n",
" plot_title = \"Endometrial Cancer Survival Risk: \" + attribute\n",
" cph.plot_partial_effects_on_outcome(attribute, [1,2,3], cmap='coolwarm', \n",
" title=plot_title)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Results\n",
"These different analyses tend to follow the baseline survival function, however, there are some differences in varying levels of each attribute. For example, FIGO Stage I tumors tend to have a higher survival rate over time comparatively to Stage III tumors. We can explore these differences with the CoxPHFitter object's *print_summary* function (which prints out results for multivariate linear regression)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
model
\n",
"
lifelines.CoxPHFitter
\n",
"
\n",
"
\n",
"
duration col
\n",
"
'Days_Until_Last_Contact_Or_Death'
\n",
"
\n",
"
\n",
"
event col
\n",
"
'Survival status (1, dead; 0, alive)'
\n",
"
\n",
"
\n",
"
baseline estimation
\n",
"
breslow
\n",
"
\n",
"
\n",
"
number of observations
\n",
"
103
\n",
"
\n",
"
\n",
"
number of events observed
\n",
"
12
\n",
"
\n",
"
\n",
"
partial log-likelihood
\n",
"
-45.702
\n",
"
\n",
"
\n",
"
time fit was run
\n",
"
2023-09-13 20:43:11 UTC
\n",
"
\n",
"
\n",
"
model
\n",
"
untransformed variables
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
coef
\n",
"
exp(coef)
\n",
"
se(coef)
\n",
"
coef lower 95%
\n",
"
coef upper 95%
\n",
"
exp(coef) lower 95%
\n",
"
exp(coef) upper 95%
\n",
"
cmp to
\n",
"
z
\n",
"
p
\n",
"
-log2(p)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
tumor_stage_pathological
\n",
"
0.759
\n",
"
2.136
\n",
"
0.274
\n",
"
0.222
\n",
"
1.295
\n",
"
1.249
\n",
"
3.652
\n",
"
0.000
\n",
"
2.773
\n",
"
0.006
\n",
"
7.490
\n",
"
\n",
"
\n",
"
RAC2_umich_proteomics
\n",
"
0.170
\n",
"
1.185
\n",
"
0.409
\n",
"
-0.632
\n",
"
0.971
\n",
"
0.531
\n",
"
2.641
\n",
"
0.000
\n",
"
0.415
\n",
"
0.678
\n",
"
0.560
\n",
"
\n",
"
\n",
"
PODXL_umich_proteomics
\n",
"
-0.173
\n",
"
0.841
\n",
"
0.416
\n",
"
-0.989
\n",
"
0.643
\n",
"
0.372
\n",
"
1.903
\n",
"
0.000
\n",
"
-0.415
\n",
"
0.678
\n",
"
0.560
\n",
"
\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
Concordance
\n",
"
0.754
\n",
"
\n",
"
\n",
"
Partial AIC
\n",
"
97.403
\n",
"
\n",
"
\n",
"
log-likelihood ratio test
\n",
"
8.276 on 3 df
\n",
"
\n",
"
\n",
"
-log2(p) of ll-ratio test
\n",
"
4.621
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/latex": [
"\\begin{tabular}{lrrrrrrrrrrr}\n",
" & coef & exp(coef) & se(coef) & coef lower 95% & coef upper 95% & exp(coef) lower 95% & exp(coef) upper 95% & cmp to & z & p & -log2(p) \\\\\n",
"covariate & & & & & & & & & & & \\\\\n",
"tumor_stage_pathological & 0.759 & 2.136 & 0.274 & 0.222 & 1.295 & 1.249 & 3.652 & 0.000 & 2.773 & 0.006 & 7.490 \\\\\n",
"RAC2_umich_proteomics & 0.170 & 1.185 & 0.409 & -0.632 & 0.971 & 0.531 & 2.641 & 0.000 & 0.415 & 0.678 & 0.560 \\\\\n",
"PODXL_umich_proteomics & -0.173 & 0.841 & 0.416 & -0.989 & 0.643 & 0.372 & 1.903 & 0.000 & -0.415 & 0.678 & 0.560 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\n",
" duration col = 'Days_Until_Last_Contact_Or_Death'\n",
" event col = 'Survival status (1, dead; 0, alive)'\n",
" baseline estimation = breslow\n",
" number of observations = 103\n",
"number of events observed = 12\n",
" partial log-likelihood = -45.702\n",
" time fit was run = 2023-09-13 20:43:11 UTC\n",
" model = untransformed variables\n",
"\n",
"---\n",
" coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%\n",
"covariate \n",
"tumor_stage_pathological 0.759 2.136 0.274 0.222 1.295 1.249 3.652\n",
"RAC2_umich_proteomics 0.170 1.185 0.409 -0.632 0.971 0.531 2.641\n",
"PODXL_umich_proteomics -0.173 0.841 0.416 -0.989 0.643 0.372 1.903\n",
"\n",
" cmp to z p -log2(p)\n",
"covariate \n",
"tumor_stage_pathological 0.000 2.773 0.006 7.490\n",
"RAC2_umich_proteomics 0.000 0.415 0.678 0.560\n",
"PODXL_umich_proteomics 0.000 -0.415 0.678 0.560\n",
"---\n",
"Concordance = 0.754\n",
"Partial AIC = 97.403\n",
"log-likelihood ratio test = 8.276 on 3 df\n",
"-log2(p) of ll-ratio test = 4.621"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cph.print_summary(model=\"untransformed variables\", decimals=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 6 Cox's Proportional Hazard Test\n",
"With the *proportional_hazard_test* function, we can now perform Cox's Proportional Hazard Test on the data to determine how each attribute contributes to our cohort's overall survival. This is shown by the hazard ratio in the column labeled *-log2(p)* below. In general, a hazard ratio of 1 suggests that an attribute has no effect on overall survival. A ratio less than 1 suggests that an attribute contributes to lower survival risk. A ratio greater than 1 suggests that an attribute contributes to higher survival risk."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
time_transform
\n",
"
rank
\n",
"
\n",
"
\n",
"
null_distribution
\n",
"
chi squared
\n",
"
\n",
"
\n",
"
degrees_of_freedom
\n",
"
1
\n",
"
\n",
"
\n",
"
model
\n",
"
<lifelines.CoxPHFitter: fitted with 103 total ...
\n",
"
\n",
"
\n",
"
test_name
\n",
"
proportional_hazard_test
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
test_statistic
\n",
"
p
\n",
"
-log2(p)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
PODXL_umich_proteomics
\n",
"
1.40
\n",
"
0.24
\n",
"
2.08
\n",
"
\n",
"
\n",
"
RAC2_umich_proteomics
\n",
"
0.31
\n",
"
0.58
\n",
"
0.79
\n",
"
\n",
"
\n",
"
tumor_stage_pathological
\n",
"
1.73
\n",
"
0.19
\n",
"
2.41
\n",
"
\n",
" \n",
"
"
],
"text/latex": [
"\\begin{tabular}{lrrr}\n",
" & test_statistic & p & -log2(p) \\\\\n",
"PODXL_umich_proteomics & 1.40 & 0.24 & 2.08 \\\\\n",
"RAC2_umich_proteomics & 0.31 & 0.58 & 0.79 \\\\\n",
"tumor_stage_pathological & 1.73 & 0.19 & 2.41 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\n",
" time_transform = rank\n",
" null_distribution = chi squared\n",
"degrees_of_freedom = 1\n",
" model = \n",
" test_name = proportional_hazard_test\n",
"\n",
"---\n",
" test_statistic p -log2(p)\n",
"PODXL_umich_proteomics 1.40 0.24 2.08\n",
"RAC2_umich_proteomics 0.31 0.58 0.79\n",
"tumor_stage_pathological 1.73 0.19 2.41"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"results = proportional_hazard_test(cph, df_clean, time_transform='rank')\n",
"results.print_summary(decimals=3, model=\"untransformed variables\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we show confidence intervals for each of the hazard ratios. Since both bars include the log(HR) of 1.0 and both of their p-values were greater than 0.05, there is insufficient evidence to suggest that a specific Histologic Grade or Tumor Stage is connected with negative clinical outcomes of death or the development of a new tumor *in our cohort of Endometrial cancer tumors*."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cph.plot()\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Closing Remarks\n",
"It is important to note that there are relatively few patients who died in our cohort (7 out of 88), which is good, but with such a small sample size of death events, it is difficult to conclude with certainty that these features are not more or less connected with survival. Perhaps a sample of patients with more deaths might have different results. Alternatively, studying an event with more negative outcomes (such as tumor recurrence) may also provide more data to work with."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Survival status (1, dead; 0, alive)\n",
"False 91\n",
"True 12\n",
"Name: count, dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clean['Survival status (1, dead; 0, alive)'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also important to note that the confidence intervals for these ratios are very large, especially since hazard ratios are standardly shown on a log-scale."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
95% lower-bound
\n",
"
95% upper-bound
\n",
"
\n",
"
\n",
"
covariate
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
tumor_stage_pathological
\n",
"
0.222404
\n",
"
1.295220
\n",
"
\n",
"
\n",
"
RAC2_umich_proteomics
\n",
"
-0.632089
\n",
"
0.971199
\n",
"
\n",
"
\n",
"
PODXL_umich_proteomics
\n",
"
-0.988706
\n",
"
0.643185
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 95% lower-bound 95% upper-bound\n",
"covariate \n",
"tumor_stage_pathological 0.222404 1.295220\n",
"RAC2_umich_proteomics -0.632089 0.971199\n",
"PODXL_umich_proteomics -0.988706 0.643185"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cph.confidence_intervals_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is just one example of how you might use Survival Analysis to learn more about different types of cancer, and how clinical and/or genetic attributes contribute to likelihood of survival. There are many other clinical and genetic attributes, as well as several other cancer types, that can be explored using a similar process to that above. In particular, lung cancer and ovarian cancer have a larger number of negative outcomes per cohort, and would be good to look into further. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}