{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Predicting Patient Outcomes of Artificial Heart Implant\n",
"\n",
"The artificial heart, VAD (ventricular assist device), is an implantable electromechanical device used to partially replace the function of a heart. It is the last therapeutic treatment for people in end-stage heart failure. VADs were expected to extend patients lives for several years. However, many patients who received VADs died shortly after the implant.\n",
"\n",
"In this project, I mine previous VAD recipients' clinical records and outcomes, and build machine learning models that can help physicians to predict the likely outcome of each implant.\n",
"\n",
"
\n",
"\n",
"#### Data Science Problem, Goal and Prior Work: \n",
"\n",
"I used the [INTERMACS dataset](https://intermacs.uab.edu/) for this project. This dataset includes 23,787 patients' clinical data relevant to mechanical circulatory support devices (VAD is one kind of such devices) from initial hospitalization through post-implant follow-up evaluations. Their pre-implant clinical conditions served as starting places for my feature engineering, and the time interval between their implant and death/explant is the training labels.\n",
"\n",
"Comparable previous works on VAD implant prognostics using INTERMACS data most often used linear regression or Bayesian models. The accuracy of one-year mortality predictions is [83%](http://d-scholarship.pitt.edu/25529/) to [84.5%](https://www.ncbi.nlm.nih.gov/pubmed/26820445)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Table of Content: **\n",
"\n",
"1. [Data Preparation](#preprocessing)\n",
"2. [Problem Setup and Baseline](#baseline)\n",
"3. [Data Exploration, Cleaning and Visualization](#cleaning)\n",
" - [Overview and transform prediction targets](#3-1)\n",
" - [Dealing with and utilizing (lots of) missing values](#3-2)\n",
" - [Managing feature redundancy](#3-3)\n",
" - [Managing observation redundancy](#3-4)\n",
" \n",
"2. [Supervised Learning](#supervised)\n",
"3. [\"Divide and Conquer\"](#divide)\n",
"4. [Explorations for Future Work](#other)\n",
" - Prognostic Fairness\n",
" - Patient Clustering\n",
" - Deep Learning\n",
"5. [Conclusion and Lessons Learned](#sum)\n",
"6. [Reference](#reference)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# utility\n",
"import pandas as pd\n",
"import numpy as np\n",
"import json\n",
"from collections import Counter\n",
"def pprintDict(d):\n",
" '''pretty print dictionary'''\n",
" print(json.dumps(d, sort_keys=True, indent=4))\n",
"import warnings\n",
"warnings.simplefilter('ignore')\n",
"\n",
"# stats and machine learning\n",
"import math\n",
"import random as rand \n",
"import scipy.stats as stats\n",
"import scipy.sparse as sp\n",
"from sklearn import dummy, preprocessing, feature_selection, feature_extraction, metrics, model_selection\n",
"from sklearn import tree, multiclass, svm, ensemble, gaussian_process, neighbors\n",
"from sklearn.linear_model import LinearRegression\n",
"# from point import Point\n",
"\n",
"from imblearn.combine import SMOTETomek # https://github.com/scikit-learn-contrib/imbalanced-learn\n",
"\n",
"# visualization\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('ggplot')\n",
"import seaborn as sns\n",
"sns.set_style(\"white\")\n",
"sns.set_palette(\"PuBuGn_d\")\n",
"from matplotlib.ticker import FuncFormatter\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## 1 - Data Preparation\n",
"\n",
"### 1.1 - Consolidating INTERMACS databases\n",
"\n",
"I elicited data of the patients who have received VAD implants and have died or explanted. Please see [`dataprep.py`](https://raw.githubusercontent.com/yang-qian/15-688-Final/master/dataprep.py) for the implementation of this part.\n",
"\n",
"- Combine four INTERMACS databases. Here I took all columns as objects (str) for the moment.\n",
" - ```patient_INTERMACS_Data_Dictionary.csv```\n",
" - ```device_INTERMACS_Data_Dictionary.csv```\n",
" - ```followup_INTERMACS_Data_Dictionary.csv```\n",
"- Elicitating patients who eventually received an artificial heart;\n",
"- Focusing on first-time implant patients;\n",
"- Seperating pre-implant and post-implant variables;\n",
"- Set INT_DEAD (integer) as label."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Integrated and imported 551 input columns, 118 output columns and 5821 observations.\n"
]
},
{
"data": {
"text/html": [
"
| \n", " | PATIENT_ID | \n", "CC_ADVANCED_AGE_M | \n", "CC2_ADVANCED_AGE_M | \n", "CC_ALLOSENSITIZATION_M | \n", "CC2_ALLOSENSITIZATION_M | \n", "CC_CHRONIC_COAGULOPATHY | \n", "CC2_CHRONIC_COAGULOPATHY | \n", "CC_CHRONIC_INF_CONCERNS_M | \n", "CC2_CHRONIC_INF_CONCERNS_M | \n", "CC_CHRONIC_RENAL_DISEASE_M | \n", "... | \n", "OP4EXPL | \n", "OP4EXPREA | \n", "OP4INTD | \n", "OP4INTR | \n", "OP4INTT | \n", "OP4REC | \n", "OP4TXPL | \n", "TRANSFER_CARE | \n", "TREC_PT | \n", "TXPL_PT | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | \n", "12 | \n", "Missing | \n", "No | \n", "Missing | \n", "No | \n", "Missing | \n", "Missing | \n", "Missing | \n", "No | \n", "Missing | \n", "... | \n", "0.0 | \n", "nan | \n", "7.162364730299999 | \n", "7.162364730299999 | \n", "7.162364730299999 | \n", "0.0 | \n", "0.0 | \n", "nan | \n", "0.0 | \n", "0.0 | \n", "
| 3 | \n", "12 | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "
| 4 | \n", "12 | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "
| 5 | \n", "12 | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Yes | \n", "Yes | \n", "No | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "
| 6 | \n", "13 | \n", "Missing | \n", "No | \n", "Missing | \n", "No | \n", "Missing | \n", "Missing | \n", "Missing | \n", "No | \n", "Missing | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "0.0 | \n", "0.0 | \n", "
5 rows × 672 columns
\n", "