{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Predicting Patient Outcomes of Artificial Heart Implant\n", "\n", "The artificial heart, VAD (ventricular assist device), is an implantable electromechanical device used to partially replace the function of a heart. It is the last therapeutic treatment for people in end-stage heart failure. VADs were expected to extend patients lives for several years. However, many patients who received VADs died shortly after the implant.\n", "\n", "In this project, I mine previous VAD recipients' clinical records and outcomes, and build machine learning models that can help physicians to predict the likely outcome of each implant.\n", "\n", "\n", "\n", "#### Data Science Problem, Goal and Prior Work: \n", "\n", "I used the [INTERMACS dataset](https://intermacs.uab.edu/) for this project. This dataset includes 23,787 patients' clinical data relevant to mechanical circulatory support devices (VAD is one kind of such devices) from initial hospitalization through post-implant follow-up evaluations. Their pre-implant clinical conditions served as starting places for my feature engineering, and the time interval between their implant and death/explant is the training labels.\n", "\n", "Comparable previous works on VAD implant prognostics using INTERMACS data most often used linear regression or Bayesian models. The accuracy of one-year mortality predictions is [83%](http://d-scholarship.pitt.edu/25529/) to [84.5%](https://www.ncbi.nlm.nih.gov/pubmed/26820445)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Table of Content: **\n", "\n", "1. [Data Preparation](#preprocessing)\n", "2. [Problem Setup and Baseline](#baseline)\n", "3. [Data Exploration, Cleaning and Visualization](#cleaning)\n", " - [Overview and transform prediction targets](#3-1)\n", " - [Dealing with and utilizing (lots of) missing values](#3-2)\n", " - [Managing feature redundancy](#3-3)\n", " - [Managing observation redundancy](#3-4)\n", " \n", "2. [Supervised Learning](#supervised)\n", "3. [\"Divide and Conquer\"](#divide)\n", "4. [Explorations for Future Work](#other)\n", " - Prognostic Fairness\n", " - Patient Clustering\n", " - Deep Learning\n", "5. [Conclusion and Lessons Learned](#sum)\n", "6. [Reference](#reference)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# utility\n", "import pandas as pd\n", "import numpy as np\n", "import json\n", "from collections import Counter\n", "def pprintDict(d):\n", " '''pretty print dictionary'''\n", " print(json.dumps(d, sort_keys=True, indent=4))\n", "import warnings\n", "warnings.simplefilter('ignore')\n", "\n", "# stats and machine learning\n", "import math\n", "import random as rand \n", "import scipy.stats as stats\n", "import scipy.sparse as sp\n", "from sklearn import dummy, preprocessing, feature_selection, feature_extraction, metrics, model_selection\n", "from sklearn import tree, multiclass, svm, ensemble, gaussian_process, neighbors\n", "from sklearn.linear_model import LinearRegression\n", "# from point import Point\n", "\n", "from imblearn.combine import SMOTETomek # https://github.com/scikit-learn-contrib/imbalanced-learn\n", "\n", "# visualization\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "plt.style.use('ggplot')\n", "import seaborn as sns\n", "sns.set_style(\"white\")\n", "sns.set_palette(\"PuBuGn_d\")\n", "from matplotlib.ticker import FuncFormatter\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## 1 - Data Preparation\n", "\n", "### 1.1 - Consolidating INTERMACS databases\n", "\n", "I elicited data of the patients who have received VAD implants and have died or explanted. Please see [`dataprep.py`](https://raw.githubusercontent.com/yang-qian/15-688-Final/master/dataprep.py) for the implementation of this part.\n", "\n", "- Combine four INTERMACS databases. Here I took all columns as objects (str) for the moment.\n", " - ```patient_INTERMACS_Data_Dictionary.csv```\n", " - ```device_INTERMACS_Data_Dictionary.csv```\n", " - ```followup_INTERMACS_Data_Dictionary.csv```\n", "- Elicitating patients who eventually received an artificial heart;\n", "- Focusing on first-time implant patients;\n", "- Seperating pre-implant and post-implant variables;\n", "- Set INT_DEAD (integer) as label." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Integrated and imported 551 input columns, 118 output columns and 5821 observations.\n" ] }, { "data": { "text/html": [ "
\n", " | PATIENT_ID | \n", "CC_ADVANCED_AGE_M | \n", "CC2_ADVANCED_AGE_M | \n", "CC_ALLOSENSITIZATION_M | \n", "CC2_ALLOSENSITIZATION_M | \n", "CC_CHRONIC_COAGULOPATHY | \n", "CC2_CHRONIC_COAGULOPATHY | \n", "CC_CHRONIC_INF_CONCERNS_M | \n", "CC2_CHRONIC_INF_CONCERNS_M | \n", "CC_CHRONIC_RENAL_DISEASE_M | \n", "... | \n", "OP4EXPL | \n", "OP4EXPREA | \n", "OP4INTD | \n", "OP4INTR | \n", "OP4INTT | \n", "OP4REC | \n", "OP4TXPL | \n", "TRANSFER_CARE | \n", "TREC_PT | \n", "TXPL_PT | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | \n", "12 | \n", "Missing | \n", "No | \n", "Missing | \n", "No | \n", "Missing | \n", "Missing | \n", "Missing | \n", "No | \n", "Missing | \n", "... | \n", "0.0 | \n", "nan | \n", "7.162364730299999 | \n", "7.162364730299999 | \n", "7.162364730299999 | \n", "0.0 | \n", "0.0 | \n", "nan | \n", "0.0 | \n", "0.0 | \n", "
3 | \n", "12 | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "
4 | \n", "12 | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "
5 | \n", "12 | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "No | \n", "Yes | \n", "Yes | \n", "No | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "
6 | \n", "13 | \n", "Missing | \n", "No | \n", "Missing | \n", "No | \n", "Missing | \n", "Missing | \n", "Missing | \n", "No | \n", "Missing | \n", "... | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "nan | \n", "0.0 | \n", "0.0 | \n", "
5 rows × 672 columns
\n", "