{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic Homework"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datascience import *\n",
    "import numpy as np\n",
    "import math\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plots\n",
    "plots.style.use('fivethirtyeight')\n",
    "from scipy import stats\n",
    "\n",
    "import pandas\n",
    "\n",
    "sepsis = Table.read_table('sepsis.csv')\n",
    "\n",
    "import statsmodels.formula.api as sfm\n",
    "\n",
    "np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)\n",
    "\n",
    "def boolean_to_binary(x):\n",
    "    return int(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data we call `sepsis` was downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Sepsis+survival+minimal+clinical+records) and was originally published in [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0187990).  It contains information on whether patients admitted to the hospital suffering from sepsis survived.  The variable names are long, but also self-explanatory.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "    <thead>\n",
       "        <tr>\n",
       "            <th>age_years</th> <th>sex_0male_1female</th> <th>episode_number</th> <th>hospital_outcome_1alive_0dead</th>\n",
       "        </tr>\n",
       "    </thead>\n",
       "    <tbody>\n",
       "        <tr>\n",
       "            <td>21       </td> <td>1                </td> <td>1             </td> <td>1                            </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>20       </td> <td>1                </td> <td>1             </td> <td>1                            </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>21       </td> <td>1                </td> <td>1             </td> <td>1                            </td>\n",
       "        </tr>\n",
       "    </tbody>\n",
       "</table>\n",
       "<p>... (110201 rows omitted)</p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sepsis.show(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 1** Run a multivariate logistic model to predict whether a person survives (1) or does not survive (0), using all the variables in the `sepsis` table.\n",
    "\n",
    "If you don't recall how to do this, review the Logistic Lab or the in-class notebook over logistic regression.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ellipsis"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Prepare the data in the necessary format over the next few lines\n",
    "\n",
    "Age = sepsis.column(\"age_years\")\n",
    "Sex = sepsis.column(\"sex_0male_1female\")\n",
    "Episode = sepsis.column(\"episode_number\")\n",
    "Survived = sepsis.column(\"hospital_outcome_1alive_0dead\")\n",
    "\n",
    "\n",
    "## On the next few lines, only edit the names of the arrays and variables\n",
    "logistic_model_data = pandas.DataFrame({'Age': Age, \"Sex\": Sex, \"Episode\": Episode, \"Survived\": Survived})\n",
    "\n",
    "## Finish the analysis\n",
    "\n",
    "...\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2** One of the variables in the so-called \"complete model\" `logistic_model1` is not significant at the 5% level.  Remove it, rerun the model and call that model `logistic_model_2`.  Print the summary of this model.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ellipsis"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 3**  Based on this data and this model, which statement is true?\n",
    "\n",
    "    a) An older person with sepsis is more likely to die than a younger person with it.\n",
    "\n",
    "    b) A younger person with sepsis is more likely to die than an older person with it.\n",
    "\n",
    "(Keep in mind that Survived is coded as 1 in this data set.)\n",
    "\n",
    "Enter either a) or b) inside the quotes for the variable `age_risk`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "age_risk = \" \""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 4** Based on this data and this model, which statement is true?\n",
    "\n",
    "    a) A male with sepsis is more likely to die than a female with it.\n",
    "\n",
    "    b) A female with sepsis is more likely to die than a male with it.\n",
    "\n",
    "(Keep in mind that male is coded as 0 and female is coded as 1 in this data set.)\n",
    "\n",
    "Enter either a) or b) inside the quotes for the variable `sex_risk`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "sex_risk = \" \""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 5** Using a threshold of 0.8, construct the confusion matrix for `logistic_model_2`.  To help, we got you started on formatting a table that can be used to create the confusion matrix.  \n",
    "\n",
    "Use the variable `confusion_matrix` for the confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "    <thead>\n",
       "        <tr>\n",
       "            <th>Actual</th> <th>0</th> <th>1</th>\n",
       "        </tr>\n",
       "    </thead>\n",
       "    <tbody>\n",
       "        <tr>\n",
       "            <td>0     </td> <td>102 </td> <td>8003  </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>1     </td> <td>435 </td> <td>101664</td>\n",
       "        </tr>\n",
       "    </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "Actual | 0    | 1\n",
       "0      | 102  | 8003\n",
       "1      | 435  | 101664"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions = Table().with_columns(\"Actual\", Survived, \"bool\", logistic_model_2.predict() >=0.8)\n",
    "predictions = predictions.with_column(\"Prediction\", predictions.apply(boolean_to_binary, \"bool\")).drop(\"bool\")\n",
    "\n",
    "## keep going to make the confusion matrix\n",
    "confusion_matrix = predictions.pivot(\"Prediction\", \"Actual\")\n",
    "confusion_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 6** Find the sensitivity for this model with a threshold of 0.8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ellipsis"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sensitivity = ...\n",
    "sensitivity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 7** Find the specificity for this model with a threshold of 0.8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ellipsis"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "specificity = ...\n",
    "specificity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 8** Find the positive predictive value, PPV, for this model with a threshold of 0.8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppv = ...\n",
    "ppv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 9** Find the negative predictive value, NPV, for this model with a threshold of 0.8."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "npv = ...\n",
    "npv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 10** Which of these four individuals does this model predict has the highest probability of surviving sepsis?\n",
    "\n",
    "    a) A 20 year old female\n",
    "    b) A 20 year old male\n",
    "    c) A 40 year old female\n",
    "    d) A 40 year old male\n",
    "    \n",
    "In the cell below, replace the ellipse (...) with either a, b, c or d.  Leave no spaces between you letter choice and the quotation marks.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_chance = \"...\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary  \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The coefficients of logistic_model_1 are [ 5.63538281 -0.04438333  0.1779475  -0.0256846 ]\n",
      "The coefficients of logistic_model_2 are [ 5.59674657 -0.0443472   0.17996867]\n"
     ]
    }
   ],
   "source": [
    "print(f\"The coefficients of logistic_model_1 are {logistic_model_1._results.params}\")\n",
    "print(f\"The coefficients of logistic_model_2 are {logistic_model_2._results.params}\")\n",
    "print(f\"My choice for age_risk was {age_risk}\")\n",
    "print(f\"My choice for sex_risk was {sex_risk}\")\n",
    "print(\"My second models confusion matrix was:\")\n",
    "print(confusion_matrix)\n",
    "print(f\"This model has a sensitivity of {sensitivity}, a specificity of {specificity}, a PPV of {ppv} and a NPV of {npv}\")\n",
    "print(f\"My choice for best_chance was {best_chance}\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}