{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Homework" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "import math\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "from scipy import stats\n", "\n", "import pandas\n", "\n", "sepsis = Table.read_table('sepsis.csv')\n", "\n", "import statsmodels.formula.api as sfm\n", "\n", "np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)\n", "\n", "def boolean_to_binary(x):\n", " return int(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data we call `sepsis` was downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Sepsis+survival+minimal+clinical+records) and was originally published in [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0187990). It contains information on whether patients admitted to the hospital suffering from sepsis survived. The variable names are long, but also self-explanatory. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age_years sex_0male_1female episode_number hospital_outcome_1alive_0dead
21 1 1 1
20 1 1 1
21 1 1 1
\n", "

... (110201 rows omitted)

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sepsis.show(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1** Run a multivariate logistic model to predict whether a person survives (1) or does not survive (0), using all the variables in the `sepsis` table.\n", "\n", "If you don't recall how to do this, review the Logistic Lab or the in-class notebook over logistic regression. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Prepare the data in the necessary format over the next few lines\n", "\n", "Age = sepsis.column(\"age_years\")\n", "Sex = sepsis.column(\"sex_0male_1female\")\n", "Episode = sepsis.column(\"episode_number\")\n", "Survived = sepsis.column(\"hospital_outcome_1alive_0dead\")\n", "\n", "\n", "## On the next few lines, only edit the names of the arrays and variables\n", "logistic_model_data = pandas.DataFrame({'Age': Age, \"Sex\": Sex, \"Episode\": Episode, \"Survived\": Survived})\n", "\n", "## Finish the analysis\n", "\n", "...\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2** One of the variables in the so-called \"complete model\" `logistic_model1` is not significant at the 5% level. Remove it, rerun the model and call that model `logistic_model_2`. Print the summary of this model. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 3** Based on this data and this model, which statement is true?\n", "\n", " a) An older person with sepsis is more likely to die than a younger person with it.\n", "\n", " b) A younger person with sepsis is more likely to die than an older person with it.\n", "\n", "(Keep in mind that Survived is coded as 1 in this data set.)\n", "\n", "Enter either a) or b) inside the quotes for the variable `age_risk`. " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "age_risk = \" \"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 4** Based on this data and this model, which statement is true?\n", "\n", " a) A male with sepsis is more likely to die than a female with it.\n", "\n", " b) A female with sepsis is more likely to die than a male with it.\n", "\n", "(Keep in mind that male is coded as 0 and female is coded as 1 in this data set.)\n", "\n", "Enter either a) or b) inside the quotes for the variable `sex_risk`. " ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "sex_risk = \" \"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 5** Using a threshold of 0.8, construct the confusion matrix for `logistic_model_2`. To help, we got you started on formatting a table that can be used to create the confusion matrix. \n", "\n", "Use the variable `confusion_matrix` for the confusion matrix." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Actual 0 1
0 102 8003
1 435 101664
" ], "text/plain": [ "Actual | 0 | 1\n", "0 | 102 | 8003\n", "1 | 435 | 101664" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions = Table().with_columns(\"Actual\", Survived, \"bool\", logistic_model_2.predict() >=0.8)\n", "predictions = predictions.with_column(\"Prediction\", predictions.apply(boolean_to_binary, \"bool\")).drop(\"bool\")\n", "\n", "## keep going to make the confusion matrix\n", "confusion_matrix = predictions.pivot(\"Prediction\", \"Actual\")\n", "confusion_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 6** Find the sensitivity for this model with a threshold of 0.8." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sensitivity = ...\n", "sensitivity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 7** Find the specificity for this model with a threshold of 0.8." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "specificity = ...\n", "specificity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 8** Find the positive predictive value, PPV, for this model with a threshold of 0.8." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ppv = ...\n", "ppv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 9** Find the negative predictive value, NPV, for this model with a threshold of 0.8." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "npv = ...\n", "npv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 10** Which of these four individuals does this model predict has the highest probability of surviving sepsis?\n", "\n", " a) A 20 year old female\n", " b) A 20 year old male\n", " c) A 40 year old female\n", " d) A 40 year old male\n", " \n", "In the cell below, replace the ellipse (...) with either a, b, c or d. Leave no spaces between you letter choice and the quotation marks. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_chance = \"...\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary \n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The coefficients of logistic_model_1 are [ 5.63538281 -0.04438333 0.1779475 -0.0256846 ]\n", "The coefficients of logistic_model_2 are [ 5.59674657 -0.0443472 0.17996867]\n" ] } ], "source": [ "print(f\"The coefficients of logistic_model_1 are {logistic_model_1._results.params}\")\n", "print(f\"The coefficients of logistic_model_2 are {logistic_model_2._results.params}\")\n", "print(f\"My choice for age_risk was {age_risk}\")\n", "print(f\"My choice for sex_risk was {sex_risk}\")\n", "print(\"My second models confusion matrix was:\")\n", "print(confusion_matrix)\n", "print(f\"This model has a sensitivity of {sensitivity}, a specificity of {specificity}, a PPV of {ppv} and a NPV of {npv}\")\n", "print(f\"My choice for best_chance was {best_chance}\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }