{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Build and Evaluate a Linear Risk model\n", "\n", "Welcome to the first assignment in Course 2!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "- [1. Import Packages](#1)\n", "- [2. Load Data](#2)\n", "- [3. Explore the Dataset](#3)\n", "- [4. Mean-Normalize the Data](#4)\n", " - [Exercise 1](#Ex-1)\n", "- [5. Build the Model](#Ex-2)\n", " - [Exercise 2](#Ex-2)\n", "- [6. Evaluate the Model Using the C-Index](#6)\n", " - [Exercise 3](#Ex-3)\n", "- [7. Evaluate the Model on the Test Set](#7)\n", "- [8. Improve the Model](#8)\n", " - [Exercise 4](#Ex-4)\n", "- [9. Evalute the Improved Model](#9)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "DU20mFeib5Kd" }, "source": [ "## Overview of the Assignment\n", "\n", "In this assignment, you'll build a risk score model for retinopathy in diabetes patients using logistic regression.\n", "\n", "As we develop the model, we will learn about the following topics:\n", "\n", "- Data preprocessing\n", " - Log transformations\n", " - Standardization\n", "- Basic Risk Models\n", " - Logistic Regression\n", " - C-index\n", " - Interactions Terms\n", " \n", "### Diabetic Retinopathy\n", "Retinopathy is an eye condition that causes changes to the blood vessels in the part of the eye called the retina.\n", "This often leads to vision changes or blindness.\n", "Diabetic patients are known to be at high risk for retinopathy. \n", " \n", "### Logistic Regression \n", "Logistic regression is an appropriate analysis to use for predicting the probability of a binary outcome. In our case, this would be the probability of having or not having diabetic retinopathy.\n", "Logistic Regression is one of the most commonly used algorithms for binary classification. It is used to find the best fitting model to describe the relationship between a set of features (also referred to as input, independent, predictor, or explanatory variables) and a binary outcome label (also referred to as an output, dependent, or response variable). Logistic regression has the property that the output prediction is always in the range $[0,1]$. Sometimes this output is used to represent a probability from 0%-100%, but for straight binary classification, the output is converted to either $0$ or $1$ depending on whether it is below or above a certain threshold, usually $0.5$.\n", "\n", "It may be confusing that the term regression appears in the name even though logistic regression is actually a classification algorithm, but that's just a name it was given for historical reasons." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pzuRKOt1cU8B" }, "source": [ "\n", "## 1. Import Packages\n", "\n", "We'll first import all the packages that we need for this assignment. \n", "\n", "- `numpy` is the fundamental package for scientific computing in python.\n", "- `pandas` is what we'll use to manipulate our data.\n", "- `matplotlib` is a plotting library." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "id": "qHjB-KVmwmtR" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "3J7NXuQadLnY" }, "source": [ "\n", "## 2. Load Data\n", "\n", "First we will load in the dataset that we will use for training and testing our model.\n", "\n", "- Run the next cell to load the data that is stored in csv files.\n", "- There is a function `load_data` which randomly generates data, but for consistency, please use the data from the csv files." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "id": "FN5Y5hU5yXnE" }, "outputs": [], "source": [ "from utils import load_data\n", "\n", "# This function creates randomly generated data\n", "# X, y = load_data(6000)\n", "\n", "# For stability, load data from files that were generated using the load_data\n", "X = pd.read_csv('X_data.csv',index_col=0)\n", "y_df = pd.read_csv('y_data.csv',index_col=0)\n", "y = y_df['y']" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "5yF06E6sZMmD" }, "source": [ "`X` and `y` are Pandas DataFrames that hold the data for 6,000 diabetic patients. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Explore the Dataset\n", "\n", "The features (`X`) include the following fields:\n", "* Age: (years)\n", "* Systolic_BP: Systolic blood pressure (mmHg)\n", "* Diastolic_BP: Diastolic blood pressure (mmHg)\n", "* Cholesterol: (mg/DL)\n", " \n", "We can use the `head()` method to display the first few records of each. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "colab_type": "code", "id": "qp1SgI7PT024", "outputId": "3ff454c2-65fb-4fea-858a-647c7a5d750d" }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Age | \n", "Systolic_BP | \n", "Diastolic_BP | \n", "Cholesterol | \n", "
|---|---|---|---|---|
| 0 | \n", "77.196340 | \n", "85.288742 | \n", "80.021878 | \n", "79.957109 | \n", "
| 1 | \n", "63.529850 | \n", "99.379736 | \n", "84.852361 | \n", "110.382411 | \n", "
| 2 | \n", "69.003986 | \n", "111.349455 | \n", "109.850616 | \n", "100.828246 | \n", "
| 3 | \n", "82.638210 | \n", "95.056128 | \n", "79.666851 | \n", "87.066303 | \n", "
| 4 | \n", "78.346286 | \n", "109.154591 | \n", "90.713220 | \n", "92.511770 | \n", "
\n", "
mean and std functions. Note that in order to apply an aggregation function separately for each row or each column, you'll set the axis parameter to either 0 or 1. One produces the aggregation along columns and the other along rows, but it is easy to get them confused. So experiment with each option below to see which one you should use to get an average for each column in the dataframe.\n",
"\n",
"avg = df.mean(axis=0)\n",
"avg = df.mean(axis=1) \n",
"\n",
" \n", "
sklearn.linear_model.LogisticRegression class. If you get a warning message regarding the solver parameter, however, you may want to specify that particular one explicitly with solver='lbfgs'. \n",
"