{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Ok1vxsLqqw3w" }, "source": [ "# Estimating Treatment Effect Using Machine Learning" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "B16h5bb8eFmw" }, "source": [ "Welcome to the first assignment of **AI for Medical Treatment**!\n", "\n", "You will be using different methods to evaluate the results of a [randomized control trial](https://en.wikipedia.org/wiki/Randomized_controlled_trial) (RCT).\n", "\n", "**You will learn:**\n", "- How to analyze data from a randomized control trial using both:\n", " - traditional statistical methods\n", " - and the more recent machine learning techniques\n", "- Interpreting Multivariate Models\n", " - Quantifying treatment effect\n", " - Calculating baseline risk\n", " - Calculating predicted risk reduction\n", "- Evaluating Treatment Effect Models\n", " - Comparing predicted and empirical risk reductions\n", " - Computing C-statistic-for-benefit\n", "- Interpreting ML models for Treatment Effect Estimation\n", " - Implement T-learner" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This assignment covers the folowing topics:\n", "\n", "- [1. Dataset](#1)\n", " - [1.1 Why RCT?](#1-1)\n", " - [1.2 Data Processing](#1-2)\n", " - [Exercise 1](#ex-01)\n", " - [Exercise 2](#ex-02)\n", "- [2. Modeling Treatment Effect](#2)\n", " - [2.1 Constant Treatment Effect](#2-1)\n", " - [Exercise 3](#ex-03)\n", " - [2.2 Absolute Risk Reduction](#2-2)\n", " - [Exercise 4](#ex-04)\n", " - [2.3 Model Limitations](#2-3)\n", " - [Exercise 5](#ex-05)\n", " - [Exercise 6](#ex-06)\n", "- [3. Evaluation Metric](#3)\n", " - [3.1 C-statistic-for-benefit](#3-1)\n", " - [Exercise 7](#ex-07)\n", " - [Exercise 8](#ex-08)\n", "- [4. Machine Learning Approaches](#4)\n", " - [4.1 T-Learner](#4-1)\n", " - [Exercise 9](#ex-09)\n", " - [Exercise 10](#ex-10)\n", " - [Exercise 11](#ex-11)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Tklnk8tneq2U" }, "source": [ "## Packages\n", "\n", "We'll first import all the packages that we need for this assignment. \n", "\n", "\n", "- `pandas` is what we'll use to manipulate our data\n", "- `numpy` is a library for mathematical and scientific operations\n", "- `matplotlib` is a plotting library\n", "- `sklearn` contains a lot of efficient tools for machine learning and statistical modeling\n", "- `random` allows us to generate random numbers in python\n", "- `lifelines` is an open-source library that implements c-statistic\n", "- `itertools` will help us with hyperparameters searching\n", "\n", "## Import Packages\n", "\n", "Run the next cell to import all the necessary packages, dependencies and custom util functions." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:29:41.602385Z", "start_time": "2020-04-04T15:29:39.274097Z" }, "colab": {}, "colab_type": "code", "id": "Z5zOXfAIH-41" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import sklearn\n", "import random\n", "import lifelines\n", "import itertools\n", "\n", "plt.rcParams['figure.figsize'] = [10, 7]" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "pVEHJZ79mvQx" }, "source": [ "\n", "## 1 Dataset\n", "\n", "### 1.1 Why RCT?\n", "\n", "In this assignment, we'll be examining data from an RCT, measuring the effect of a particular drug combination on colon cancer. Specifically, we'll be looking the effect of [Levamisole](https://en.wikipedia.org/wiki/Levamisole) and [Fluorouracil](https://en.wikipedia.org/wiki/Fluorouracil) on patients who have had surgery to remove their colon cancer. After surgery, the curability of the patient depends on the remaining residual cancer. In this study, it was found that this particular drug combination had a clear beneficial effect, when compared with [Chemotherapy](https://en.wikipedia.org/wiki/Chemotherapy). \n", "\n", "### 1.2 Data Processing\n", "In this first section, we will load in the dataset and calculate basic statistics. Run the next cell to load the dataset. We also do some preprocessing to convert categorical features to one-hot representations." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:29:41.612018Z", "start_time": "2020-04-04T15:29:41.602385Z" }, "colab": {}, "colab_type": "code", "id": "QOV_BJGyLtjR" }, "outputs": [], "source": [ "data = pd.read_csv(\"levamisole_data.csv\", index_col=0)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "RlqE8036sj3y" }, "source": [ "Let's look at our data to familiarize ourselves with the various fields. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:29:45.698204Z", "start_time": "2020-04-04T15:29:45.677460Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 221 }, "colab_type": "code", "id": "RPS1stb7si4N", "outputId": "a64b50c6-5df2-467a-abee-0d73f82d7825" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data Dimensions: (607, 14)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexageobstructperforadherenodesnode4outcomeTRTMTdiffer_2.0differ_3.0extent_2extent_3extent_4
11430005.011True10010
21630001.000True10010
30710017.011False10100
40661006.011True10010
516900022.011False10010
\n", "
" ], "text/plain": [ " sex age obstruct perfor adhere nodes node4 outcome TRTMT \\\n", "1 1 43 0 0 0 5.0 1 1 True \n", "2 1 63 0 0 0 1.0 0 0 True \n", "3 0 71 0 0 1 7.0 1 1 False \n", "4 0 66 1 0 0 6.0 1 1 True \n", "5 1 69 0 0 0 22.0 1 1 False \n", "\n", " differ_2.0 differ_3.0 extent_2 extent_3 extent_4 \n", "1 1 0 0 1 0 \n", "2 1 0 0 1 0 \n", "3 1 0 1 0 0 \n", "4 1 0 0 1 0 \n", "5 1 0 0 1 0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(f\"Data Dimensions: {data.shape}\")\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ctvm6IEhauEd" }, "source": [ "Below is a description of all the fields (one-hot means a different field for each level):\n", "- `sex (binary): 1 if Male, 0 otherwise`\n", "- `age (int): age of patient at start of the study`\n", "- `obstruct (binary): obstruction of colon by tumor`\n", "- `perfor (binary): perforation of colon`\n", "- `adhere (binary): adherence to nearby organs`\n", "- `nodes (int): number of lymphnodes with detectable cancer`\n", "- `node4 (binary): more than 4 positive lymph nodes`\n", "- `outcome (binary): 1 if died within 5 years`\n", "- `TRTMT (binary): treated with levamisole + fluoroucil`\n", "- `differ (one-hot): differentiation of tumor`\n", "- `extent (one-hot): extent of local spread`" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "WTfGBXTOsq06" }, "source": [ "In particular pay attention to the `TRTMT` and `outcome` columns. Our primary endpoint for our analysis will be the 5-year survival rate, which is captured in the `outcome` variable." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Mz2uT46QMQPc" }, "source": [ "\n", "### Exercise 01\n", "\n", "Since this is an RCT, the treatment column is randomized. Let's warm up by finding what the treatment probability is.\n", "\n", "$$p_{treatment} = \\frac{n_{treatment}}{n}$$\n", "\n", "- $n_{treatment}$ is the number of patients where `TRTMT = True`\n", "- $n$ is the total number of patients." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "colab_type": "code", "id": "WKpz5E_CLKQy", "outputId": "5fb60465-d681-4fc4-ae67-1dd0baa8158d" }, "outputs": [], "source": [ "# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def proportion_treated(df):\n", " \"\"\"\n", " Compute proportion of trial participants who have been treated\n", "\n", " Args:\n", " df (dataframe): dataframe containing trial results. Column\n", " 'TRTMT' is 1 if patient was treated, 0 otherwise.\n", " \n", " Returns:\n", " result (float): proportion of patients who were treated\n", " \"\"\"\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", "\n", " proportion = sum(df.TRTMT==1)/len(df.TRTMT)\n", " \n", " ### END CODE HERE ###\n", "\n", " return proportion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Test Case**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dataframe:\n", "\n", " outcome TRTMT\n", "0 0 0\n", "1 1 1\n", "2 1 1\n", "3 1 1\n", "\n", "\n", "Proportion of patient treated: computed 0.75, expected: 0.75\n" ] } ], "source": [ "print(\"dataframe:\\n\")\n", "example_df = pd.DataFrame(data =[[0, 0],\n", " [1, 1], \n", " [1, 1],\n", " [1, 1]], columns = ['outcome', 'TRTMT'])\n", "print(example_df)\n", "print(\"\\n\")\n", "treated_proportion = proportion_treated(example_df)\n", "print(f\"Proportion of patient treated: computed {treated_proportion}, expected: 0.75\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "BtHs90CWLinQ" }, "source": [ "Next let's run it on our trial data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:00.222152Z", "start_time": "2020-04-04T15:30:00.219183Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "Oz9j9egVLh2k", "outputId": "3a2ce4a7-4747-4bce-efe1-f73bb8304910" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Proportion Treated: 0.49093904448105435 ~ 49%\n" ] } ], "source": [ "p = proportion_treated(data)\n", "print(f\"Proportion Treated: {p} ~ {int(p*100)}%\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "DWvZ4Qvun8p1" }, "source": [ "\n", "### Exercise 02\n", "\n", "Next, we can get a preliminary sense of the results by computing the empirical 5-year death probability for the treated arm versus the control arm. \n", "\n", "The probability of dying for patients who received the treatment is:\n", "\n", "$$p_{\\text{treatment, death}} = \\frac{n_{\\text{treatment,death}}}{n_{\\text{treatment}}}$$\n", "\n", "- $n_{\\text{treatment,death}}$ is the number of patients who received the treatment and died.\n", "- $n_{\\text{treatment}}$ is the number of patients who received treatment.\n", "\n", "The probability of dying for patients in the control group (who did not received treatment) is:\n", "\n", "$$p_{\\text{control, death}} = \\frac{n_{\\text{control,death}}}{n_{\\text{control}}}$$\n", "- $n_{\\text{control,death}}$ is the number of patients in the control group (did not receive the treatment) who died.\n", "- $n_{\\text{control}}$ is the number of patients in the control group (did not receive treatment).\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 221 }, "colab_type": "code", "id": "etNHvX3AKleg", "outputId": "758c295e-9556-4314-e83e-c2062ee660ce" }, "outputs": [], "source": [ "# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def event_rate(df):\n", " '''\n", " Compute empirical rate of death within 5 years\n", " for treated and untreated groups.\n", "\n", " Args:\n", " df (dataframe): dataframe containing trial results. \n", " 'TRTMT' column is 1 if patient was treated, 0 otherwise. \n", " 'outcome' column is 1 if patient died within 5 years, 0 otherwise.\n", " \n", " Returns:\n", " treated_prob (float): empirical probability of death given treatment\n", " untreated_prob (float): empirical probability of death given control\n", " '''\n", " \n", " treated_prob = 0.0\n", " control_prob = 0.0\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " \n", " treated_prob = sum((df.TRTMT == 1) & (df.outcome == 1)) / sum((df.TRTMT == 1))\n", " control_prob = sum((df.TRTMT == 0) & (df.outcome == 1)) / sum((df.TRTMT == 0))\n", " \n", " ### END CODE HERE ###\n", "\n", " return treated_prob, control_prob" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Test Case**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TEST CASE\n", "dataframe:\n", "\n", " outcome TRTMT\n", "0 0 1\n", "1 1 1\n", "2 1 1\n", "3 0 1\n", "4 1 0\n", "5 1 0\n", "6 1 0\n", "7 0 0\n", "\n", "\n", "Treated 5-year death rate, expected: 0.5, got: 0.5000\n", "Control 5-year death rate, expected: 0.75, got: 0.7500\n" ] } ], "source": [ "print(\"TEST CASE\\ndataframe:\\n\")\n", "example_df = pd.DataFrame(data =[[0, 1],\n", " [1, 1], \n", " [1, 1],\n", " [0, 1],\n", " [1, 0],\n", " [1, 0],\n", " [1, 0],\n", " [0, 0]], columns = ['outcome', 'TRTMT'])\n", "#print(\"dataframe:\\n\")\n", "print(example_df)\n", "print(\"\\n\")\n", "treated_prob, control_prob = event_rate(example_df)\n", "print(f\"Treated 5-year death rate, expected: 0.5, got: {treated_prob:.4f}\")\n", "print(f\"Control 5-year death rate, expected: 0.75, got: {control_prob:.4f}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ShpX6ABSV_Pd" }, "source": [ "Now let's try the function on the real data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:17.279595Z", "start_time": "2020-04-04T15:30:17.273594Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "7rw2yKymV-WD", "outputId": "9daebe7b-d0d1-4654-d3d1-764312b598d2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Death rate for treated patients: 0.3725 ~ 37%\n", "Death rate for untreated patients: 0.4822 ~ 48%\n" ] } ], "source": [ "treated_prob, control_prob = event_rate(data)\n", "\n", "print(f\"Death rate for treated patients: {treated_prob:.4f} ~ {int(treated_prob*100)}%\")\n", "print(f\"Death rate for untreated patients: {control_prob:.4f} ~ {int(control_prob*100)}%\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "yoTzaBUorB-3" }, "source": [ "On average, it seemed like treatment had a positive effect. \n", "\n", "#### Sanity checks\n", "It's important to compute these basic summary statistics as a sanity check for more complex models later on. If they strongly disagree with these robust summaries and there isn't a good reason, then there might be a bug. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fywUHcbRnsQZ" }, "source": [ "### Train test split\n", "\n", "We'll now try to quantify the impact more precisely using statistical models. Before we get started fitting models to analyze the data, let's split it using the `train_test_split` function from `sklearn`. While a hold-out test set isn't required for logistic regression, it will be useful for comparing its performance to the ML models later on." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:22.034397Z", "start_time": "2020-04-04T15:30:21.942443Z" }, "colab": {}, "colab_type": "code", "id": "FUBvTfF0mQuH" }, "outputs": [], "source": [ "# As usual, split into dev and test set\n", "from sklearn.model_selection import train_test_split\n", "np.random.seed(18)\n", "random.seed(1)\n", "\n", "data = data.dropna(axis=0)\n", "y = data.outcome\n", "# notice we are dropping a column here. Now our total columns will be 1 less than before\n", "X = data.drop('outcome', axis=1) \n", "X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size = 0.25, random_state=0)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:23.071470Z", "start_time": "2020-04-04T15:30:23.068473Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "6EeBLbfeFVnk", "outputId": "bd02e605-335a-4007-f1c0-46906dc0522c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dev set shape: (455, 13)\n", "test set shape: (152, 13)\n" ] } ], "source": [ "print(f\"dev set shape: {X_dev.shape}\")\n", "print(f\"test set shape: {X_test.shape}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "2c8mLTMQEZxD" }, "source": [ "\n", "## 2 Modeling Treatment Effect" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QxHy4RGA0Goi" }, "source": [ "\n", "### 2.1 Constant Treatment Effect\n", "\n", "First, we will model the treatment effect using a standard logistic regression. If $x^{(i)}$ is the input vector, then this models the probability of death within 5 years as \n", "$$\\sigma(\\theta^T x^{(i)}) = \\frac{1}{1 + exp(-\\theta^T x^{(i)})},$$\n", "\n", "where $ \\theta^T x^{(i)} = \\sum_{j} \\theta_j x^{(i)}_j$ is an inner product. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, if we have three features, $TRTMT$, $AGE$, and $SEX$, then our probability of death would be written as: \n", "\n", "$$\\sigma(\\theta^T x^{(i)}) = \\frac{1}{1 + exp(-\\theta_{TRTMT} x^{(i)}_{TRTMT} - \\theta_{AGE}x_{AGE}^{(i)} - \\theta_{SEX}x^{(i)}_{SEX})}.$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to look at logistic regresion is as a linear model for the \"logit\" function, or \"log odds\": \n", "\n", "$$logit(p) = \\log \\left(\\frac{p}{1-p} \\right)= \\theta^T x^{(i)}$$\n", "\n", "- \"Odds\" is defined as the probability of an event divided by the probability of not having the event: $\\frac{p}{1-p}$. \n", "\n", "- \"Log odds\", or \"logit\" function, is the natural log of the odds: $log \\left(\\frac{p}{1-p} \\right)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, $x^{(i)}_{TRTMT}$ is the treatment variable. Therefore, $\\theta_{TRTMT}$ tells you what the effect of treatment is. If $\\theta_{TRTMT}$ is negative, then having treatment reduces the log-odds of death, which means death is less likely than if you did not have treatment. \n", "\n", "Note that this assumes a constant relative treatment effect, since the impact of treatment does not depend on any other covariates. \n", "\n", "Typically, a randomized control trial (RCT) will seek to establish a negative $\\theta_{TRTMT}$ (because the treatment is intended to reduce risk of death), which corresponds to an odds ratio of less than 1.\n", "\n", "An odds ratio of less than one implies the probability of death is less than the probability of surviving.\n", "\n", "$$ \\frac{p}{1-p} < 1 \\rightarrow p < 1-p$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the next cell to fit your logistic regression model. \n", "\n", "You can use the entire dev set (and do not need to reserve a separate validation set) because there is no need for hyperparameter tuning using a validation set." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:26.980302Z", "start_time": "2020-04-04T15:30:26.884988Z" }, "colab": {}, "colab_type": "code", "id": "U-2hcHYycgFJ" }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "lr = LogisticRegression(penalty='l2',solver='lbfgs', max_iter=10000).fit(X_dev, y_dev)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the Odds ratio\n", "\n", "You are interested in finding the odds for treatment relative to the odds for the baseline.\n", "\n", "$$ OddsRatio = \\frac{Odds_{treatment}}{Odds_{baseline}}$$\n", "\n", "where\n", "$$Odds_{treatment} = \\frac{p_{treatment}}{1-p_{treatment}}$$\n", "\n", "and \n", "\n", "$$Odds_{baseline} = \\frac{p_{baseline}}{1-p_{baseline}}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look at the expression\n", "\n", "$$\\log \\left(\\frac{p}{1-p} \\right)= \\theta^T x^{(i)} = \\theta_{treatment} \\times x_{treatment}^{(i)} + \\theta_{age} \\times x_{age}^{(i)} + \\cdots$$\n", "\n", "Let's just let \"$\\theta \\times x_{age}^{(i)} + \\cdots$\" stand for all the other thetas and feature variables except for the treatment $\\theta_{treatment}^{(i)}$, and $x_{treatment}^{(i)}$ ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Treatment\n", "To denote that the patient received treatment, we set $x_{treatment}^{(i)} = 1$. Which means the log odds for a treated patient are:\n", "\n", "$$ log( Odds_{treatment}) = \\log \\left(\\frac{p_{treatment}}{1-p_{treatment}} \\right) = \\theta_{treatment} \\times 1 + \\theta_{age} \\times x_{age}^{(i)} + \\cdots$$\n", "\n", "To get odds from log odds, use exponentiation (raise to the power of e) to take the inverse of the natural log.\n", "\n", "$$Odds_{treatment} = e^{log( Odds_{treatment})} = \\left(\\frac{p_{treatment}}{1-p_{treatment}} \\right) = e^{\\theta_{treatment} \\times 1 + \\theta_{age} \\times x_{age}^{(i)} + \\cdots}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Control (baseline)\n", "\n", "Similarly, when the patient has no treatment, this is denoted by $x_{treatment}^{(i)} = 0$. So the log odds for the untreated patient is:\n", "\n", "$$log(Odds_{baseline}) = \\log \\left(\\frac{p_{baseline}}{1-p_{baseline}} \\right) = \\theta_{treatment} \\times 0 + \\theta_{age} \\times x_{age}^{(i)} + \\cdots$$\n", "\n", "$$ = 0 + \\theta_{age} \\times x_{age}^{(i)} + \\cdots$$\n", "\n", "To get odds from log odds, use exponentiation (raise to the power of e) to take the inverse of the natural log.\n", "\n", "$$Odds_{baseline} = e^{log(Odds_{baseline})} = \\left(\\frac{p_{baseline}}{1-p_{baseline}} \\right) = e^{0 + \\theta_{age} \\times x_{age}^{(i)} + \\cdots}$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Odds Ratio\n", "\n", "The Odds ratio is:\n", "\n", "$$ OddsRatio = \\frac{Odds_{treatment}}{Odds_{baseline}}$$\n", "\n", "Doing some substitution:\n", "\n", "$$ OddsRatio = \\frac{e^{\\theta_{treatment} \\times 1 + \\theta_{age} \\times x_{age}^{(i)} + \\cdots}}{e^{0 + \\theta_{age} \\times x_{age}^{(i)} + \\cdots}}$$\n", "\n", "Notice that $e^{\\theta_{age} \\times x_{age}^{(i)} + \\cdots}$ cancels on top and bottom, so that:\n", "\n", "$$ OddsRatio = \\frac{e^{\\theta_{treatment} \\times 1}}{e^{0}}$$\n", "\n", "Since $e^{0} = 1$, This simplifies to:\n", "\n", "$$ OddsRatio = e^{\\theta_{treatment}}$$" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "JVUl6hTRzA-w" }, "source": [ "\n", "### Exercise 03: Extract the treatment effect\n", "\n", "Complete the `extract_treatment_effect` function to extract $\\theta_{treatment}$ and then calculate the odds ratio of treatment from the logistic regression model." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:29.151352Z", "start_time": "2020-04-04T15:30:29.146349Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 51 }, "colab_type": "code", "id": "vePgJgTWeclb", "outputId": "6517a03a-63b0-4780-d89e-979de53e86cd" }, "outputs": [], "source": [ "# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def extract_treatment_effect(lr, data):\n", " theta_TRTMT = 0.0\n", " TRTMT_OR = 0.0\n", " coeffs = {data.columns[i]:lr.coef_[0][i] for i in range(len(data.columns))}\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " \n", " # get the treatment coefficient\n", " theta_TRTMT = coeffs['TRTMT']\n", " \n", " # calculate the Odds ratio for treatment\n", " TRTMT_OR = np.exp(theta_TRTMT)\n", " \n", " ### END CODE HERE ###\n", " return theta_TRTMT, TRTMT_OR\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Theta_TRTMT: -0.2885\n", "Treatment Odds Ratio: 0.7494\n" ] } ], "source": [ "# Test extract_treatment_effect function\n", "theta_TRTMT, trtmt_OR = extract_treatment_effect(lr, X_dev)\n", "print(f\"Theta_TRTMT: {theta_TRTMT:.4f}\")\n", "print(f\"Treatment Odds Ratio: {trtmt_OR:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Expected Output\n", "\n", "```CPP\n", "Theta_TRTMT: -0.2885\n", "Treatment Odds Ratio: 0.7494\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "clf289SQtTzV" }, "source": [ "Based on this model, it seems that the treatment has a beneficial effect. \n", "- The $\\theta_{treatment} = -0.29$ is a negative value, meaning that it has the effect of reducing risk of death.\n", "- In the code above, the $OddsRatio$ is stored in the variable `TRTMT_OR`.\n", "- The $OddsRatio = 0.75$, which is less than 1. \n", "\n", "\n", "You can think of the $OddsRatio$ as a factor that is multiplied to the baseline odds $Odds_{baseline}$ in order to estimate the $Odds_{treatment}$. You can think about the Odds Ratio as a rate, converting between baseline odds and treatment odds.\n", "\n", "$$Odds_{treatment} = OddsRatio \\times Odds_{baseline}$$\n", "\n", "In this case:\n", "\n", "$$Odds_{treatment} = 0.75 \\times Odds_{baseline}$$\n", "\n", "So you can interpret this to mean that the treatment reduces the odds of death by $(1 - OddsRatio) = 1 - 0.75 = 0.25$, or about 25%.\n", "\n", "You will see how well this model fits the data in the next few sections." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "kgv-HoPGsBP-" }, "source": [ "\n", "### 2.2 Absolute Risk Reduction" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "hVhcO3t2yj-4" }, "source": [ "\n", "### Exercise 4: Calculate ARR\n", "\n", "A valuable quantity is the absolute risk reduction (ARR) of a treatment. If $p$ is the baseline probability of death, and $p_{treatment}$ is the probability of death if treated, then \n", "$$ARR = p_{baseline} - p_{treatment} $$\n", "\n", "In the case of logistic regression, here is how ARR can be computed: \n", "Recall that the Odds Ratio is defined as:\n", "\n", "$$OR = Odds_{treatment} / Odds_{baseline}$$\n", "\n", "where the \"odds\" is the probability of the event over the probability of not having the event, or $p/(1-p)$. \n", "\n", "$$Odds_{trtmt} = \\frac{p_{treatment}}{1- p_{treatment}}$$\n", "and\n", "$$Odds_{baseline} = \\frac{p_{baseline}}{1- p_{baseline}}$$\n", "\n", "In the function below, compute the predicted absolute risk reduction (ARR) given\n", "- the odds ratio for treatment \"$OR$\", and\n", "- the baseline risk of an individual $p_{baseline}$\n", "\n", "If you get stuck, try reviewing the level 1 hints by clicking on the cell \"Hints Level 1\". If you would like more help, please try viewing \"Hints Level 2\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints Level 1\n", "\n", "

\n", "

\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints Level 2\n", "\n", "

\n", "

\n", "

" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 119 }, "colab_type": "code", "id": "CCCmR2lQjDzs", "outputId": "177ff01a-d39a-4a69-ac3a-df0b71588019" }, "outputs": [], "source": [ "# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def OR_to_ARR(p, OR):\n", " \"\"\"\n", " Compute ARR for treatment for individuals given\n", " baseline risk and odds ratio of treatment.\n", "\n", " Args:\n", " p (float): baseline probability of risk (without treatment)\n", " OR (float): odds ratio of treatment versus baseline\n", "\n", " Returns:\n", " ARR (float): absolute risk reduction for treatment \n", " \"\"\"\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", "\n", " # compute baseline odds from p\n", " odds_baseline = p/(1-p)\n", "\n", " # compute odds of treatment using odds ratio\n", " odds_trtmt = OR*odds_baseline\n", "\n", " # compute new probability of death from treatment odds\n", " p_trtmt = odds_trtmt/(1+odds_trtmt)\n", "\n", " # compute ARR using treated probability and baseline probability \n", " ARR = p - p_trtmt\n", " \n", " ### END CODE HERE ###\n", " \n", " return ARR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Test Case**" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TEST CASES\n", "baseline p: 0.75, OR: 0.5\n", "Output: 0.1500, Expected: 0.15\n", "\n", "baseline p: 0.04, OR: 1.2\n", "Output: -0.0076, Expected: -0.0076\n" ] } ], "source": [ "print(\"TEST CASES\")\n", "test_p, test_OR = (0.75, 0.5)\n", "print(f\"baseline p: {test_p}, OR: {test_OR}\")\n", "print(f\"Output: {OR_to_ARR(test_p, test_OR):.4f}, Expected: {0.15}\\n\")\n", "\n", "test_p, test_OR = (0.04, 1.2)\n", "print(f\"baseline p: {test_p}, OR: {test_OR}\")\n", "print(f\"Output: {OR_to_ARR(test_p, test_OR):.4f}, Expected: {-0.0076}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LLxmh1h92FFe" }, "source": [ "#### Visualize the treatment effect as baseline risk varies\n", "\n", "The logistic regression model assumes that treatment has a constant effect in terms of odds ratio and is independent of other covariates. \n", "\n", "However, this does not mean that absolute risk reduction is necessarily constant for any baseline risk $\\hat{p}$. To illustrate this, we can plot absolute risk reduction as a function of baseline predicted risk $\\hat{p}$. \n", "\n", "Run the next cell to see the relationship between ARR and baseline risk for the logistic regression model." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:45.443881Z", "start_time": "2020-04-04T15:30:45.270615Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 458 }, "colab_type": "code", "id": "eQdG21ogqTWy", "outputId": "16531142-20c9-459e-8dde-f239c1e31203" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ps = np.arange(0.001, 0.999, 0.001)\n", "diffs = [OR_to_ARR(p, trtmt_OR) for p in ps]\n", "plt.plot(ps, diffs)\n", "plt.title(\"Absolute Risk Reduction for Constant Treatment OR\")\n", "plt.xlabel('Baseline Risk')\n", "plt.ylabel('Absolute Risk Reduction')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OI4QLB5l2OyZ" }, "source": [ "Note that when viewed on an absolute scale, the treatment effect is not constant, despite the fact that you used a model with no interactions between the features (we didn't multiply two features together). \n", "\n", "As shown in the plot, when the baseline risk is either very low (close to zero) or very high (close to one), the Absolute Risk Reduction from treatment is fairly low. When the baseline risk is closer to 0.5 the ARR of treatment is higher (closer to 0.10).\n", "\n", "It is always important to remember that baseline risk has a natural effect on absolute risk reduction." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "9bGTgLRkQZPR" }, "source": [ "\n", "### 2.3 Model Limitations\n", "\n", "We can now plot how closely the empirical (actual) risk reduction matches the risk reduction that is predicted by the logistic regression model. \n", "\n", "This is complicated by the fact that for each patient, we only observe one outcome (treatment or no treatment). \n", "- We can't give a patient treatment, then go back in time and measure an alternative scenario where the same patient did not receive the treatment.\n", "- Therefore, we will group patients into groups based on their baseline risk as predicted by the model, and then plot their empirical ARR within groups that have similar baseline risks.\n", "- The empirical ARR is the death rate of the untreated patients in that group minus the death rate of the treated patients in that group.\n", "\n", "$$ARR_{empirical} = p_{baseline} - p_{treatment}$$" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "y7sx9hZ85jNQ" }, "source": [ "\n", "### Exercise 5: Baseline Risk\n", "In the next cell, write a function to compute the baseline risk of each patient using the logistic regression model.\n", "\n", "The baseline risk is the model's predicted probability that the patient is predicted to die if they do not receive treatment.\n", "\n", "You will later use the baseline risk of each patient to organize patients into risk groups (that have similar baseline risks). This will allow you to calculate the ARR within each risk group.\n", "\n", "$$p_{baseline} = logisticRegression(Treatment = False, Age = age_{i}, Obstruct = obstruct_{i}, \\cdots)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • A patient receives treatment if their feature x_treatment is True, and does not receive treatment when their x_treatment is False.
  • \n", "
  • For a patient who actually did receive treatment, you can ask the model to predict their risk without receiving treatment by setting the patient's x_treatment to False.
  • \n", "
  • The logistic regression predict_proba() function returns a 2D array, one row for each patient, and one column for each possible outcome (each class). In this case, the two outcomes are either no death (0), or death (1). To find out which column contains the probability for death, check the order of the classes by using lr.classes_
  • \n", "
\n", "

" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:30:49.614506Z", "start_time": "2020-04-04T15:30:49.580917Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 238 }, "colab_type": "code", "id": "BrIYA-Ciu3EK", "outputId": "4c6b2802-581c-4346-8e41-da7ee2967d7d" }, "outputs": [], "source": [ "# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def base_risks(X, lr_model):\n", " \"\"\"\n", " Compute baseline risks for each individual in X.\n", "\n", " Args:\n", " X (dataframe): data from trial. 'TRTMT' column\n", " is 1 if subject retrieved treatment, 0 otherwise\n", " lr_model (model): logistic regression model\n", " \n", " Returns:\n", " risks (np.array): array of predicted baseline risk\n", " for each subject in X\n", " \"\"\"\n", " \n", " # first make a copy of the dataframe so as not to overwrite the original\n", " X = X.copy(deep=True)\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", "\n", " # Set the treatment variable to assume that the patient did not receive treatment\n", " X.TRTMT = False\n", " \n", " # Input the features into the model, and predict the probability of death.\n", " risks = lr_model.predict_proba(X)[:,1]\n", " \n", " # END CODE HERE\n", "\n", " return risks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Test Case**" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TEST CASE\n", " sex age obstruct perfor adhere nodes node4 TRTMT differ_2.0 differ_3.0 \\\n", "0 1 60 0 0 0 3 0 True 1 0 \n", "1 1 60 0 0 0 3 0 0 1 0 \n", "\n", " extent_2 extent_3 extent_4 \n", "0 0 1 0 \n", "1 0 1 0 \n", " TRTMT\n", "0 True\n", "1 0\n", "\n", "\n", "Base risks for both rows should be the same\n", "Baseline Risks: [0.43115868 0.43115868]\n" ] } ], "source": [ "example_df = pd.DataFrame(columns = X_dev.columns)\n", "example_df.loc[0, :] = X_dev.loc[X_dev.TRTMT == 1, :].iloc[0, :]\n", "example_df.loc[1, :] = example_df.iloc[0, :]\n", "example_df.loc[1, 'TRTMT'] = 0\n", "\n", "print(\"TEST CASE\")\n", "print(example_df)\n", "print(example_df.loc[:, ['TRTMT']])\n", "print('\\n')\n", "\n", "print(\"Base risks for both rows should be the same\")\n", "print(f\"Baseline Risks: {base_risks(example_df.copy(deep=True), lr)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Expected output\n", "\n", "```CPP\n", "Base risks for both rows should be the same\n", "Baseline Risks: [0.43115868 0.43115868]\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "JQsYKmVc6prz" }, "source": [ "\n", "### Exercise 6: ARR by quantile\n", "\n", "Since the effect of treatment varies depending on the baseline risk, it makes more sense to group patients who have similar baseline risks, and then look at the outcomes of those who receive treatment versus those who do not, to estimate the absolute risk reduction (ARR).\n", "\n", "You'll now implement the `lr_ARR_quantile` function to plot empirical average ARR for each quantile of base risk." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • Use pandas.cut to define intervals of bins of equal size. For example, pd.cut(arr,5) uses the values in the list or array 'arr' and returns the intervals of 5 bins.
  • \n", "
  • Use pandas.DataFrame.groupby to group by a selected column of the dataframe. Then select the desired variable and apply an aggregator function. For example, df.groupby('col1')['col2'].sum() groups by column 1, and then calculates the sum of column 2 for each group.
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def lr_ARR_quantile(X, y, lr):\n", " \n", " # first make a deep copy of the features dataframe to calculate the base risks\n", " X = X.copy(deep=True)\n", " \n", " # Make another deep copy of the features dataframe to store baseline risk, risk_group, and y\n", " df = X.copy(deep=True)\n", "\n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " # Calculate the baseline risks (use the function that you just implemented)\n", " baseline_risk = base_risks(df.copy(deep=True), lr)\n", " \n", " # bin patients into 10 risk groups based on their baseline risks\n", " risk_groups = pd.cut(baseline_risk,10)\n", " \n", " # Store the baseline risk, risk_groups, and y into the new dataframe\n", " df.loc[:, 'baseline_risk'] = baseline_risk\n", " df.loc[:, 'risk_group'] = risk_groups\n", " df.loc[:, 'y'] = y_dev\n", "\n", " # select the subset of patients who did not actually receive treatment\n", " df_baseline = df[df.TRTMT==False]\n", " \n", " # select the subset of patients who did actually receive treatment\n", " df_treatment = df[df.TRTMT==True]\n", " \n", " # For baseline patients, group them by risk group, select their outcome 'y', and take the mean\n", " baseline_mean_by_risk_group = df_baseline.groupby('risk_group')['y'].mean()\n", " \n", " # For treatment patients, group them by risk group, select their outcome 'y', and take the mean\n", " treatment_mean_by_risk_group = df_treatment.groupby('risk_group')['y'].mean()\n", " \n", " # Calculate the absolute risk reduction by risk group (baseline minus treatment)\n", " arr_by_risk_group = baseline_mean_by_risk_group - treatment_mean_by_risk_group\n", " \n", " # Set the index of the arr_by_risk_group dataframe to the average baseline risk of each risk group \n", " # Use data for all patients to calculate the average baseline risk, grouped by risk group.\n", " arr_by_risk_group.index = df.groupby('risk_group')['baseline_risk'].mean()\n", "\n", " ### END CODE HERE ###\n", " \n", " # Set the name of the Series to 'ARR'\n", " arr_by_risk_group.name = 'ARR'\n", " \n", "\n", " return arr_by_risk_group\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "baseline_risk\n", "0.231595 0.089744\n", "0.314713 0.042857\n", "0.386342 -0.014604\n", "0.458883 0.122222\n", "0.530568 0.142857\n", "0.626937 -0.104072\n", "0.693404 0.150000\n", "0.777353 0.293706\n", "0.836617 0.083333\n", "0.918884 0.200000\n", "Name: ARR, dtype: float64\n" ] } ], "source": [ "# Test\n", "abs_risks = lr_ARR_quantile(X_dev, y_dev, lr)\n", "\n", "# print the Series\n", "print(abs_risks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "```CPP\n", "baseline_risk\n", "0.231595 0.089744\n", "0.314713 0.042857\n", "0.386342 -0.014604\n", "0.458883 0.122222\n", "0.530568 0.142857\n", "0.626937 -0.104072\n", "0.693404 0.150000\n", "0.777353 0.293706\n", "0.836617 0.083333\n", "0.918884 0.200000\n", "Name: ARR, dtype: float64\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the ARR grouped by baseline risk" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:31:27.365631Z", "start_time": "2020-04-04T15:31:27.190715Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 458 }, "colab_type": "code", "id": "xtmp3BxtNR39", "outputId": "266dcffc-0c16-4456-c789-106465666b41" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.scatter(abs_risks.index, abs_risks, label='empirical ARR')\n", "plt.title(\"Empirical Absolute Risk Reduction vs. Baseline Risk\")\n", "plt.ylabel(\"Absolute Risk Reduction\")\n", "plt.xlabel(\"Baseline Risk Range\")\n", "ps = np.arange(abs_risks.index[0]-0.05, abs_risks.index[-1]+0.05, 0.01)\n", "diffs = [OR_to_ARR(p, trtmt_OR) for p in ps]\n", "plt.plot(ps, diffs, label='predicted ARR')\n", "plt.legend(loc='upper right')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fz8Es6q98Kjw" }, "source": [ "In the plot, the empirical absolute risk reduction is shown as circles, whereas the predicted risk reduction from the logistic regression model is given by the solid line.\n", "\n", "If ARR depended only on baseline risk, then if we plotted actual (empirical) ARR grouped by baseline risk, then it would follow the model's predictions closely (the dots would be near the line in most cases).\n", "\n", "However, you can see that the empirical absolute risk reduction (shown as circles) does not match the predicted risk reduction from the logistic regression model (given by the solid line). \n", "\n", "This may indicate that ARR may depend on more than simply the baseline risk. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "aAgIlK6Z8s2p" }, "source": [ "\n", "## 3 Evaluation Metric" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "oCASYrsI1EFI" }, "source": [ "\n", "### 3.1 C-statistic-for-benefit (C-for-benefit)\n", "\n", "You'll now use a measure to evaluate the discriminative power of your models for predicting ARR. Ideally, you could use something like the regular Concordance index (also called C-statistic) from Course 2. Proceeding by analogy, you'd like to estimate something like:\n", "\n", "$$P(A \\text{ has higher predicted ARR than } B| A \\text{ experienced a greater risk reduction than } B).$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The ideal data cannot be observed\n", "\n", "The fundamental problem is that for each person, you can only observe either their treatment outcome or their baseline outcome. \n", "- The patient either receives the treatment, or does not receive the treatment. You can't go back in time to have the same patient undergo treatment and then not have treatment.\n", "- This means that you can't determine what their actual risk reduction was. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Estimate the treated/untreated patient using a pair of patients\n", "\n", "What you will do instead is match people across treatment and control arms based on predicted ARR. \n", "- Now, in each pair, you'll observe both outcomes, so you'll have an estimate of the true treatment effect.\n", "- In the pair of patients (A,B), \n", " - Patient A receives the treatment \n", " - Patient B does not receive the treatment.\n", "- Think of the pair of patients as a substitute for the the ideal data that has the same exact patient in both the treatment and control group." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The C-for-benefit\n", "\n", "$$P(\\text{$P_1$ has a predicted ARR greater than $P_2$} | \\text{$P_1$ experiences greater risk reduction than $P_2$}),$$\n", "\n", "- Pair 1 consists of two patients (A,B), where A receives treatment, B does not.\n", "- Pair 2 is another pair of two patients (A,B), where A receives treatment, B does not.\n", "\n", "The risk reduction for each pair is:\n", "- 1 if the treated person A survives and the untreated B person does not (treatment helps). \n", "- -1 if the treated person A dies and the untreated person B doesn't (treatment harms)\n", "- 0 otherwise (treatment has no effect, because both patients in the pair live, or both die)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Details for calculating C-for-benefit\n", "\n", "The c-for-benefit gives you a way to evaluate the ability of models to discriminate between patient profiles which are likely to experience greater benefit from treatment. \n", "- If you are better able to predict how likely a treatment can improve a patient's outcome, you can help the doctor and patient make a more informed decision when deciding whether to undergo treatment, considering the possible side-effects and other risks associated with treatment.\n", "\n", "Please complete the implementation of the C-statistic-for-benefit below. \n", "\n", "The code to create the pairs is given to you. \n", "```CPP\n", "obs_benefit_dict = {\n", " (0, 0): 0,\n", " (0, 1): -1,\n", " (1, 0): 1,\n", " (1, 1): 0,\n", " }\n", "```\n", "Here is the interpretation of this dictionary for a pair of patients, (A,B), where A receives treatment and B does not: \n", "- When patient A does not die, and neither does patient B, `(0, 0)`, the observed benefit of treatment is 0.\n", "- When patient A does not die, but patient B does die, `(0, 1)`, the observed benefit is -1 (the treatment helped).\n", "- When patient A dies, but patient B does not die, `(1, 0)`, the observed benefit is 1 (the treatment was harmful)\n", "- When patient A dies and patient B dies, `(0, 0)`, the observed benefit of treatment is 0.\n", "\n", "Each patient in the pair is represented by a tuple `(ARR, y)`.\n", "- Index 0 contains the predicted ARR, which is the predicted benefit from treatment.\n", "- Index 1 contains the actual patient outcome: 0 for no death, 1 for death.\n", "\n", "So a pair of patients is represented as a tuple containing two tuples:\n", "\n", "For example, Pair_1 is `( (ARR_1_A, y_1_A),(ARR_1_B, y_1_B))`, and the data may look like:\n", "`( (0.60, 0),(0.40, 1))`. \n", "- This means that patient A (who received treatment) has a predicted benefit of 0.60 and does not die.\n", "- Patient B (who did not receive treatment) has a predicted benefit of 0.40 and dies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 7: Calculate c for benefit score\n", "In `c_for_benefit_score`, you will compute the C-for-benefit given the matched pairs.\n", "\n", "$$\\text{c for benefit score} = \\frac{concordant + 0.5 \\times risk\\_ties}{permissible}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Click here for Hints!\n", "\n", "

\n", "

    \n", "
  • A pair of patients in this case are two patients whose data are used to represent a single patient.
  • \n", "
  • A pair of pairs is similar to what you think of as just a \"pair\" in the course 2 concordance index. It's a pair of pairs of patients (four patients total).
  • \n", "
  • Each patient is represented by a tuple of two values. The first value is the predicted risk reduction, and the second is the patient's outcome.
  • \n", "
  • observed benefit: for each patient pair, the first patient is assumed to be the one who received treatment, and second in the pair is the one who did not receive treatment. Observed benefit is either 0 (no effect), -1 (treatment helped), 1 (treatment harmed)
  • \n", "
  • predicted benefit: for each patient pair, take the mean of the two predicted benefits. This is the first value in each patient's tuple.
  • \n", "
  • permissible pair of pairs: observed benefit is different between the two pairs of pairs of patients.
  • \n", "
  • concordant pair: the observed benefit and predicted benefit of pair 1 are both less than those for pair 2; or, the observed and predicted benefit of pair 1 are both greater than those for pair 2. Also, it should be a permissible pair of pairs.
  • \n", "
  • Risk tie: the predicted benefits of both pairs are equal, and it's also a permissible pair of pairs.
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 385 }, "colab_type": "code", "id": "XYYwXThLOZKi", "outputId": "6bbb3684-89d5-4674-9147-221a26a21621" }, "outputs": [], "source": [ "# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def c_for_benefit_score(pairs):\n", " \"\"\"\n", " Compute c-statistic-for-benefit given list of\n", " individuals matched across treatment and control arms. \n", "\n", " Args:\n", " pairs (list of tuples): each element of the list is a tuple of individuals,\n", " the first from the control arm and the second from\n", " the treatment arm. Each individual \n", " p = (pred_outcome, actual_outcome) is a tuple of\n", " their predicted outcome and actual outcome.\n", " Result:\n", " cstat (float): c-statistic-for-benefit computed from pairs.\n", " \"\"\"\n", " \n", " # mapping pair outcomes to benefit\n", " obs_benefit_dict = {\n", " (0, 0): 0,\n", " (0, 1): -1,\n", " (1, 0): 1,\n", " (1, 1): 0,\n", " }\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None', 'False', and 'pass' with your code) ###\n", "\n", " # compute observed benefit for each pair\n", " obs_benefit = [obs_benefit_dict[(i[1],j[1])] for (i,j) in pairs]\n", "\n", " # compute average predicted benefit for each pair\n", " pred_benefit = [np.mean([i[0],j[0]]) for (i,j) in pairs]\n", "\n", " concordant_count, permissible_count, risk_tie_count = 0, 0, 0\n", "\n", " # iterate over pairs of pairs\n", " for i in range(len(pairs)):\n", " for j in range(i + 1, len(pairs)):\n", " \n", " # if the observed benefit is different, increment permissible count\n", " if obs_benefit[i] != obs_benefit[j]:\n", "\n", " # increment count of permissible pairs\n", " permissible_count = permissible_count + 1\n", " \n", " # if concordant, increment count\n", " concordance= ((pred_benefit[i]>pred_benefit[j] and obs_benefit[i]>obs_benefit[j]) or (pred_benefit[i]\n", "### Exercise 8: Create patient pairs and calculate c-for-benefit\n", "\n", "You will implement the function `c_statistic`, which prepares the patient data and uses the c-for-benefit score function to calculate the c-for-benefit:\n", "\n", "- Take as input:\n", " - The predicted risk reduction `pred_rr` (ARR)\n", " - outcomes `y` (1 for death, 0 for no death)\n", " - treatments `w` (1 for treatment, 0 for no treatment)\n", "- Collect the predicted risk reduction, outcomes and treatments into tuples, one tuple for each patient.\n", "- Filter one list of tuples where patients did not receive treatment.\n", "- Filter another list of tuples where patients received treatment.\n", "\n", "- Make sure that there is one treated patient for each untreated patient.\n", " - If there are fewer treated patients, randomly sample a subset of untreated patients, one for each treated patient.\n", " - If there are fewer untreated patients, randomly sample a subset of treated patients, one for each untreated patient.\n", " \n", "- Sort treated patients by their predicted risk reduction, and similarly sort the untreated patients by predicted risk reduction.\n", " - This allows you to match the treated patient with the highest predicted risk reduction with the untreated patient with the highest predicted risk reduction. Similarly, the second highest treated patient is matched with the second highest untreated patient.\n", " \n", "- Create pairs of treated and untreated patients." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • Use zip(a,b,c) to create tuples from two or more lists of equal length, and use list(zip(a,b,c)) to store that as a list data type.
  • \n", "
  • Use filter(lambda x: x[0] == True, some_list) to filter a list (such as a list of tuples) so that the 0th item in each tuple is equal to True. Cast the result as a list using list(filter(lambda x: x[0] == True, some_list))
  • \n", "
  • Use random.sample(some_list, sub_sample_length) to sample a subset from a list without replacement.
  • \n", "
  • Use sorted(some_list, key=lambda x: x[1]) to sort a list of tuples by their value in index 1.
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def c_statistic(pred_rr, y, w, random_seed=0):\n", " \"\"\"\n", " Return concordance-for-benefit, the proportion of all matched pairs with\n", " unequal observed benefit, in which the patient pair receiving greater\n", " treatment benefit was predicted to do so.\n", "\n", " Args: \n", " pred_rr (array): array of predicted risk reductions\n", " y (array): array of true outcomes\n", " w (array): array of true treatments \n", " \n", " Returns: \n", " cstat (float): calculated c-stat-for-benefit\n", " \"\"\"\n", " assert len(pred_rr) == len(w) == len(y)\n", " random.seed(random_seed)\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " # Collect pred_rr, y, and w into tuples for each patient\n", " tuples = list(zip(pred_rr,y,w))\n", " \n", " # Collect untreated patient tuples, stored as a list\n", " untreated = list(filter(lambda x:x[2]==True, tuples))\n", " \n", " # Collect treated patient tuples, stored as a list\n", " treated = list(filter(lambda x:x[2]==False, tuples))\n", "\n", " # randomly subsample to ensure every person is matched\n", " \n", " # if there are more untreated than treated patients,\n", " # randomly choose a subset of untreated patients, one for each treated patient.\n", "\n", " if len(treated) < len(untreated):\n", " untreated = random.sample(untreated,k=len(treated))\n", " \n", " # if there are more treated than untreated patients,\n", " # randomly choose a subset of treated patients, one for each treated patient.\n", " if len(untreated) < len(treated):\n", " treated = random.sample(treated,k=len(untreated))\n", " \n", " assert len(untreated) == len(treated)\n", "\n", " # Sort the untreated patients by their predicted risk reduction\n", " untreated = sorted(untreated,key=lambda x:x[0])\n", " \n", " # Sort the treated patients by their predicted risk reduction\n", " treated = sorted(treated,key=lambda x:x[0])\n", " \n", " # match untreated and treated patients to create pairs together\n", " pairs = list(zip(treated,untreated))\n", "\n", " # calculate the c-for-benefit using these pairs (use the function that you implemented earlier)\n", " cstat = c_for_benefit_score(pairs)\n", " \n", " ### END CODE HERE ###\n", " \n", " return cstat" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C-for-benefit calculated is 0.6\n" ] } ], "source": [ "# Test\n", "\n", "tmp_pred_rr = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]\n", "tmp_y = [0,1,0,1,0,1,0,1,0]\n", "tmp_w = [0,0,0,0,1,1,1,1,1]\n", "\n", "tmp_cstat = c_statistic(tmp_pred_rr, tmp_y, tmp_w)\n", "\n", "print(f\"C-for-benefit calculated is {tmp_cstat}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "C-for-benefit calculated is 0.6\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "XH_yDTAq3D42" }, "source": [ "### Predicted risk reduction\n", "In order to compute the c-statistic-for-benefit for any of your models, you need to compute predicted risk reduction from treatment (predicted risk reduction is the input `pred_rr` to the c-statistic function).\n", "\n", "- The easiest way to do this in general is to create a version of the data where the treatment variable is False and a version where it is True.\n", "- Then take the difference $\\text{pred_RR} = p_{control} - p_{treatment}$\n", "\n", "We've implemented this for you." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:31:43.624458Z", "start_time": "2020-04-04T15:31:43.619458Z" }, "colab": {}, "colab_type": "code", "id": "arBYI7rR4lqr" }, "outputs": [], "source": [ "def treatment_control(X):\n", " \"\"\"Create treatment and control versions of data\"\"\"\n", " X_treatment = X.copy(deep=True)\n", " X_control = X.copy(deep=True)\n", " X_treatment.loc[:, 'TRTMT'] = 1\n", " X_control.loc[:, 'TRTMT'] = 0\n", " return X_treatment, X_control\n", "\n", "def risk_reduction(model, data_treatment, data_control):\n", " \"\"\"Compute predicted risk reduction for each row in data\"\"\"\n", " treatment_risk = model.predict_proba(data_treatment)[:, 1]\n", " control_risk = model.predict_proba(data_control)[:, 1]\n", " return control_risk - treatment_risk" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "E4g3JazHF1G9" }, "source": [ "Now let's compute the predicted risk reductions of the logistic regression model on the test set." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "X_test_treated, X_test_untreated = treatment_control(X_test)\n", "rr_lr = risk_reduction(lr, X_test_treated, X_test_untreated)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "uv0Yr96aGaeL" }, "source": [ "Before we evaluate the c-statistic-for-benefit, let's look at a histogram of predicted ARR." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:31:51.575460Z", "start_time": "2020-04-04T15:31:51.420183Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 444 }, "colab_type": "code", "id": "Oa0gA4rCGZtU", "outputId": "8f8b1896-8276-4101-f488-1453389c62bc" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(rr_lr, bins='auto')\n", "plt.title(\"Histogram of Predicted ARR using logistic regression\")\n", "plt.ylabel(\"count of patients\")\n", "plt.xlabel(\"ARR\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "rTI2xcriG4vi" }, "source": [ "Note that although it predicts different absolute risk reduction, it never predicts that the treatment will adversely impact risk. This is because the odds ratio of treatment is less than 1, so the model always predicts a decrease in the baseline risk. Run the next cell to compute the c-statistic-for-benefit on the test data." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:31:53.895737Z", "start_time": "2020-04-04T15:31:53.880107Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "HTgU2BLbGX1B", "outputId": "44bd6144-31ca-4a02-e4ce-8f11f139f46d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic Regression evaluated by C-for-Benefit: 0.5412\n" ] } ], "source": [ "tmp_cstat_test = c_statistic(rr_lr, y_test, X_test.TRTMT)\n", "print(f\"Logistic Regression evaluated by C-for-Benefit: {tmp_cstat_test:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected Output\n", "```CPP\n", "Logistic Regression evaluated by C-for-Benefit: 0.5412\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "o6YQq4LLZdBj" }, "source": [ "Recall that a c statistic ranges from 0 to 1, and is closer to when the model being evaluated is doing a good job with its predictions.\n", "\n", "You can see that the model is not doing a great job of predicting risk reduction, given a c-for-benefit of around 0.54." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regular c-index\n", "Let's compare this with the regular C-index which you've applied in previous assignments. Note that the regular c-statistic does not look at pairs of pairs of patients, and just compares one patient to another when evaluating the model's performance. So the regular c-index is evaluating the model's ability to predict overall patient risk, not necessarily measuring how well the model predicts benefit from treatment." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:31:55.406270Z", "start_time": "2020-04-04T15:31:55.400272Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "JRtzA6qyJ5sn", "outputId": "4ada7ef3-b746-4ba1-c208-828cf6c8f674" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic Regression evaluated by regular C-index: 0.7785\n" ] } ], "source": [ "from lifelines.utils import concordance_index\n", "tmp_regular_cindex = concordance_index(y_test, lr.predict_proba(X_test)[:, 1])\n", "print(f\"Logistic Regression evaluated by regular C-index: {tmp_regular_cindex:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "```CPP\n", "Logistic Regression evaluated by regular C-index: 0.7785\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "qRYEhMCOLDjs" }, "source": [ "You can see that even though the model accurately predicts overall risk (regular c-index), it does not necessarily do a great job predicting benefit from treatment (c-for-benefit). " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Z_4ogidoLqGd" }, "source": [ "You can also visually assess the discriminative ability of the model by checking if the people it thinks benefit the most from treatment empirically (actually) experience a benefit. \n", "\n", "Since you don't have counterfactual results from individuals, you'll need to aggregate patient information in some way. \n", "\n", "You can group patients by deciles (10 groups) of risk." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:32:02.277354Z", "start_time": "2020-04-04T15:32:02.107132Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 458 }, "colab_type": "code", "id": "aP8ST7ycL-I6", "outputId": "6c02ef30-8683-45b3-f3f1-dea8b39c4f79" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def quantile_benefit(X, y, arr_hat):\n", " df = X.copy(deep=True)\n", " df.loc[:, 'y'] = y\n", " df.loc[:, 'benefit'] = arr_hat\n", " benefit_groups = pd.qcut(arr_hat, 10)\n", " df.loc[:, 'benefit_groups'] = benefit_groups\n", " empirical_benefit = df.loc[df.TRTMT == 0, :].groupby('benefit_groups').y.mean() - df.loc[df.TRTMT == 1].groupby('benefit_groups').y.mean()\n", " avg_benefit = df.loc[df.TRTMT == 0, :].y.mean() - df.loc[df.TRTMT==1, :].y.mean()\n", " return empirical_benefit, avg_benefit\n", "\n", "def plot_empirical_risk_reduction(emp_benefit, av_benefit, model):\n", " plt.scatter(range(len(emp_benefit)), emp_benefit)\n", " plt.xticks(range(len(emp_benefit)), range(1, len(emp_benefit) + 1))\n", " plt.title(\"Empirical Risk Reduction vs. Predicted ({})\".format(model))\n", " plt.ylabel(\"Empirical Risk Reduction\")\n", " plt.xlabel(\"Predicted Risk Reduction Quantile\")\n", " plt.plot(range(10), [av_benefit]*10, linestyle='--', label='average RR')\n", " plt.legend(loc='lower right')\n", " plt.show()\n", "\n", "emp_benefit, avg_benefit = quantile_benefit(X_test, y_test, rr_lr)\n", "plot_empirical_risk_reduction(emp_benefit, avg_benefit, \"Logistic Regression\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "YZM3WZ2fPvOn" }, "source": [ "If the model performed well, then you would see patients in the higher deciles of predicted risk reduction (on the right) also have higher empirical risk reduction (to the top). \n", "\n", "This model using logistic regression is far from perfect. \n", "\n", "Below, you'll see if you can do better using a more flexible machine learning approach." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "JL8ET3lk9r02" }, "source": [ "\n", "## 4 Machine Learning Approaches " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-oOkd5juz5To" }, "source": [ "\n", "### 4.1 T-Learner\n", "\n", "Now you will see how recent machine learning approaches compare to the more standard analysis. The approach we'll look at is called [T-learner](https://arxiv.org/pdf/1706.03461.pdf).\n", "- \"T\" stands for \"two\". \n", "- The T-learner learns two different models, one for treatment risk, and another model for control risk.\n", "- Then takes the difference of the two risk predictions to predict the risk reduction.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 9: Complete the TLearner class. \n", "\n", "- The constructor `__init__()` sets the treatment and control estimators based on the given inputs to the constructor.\n", "- The `predict` function takes the features and uses each estimator to predict the risk of death. Then it calculates the risk of death for the control estimator minus the risk of death from the treatment estimator, and returns this as the predicted risk reduction." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "class TLearner():\n", " \"\"\"\n", " T-Learner class.\n", "\n", " Attributes:\n", " treatment_estimator (object): fitted model for treatment outcome\n", " control_estimator (object): fitted model for control outcome\n", " \"\"\" \n", " def __init__(self, treatment_estimator, control_estimator):\n", " \"\"\"\n", " Initializer for TLearner class.\n", " \"\"\"\n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " # set the treatment estimator\n", " self.treatment_estimator = treatment_estimator\n", " \n", " # set the control estimator \n", " self.control_estimator = control_estimator\n", " \n", " ### END CODE HERE ###\n", "\n", " def predict(self, X):\n", " \"\"\"\n", " Return predicted risk reduction for treatment for given data matrix.\n", "\n", " Args:\n", " X (dataframe): dataframe containing features for each subject\n", " \n", " Returns:\n", " preds (np.array): predicted risk reduction for each row of X\n", " \"\"\"\n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " # predict the risk of death using the control estimator\n", " risk_control = self.control_estimator.predict_proba(X)[:,1]\n", " \n", " # predict the risk of death using the treatment estimator\n", " risk_treatment = self.treatment_estimator.predict_proba(X)[:,1]\n", " \n", " # the predicted risk reduction is control risk minus the treatment risk\n", " pred_risk_reduction = risk_control - risk_treatment\n", " \n", " ### END CODE HERE ###\n", " \n", " return pred_risk_reduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tune the model with grid search\n", "\n", "In order to tune your two models, you will use grid search to find the desired parameters.\n", "- You will use a validation set to evaluate the model on different parameters, in order to avoid overfitting to the training set.\n", "\n", "To test models on all combinations of hyperparameters, you can first list out all of the values in a list of lists.\n", "For example:\n", "```CPP\n", "hyperparams = {\n", " 'n_estimators': [10, 20],\n", " 'max_depth': [2, 5],\n", " 'min_samples_leaf': [0.1, 0.2],\n", " 'random_state': [0]\n", "}\n", "```\n", "You can generate a list like this:\n", "```CPP\n", "[[10, 20],\n", " [2, 5], \n", " [0.1, 0.2]\n", "]\n", "```\n", "\n", "Next, you can get all combinations of the hyperparameter values:\n", "```CPP\n", "[(10, 2, 0.1),\n", " (10, 2, 0.2),\n", " (10, 5, 0.1),\n", " (10, 5, 0.2),\n", " (20, 2, 0.1),\n", " (20, 2, 0.2),\n", " (20, 5, 0.1),\n", " (20, 5, 0.2)]\n", "```\n", "\n", "To feed the hyperparameters into an random forest model, you can use a dictionary, so that you do not need to hard code the parameter names.\n", "For example, instead of\n", "```CPP\n", "RandomForestClassifier(n_estimators= 20, max_depth=5, min_samples_leaf=0.2)\n", "```\n", "\n", "You have more flexibility if you create a dictionary and pass it into the model.\n", "```CPP\n", "args_d = {'n_estimators': 20, 'max_depth': 5, 'min_samples_leaf': 0.2}\n", "RandomForestClassifier(**args_d)\n", "```\n", "This allows you to pass in a hyperparameter dictionary for any hyperpameters, not just `n_estimators`, `max_depth`, and `min_samples_leaf`.\n", "\n", "So you'll find a way to generate a list of dictionaries, like this:\n", "```CPP\n", "[{'n_estimators': 10, 'max_depth': 2, 'min_samples_leaf': 0.1},\n", " {'n_estimators': 10, 'max_depth': 2, 'min_samples_leaf': 0.2},\n", " {'n_estimators': 10, 'max_depth': 5, 'min_samples_leaf': 0.1},\n", " {'n_estimators': 10, 'max_depth': 5, 'min_samples_leaf': 0.2},\n", " {'n_estimators': 20, 'max_depth': 2, 'min_samples_leaf': 0.1},\n", " {'n_estimators': 20, 'max_depth': 2, 'min_samples_leaf': 0.2},\n", " {'n_estimators': 20, 'max_depth': 5, 'min_samples_leaf': 0.1},\n", " {'n_estimators': 20, 'max_depth': 5, 'min_samples_leaf': 0.2}]\n", "```\n", "\n", "Notice how the values in both the list of tuples and list of dictionaries are in the same order as the original hyperparams dictionary. For example, the first value in each is n_estimarors, then max_depth, and then min_samples_leaf:\n", "```CPP\n", "# list of lists\n", "(10, 2, 0.1)\n", "\n", "# list of dictionaries\n", "{'n_estimators': 10, 'max_depth': 2, 'min_samples_leaf': 0.1}\n", "```\n", "\n", "\n", "\n", "Then for each dictionary of hyperparams:\n", "- Train a model.\n", "- Use the regular concordance index to compare their performances. \n", "- Identify and return the best performing model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 10: hold out grid search\n", "\n", "Implement hold out grid search. \n", "##### Note\n", "In this case, you are not going to apply k-fold cross validation. Since `sklearn.model_selection.GridSearchCV()` applies k-fold cross validation, you won't be using this to perform grid search, and you will implement your own grid search.\n", "\n", "Please see the hints if you get stuck." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • You can use the .items() or .values() method of a dictionary to get its key, value pairs or just values. Use a list() to store them inside a list.
  • \n", "
  • To get all combinations of the hyperparams, you can use itertools.product(*args_list), where args_list is a list object.
  • \n", "
  • To generate the list of dictionaries, loop through the list of tuples. The position of each value
  • \n", "
\n", "

\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def holdout_grid_search(clf, X_train_hp, y_train_hp, X_val_hp, y_val_hp, hyperparam, verbose=False):\n", " '''\n", " Conduct hyperparameter grid search on hold out validation set. Use holdout validation.\n", " Hyperparameters are input as a dictionary mapping each hyperparameter name to the\n", " range of values they should iterate over. Use the cindex function as your evaluation\n", " function.\n", " \n", " Input:\n", " clf: sklearn classifier\n", " X_train_hp (dataframe): dataframe for training set input variables\n", " y_train_hp (dataframe): dataframe for training set targets\n", " X_val_hp (dataframe): dataframe for validation set input variables\n", " y_val_hp (dataframe): dataframe for validation set targets\n", " hyperparam (dict): hyperparameter dictionary mapping hyperparameter\n", " names to range of values for grid search\n", " \n", " Output:\n", " best_estimator (sklearn classifier): fitted sklearn classifier with best performance on\n", " validation set\n", " '''\n", " # Initialize best estimator\n", " best_estimator = None\n", " \n", " # initialize best hyperparam\n", " best_hyperparam = {}\n", " \n", " # initialize the c-index best score to zero\n", " best_score = 0.0\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " \n", " # Get the values of the hyperparam and store them as a list of lists\n", " hyper_param_l = list(hyperparam.values())\n", " \n", " # Generate a list of tuples with all possible combinations of the hyperparams\n", " combination_l_of_t = list(itertools.product(*hyper_param_l))\n", " \n", " # Initialize the list of dictionaries for all possible combinations of hyperparams\n", " combination_l_of_d = []\n", " \n", " # loop through each tuple in the list of tuples\n", " for val_tuple in combination_l_of_t: # complete this line\n", " param_d = {}\n", " \n", " # Enumerate each key in the original hyperparams dictionary\n", " for i, k in enumerate(hyperparam): # complete this line\n", " \n", " # add a key value pair to param_dict for each value in val_tuple\n", " param_d[k] = val_tuple[i]\n", " \n", " # append the param_dict to the list of dictionaries\n", " combination_l_of_d.append(param_d)\n", " \n", " \n", " # For each hyperparam dictionary in the list of dictionaries:\n", " for param_d in combination_l_of_d: # complete this line\n", " \n", " # Set the model to the given hyperparams\n", " estimator = clf(**param_d)\n", " \n", " # Train the model on the training features and labels\n", " estimator.fit(X_train_hp,y_train_hp)\n", " \n", " # Predict the risk of death using the validation features\n", " preds = estimator.predict_proba(X_val_hp)\n", " \n", " # Evaluate the model's performance using the regular concordance index\n", " estimator_score = concordance_index(y_val_hp, preds[:,1])\n", " \n", " # if the model's c-index is better than the previous best:\n", " if estimator_score>best_score: # complete this line\n", "\n", " # save the new best score\n", " best_score = estimator_score\n", " \n", " # same the new best estimator\n", " best_estimator = estimator\n", " \n", " # save the new best hyperparams\n", " best_hyperparam = param_d\n", " \n", " ### END CODE HERE ###\n", "\n", " if verbose:\n", " print(\"hyperparam:\")\n", " display(hyperparam)\n", " \n", " print(\"hyper_param_l\")\n", " display(hyper_param_l)\n", " \n", " print(\"combination_l_of_t\")\n", " display(combination_l_of_t)\n", " \n", " print(f\"combination_l_of_d\")\n", " display(combination_l_of_d)\n", " \n", " print(f\"best_hyperparam\")\n", " display(best_hyperparam)\n", " print(f\"best_score: {best_score:.4f}\")\n", " \n", " return best_estimator, best_hyperparam" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hyperparam:\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", " from numpy.core.umath_tests import inner1d\n" ] }, { "data": { "text/plain": [ "{'n_estimators': [10, 20],\n", " 'max_depth': [2, 5],\n", " 'min_samples_leaf': [0.1, 0.2],\n", " 'random_state': [0]}" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "hyper_param_l\n" ] }, { "data": { "text/plain": [ "[[10, 20], [2, 5], [0.1, 0.2], [0]]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "combination_l_of_t\n" ] }, { "data": { "text/plain": [ "[(10, 2, 0.1, 0),\n", " (10, 2, 0.2, 0),\n", " (10, 5, 0.1, 0),\n", " (10, 5, 0.2, 0),\n", " (20, 2, 0.1, 0),\n", " (20, 2, 0.2, 0),\n", " (20, 5, 0.1, 0),\n", " (20, 5, 0.2, 0)]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "combination_l_of_d\n" ] }, { "data": { "text/plain": [ "[{'n_estimators': 10,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 10,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0},\n", " {'n_estimators': 10,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 10,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0}]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "best_hyperparam\n" ] }, { "data": { "text/plain": [ "{'n_estimators': 10,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0}" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "best_score: 0.5928\n" ] } ], "source": [ "# Test\n", "n = X_dev.shape[0]\n", "tmp_X_train = X_dev.iloc[:int(n*0.8),:]\n", "tmp_X_val = X_dev.iloc[int(n*0.8):,:]\n", "tmp_y_train = y_dev[:int(n*0.8)]\n", "tmp_y_val = y_dev[int(n*0.8):]\n", "\n", "hyperparams = {\n", " 'n_estimators': [10, 20],\n", " 'max_depth': [2, 5],\n", " 'min_samples_leaf': [0.1, 0.2],\n", " 'random_state' : [0]\n", "}\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "control_model = holdout_grid_search(RandomForestClassifier,\n", " tmp_X_train, tmp_y_train,\n", " tmp_X_val, tmp_y_val, hyperparams, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "T-Learner is a convenient framework because it does not restrict your choice of base learners.\n", "- You will use random forests as the base learners, but are able to choose another model as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "##### Expected output\n", "\n", "```CPP\n", "hyperparam:\n", "{'n_estimators': [10, 20],\n", " 'max_depth': [2, 5],\n", " 'min_samples_leaf': [0.1, 0.2],\n", " 'random_state': [0]}\n", "hyper_param_l\n", "[[10, 20], [2, 5], [0.1, 0.2], [0]]\n", "combination_l_of_t\n", "[(10, 2, 0.1, 0),\n", " (10, 2, 0.2, 0),\n", " (10, 5, 0.1, 0),\n", " (10, 5, 0.2, 0),\n", " (20, 2, 0.1, 0),\n", " (20, 2, 0.2, 0),\n", " (20, 5, 0.1, 0),\n", " (20, 5, 0.2, 0)]\n", "combination_l_of_d\n", "[{'n_estimators': 10,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 10,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0},\n", " {'n_estimators': 10,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 10,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0},\n", " {'n_estimators': 20,\n", " 'max_depth': 5,\n", " 'min_samples_leaf': 0.2,\n", " 'random_state': 0}]\n", "best_hyperparam\n", "{'n_estimators': 10,\n", " 'max_depth': 2,\n", " 'min_samples_leaf': 0.1,\n", " 'random_state': 0}\n", "best_score: 0.5928\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "O-BkhCwzIEYT" }, "source": [ "\n", "### Exercise 11: Training and validation, treatment and control splits\n", "\n", "- Unlike logistic regression, the machine learning algorithms used for base learners will generally require hyperparameter tuning, which means that you need to split your dev set into a training and validation set. \n", "- You need to also split each of the training and validation sets into *treatment* and *control* groups to train the treatment and control base learners of the T-Learner.\n", "\n", "The function below takes in a dev dataset and splits it into training and validation sets for treatment and control models, respectively. \n", "Complete the implementation. \n", "\n", "#### Note\n", "- The input X_train and X_val have the 'TRTMT' column. Please remove the 'TRTMT' column from the treatment and control features that the function returns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " Hints\n", "\n", "

\n", "

    \n", "
  • To drop a column, set the axis to 1 when calling pandas.DataFrame.drop (axis=0 is used to drop a row by its index label)
  • \n", "
  • \n", "
\n", "

" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 249 }, "colab_type": "code", "id": "QdVLM4Zxjd4L", "outputId": "9e70dbc4-afbc-46e4-d566-8e19e261bbab" }, "outputs": [], "source": [ "# UNQ_C11 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n", "def treatment_dataset_split(X_train, y_train, X_val, y_val):\n", " \"\"\"\n", " Separate treated and control individuals in training\n", " and testing sets. Remember that returned\n", " datasets should NOT contain the 'TRMT' column!\n", "\n", " Args:\n", " X_train (dataframe): dataframe for subject in training set\n", " y_train (np.array): outcomes for each individual in X_train\n", " X_val (dataframe): dataframe for subjects in validation set\n", " y_val (np.array): outcomes for each individual in X_val\n", " \n", " Returns:\n", " X_treat_train (df): training set for treated subjects\n", " y_treat_train (np.array): labels for X_treat_train\n", " X_treat_val (df): validation set for treated subjects\n", " y_treat_val (np.array): labels for X_treat_val\n", " X_control_train (df): training set for control subjects\n", " y_control_train (np.array): labels for X_control_train\n", " X_control_val (np.array): validation set for control subjects\n", " y_control_val (np.array): labels for X_control_val\n", " \"\"\"\n", " \n", " ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###\n", " \n", " # From the training set, get features of patients who received treatment\n", " X_treat_train = X_train[X_train.TRTMT==True]\n", " \n", " # drop the 'TRTMT' column\n", " X_treat_train = X_treat_train.drop(columns='TRTMT')\n", " \n", " # From the training set, get the labels of patients who received treatment\n", " y_treat_train = y_train[X_train.TRTMT==1]\n", "\n", " # From the validation set, get the features of patients who received treatment\n", " X_treat_val = X_val[X_val.TRTMT==True]\n", " \n", " # Drop the 'TRTMT' column\n", " X_treat_val = X_treat_val.drop(columns='TRTMT')\n", " \n", " # From the validation set, get the labels of patients who received treatment\n", " y_treat_val = y_val[X_val.TRTMT==1]\n", " \n", "# --------------------------------------------------------------------------------------------\n", " \n", " # From the training set, get the features of patients who did not received treatment\n", " X_control_train = X_train[X_train.TRTMT==False]\n", " \n", " # Drop the TRTMT column\n", " X_control_train = X_control_train.drop(columns='TRTMT')\n", " \n", " # From the training set, get the labels of patients who did not receive treatment\n", " y_control_train = y_train[X_train.TRTMT==False]\n", " \n", " # From the validation set, get the features of patients who did not receive treatment\n", " X_control_val = X_val[X_val.TRTMT==False]\n", " \n", " # drop the 'TRTMT' column\n", " X_control_val = X_control_val.drop(columns='TRTMT')\n", "\n", " # From the validation set, get teh labels of patients who did not receive treatment\n", " y_control_val = y_val[X_val.TRTMT==False]\n", " \n", " ### END CODE HERE ###\n", "\n", " return (X_treat_train, y_treat_train,\n", " X_treat_val, y_treat_val,\n", " X_control_train, y_control_train,\n", " X_control_val, y_control_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Test Case**" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tests\n", "\n", "Didn't lose any subjects: True\n", "\n", "TRTMT not in any splits: True\n", "\n", "Treated splits have all treated patients: True\n", "\n", "All subjects in control split are untreated: True\n", "\n", "No overlap between treat_train and treat_val: True\n", "\n", "No overlap between control_train and control_val: True\n", "\n", "--> Expected: All statements should be True\n" ] } ], "source": [ "# Tests\n", "example_df = pd.DataFrame(columns = ['ID', 'TRTMT'])\n", "example_df.ID = range(100)\n", "example_df.TRTMT = np.random.binomial(n=1, p=0.5, size=100)\n", "treated_ids = set(example_df[example_df.TRTMT==1].ID)\n", "example_y = example_df.TRTMT.values\n", "\n", "example_train, example_val, example_y_train, example_y_val = train_test_split(\n", " example_df, example_y, test_size = 0.25, random_state=0\n", ")\n", "\n", "\n", "(x_treat_train, y_treat_train,\n", " x_treat_val, y_treat_val,\n", " x_control_train, y_control_train,\n", " x_control_val, y_control_val) = treatment_dataset_split(example_train, example_y_train,\n", " example_val, example_y_val)\n", "\n", "print(\"Tests\")\n", "pass_flag = True\n", "pass_flag = (len(x_treat_train) + len(x_treat_val) + len(x_control_train) +\n", " len(x_control_val) == 100)\n", "print(f\"\\nDidn't lose any subjects: {pass_flag}\")\n", "pass_flag = ((\"TRTMT\" not in x_treat_train) and (\"TRTMT\" not in x_treat_val) and\n", " (\"TRTMT\" not in x_control_train) and (\"TRTMT\" not in x_control_val))\n", "print(f\"\\nTRTMT not in any splits: {pass_flag}\")\n", "split_treated_ids = set(x_treat_train.ID).union(set(x_treat_val.ID))\n", "pass_flag = (len(split_treated_ids.union(treated_ids)) == len(treated_ids))\n", "print(f\"\\nTreated splits have all treated patients: {pass_flag}\")\n", "split_control_ids = set(x_control_train.ID).union(set(x_control_val.ID))\n", "pass_flag = (len(split_control_ids.intersection(treated_ids)) == 0)\n", "print(f\"\\nAll subjects in control split are untreated: {pass_flag}\") \n", "pass_flag = (len(set(x_treat_train.ID).intersection(x_treat_val.ID)) == 0)\n", "print(f\"\\nNo overlap between treat_train and treat_val: {pass_flag}\")\n", "pass_flag = (len(set(x_control_train.ID).intersection(x_control_val.ID)) == 0)\n", "print(f\"\\nNo overlap between control_train and control_val: {pass_flag}\")\n", "print(f\"\\n--> Expected: All statements should be True\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will now train a T-learner model on the patient data, and evaluate its performance using the c-for-benefit.\n", "\n", "First, get the training and validation sets." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "# Import the random forest classifier to be used as the base learner\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "# Split the dev data into train and validation sets\n", "X_train, X_val, y_train, y_val = train_test_split(X_dev, \n", " y_dev, \n", " test_size = 0.25,\n", " random_state = 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split the training set into a treatment and control set. \n", "Similarly, split the validation set into a treatment and control set." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "# get treatment and control arms of training and validation sets\n", "(X_treat_train, y_treat_train, \n", " X_treat_val, y_treat_val,\n", " X_control_train, y_control_train,\n", " X_control_val, y_control_val) = treatment_dataset_split(X_train, y_train,\n", " X_val, y_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose a set of hyperparameters to perform grid search and find the best model. \n", "- Please first use these given hyperparameters so that you can get the same c-for-benefit calculation at the end of this exercise. \n", "- Afterwards, we encourage you to come back and try other ranges for these hyperparameters. \n", "\n", "```CPP\n", "# Given hyperparams to do grid search\n", "hyperparams = {\n", " 'n_estimators': [100, 200],\n", " 'max_depth': [2, 5, 10, 40, None],\n", " 'min_samples_leaf': [1, 0.1, 0.2],\n", " 'random_state': [0]\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# hyperparameter grid (we'll use the same one for both arms for convenience)\n", "# Note that we set random_state to zero\n", "# in order to make the output consistent each time it's run.\n", "hyperparams = {\n", " 'n_estimators': [100, 200],\n", " 'max_depth': [2, 5, 10, 40, None],\n", " 'min_samples_leaf': [1, 0.1, 0.2],\n", " 'random_state': [0]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train the treatment base learner. \n", "- Perform grid search to find a random forest classifier and associated hyperparameters with the best c-index (the regular c-index)." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# perform grid search with the treatment data to find the best model \n", "treatment_model, best_hyperparam_treat = holdout_grid_search(RandomForestClassifier,\n", " X_treat_train, y_treat_train,\n", " X_treat_val, y_treat_val, hyperparams)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train the control base learner." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# perform grid search with the control data to find the best model \n", "control_model, best_hyperparam_ctrl = holdout_grid_search(RandomForestClassifier,\n", " X_control_train, y_control_train,\n", " X_control_val, y_control_val, hyperparams)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combine the treatment and control base learners into the T-learner." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# Save the treatment and control models into an instance of the TLearner class\n", "t_learner = TLearner(treatment_model, control_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the validation set, predict each patient's risk reduction." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_val num of patients 114\n", "rr_t_val num of patient predictions 114\n" ] } ], "source": [ "# Use the t-learner to predict the risk reduction for patients in the validation set\n", "rr_t_val = t_learner.predict(X_val.drop(['TRTMT'], axis=1))\n", "\n", "print(f\"X_val num of patients {X_val.shape[0]}\")\n", "print(f\"rr_t_val num of patient predictions {rr_t_val.shape[0]}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "xYX1rN1tIv4w" }, "source": [ "Now plot a histogram of your predicted risk reduction on the validation set. " ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:32:34.703743Z", "start_time": "2020-04-04T15:32:34.529749Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 444 }, "colab_type": "code", "id": "XISgvb6IiXnl", "outputId": "6850488a-51aa-4bad-a151-1bcf9a7573bc" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(rr_t_val, bins='auto')\n", "plt.title(\"Histogram of Predicted ARR, T-Learner, validation set\")\n", "plt.xlabel('predicted risk reduction')\n", "plt.ylabel('count of patients')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "V89cP4pxQhNo" }, "source": [ "Notice when viewing the histogram that predicted risk reduction can be negative.\n", "- This means that for some patients, the T-learner predicts that treatment will actually increase their risk (negative risk reduction). \n", "- The T-learner is more flexible compared to the logistic regression model, which only predicts non-negative risk reduction for all patients (view the earlier histogram of the 'predicted ARR' histogram for the logistic regression model, and you'll see that the possible values are all non-negative)." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "noMOc9kOI5cw" }, "source": [ "Now plot an empirical risk reduction plot for the validation set examples. " ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:32:38.119651Z", "start_time": "2020-04-04T15:32:37.941488Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 458 }, "colab_type": "code", "id": "S-0nbpSkJFmZ", "outputId": "13afaa75-71e8-4f7f-fa25-78da6cefe18a" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "empirical_benefit, avg_benefit = quantile_benefit(X_val, y_val, rr_t_val)\n", "plot_empirical_risk_reduction(empirical_benefit, avg_benefit, 'T Learner [val set]')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "w8F2N-Zje8dB" }, "source": [ "Recall that the predicted risk reduction is along the horizontal axis and the vertical axis is the empirical (actual risk reduction).\n", "\n", "A good model would predict a lower risk reduction for patients with actual lower risk reduction. Similarly, a good model would predict a higher risk reduction for patients with actual higher risk reduction (imagine a diagonal line going from the bottom left to the top right of the plot).\n", "\n", "The T-learner seems to be doing a bit better (compared to the logistic regression model) at differentiating between the people who would benefit most treatment and the people who would benefit least from treatment." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "CzcjvmxKJWlN" }, "source": [ "Compute the C-statistic-for-benefit on the validation set." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:32:40.675054Z", "start_time": "2020-04-04T15:32:40.671084Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "blwOcph5JVnV", "outputId": "4f359278-db85-4296-a717-87d6175465cc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C-for-benefit statistic of T-learner on val set: 0.5043\n" ] } ], "source": [ "c_for_benefit_tlearner_val_set = c_statistic(rr_t_val, y_val, X_val.TRTMT)\n", "print(f\"C-for-benefit statistic of T-learner on val set: {c_for_benefit_tlearner_val_set:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "C-for-benefit statistic of T-learner on val set: 0.5043\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "yWo27MRmJoa0" }, "source": [ "Now or the test set, predict each patient's risk reduction" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# predict the risk reduction for each of the patients in the test set\n", "rr_t_test = t_learner.predict(X_test.drop(['TRTMT'], axis=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the histogram of risk reduction for the test set." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot a histogram of the predicted risk reduction\n", "plt.hist(rr_t_test, bins='auto')\n", "plt.title(\"Histogram of Predicted ARR for the T-learner on test set\")\n", "plt.xlabel(\"predicted risk reduction\")\n", "plt.ylabel(\"count of patients\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the predicted versus empircal risk reduction for the test set." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the predicted versus empirical risk reduction for the test set\n", "empirical_benefit, avg_benefit = quantile_benefit(X_test, y_test, rr_t_test)\n", "plot_empirical_risk_reduction(empirical_benefit, avg_benefit, 'T Learner (test set)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the T-learner's performance using the test set." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2020-04-04T15:32:45.849067Z", "start_time": "2020-04-04T15:32:45.502487Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 970 }, "colab_type": "code", "id": "tGFuQSpLJnym", "outputId": "6cc2307e-7abf-40be-df49-8be92147e4c1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C-for-benefit statistic on test set: 0.5250\n" ] } ], "source": [ "# calculate the c-for-benefit of the t-learner on the test set\n", "c_for_benefit_tlearner_test_set = c_statistic(rr_t_test, y_test, X_test.TRTMT)\n", "print(f\"C-for-benefit statistic on test set: {c_for_benefit_tlearner_test_set:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Expected output\n", "\n", "```CPP\n", "C-for-benefit statistic on test set: 0.5250\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ihGyqKsEfJa0" }, "source": [ "The c-for-benefit of the two models were evaluated on different test sets. However, we can compare their c-for-benefit scores to get a sense of how they perform:\n", "- logistic regression: 0.5412\n", "- T-learner: 0.5250\n", "\n", "The T-learner doesn't actually do better than the logistic regression in this case. You can try to tune the hyperparameters of the T-Learner to see if you can improve it.\n", "\n", "### Note\n", "While the more flexible ML techniques may improve predictive power, the sample size is too small to be certain. \n", "- Models like the T-learner could still be helpful in identifying subgroups who will likely not be helped by treatment, or could even be harmed by treatment. \n", "- So doctors can study these patients in more detail to find out how to improve their outcomes. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "PHjwt4UYoqy7" }, "source": [ "## Congratulations\n", "\n", "You've finished the assignment for Course 3 Module 1! We've seen that machine learning techniques can help determine when a treatment will have greater treatment effect for a particular patient." ] } ], "metadata": { "colab": { "collapsed_sections": [ "sn8ODLuvXAyn" ], "include_colab_link": true, "name": "C3M1_Assignment.ipynb", "provenance": [], "toc_visible": true }, "coursera": { "schema_names": [ "AI4MC3-1" ] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 4 }