{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AI4M Course 2 Week 4 lecture notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "[One-hot encode categorical variables](#one-hot-encoding)\n", "\n", "[Hazard function](#hazard-function)\n", "\n", "[Permissible pairs with censoring and time](#permissible-pairs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## One-hot encode categorical variables" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which features are categorical?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ascitesedemastagecholesterol
000.53200.5
110.04180.2
201.03190.5
310.54210.3
\n", "
" ], "text/plain": [ " ascites edema stage cholesterol\n", "0 0 0.5 3 200.5\n", "1 1 0.0 4 180.2\n", "2 0 1.0 3 190.5\n", "3 1 0.5 4 210.3" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'ascites': [0,1,0,1],\n", " 'edema': [0.5,0,1,0.5],\n", " 'stage': [3,4,3,4],\n", " 'cholesterol': [200.5,180.2,190.5,210.3]\n", " })\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this small sample dataset, 'ascites', 'edema', and 'stage' are categorical variables\n", "- ascites: value is either 0 or 1\n", "- edema: value is either 0, 0.5 or 1\n", "- stage: is either 3 or 4\n", "\n", "'cholesterol' is a continuous variable, since it can be any decimal value greater than zero." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which categorical variables to one-hot encode?\n", "\n", "Of the categorical variables, which one should be one-hot encoded (turned into dummy variables)?\n", "\n", "- ascites: is already 0 or 1, so there is not a need to one-hot encode it.\n", " - We could one-hot encode ascites, but it is not necessary when there are just two possible values that are 0 or 1.\n", " - When values are 0 or 1, 1 means a disease is present, and 0 means normal (no disease).\n", "- edema: Edema is swelling in any part of the body. This data set's 'edema' feature has 3 categories, so we will want to one-hot encode it so that there is one feature column for each of the three possible values.\n", " - 0: No edema\n", " - 0.5: Patient has edema, but did not receive diuretic therapy (which is used to treat edema)\n", " - 1: Patient has edeam, despite also receiving diuretic therapy (so the condition may be more severe).\n", "- stage: has values of 3 and 4. We will want to one-hot encode these because they are not values of 0 or 1.\n", " - the \"stage\" of cancer is either 0, 1,2,3 or 4. \n", " - Stage 0 means there is no cancer. \n", " - Stage 1 is cancer that is limited to a small area of the body, also known as \"early stage cancer\"\n", " - Stage 2 is cancer that has spread to nearby tissues\n", " - stage 3 is cancer that has spread to nearby tissues, but more so than stage 2\n", " - stage 4 is cancer that has spread to distant parts of the body, also known as \"metastatic cancer\".\n", " - We could convert stage 3 to 0 and stage 4 to 1 for the sake of training a model. This would may be confusing for anyone reviewing our code and data. We will one-hot encode the 'stage'.\n", " -You'll actually see that we end up with 0 representing stage 3 and 1 representing stage 4 (see the next section)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multi-collinearity of one-hot encoded features\n", "\n", "Let's see what happens when we one-hot encode the 'stage' feature.\n", "\n", "We'll use [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stage_3stage_4
010
101
210
301
\n", "
" ], "text/plain": [ " stage_3 stage_4\n", "0 1 0\n", "1 0 1\n", "2 1 0\n", "3 0 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_stage = pd.get_dummies(data=df,\n", " columns=['stage']\n", " )\n", "df_stage[['stage_3','stage_4']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice about the 'stage_3' and 'stage_4' features?\n", "\n", "Given that stage 3 and stage 4 are the only possible values for stage, \n", "If you know that patient 0 (row 0) has stage_3 set to 1, \n", "what can you say about that same patient's value for the stage_4 feature?\n", "- When stage_3 is 1, then stage_4 must be 0\n", "- When stage_3 is 0, then stage_4 must be 1\n", "\n", "This means that one of the feature columns is actually redundant. We should drop one of these features to avoid multicollinearity (where one feature can predict another feature)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ascitesedemacholesterolstage_3stage_4
000.5200.510
110.0180.201
201.0190.510
310.5210.301
\n", "
" ], "text/plain": [ " ascites edema cholesterol stage_3 stage_4\n", "0 0 0.5 200.5 1 0\n", "1 1 0.0 180.2 0 1\n", "2 0 1.0 190.5 1 0\n", "3 1 0.5 210.3 0 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_stage" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ascitesedemacholesterolstage_4
000.5200.50
110.0180.21
201.0190.50
310.5210.31
\n", "
" ], "text/plain": [ " ascites edema cholesterol stage_4\n", "0 0 0.5 200.5 0\n", "1 1 0.0 180.2 1\n", "2 0 1.0 190.5 0\n", "3 1 0.5 210.3 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_stage_drop_first = df_stage.drop(columns='stage_3')\n", "df_stage_drop_first" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, there's actually a parameter of pandas.get_dummies() that lets you drop the first one-hot encoded column. You'll practice doing this in this week's assignment!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make the numbers decimals\n", "\n", "We can cast the one-hot encoded values as floats by setting the data type to numpy.float64.\n", "- This is helpful if we are feeding data into a model, where the model expects a certain data type (such as a 64-bit float, 32-bit float etc.)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stage_4
00
11
20
31
\n", "
" ], "text/plain": [ " stage_4\n", "0 0\n", "1 1\n", "2 0\n", "3 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_stage = pd.get_dummies(data=df,\n", " columns=['stage'],\n", " )\n", "df_stage[['stage_4']]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stage_4
00.0
11.0
20.0
31.0
\n", "
" ], "text/plain": [ " stage_4\n", "0 0.0\n", "1 1.0\n", "2 0.0\n", "3 1.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_stage_float64 = pd.get_dummies(data=df,\n", " columns=['stage'],\n", " dtype=np.float64\n", " )\n", "df_stage_float64[['stage_4']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This is the end of this practice section.\n", "\n", "Please continue on with the lecture videos!\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Hazard function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we fit the hazard function\n", "$$\n", "\\lambda(t, x) = \\lambda_0(t)e^{\\theta^T X_i}\n", "$$\n", "\n", "So that we have the coefficients $\\theta$ for the features in $X_i$\n", "\n", "If you have a new patient, let's predict their hazard $\\lambda(t,x)$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.5, 2. ])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lambda_0 = 1\n", "coef = np.array([0.5,2.])\n", "coef" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecholesterol
020180
130220
240170
\n", "
" ], "text/plain": [ " age cholesterol\n", "0 20 180\n", "1 30 220\n", "2 40 170" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = pd.DataFrame({'age': [20,30,40],\n", " 'cholesterol': [180,220,170]\n", " })\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- First, let's multiply the coefficients to the features.\n", "- Check the shapes of the coefficients and the features to decide which one to transpose" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2,)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coef.shape" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3, 2)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like the coefficient is a 1D array, so transposing it won't do anything. \n", "- We can transpose the X so that we're multiplying a (2,) array by a (2,3) dataframe.\n", "\n", "So the formula looks more like this (transpose $X_i$ instead of $\\theta$\n", "$$\n", "\\lambda(t, x) = \\lambda_0(t)e^{\\theta X_i^T}\n", "$$\n", "\n", "- Let's multiply $\\theta X_i^T$" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([370., 455., 360.])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.dot(coef,X.T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the hazard for the three patients (there are 3 rows in X)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecholesterolhazards
0201804.886054e+160
1302204.017809e+197
2401702.218265e+156
\n", "
" ], "text/plain": [ " age cholesterol hazards\n", "0 20 180 4.886054e+160\n", "1 30 220 4.017809e+197\n", "2 40 170 2.218265e+156" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lambdas = lambda_0 * np.exp(np.dot(coef,X.T))\n", "patients_df = X.copy()\n", "patients_df['hazards'] = lambdas\n", "patients_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This is the end of this practice section.\n", "\n", "Please continue on with the lecture videos!\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Permissible pairs with censoring and time" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timeeventrisk_score
02120
14140
22140
34120
42020
54140
62140
74020
\n", "
" ], "text/plain": [ " time event risk_score\n", "0 2 1 20\n", "1 4 1 40\n", "2 2 1 40\n", "3 4 1 20\n", "4 2 0 20\n", "5 4 1 40\n", "6 2 1 40\n", "7 4 0 20" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'time': [2,4,2,4,2,4,2,4],\n", " 'event': [1,1,1,1,0,1,1,0],\n", " 'risk_score': [20,40,40,20,20,40,40,20] \n", " })\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We made this data sample so that you can compare pairs of patients visually." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When at least one patient is not censored\n", "- A pair may be permissible if at least one patient is not censored.\n", "- If both pairs of patients are censored, then they are definitely not a permissible pair." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timeeventrisk_score
02120
14140
\n", "
" ], "text/plain": [ " time event risk_score\n", "0 2 1 20\n", "1 4 1 40" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df.iloc[0:1],df.iloc[1:2]],axis=0)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "May be a permissible pair: 0 and 1\n" ] } ], "source": [ "if df['event'][0] == 1 or df['event'][1] == 1:\n", " print(f\"May be a permissible pair: 0 and 1\")\n", "else:\n", " print(f\"Definitely not permissible pair: 0 and 1\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timeeventrisk_score
42020
74020
\n", "
" ], "text/plain": [ " time event risk_score\n", "4 2 0 20\n", "7 4 0 20" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df.iloc[4:5],df.iloc[7:8]],axis=0)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Definitely not permissible pair: 4 and 7\n" ] } ], "source": [ "if df['event'][4] == 1 or df['event'][7] == 1:\n", " print(f\"May be a permissible pair: 4 and 7\")\n", "else:\n", " print(f\"Definitely not permissible pair: 4 and 7\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### If neither patient was censored:\n", "- If both patients had an event (neither one was censored). This is definitely a permissible pair." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timeeventrisk_score
02120
14140
\n", "
" ], "text/plain": [ " time event risk_score\n", "0 2 1 20\n", "1 4 1 40" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df.iloc[0:1],df.iloc[1:2]],axis=0)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Definitely a permissible pair: 0 and 1\n" ] } ], "source": [ "if df['event'][0] == 1 and df['event'][1] == 1:\n", " print(f\"Definitely a permissible pair: 0 and 1\")\n", "else:\n", " print(f\"May be a permissible pair: 0 and 1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When one patient is censored:\n", "- If we know that one patient was censored and one had an event, then we can check if censored patient's time is at least as great as the uncensored patient's time. If so, it's a permissible pair as well" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timeeventrisk_score
62140
74020
\n", "
" ], "text/plain": [ " time event risk_score\n", "6 2 1 40\n", "7 4 0 20" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df.iloc[6:7],df.iloc[7:8]],axis=0)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Permissible pair: Censored patient 7 lasted at least as long as uncensored patient 6\n" ] } ], "source": [ "if df['time'][7] >= df['time'][6]:\n", " print(f\"Permissible pair: Censored patient 7 lasted at least as long as uncensored patient 6\")\n", "else:\n", " print(\"Not a permisible pair\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timeeventrisk_score
42020
54140
\n", "
" ], "text/plain": [ " time event risk_score\n", "4 2 0 20\n", "5 4 1 40" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df.iloc[4:5],df.iloc[5:6]],axis=0)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Not a permisible pair: censored patient 4 was censored before patient 5 had their event\n" ] } ], "source": [ "if df['time'][4] >= df['time'][5]:\n", " print(f\"Permissible pair\")\n", "else:\n", " print(\"Not a permisible pair: censored patient 4 was censored before patient 5 had their event\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This is the end of this practice section.\n", "\n", "Please continue on with the lecture videos!\n", "\n", "---" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }