{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Diabetes readmission\n", "by Samokhvalov Mikhail, Moscow 2018" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Research plan\n", " - [Part 1. Dataset and features description](#part1)\n", " - [Part 2. Exploratory data analysis](#part2)\n", " - [Part 3. Visual analysis of the features](#part3)\n", " - 3.1. Univariate analisys\n", " - 3.2. Bi-variate Analysis\n", " - 3.2.1. Continuous & Continuous\n", " - 3.2.2. Categorical & Categorical\n", " - 3.2.3. 3.2.3 Numeric & Categorical\n", " - [Part 4. Patterns, insights, peculiarities of data](#part4)\n", " - [Part 5. Data preprocessing](#part5)\n", " - [Part 6. Feature engineering and description](#part6)\n", " - [Part 7. Cross-validation, hyperparameter tuning](#part7)\n", " - [Part 8. Validation and learning curves](#part8)\n", " - [Part 9. Prediction for test samples](#part9)\n", " - [Part 10. Model evaluation with metrics description](#part10)\n", " - [Part 11. Conclusions](#part11)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1. Dataset and features description " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Dataset description from Kaggle" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://www.kaggle.com/brandao/diabetes/home\n", "#### Basic Explanaition\n", "It is important to know if a patient will be readmitted in some hospital. The reason is that you can change the treatment, in order to avoid a readmission.\n", "\n", "In this database, you have 3 different outputs:\n", "\n", " * No readmission;\n", " * A readmission in less than 30 days (this situation is not good, because maybe your treatment was not appropriate);\n", " * A readmission in more than 30 days (this one is not so good as well the last one, however, the reason can be the state of the patient.\n", " \n", "In this context, you can see different objective functions for the problem. You can try to figure out situations where the patient will not be readmitted, or if their are going to be readmitted in less than 30 days (because the problem can the the treatment), etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Content\n", "\n", "\"The data set represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.\n", "\n", "It is an inpatient encounter (a hospital admission).\n", "It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.\n", "The length of stay was at least 1 day and at most 14 days.\n", "Laboratory tests were performed during the encounter.\n", "Medications were administered during the encounter.\n", "The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.\"\n", "\n", " * https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Source\n", "\n", "The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058 and a recipient of the CERNER data. John Clore (jclore '@' vcu.edu), Krzysztof J. Cios (kcios '@' vcu.edu), Jon DeShazo (jpdeshazo '@' vcu.edu), and Beata Strack (strackb '@' vcu.edu). This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Feature description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all lets get features description from the article and convert in to **markdown** for better readable. Also lets map them to dataframe names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Feature name | Name in dataframe | Type | Description and values | % missing |\n", "|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|\n", "| Encounter ID | encounter_id | Numeric | Unique identifier of an encounter | 0 |\n", "| Patient number | patient_nbr | Numeric | Unique identifier of a patient | 0 |\n", "| Race | race | Nominal | Values: Caucasian, Asian, African American, Hispanic, and other | 2 |\n", "| Gender | gender | Nominal | Values: male, female, and unknown/invalid | 0 |\n", "| Age | age | Nominal | Grouped in 10-year intervals: [0, 10), [10, 20), . . ., [90, 100) | 0 |\n", "| Weight | weight | Numeric | Weight in pounds. | 97 |\n", "| Admission type | admission_type_id | Nominal | Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available | 0 |\n", "| Discharge disposition | discharge_disposition_id | Nominal | Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available | 0 |\n", "| Admission source | admission_source_id | Nominal | Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital | 0 |\n", "| Time in hospital | time_in_hospital | Numeric | Integer number of days between admission and discharge | 0 |\n", "| Payer code | payer_code | Nominal | Integer identifier corresponding to 23 distinct values, for example, Blue Cross\\Blue Shield, Medicare, and self-pay | 52 |\n", "| Medical specialty | medical_specialty | Nominal | Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\\general practice, and surgeon | 53 |\n", "| Number of lab procedures | num_lab_procedures | Numeric | Number of lab tests performed during the encounter | 0 |\n", "| Number of procedures | num_procedures | Numeric | Number of procedures (other than lab tests) performed during the encounter | 0 |\n", "| Number of medications | num_medications | Numeric | Number of distinct generic names administered during the encounter | 0 |\n", "| Number of outpatient visits | number_outpatient | Numeric | Number of outpatient visits of the patient in the year preceding the encounter | 0 |\n", "| Number of emergency visits | number_emergency | Numeric | Number of emergency visits of the patient in the year preceding the encounter | 0 |\n", "| Number of inpatient visits | number_inpatient | Numeric | Number of inpatient visits of the patient in the year preceding the encounter | 0 |\n", "| Diagnosis 1 | diag_1 | Nominal | The primary diagnosis (coded as first three digits of ICD9); 848 distinct values | 0 |\n", "| Diagnosis 2 | diag_2 | Nominal | Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values | 0 |\n", "| Diagnosis 3 | diag_3 | Nominal | Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values | 1 |\n", "| Number of diagnoses | number_diagnoses | Numeric | Number of diagnoses entered to the system | 0 |\n", "| Glucose serum test result | max_glu_serum | Nominal | Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured | 0 |\n", "| A1c test result | A1Cresult | Nominal | Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. | 0 |\n", "| Change of medications | change | Nominal | Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” | 0 |\n", "| Diabetes medications | diabetesMed | Nominal | Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” | 0 |\n", "| 24 features for medications | metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone | Nominal | For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed | 0 |\n", "| Readmitted | readmitted | Nominal | Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. | 0 |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Output variable\n", "**Last one feature - readmitted feature - is a target.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Part 2. Exploratory data analysis** \n", "### 2.1. Loading data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loading all necessary libraries:\n", "import zipfile\n", "import missingno as msno\n", "from tqdm import tqdm_notebook\n", "import itertools\n", "\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from scipy.stats import chi2_contingency\n", "\n", "from sklearn.impute import SimpleImputer #sklearn 0.20.1 is necessary\n", "from sklearn.model_selection import train_test_split, KFold\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can read files without unzipping!\n", "with zipfile.ZipFile(\"diabetes.zip\") as z:\n", " with z.open(\"diabetic_data.csv\") as f:\n", " data_df = pd.read_csv(f, encoding='utf-8')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_df.dtypes.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets take a look at the data:\n", "display(data_df.describe())\n", "data_size = len(data_df)\n", "print(f'Whole dataset size: {data_size}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Train test split\n", "As we got entire dataset here we need to split it to two parts: train and test and never spy to the test target array. We will use test target for checking our final solution.\n", "\n", "Data could be collected in chronological order. Therefore, to make the experiment more realistic, we divide the sample in half.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total_len = len(data_df)\n", "print('Total length: ', total_len)\n", "split_coef = 0.5\n", "split_number = int(total_len*split_coef)\n", "print('Split number: ', split_number)\n", "\n", "X_train = data_df.iloc[0:split_number]\n", "X_test = data_df.iloc[split_number:]\n", "\n", "y_train = X_train['readmitted']\n", "y_test = X_test['readmitted']\n", "\n", "X_train = X_train.drop(columns='readmitted')\n", "X_test = X_test.drop(columns='readmitted')\n", "\n", "print(X_train.shape, y_train.shape)\n", "print(X_test.shape, y_test.shape)\n", "\n", "# Also for the baseline lets convert y_target to numeric in this way:\n", "y_target = y_train.map({'<30':0, '>30':1, 'NO':2})\n", "y_test = y_test.map({'<30':0, '>30':1, 'NO':2})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Filling missings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Lets check missings:\n", "for col in data_df:\n", " uniq_values = data_df[col].unique()\n", " if '?' in uniq_values:\n", " num_of_nan = len(data_df[data_df[col]=='?'])\n", " print(f'Feature {col}, missed: {num_of_nan} or {num_of_nan/data_size*100:.2f} %') \n", " # adding here uniq_values we can see all of them. Ans see missings as '?' always" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we found missing values in dataset marked as **'?'**. Also there are '?' not only in features as shown in the article, but also in `diag_1` and `diag_2` features too!\n", "\n", "There are several methods to fill in the missings:\n", "1. drop nans\n", "2. fill with constant (0, -1, ...)\n", "3. fill with mean/median/moda\n", "4. groupby and fill with mean/median of the group\n", "5. built model to predict missings\n", "6. some methods can handle missings!\n", "\n", "Good example of using different methods:\n", "https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce\n", "\n", "**Important moment - we can't just drop missings in data - model should be able to work with missing values because we can't ignore any new patient just because he/she didn't indicate weight or race in the questionary.**\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# interesting method to visualize missings:\n", "\n", "columns_nans = ['race', 'weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3']\n", "\n", "imp = SimpleImputer(missing_values='?', strategy='constant', fill_value=np.nan)\n", "\n", "data_df_nans = pd.DataFrame(imp.fit_transform(data_df[columns_nans]), columns=columns_nans)\n", "msno.matrix(data_df_nans);\n", "msno.heatmap(data_df_nans);\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no correlation in missings (they doesn't appear simultaneously).\n", "Three theatures have too many missings: weight, payer_code, medical_specialty - from 40 to 97%.\n", "So it can be unsafe to fill them with any values.\n", "**Let's ignore them for baseline** and try different filling methods at tuning stage.\n", "\n", "Let's try different methods - start from the simplest one for baseline model and come back here and try another methods for more complex model. We will change data always in new columns and drop excess data before using each model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " - Baseline model: fill with most frequent value" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "columns_nans = ['race', 'diag_1', 'diag_2', 'diag_3']\n", "imp_most_frequent = SimpleImputer(missing_values='?', strategy='most_frequent', verbose=1)\n", "X_train_nan_most_frequent = pd.DataFrame(imp_most_frequent.fit_transform(X_train[columns_nans]),\n", " columns=[el+'_mf' for el in columns_nans] )\n", "X_test_nan_most_frequent = pd.DataFrame(imp_most_frequent.transform(X_test[columns_nans]),\n", " columns=[el+'_mf' for el in columns_nans] )\n", "\n", "X_train = pd.concat([X_train, X_train_nan_most_frequent], axis=1)\n", "X_test = pd.concat([X_test.reset_index(drop=True), X_test_nan_most_frequent], axis=1).set_index(X_test.index)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3. Visual analysis of the features " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1. Univariate analisys\n", "Lets do some data analisys. First of all we check numeric data, then categorical and finish with cat vs num data comparison. \n", "Very good example about general methods for data analisys: # https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features_numeric = X_train.select_dtypes(include='int64').columns\n", "features_categorical = X_train.select_dtypes(include='object').columns\n", "\n", "print(features_numeric)\n", "print(len(features_numeric))\n", "print(features_categorical)\n", "print(len(features_categorical))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lets take a look at numeric first ...**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col in features_numeric[2:]:\n", " print(col)\n", " print(X_train[col].value_counts())\n", " print(X_test[col].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "sns.set(style=\"whitegrid\")\n", "sns.set(rc={'figure.figsize':(10,10)})\n", "#sns.boxplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,[1,2,3,4,6,10]]);\n", "#sns.swarmplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[::100,[1,2,3,4,6,10]], color=\".25\")\n", "sns.violinplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,0:5]);\n", "#sns.boxplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,[0,5,7,8,9]]);\n", "#sns.swarmplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[::100,[0,5,7,8,9]], color=\".25\")\n", "plt.figure()\n", "sns.violinplot(data=X_train[features_numeric].drop(columns=['encounter_id', 'patient_nbr']).iloc[:,5:10]);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.set(rc={'figure.figsize':(15,5)})\n", "for axis in range(0,len(X_train[features_numeric[2:]].columns),3):\n", " cols = X_train[features_numeric[2:]].columns[axis:axis+3]\n", " f, axes = plt.subplots(1, 3, sharex=True)\n", " palette = \"crimson\"\n", " sns.distplot( X_train[cols[0]].values , color=palette, ax=axes[0]);\n", " try:\n", " sns.distplot( X_train[cols[1]].values , color=palette, ax=axes[1], label=cols[1]);\n", " except:\n", " pass\n", " try:\n", " sns.distplot( X_train[cols[2]].values , color=palette, ax=axes[2], label=cols[2]);\n", " except:\n", " pass\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**... and categorical.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features_ignored = ['weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'race']\n", "X_train_categorical = X_train[features_categorical].drop(columns=features_ignored)\n", "features_categorical = [el for el in features_categorical if el not in features_ignored]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for axis in range(0,len(X_train_categorical.columns[:-3]),3):\n", " cols = X_train_categorical.columns[axis:axis+3]\n", " f, axes = plt.subplots(1, 3)\n", " palette = \"crimson\"\n", " sns.countplot(X_train_categorical[cols[0]] , color=palette, ax=axes[0]);\n", " try:\n", " sns.countplot(X_train_categorical[cols[1]] , color=palette, ax=axes[1]);\n", " except:\n", " pass\n", " try:\n", " sns.countplot(X_train_categorical[cols[2]] , color=palette, ax=axes[2]);\n", " except:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_categorical[['diag_1_mf', 'diag_2_mf', 'diag_3_mf']] \\\n", " .apply(pd.Series.value_counts) \\\n", " .sort_values('diag_3_mf', ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col in X_train_categorical.columns[:-3]:\n", " print(col, X_train_categorical[col].unique())\n", " print('---'*10)\n", " print(X_train_categorical[col].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "constant_features = ['examide', 'citoglipton', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone',\n", " 'acetohexamide', \n", " 'tolbutamide', 'miglitol', 'troglitazone', 'tolazamide', 'glipizide-metformin']\n", "for col in constant_features:\n", " print(col)\n", " print(X_train[col].value_counts())\n", " print(X_test[col].value_counts())\n", " print('---'*10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**First of all - we can drop this columns: all columns has the same value (`No`). There are only 1-2 values == `Steady`**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train.drop(columns=constant_features, inplace=True)\n", "X_train_categorical.drop(columns=constant_features, inplace=True)\n", "X_test.drop(columns=constant_features, inplace=True)\n", "\n", "features_categorical = [el for el in features_categorical if el not in constant_features]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2. Bi-variate Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.1. Continuous & Continuous" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# this action can take about minute\n", "sns.pairplot( X_train[features_numeric].assign(target=y_target.values) );" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(X_train[features_numeric].assign(target=y_target.values).corr(), annot=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.2. Categorical & Categorical" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = X_train_categorical[features_categorical].assign(target=y_train.values)\n", "df.head()\n", "#df.set_index('target').T.plot(kind='bar', stacked=True)\n", "#df.plot(x='target',kind='bar', stacked=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "chi_table_prob = pd.DataFrame(np.zeros((len(df.columns), len(df.columns))),index=df.columns, columns=df.columns)\n", "chi_table_val =pd.DataFrame(np.zeros((len(df.columns), len(df.columns))),index=df.columns, columns=df.columns)\n", "for col_ind, col in tqdm_notebook(enumerate(df.columns)):\n", " for row_ind, row in enumerate(df.columns):\n", " chi = chi2_contingency(pd.crosstab(df[row], df[col], margins = True))\n", " chi_table_prob.iloc[row_ind, col_ind] = chi[1]\n", " chi_table_val.iloc[row_ind, col_ind] = chi[0]\n", " pd.crosstab(df[row], df[col], margins = True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.set(rc={'figure.figsize':(15,15)})\n", "sns.heatmap(chi_table_prob.round(2), annot=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This information shows dependencies between categorical features. There are also much stronger methods as Cramer V. It can be researched in further works.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.3. Numeric & Categorical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some test for exploring dependencies between numeric and categorical data, such Z-test or ANOVA test. But all of them has important assumptions that must be satisfied. One of them - Each data sample is from a normally distributed population. As a lot of our data are not from normally distribution, so we can't use it in this way. \n", "Maybe we can implement it if make some data modification transforming it to normal, but it is the plan for further research." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4. Patterns, insights, pecularities of data \n", "\n", "Let's sum up conclusions about data, based on previous parts.\n", "\n", "Fields `encounter_id`, `patient_nbr` must be dropped, because it is just patient number, algorithms can overfit on them\n", "There are a lot of missings in the fields `weight`, `payer_code`, `medical_specialty`. So we need to force it - use boosting (they can work with nans and '?'), drop them or fill them.\n", "\n", "A lot of features disturbed non-normally. So 'classic' methods fill loose accuracy here. Good idea for further reserch - try to transform them to normal (maybe using log).\n", "\n", "There are no any significant correlation between target and any of numeric features, but there are for some categorical. We need to choose categoricals carefully!\n", "\n", "There are high correlations between some numeric/categorical features, so we can drop a lot of them with no loosing accuracy.\n", "\n", "Need to pay special attention to `diag_` categories - they have a lot values, so using one-hot-encoding increase number of features greatly.\n", "\n", "**This is a multiclass problem, so we need to choose suitable methods.\n", "Good one can be KNN and trees.**\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5. Dataset and features description " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1. Prepare data for KNN method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Drop seatures we would not use for baseline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_knn = X_train.drop(['encounter_id', 'patient_nbr', 'race', 'weight', 'payer_code', 'medical_specialty', \n", " 'diag_1', 'diag_2', 'diag_3'], axis=1)\n", "X_test_knn = X_test.drop(['encounter_id', 'patient_nbr', 'race', 'weight', 'payer_code', 'medical_specialty', \n", " 'diag_1', 'diag_2', 'diag_3'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert string types to numeric: age, gender" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_knn['age_num'] = X_train_knn['age'].apply(lambda x: int(x[1]) )\n", "X_train_knn.drop('age', axis=1, inplace=True)\n", "\n", "X_test_knn['age_num'] = X_test_knn['age'].apply(lambda x: int(x[1]) )\n", "X_test_knn.drop('age', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# gender outlier:\n", "# we have 1 outlier here gender==Unknown/Invalid. 1 value is not important, \n", "# but for not to create additional dimension lets change it to Male, as his features a bit closer to Male mean than Female\n", "X_train_knn.iloc[X_train_knn[X_train_knn['gender']=='Unknown/Invalid'].index, 0] = 'Male'\n", "X_train_knn['gender_num'] = X_train_knn['gender'].apply(lambda x: 0 if x=='Male' else 1)\n", "X_train_knn.drop('gender', axis=1, inplace=True)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_test_knn.loc[X_test_knn.gender == 'Unknown/Invalid', 'gender'] = 'Male'\n", "X_test_knn['gender_num'] = X_test_knn['gender'].apply(lambda x: 0 if x=='Male' else 1)\n", "X_test_knn.drop('gender', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_knn_tmp = X_train_knn.iloc[:,0:11].assign(age_num=X_train_knn['age_num'], \n", " gender_num=X_train_knn['gender_num'])\n", "\n", "X_test_knn = X_test_knn.iloc[:,0:11].assign(age_num=X_test_knn['age_num'], \n", " gender_num=X_test_knn['gender_num'])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Select medical supplieses (medications) that affect accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# list of features for adding them to X_train\n", "medicals = X_train_knn.columns[11:25]\n", "print(medicals)\n", "medicals_list = []\n", "for ind, med in enumerate(medicals):\n", " medicals_list.append( pd.get_dummies(X_train_knn[med], prefix=med) )\n", " \n", "medicals_list_test = []\n", "for ind, med in enumerate(medicals):\n", " medicals_list_test.append( pd.get_dummies(X_test[med], prefix=med) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all lets start from using numerical features + age and gender only.\n", "Result accuracy for KNN - 0.48.\n", "\n", "Finding optimal neibhoor value = 70 give us baseline - **accuracy 0.57**\n", "\n", "Also lets test what if delete some of the numerical features - tried one by one, all them increasing accuracy **except gender!** So maybe we will drop this feature.\n", "Also checked do we need to scale binary features - it doesn't affect results, as expected.\n", "After than lets add features one by one, checking accuracy - does it incease or decrease it.\n", "\n", "Using KNN on all features give worse result, so lets repeat adding categorial feature one by one.\n", "\n", "**Not all result shown here because of rewriting code for every experiment and so as not to take a lot of space.**\n", "\n", "Shown final results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# this cell takes 35 mins to find optimal set of medical supplies features\n", "\n", "# medicals_optimal = [0,3,5,6,11,13]\n", "# subsets = []\n", "# for el in range(2, len(medicals_optimal)+1):\n", "# for subset in itertools.combinations(medicals_optimal, el):\n", "# subsets.append(subset)\n", "\n", "# for subset in subsets:\n", "# print(subset)\n", "# X_train_knn_tmp_cat = X_train_knn_tmp\n", "# for el in subset:\n", "# X_train_knn_tmp_cat = pd.concat( [X_train_knn_tmp_cat, medicals_list[el]], axis=1)\n", "# print(X_train_knn_tmp_cat.columns)\n", "\n", "# scaler = StandardScaler()\n", "# X_train_knn_scaled = scaler.fit_transform(X_train_knn_tmp_cat.drop('gender_num', axis=1))\n", "\n", "# print(X_train_knn_scaled.shape)\n", "\n", "# X_tr, X_holdout, y_tr, y_holdout = train_test_split(X_train_knn_scaled, y_target, test_size=0.3,\n", "# random_state=17)\n", "\n", "# n_neib = 70\n", "# neigh = KNeighborsClassifier(n_neighbors=n_neib)\n", "# neigh.fit(X_tr, y_tr) \n", "# knn_pred = neigh.predict(X_holdout)\n", "# res = accuracy_score(y_holdout, knn_pred)\n", "# print(res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The optimal set of medical supplies:\n", "medicals_optimal = [3,11,13]\n", "for el in medicals_optimal:\n", " X_train_knn_tmp = pd.concat( [X_train_knn_tmp, medicals_list[el]], axis=1)\n", "X_train_knn_tmp.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Check other features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trying to use all of the diagnoses as dummy variables - bad idea. It took a lot of time for computation and give no effect - accuracy is lower than without diagnoses.\n", "So the idea - use only top N most frequent diagnoses and set \"Other\" category for all the others. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For now we need to check all other categorical features:\n", "non_medicals = ['race_mf', 'diag_1_mf', 'diag_2_mf', 'diag_3_mf', 'change', 'diabetesMed']\n", "non_medicals_list = []\n", "for ind, el in enumerate(non_medicals):\n", " non_medicals_list.append( pd.get_dummies(X_train_knn[el], prefix=el) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "top_diag1 = X_train_knn['diag_1_mf'].value_counts()[0:20]\n", "top_diag1.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %%time\n", "\n", "# This cell takes a lot of time again - it's just example how to test features with a lot of categories\n", "\n", "# for num_feat in [125, 225]:\n", "# top_diag1 = X_train_knn['diag_1_mf'].value_counts()[0:num_feat]\n", "# dum = X_train_knn['diag_1_mf'].apply(lambda x: x if x in top_diag1.index else 'other')\n", "# dum = pd.get_dummies(dum, prefix='diag_1') \n", "# X_train_knn_tmp_cat = pd.concat( [X_train_knn_tmp, dum], axis=1)\n", "# print(X_train_knn_tmp_cat.columns)\n", "\n", "# scaler = StandardScaler()\n", "# X_train_knn_scaled = scaler.fit_transform(X_train_knn_tmp_cat.drop('gender_num', axis=1))\n", "\n", "# print(X_train_knn_scaled.shape)\n", "\n", "# X_tr, X_holdout, y_tr, y_holdout = train_test_split(X_train_knn_scaled, y_target, test_size=0.3,\n", "# random_state=17)\n", "\n", "# n_neib = 70\n", "# neigh = KNeighborsClassifier(n_neighbors=n_neib)\n", "# neigh.fit(X_tr, y_tr) \n", "# knn_pred = neigh.predict(X_holdout)\n", "# res = accuracy_score(y_holdout, knn_pred)\n", "# print(res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Conlusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like KNN is not really good for this task. We have a lot of categorical features, some of the with really big number of categories. This creates data of big dimentionality. As we think that some features still can be helpful KNN loose accuracy on them.\n", "\n", "The best achived result for KNN: **accuracy = 0.575**\n", "\n", "Lets try another algorithm whichis good with categorical features and good with big dimention data.\n", "I suppose to use some boosting and start from **CatBoost**.\n", "\n", "Optimal solution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "medicals_optimal = [3,11,13]\n", "for el in medicals_optimal:\n", " X_train_knn_tmp = pd.concat( [X_train_knn_tmp, medicals_list[el]], axis=1)\n", " \n", "print(X_train_knn_tmp.columns)\n", "print(X_train_knn_tmp.shape)\n", "\n", "scaler = StandardScaler()\n", "X_train_knn_scaled = scaler.fit_transform(X_train_knn_tmp.drop('gender_num', axis=1))\n", "\n", "X_tr_knn, X_holdout_knn, y_tr, y_holdout = train_test_split(X_train_knn_scaled, y_target, test_size=0.3,\n", " random_state=17)\n", "\n", "n_neib = 70\n", "neigh = KNeighborsClassifier(n_neighbors=n_neib)\n", "neigh.fit(X_tr_knn, y_tr) \n", "knn_pred = neigh.predict(X_holdout_knn)\n", "res_knn = accuracy_score(y_holdout, knn_pred)\n", "print(res_knn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prepare test data for final check\n", "medicals_optimal = [3,11,13]\n", "for el in medicals_optimal:\n", " X_test_knn = pd.concat( [X_test_knn, medicals_list_test[el]], axis=1)\n", "X_test_knn = scaler.transform(X_test_knn.drop('gender_num', axis=1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Use for CV\n", "# res_knn_cv = []\n", "\n", "# kf = KFold(n_splits = 5, random_state = 17, shuffle = True)\n", "# for i, (train_index, test_index) in enumerate(kf.split(X_train_knn_scaled)):\n", " \n", "# # Create data for this fold\n", "# y_train_knn_cv, y_valid_knn_cv = y_target.iloc[train_index], y_target.iloc[test_index]\n", "# X_train_knn_cv, X_valid_knn_cv = X_train_knn_scaled[train_index,:], X_train_knn_scaled[test_index,:]\n", "# print( \"\\nFold \", i)\n", " \n", "# # Run model for this fold\n", "# neigh.fit(X_train_knn_cv, y_train_knn_cv) \n", "# pred = neigh.predict(X_valid_knn_cv)\n", "# res_knn_cv.append(accuracy_score(pred,y_valid_knn_cv)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# np.mean(res_knn_cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2. Prepare data for CatBoost method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from catboost import CatBoostClassifier, Pool, cv\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature importance show features that can be dropped without loss of accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# initialize data\n", "X_train_catboost = X_train.drop(['race', 'diag_1', 'diag_2', 'diag_3', 'encounter_id', 'patient_nbr', 'change',\n", " 'glyburide-metformin', 'acarbose', 'chlorpropamide',\n", " 'pioglitazone', 'rosiglitazone', 'nateglinide', 'glyburide',\n", " 'metformin', 'diag_3_mf', 'A1Cresult', 'gender',\n", " 'admission_type_id', 'num_medications', 'insulin',\n", " 'admission_source_id', 'num_procedures', 'weight'], axis=1)\n", "X_test_catboost = X_test.drop(['race', 'diag_1', 'diag_2', 'diag_3','encounter_id', 'patient_nbr', 'change',\n", " 'glyburide-metformin', 'acarbose', 'chlorpropamide',\n", " 'pioglitazone', 'rosiglitazone', 'nateglinide', 'glyburide',\n", " 'metformin', 'diag_3_mf', 'A1Cresult', 'gender',\n", " 'admission_type_id', 'num_medications', 'insulin',\n", " 'admission_source_id', 'num_procedures', 'weight'], axis=1)\n", "\n", "\n", "X_tr_catboost, X_holdout_catboost, y_tr, y_holdout = train_test_split(X_train_catboost, y_target, test_size=0.3,\n", " random_state=17)\n", "\n", "\n", "# save indeces of categorical data\n", "cat_features_catboost = []\n", "for ind, el in enumerate(X_tr_catboost.columns):\n", " if el in X_tr_catboost.select_dtypes(include='object').columns:\n", " cat_features_catboost.append(ind)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = CatBoostClassifier(iterations=50, depth=6, learning_rate=0.9, loss_function='MultiClass', \n", " verbose=10, random_seed=17, custom_loss='Accuracy')\n", "\n", "#train the model\n", "train_pool = Pool(X_tr_catboost, y_tr, cat_features=cat_features_catboost)\n", "model.fit(train_pool)\n", "\n", "# make the prediction using the resulting model\n", "catboost_pred = model.predict(X_holdout_catboost)\n", "# preds_proba = model.predict_proba(X_holdout)\n", "res = accuracy_score(y_holdout, catboost_pred)\n", "print(res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Use for CV\n", "# params = {'iterations':50,\n", "# 'depth':6,\n", "# 'learning_rate':0.9,\n", "# 'loss_function':'MultiClass',\n", "# 'random_seed':17}\n", "\n", "# res_catboost_cv = []\n", "\n", "# kf = KFold(n_splits = 5, random_state = 17, shuffle = True)\n", "# for i, (train_index, test_index) in enumerate(kf.split(X_train_catboost)):\n", " \n", "# # Create data for this fold\n", "# y_train_cb_cv, y_valid_cb_cv = y_target.iloc[train_index], y_target.iloc[test_index]\n", "# X_train_cb_cv, X_valid_cb_cv = X_train_catboost.iloc[train_index,:], X_train_catboost.iloc[test_index,:]\n", "# print( \"\\nFold \", i)\n", " \n", "# # Run model for this fold\n", "# fit_model = model.fit( X_train_cb_cv, y_train_cb_cv, \n", "# cat_features=cat_features_catboost\n", "# )\n", " \n", "# # Generate validation predictions for this fold\n", "# pred_cb_cv = fit_model.predict(X_valid_cb_cv)\n", "# res_catboost_cv.append(accuracy_score(pred_cb_cv,y_valid_cb_cv)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#np.mean(res_catboost_cv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Feature importance for CatBoost\n", "\n", "def plot_feature_importances_catboost(data : pd.DataFrame, model, train_pool):\n", " feature_scores = pd.DataFrame(list(zip(data.dtypes.index, model.get_feature_importance(train_pool))),\n", " columns=['Feature','Score'])\n", " feature_scores = feature_scores.sort_values(by='Score', ascending=False)\n", "\n", " plt.rcParams[\"figure.figsize\"] = (15,7)\n", " ax = feature_scores.plot('Feature', 'Score', kind='bar', color='c')\n", " ax.set_title(\"Catboost Feature Importance Ranking\", fontsize = 14)\n", " ax.set_xlabel('')\n", "\n", " rects = ax.patches\n", "\n", " # get feature score as labels round to 2 decimal\n", " labels = feature_scores['Score'].round(2)\n", "\n", " for rect, label in zip(rects, labels):\n", " height = rect.get_height()\n", " ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')\n", "\n", " plt.show()\n", " \n", "plot_feature_importances_catboost(X_tr_catboost, model, train_pool)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conf_matrix = pd.DataFrame({'true':y_holdout.values, 'pred':catboost_pred.flatten()})\n", "\n", "pd.crosstab(conf_matrix['true'], conf_matrix['pred'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3. Prepare data for LightGBM method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lightgbm as lgb\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "X_train_lgb = pd.DataFrame.copy(X_train)\n", "X_train_lgb.select_dtypes(include='object').columns\n", "\n", "X_train_lgb = X_train_lgb.drop(['race', 'diag_1', 'diag_2', 'diag_3', 'encounter_id', 'patient_nbr', 'change',\n", " 'glyburide-metformin', 'acarbose', 'chlorpropamide',\n", " 'pioglitazone', 'rosiglitazone', 'nateglinide', 'glyburide',\n", " 'metformin', 'diag_3_mf', 'A1Cresult', 'gender',\n", " 'admission_type_id', 'num_medications', 'insulin'], axis=1)\n", "X_test_lgb = X_test.drop(['race', 'diag_1', 'diag_2', 'diag_3', 'encounter_id', 'patient_nbr', 'change',\n", " 'glyburide-metformin', 'acarbose', 'chlorpropamide',\n", " 'pioglitazone', 'rosiglitazone', 'nateglinide', 'glyburide',\n", " 'metformin', 'diag_3_mf', 'A1Cresult', 'gender',\n", " 'admission_type_id', 'num_medications', 'insulin'], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoder = LabelEncoder()\n", "cat_features_lgb = []\n", "for ind, el in enumerate(X_train_lgb.columns):\n", " if el in X_train_lgb.select_dtypes(include='object').columns:\n", " cat_features_lgb.append(el)\n", "\n", "\n", "for el in cat_features_lgb:\n", " encoder = encoder.fit(X_train_lgb[el])\n", " X_train_lgb[el] = encoder.transform(X_train_lgb[el])\n", "\n", "\n", "X_tr_lgb, X_holdout_lgb, y_tr, y_holdout = train_test_split(X_train_lgb, y_target, test_size=0.3,\n", " random_state=17)\n", " \n", "for ind, el in enumerate(X_tr_lgb.columns):\n", " if el in X_tr_lgb.select_dtypes(include='object').columns:\n", " X_tr_lgb[el] = X_tr_lgb[el].astype('category')\n", " X_holdout_lgb[el] = X_holdout_lgb[el].astype('category')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param = {'objective': 'multiclass',\n", " 'num_class': 3,\n", " 'num_leaves':20,\n", " 'num_trees':200,\n", " 'metric': ['multi_error']}\n", "\n", "lgb_train = lgb.Dataset(X_tr_lgb, label=y_tr)\n", "lgb_val = lgb.Dataset(X_holdout_lgb, label=y_holdout, reference=lgb_train)\n", "\n", "\n", "# model = lgb.train(params, lgb_train,\n", "# valid_sets=[lgb_val], \n", "# verbose_eval=True)\n", "lgb_model = lgb.train(param, lgb_train, 10000, valid_sets=[lgb_train], verbose_eval=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lgb_pred = lgb_model.predict(X_holdout)\n", "lgb_pred = np.argmax(lgb_pred,axis=1)\n", "accuracy_score(y_holdout, lgb_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Used for CV\n", "# for ind, el in enumerate(X_train_lgb.columns):\n", "# if el in X_train_lgb.select_dtypes(include='object').columns:\n", "# X_train_lgb[el] = X_train_lgb[el].astype('category')\n", "\n", "# lgb_cv = lgb.Dataset(X_train_lgb, label=y_tr)\n", "# # Use for CV\n", "\n", "# res_lgb_cv = []\n", "\n", "# kf = KFold(n_splits = 5, random_state = 17, shuffle = True)\n", "# for i, (train_index, test_index) in enumerate(kf.split(X_train_lgb)):\n", " \n", "# # Create data for this fold\n", "# y_train_lgb_cv, y_valid_lgb_cv = y_target.iloc[train_index], y_target.iloc[test_index]\n", "# X_train_lgb_cv, X_valid_lgb_cv = X_train_lgb.iloc[train_index,:], X_train_lgb.iloc[test_index,:]\n", "# print( \"\\nFold \", i)\n", "# lgb_cv = lgb.Dataset(X_train_lgb_cv, label=y_train_lgb_cv)\n", "# lgb_cv_val = lgb.Dataset(X_valid_lgb_cv, label=y_valid_lgb_cv)\n", "# # Run model for this fold\n", "# fit_model = None\n", "# fit_model = lgb.train(param, lgb_cv, valid_sets=[lgb_cv_val], verbose_eval=20)\n", " \n", "# # Generate validation predictions for this fold\n", "# pred_lgb_cv = fit_model.predict(X_valid_lgb_cv)\n", "# pred_lgb_cv = np.argmax(pred,axis=1)\n", "# res_lgb_cv.append(accuracy_score(pred,y_valid_lgb_cv)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# np.mean(res_lgb_cv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prepare test data for final check\n", "\n", "for el in cat_features_lgb:\n", " encoder = encoder.fit(X_test_lgb[el])\n", " X_test_lgb[el] = encoder.transform(X_test_lgb[el])\n", "\n", "for ind, el in enumerate(X_test_lgb.columns):\n", " if el in X_test_lgb.select_dtypes(include='object').columns:\n", " X_test_lgb[el] = X_test_lgb[el].astype('category')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 6. Feature engineering and description " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we find some optimization minimum - all three algorithms gave same result (0.57-0.60).\n", "So as adding/deleting features and tuning hyperparameters give no positive effect, let's try to create new features.\n", "\n", "Some simple way for it - create some useful combinations of the existing features.\n", "\n", "\n", "P.S. There are no code for every usage of new generated features - just results (as to make project more readable)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First idea - as we drop a lot of medications features, lets create sum of them - how many of them patient used.\n", "Also we can try create binary features - did he used this mediacation (just 0 or 1) - Perhaps some drugs are prescribed at various stages of diabetes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Summary od medicals\n", "medicals = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',\n", " 'glimepiride', 'glipizide', 'glyburide', 'pioglitazone',\n", " 'rosiglitazone', 'acarbose', 'insulin', 'glyburide-metformin']\n", "X_med = pd.DataFrame()\n", "for el in medicals:\n", " X_med[el+'_usage'] = X_train[el].map({'No':0, 'Steady':1, 'Up':1, 'Down':1})\n", "\n", "X_med['med_sum'] = X_med.sum(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second idea is more brut force. lets create different polynomial features from ours. We didn't find linear dependencies but it can be non-linear. Polynomial features can help us chech this hypothesis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating polynomial features\n", "from sklearn.preprocessing import PolynomialFeatures\n", "poly = PolynomialFeatures(2)\n", "features_numeric[2:]\n", "X_poly = pd.DataFrame(poly.fit_transform(X_train[features_numeric[2:]]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_poly_best = X_poly[[23, 48]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Third idea - let's try change feature distribution. Using log can reduce it skewness." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_log = X_train[features_numeric].drop(['encounter_id', 'patient_nbr'], axis=1).applymap(lambda x: np.log(x+1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Checking CatBoost" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# initialize data\n", "\n", "X_train_catboost = X_train.drop(['race', 'diag_1', 'diag_2', 'diag_3', 'encounter_id', 'patient_nbr', 'change',\n", " 'glyburide-metformin', 'acarbose', 'chlorpropamide',\n", " 'pioglitazone', 'rosiglitazone', 'nateglinide', 'glyburide',\n", " 'metformin', 'diag_3_mf', 'A1Cresult', 'gender',\n", " 'admission_type_id', 'num_medications', 'insulin'], axis=1)\n", "X_train_catboost = pd.concat([X_train_catboost, X_log], axis=1)\n", "\n", "X_test_catboost = X_test.drop(['race', 'diag_1', 'diag_2', 'diag_3','encounter_id', 'patient_nbr'], axis=1)\n", "# TODO: add the same action for test\n", "\n", "X_tr, X_holdout, y_tr, y_holdout = train_test_split(X_train_catboost, y_target, test_size=0.3,\n", " random_state=17)\n", "\n", "\n", "# save indeces of categorical data\n", "X_tr.select_dtypes(include='object').columns\n", "cat_features_catboost = []\n", "for ind, el in enumerate(X_tr.columns):\n", " if el in X_tr.select_dtypes(include='object').columns:\n", " cat_features_catboost.append(ind)\n", "X_train_catboost.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_catboost = CatBoostClassifier(iterations=50, depth=6, learning_rate=0.9, loss_function='MultiClass', \n", " verbose=10, random_seed=17, custom_loss='Accuracy')\n", "\n", "#train the model\n", "train_pool = Pool(X_tr, y_tr, cat_features=cat_features_catboost)\n", "model_catboost.fit(train_pool)\n", "\n", "# make the prediction using the resulting model\n", "catboost_pred = model_catboost.predict(X_holdout)\n", "# preds_proba = model.predict_proba(X_holdout)\n", "res = accuracy_score(y_holdout, catboost_pred)\n", "print(res)\n", "#print(\"proba = \", preds_proba)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_feature_importances_catboost(X_tr, model_catboost, train_pool)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Checking LightGBM" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_lgb = pd.DataFrame.copy(X_train)\n", "X_train_lgb.select_dtypes(include='object').columns\n", "\n", "X_train_lgb = X_train_lgb.drop(['race', 'diag_1', 'diag_2', 'diag_3', 'encounter_id', 'patient_nbr', 'change',\n", " 'glyburide-metformin', 'acarbose', 'chlorpropamide',\n", " 'pioglitazone', 'rosiglitazone', 'nateglinide', 'glyburide',\n", " 'metformin', 'diag_3_mf', 'A1Cresult', 'gender',\n", " 'admission_type_id', 'num_medications', 'insulin'], axis=1)\n", "X_test_lgb = X_test.drop(['race', 'diag_1', 'diag_2', 'diag_3', 'encounter_id', 'patient_nbr', 'change',\n", " 'glyburide-metformin', 'acarbose', 'chlorpropamide',\n", " 'pioglitazone', 'rosiglitazone', 'nateglinide', 'glyburide',\n", " 'metformin', 'diag_3_mf', 'A1Cresult', 'gender',\n", " 'admission_type_id', 'num_medications', 'insulin'], axis=1)\n", "\n", "X_train_lgb = pd.concat([X_train_lgb, X_log], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoder = LabelEncoder()\n", "cat_features_lgb = []\n", "for ind, el in enumerate(X_train_lgb.columns):\n", " if el in X_train_lgb.select_dtypes(include='object').columns:\n", " cat_features_lgb.append(el)\n", "\n", "\n", "for el in cat_features_lgb:\n", " encoder = encoder.fit(X_train_lgb[el])\n", " X_train_lgb[el] = encoder.transform(X_train_lgb[el])\n", "\n", "\n", "X_tr, X_holdout, y_tr, y_holdout = train_test_split(X_train_lgb, y_target, test_size=0.3,\n", " random_state=17)\n", " \n", "# save indeces of categorical data\n", "X_tr.select_dtypes(include='object').columns\n", "\n", "for ind, el in enumerate(X_tr.columns):\n", " if el in X_tr.select_dtypes(include='object').columns:\n", " X_tr[el] = X_tr[el].astype('category')\n", " X_holdout[el] = X_holdout[el].astype('category')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param = {'objective': 'multiclass',\n", " 'num_class': 3,\n", " 'num_leaves':20,\n", " 'num_trees':200,\n", " 'metric': ['multi_error']}\n", "lgb_train = lgb.Dataset(X_tr, label=y_tr)\n", "lgb_val = lgb.Dataset(X_holdout, label=y_holdout, reference=lgb_train)\n", "\n", "lgb_model = lgb.train(param, lgb_train, 1000, valid_sets=[lgb_train], verbose_eval=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lgb_pred = lgb_model.predict(X_holdout)\n", "lgb_pred = np.argmax(lgb_pred,axis=1)\n", "accuracy_score(y_holdout, lgb_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "So all attempts to find useful features not really succeed. This may be for two reasons:\n", "1. Need to try more approaches, For example, groupby categorical features and encode with some numeric characteristic for this group. This let to drop categorical and use numeric instead.\n", "2. Data is too poor, and it's really difficult to increase accuracy without using additional data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 7. Cross-validation, hyperparameter tuning " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we already did hyperparameter tuning at previuos stages (cv also), let's try blending now. We will use probabilities of CatBoost and LightGBM. The idea of this action - if one of the algorithms predicts wrong on some data sample another maybe not! So if we take mean (or anothe proportion) of answers we increase probability of right answer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "preds_proba_catboost = model.predict_proba(X_holdout_catboost)\n", "lgb_pred = lgb_model.predict(X_holdout_lgb)\n", "pred_knn = neigh.predict(X_holdout_knn)\n", "pred_knn_dummy = pd.get_dummies(pred_knn).values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_a = 0\n", "best_b = 0\n", "best_c = 0\n", "best_acc = 0\n", "for c in tqdm_notebook(range(0,100,5)):\n", " for a in range(0,101):\n", " for b in range(0,101):\n", " blended_proba = (a/100*preds_proba_catboost + b/100*lgb_pred + c/100*pred_knn_dummy) / 3\n", "\n", " blend_pred = np.argmax(blended_proba,axis=1)\n", " acc = accuracy_score(y_holdout, blend_pred)\n", " if acc > best_acc:\n", " best_a = a\n", " best_b = b\n", " best_acc = acc\n", " best_c = c\n", "print(best_a, best_b, best_c, best_acc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means using knn doesn't improve results. So lets check one more time using only CatBoost and LGBM" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_a = 0\n", "best_acc = 0\n", "for a in range(0,101):\n", " blended_proba = (a/100*preds_proba_catboost + (100-a)/100*lgb_pred) / 2\n", " blend_pred = np.argmax(blended_proba,axis=1)\n", " acc = accuracy_score(y_holdout, blend_pred)\n", " if acc > best_acc:\n", " best_a = a\n", " best_b = b\n", " best_acc = acc\n", " best_c = c\n", "print(best_a, best_acc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the best parameters are:\n", "\n", "**prediction = (0.47\\*CatBoost + 0.82\\*LGB + c\\*KNN) / 3**\n", "\n", "But it's a bit strange, because we can get probability more than 1.0 (the algorithm is absolutely confident in its decision :) )\n", "\n", "So let's use it without such hack.\n", "\n", "**prediction = (0.31\\*CatBoost + 0.69\\*LGB) / 2**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Use for CV\n", "# res_blend_cv = []\n", "\n", "\n", "# kf = KFold(n_splits = 5, random_state = 17, shuffle = True)\n", "# for i, (train_index, test_index) in enumerate(kf.split(X_train)):\n", " \n", "# # Create data for this fold\n", "# y_train_bl, y_valid_bl = y_target.iloc[train_index], y_target.iloc[test_index]\n", "# X_train_cb, X_valid_cb = X_train_catboost.iloc[train_index,:], X_train_catboost.iloc[test_index,:]\n", "# X_train_bl_cv, X_valid_bl_cv = X_train_lgb.iloc[train_index,:], X_train_lgb.iloc[test_index,:]\n", " \n", "# lgb_cv = lgb.Dataset(X_train_bl_cv, label=y_train_bl)\n", "# lgb_cv_val = lgb.Dataset(X_valid_bl_cv, label=y_valid_bl)\n", "# print( \"\\nFold \", i)\n", " \n", "\n", "# fit_model_cb = model.fit( X_train_cb, y_train_bl, \n", "# cat_features=cat_features_catboost\n", "# )\n", "# fit_model_lgb = lgb.train(param, lgb_cv, valid_sets=[lgb_cv_val], verbose_eval=20)\n", " \n", "# # Generate validation predictions for this fold\n", "# pred_cb = fit_model_cb.predict_proba(X_valid_cb)\n", "# pred_lgb = fit_model_lgb.predict(X_valid_bl_cv)\n", " \n", "# res_blend_cv.append(accuracy_score(np.argmax(0.31*pred_cb + 0.69*pred_lgb ,axis=1),y_valid_bl))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# np.mean(res_blend_cv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 8. Validation and learning curves " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Curves for KNN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import learning_curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_sizes, train_scores, valid_scores = learning_curve(neigh, X_tr_knn, y_tr, cv=5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "\n", "plt.xlabel(\"Training examples\")\n", "plt.ylabel(\"Score\")\n", "train_scores_mean = np.mean(train_scores, axis=1)\n", "train_scores_std = np.std(train_scores, axis=1)\n", "valid_scores_mean = np.mean(valid_scores, axis=1)\n", "valid_scores_std = np.std(valid_scores, axis=1)\n", "plt.grid()\n", "\n", "plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, alpha=0.1,\n", " color=\"r\")\n", "plt.fill_between(train_sizes, valid_scores_mean - valid_scores_std,\n", " valid_scores_mean + valid_scores_std, alpha=0.1, color=\"g\")\n", "plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n", " label=\"Training score\")\n", "plt.plot(train_sizes, valid_scores_mean, 'o-', color=\"g\",\n", " label=\"Cross-validation score\")\n", "\n", "plt.legend(loc=\"best\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Curves for CatBoost" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_pool = Pool(X_holdout_catboost, y_holdout, cat_features=cat_features_catboost)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.fit(train_pool, eval_set=test_pool, plot=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Curves for LightGBM" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval_result = {}\n", "param['metric'] = {'multi_error'}\n", "lgb_model = lgb.train(param, lgb_train, 10000, valid_sets=[lgb_train, lgb_val], verbose_eval=20, evals_result=eval_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print curves\n", "print('Plot metrics during training...')\n", "ax = lgb.plot_metric(eval_result, metric='multi_error')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 9. Prediction for test samples " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_knn = accuracy_score(y_test, neigh.predict(X_test_knn))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res_catboost = model.score(X_test_catboost, y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lgb_proba = lgb_model.predict(X_test_lgb)\n", "lgb_pred = np.argmax(lgb_proba,axis=1)\n", "res_lgb = accuracy_score(y_test, lgb_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# prediction = (0.31*CatBoost + 0.69*LGB) / 2\n", "catboost_proba = model.predict_proba(X_test_catboost)\n", "blender_proba = 0.31*catboost_proba + 0.69*lgb_proba\n", "blender_pred = np.argmax(blender_proba,axis=1)\n", "res_blender = accuracy_score(y_test, blender_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = pd.DataFrame({'model':['KNN', 'CatBoost', 'LGB', 'Blender'], \n", " 'Test Accuracy':[res_knn, res_catboost, res_lgb, res_blender],\n", " 'CV Accuracy':[np.mean(res_knn_cv), np.mean(res_catboost_cv), \n", " np.mean(res_lgb_cv), np.mean(res_blend_cv)]})\n", "result.set_index('model')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 10. Model evaluation with metrics description " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can see accuracy for all our models and compare it for CV and for test data. All accuracies are smaller for test dataset. It usually talks about overfitting, but in this case it is more likely about difference between train and test dataset - for example, `payer_code` has more missings in train data.\n", "\n", "Also it can show, that we still have no enough data - if we got more data for training set (not 50/50 but 70/30) maybe model show better result.\n", "\n", "Metrics we used - accuracy. We choose this one because it's classical for solution \n", "\n", "From the all models **blender model** show best result on CV and test datasets. But the difference between blender and LGB not so much to use it for production model: **LGB is simplier and faster**.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 11. Conclusions " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Really we got not perfect solution. Accuracy for multi-label classification = 0.55 will not allow to use it in real conditions. It is worth paying more attention to the feature enineering. It may be necessary to increase the sample data, adding new featuress.\n", "\n", "Pluses of the solution:\n", " - we made fast and light model, that use small number of features\n", " - we go for multi-class solution instead of binary classification as it described in all solutions at Kaggle about this dataset\n", " - stack models\n", " - We conducted a broad study that allows us to outline ways for the further development of the project.\n", " \n", "Minuses:\n", " - Accuracy is not enough\n", " - Feature engineering gave no affect\n", "\n", "Further research:\n", " - more attention to features: we can transform them to normally distributed, encode categorical feature as mean/median/ets of numeric groups, generate nore new features (for example split feature for to as it distribution has 2 peaks)\n", " - still a lot of statistics that we didn't implement (as Cramer V) that can show new dependencies in data\n", " - always new methods can be applied (NN?) but looks like features are first here\n", "\n", "For me:\n", "I study a lot from this progect. Starting from markdown and to blending models. Learned a lot of new statistic tests, data approaches and etc. Work on a large project in a short time does not allow to relax. And although it was possible to do not all that I wanted, the result was obtained. I express my gratitude to the creators of the course!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }