{ "cells": [ { "cell_type": "markdown", "metadata": { "_cell_guid": "83708667-4fdc-1563-7b3a-06b6575d2865" }, "source": [ "\n", "\n", "# Classification-Master Template\n", "\n", "How do you work through a predictive modeling- Classification or Regression based Machine learning problem end-to-end? \n", "In this jupyter note you will work through a case study classication predictive modeling problem in Python\n", "including each step of the applied machine learning process. \n", "However, this notebook is applicable for Regression based case study as well. The Models, Grid Search and Evaluation Metrics will need to change for the regression based case study.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [1. Introduction](#0)\n", "* [2. Getting Started - Load Libraries and Dataset](#1)\n", " * [2.1. Load Libraries](#1.1) \n", " * [2.2. Load Dataset](#1.2)\n", "* [3. Exploratory Data Analysis](#2)\n", " * [3.1 Descriptive Statistics](#2.1) \n", " * [3.2. Data Visualisation](#2.2)\n", "* [4. Data Preparation](#3)\n", " * [4.1 Data Cleaning](#3.1)\n", " * [4.2.Handling Categorical Data](#3.2)\n", " * [4.3.Feature Selection](#3.3)\n", " * [4.3.Data Transformation](#3.4) \n", " * [4.3.1 Rescaling ](#3.4.1)\n", " * [4.3.2 Standardization](#3.4.2)\n", " * [4.3.3 Normalization](#3.4.3) \n", "* [5.Evaluate Algorithms and Models](#4) \n", " * [5.1. Train/Test Split](#4.1)\n", " * [5.2. Test Options and Evaluation Metrics](#4.2)\n", " * [5.3. Compare Models and Algorithms](#4.3)\n", " * [5.3.1 Common Classification Models](#4.3.1)\n", " * [5.3.2 Ensemble Models](#4.3.2)\n", " * [5.3.3 Deep Learning Models](#4.3.3) \n", "* [6. Model Tuning and Grid Search](#5) \n", "* [7. Finalize the Model](#6) \n", " * [7.1. Results on test dataset](#6.1)\n", " * [7.1. Variable Intuition/Feature Selection](#6.2) \n", " * [7.3. Save model for later use](#6.3)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 1. Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal in this jupyter notebook is to under the following\n", "- How to work through a predictive modeling problem end-to-end. This notebook is applicable both for regression and classification problems.\n", "- How to use data transforms to improve model performance.\n", "- How to use algorithm tuning to improve model performance.\n", "- How to use ensemble methods and tuning of ensemble methods to improve model performance.\n", "- How to use deep Learning methods.\n", "\n", "The data is a subset of the German Default data (https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) with the following attributes. Age, Sex, Job, Housing, SavingAccounts, CheckingAccount, CreditAmount, Duration, Purpose\n", "- Following models are implemented and checked: \n", "\n", " * Logistic Regression\n", " * Linear Discriminant Analysis\n", " * K Nearest Neighbors \n", " * Decision Tree (CART)\n", " * Support Vector Machine \n", " * Ada Boost\n", " * Gradient Boosting Method\n", " * Random Forest\n", " * Extra Trees\n", " * Neural Network - Shallow \n", " * Deep Neural Network " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 2. Getting Started- Loading the data and python packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2.1. Loading the python packages" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "_cell_guid": "5d8fee34-f454-2642-8b06-ed719f0317e1" }, "outputs": [], "source": [ "# Load libraries\n", "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot\n", "from pandas import read_csv, set_option\n", "from pandas.plotting import scatter_matrix\n", "import seaborn as sns\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.svm import SVC\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier\n", "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score\n", "\n", "#Libraries for Deep Learning Models\n", "from keras.models import Sequential\n", "from keras.layers import Dense\n", "from keras.wrappers.scikit_learn import KerasClassifier\n", "from keras.optimizers import SGD\n", "\n", "#Libraries for Saving the Model\n", "from pickle import dump\n", "from pickle import load" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2.2. Loading the Data" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "_cell_guid": "787e35f7-bf9e-0969-8d13-a54fa87f3519" }, "outputs": [], "source": [ "# load dataset\n", "dataset = read_csv('german_credit_data.csv')" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "#Diable the warnings\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(dataset)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "df6a4523-b385-69ee-c933-592826d81431" }, "source": [ "\n", "# 3. Exploratory Data Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.1. Descriptive Statistics" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "_cell_guid": "52f85dc2-0f91-3c50-400e-ddc38bea966b" }, "outputs": [ { "data": { "text/plain": [ "(1000, 10)" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# shape\n", "dataset.shape" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSexJobHousingSavingAccountsCheckingAccountCreditAmountDurationPurposeRisk
067male2ownNaNlittle11696radio/TVgood
122female2ownlittlemoderate595148radio/TVbad
\n", "
" ], "text/plain": [ " Age Sex Job Housing SavingAccounts CheckingAccount CreditAmount Duration Purpose Risk\n", "0 67 male 2 own NaN little 1169 6 radio/TV good\n", "1 22 female 2 own little moderate 5951 48 radio/TV bad" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# peek at data\n", "set_option('display.width', 100)\n", "dataset.head(2)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "_cell_guid": "f36dd804-0c16-f0c9-05c9-d22b85a79e75" }, "outputs": [ { "data": { "text/plain": [ "Age int64\n", "Sex object\n", "Job int64\n", "Housing object\n", "SavingAccounts object\n", "CheckingAccount object\n", "CreditAmount int64\n", "Duration int64\n", "Purpose object\n", "Risk object\n", "dtype: object" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# types\n", "set_option('display.max_rows', 500)\n", "dataset.dtypes" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "_cell_guid": "7bffeec0-5bbc-fffb-18f2-3da56b862ca3" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeJobCreditAmountDuration
count1000.0001000.0001000.0001000.000
mean35.5461.9043271.25820.903
std11.3750.6542822.73712.059
min19.0000.000250.0004.000
25%27.0002.0001365.50012.000
50%33.0002.0002319.50018.000
75%42.0002.0003972.25024.000
max75.0003.00018424.00072.000
\n", "
" ], "text/plain": [ " Age Job CreditAmount Duration\n", "count 1000.000 1000.000 1000.000 1000.000\n", "mean 35.546 1.904 3271.258 20.903\n", "std 11.375 0.654 2822.737 12.059\n", "min 19.000 0.000 250.000 4.000\n", "25% 27.000 2.000 1365.500 12.000\n", "50% 33.000 2.000 2319.500 18.000\n", "75% 42.000 2.000 3972.250 24.000\n", "max 75.000 3.000 18424.000 72.000" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# describe data\n", "set_option('precision', 3)\n", "dataset.describe()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "_cell_guid": "565b36d1-0abc-e91c-47f5-c3153d54e265" }, "outputs": [ { "data": { "text/plain": [ "Housing\n", "free 108\n", "own 713\n", "rent 179\n", "dtype: int64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# class distribution\n", "dataset.groupby('Housing').size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3.2. Data Visualization" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "_cell_guid": "16d50177-f93e-9d26-af7a-313d7ebe9fcf" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# histograms\n", "dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1, figsize=(12,12))\n", "pyplot.show()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "_cell_guid": "ca420570-fce1-e2ff-8511-50691e099d69" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# density\n", "dataset.plot(kind='density', subplots=True, layout=(3,3), sharex=False, legend=True, fontsize=1, figsize=(15,15))\n", "pyplot.show()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#Box and Whisker Plots\n", "dataset.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(15,15))\n", "pyplot.show()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# correlation\n", "correlation = dataset.corr()\n", "pyplot.figure(figsize=(15,15))\n", "pyplot.title('Correlation Matrix')\n", "sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Scatterplot Matrix\n", "from pandas.plotting import scatter_matrix\n", "pyplot.figure(figsize=(15,15))\n", "scatter_matrix(dataset,figsize=(12,12))\n", "pyplot.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Data Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4.1. Data Cleaning\n", "Check for the NAs in the rows, either drop them or fill them with the mean of the column" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Null Values = True\n" ] } ], "source": [ "#Checking for any null values and removing the null values'''\n", "print('Null Values =',dataset.isnull().values.any())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given that there are null values drop the rown contianing the null values." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "# Drop the rows containing NA\n", "dataset = dataset.dropna(axis=0)\n", "# Fill na with 0\n", "#dataset.fillna('0')\n", "\n", "#Filling the NAs with the mean of the column.\n", "#dataset['col'] = dataset['col'].fillna(dataset['col'].mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4.2. Handling Categorical Data" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexSex_CodeHousingHousing_CodeRisk_CodeRisk
1female0own10bad
3male1free01good
4male1free00bad
7male1rent21good
9male1own10bad
10female0rent20bad
11female0rent20bad
12female0own11good
13male1own10bad
14female0rent21good
\n", "
" ], "text/plain": [ " Sex Sex_Code Housing Housing_Code Risk_Code Risk\n", "1 female 0 own 1 0 bad\n", "3 male 1 free 0 1 good\n", "4 male 1 free 0 0 bad\n", "7 male 1 rent 2 1 good\n", "9 male 1 own 1 0 bad\n", "10 female 0 rent 2 0 bad\n", "11 female 0 rent 2 0 bad\n", "12 female 0 own 1 1 good\n", "13 male 1 own 1 0 bad\n", "14 female 0 rent 2 1 good" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "lb_make = LabelEncoder()\n", "dataset[\"Sex_Code\"] = lb_make.fit_transform(dataset[\"Sex\"])\n", "dataset[\"Housing_Code\"] = lb_make.fit_transform(dataset[\"Housing\"])\n", "dataset[\"SavingAccount_Code\"] = lb_make.fit_transform(dataset[\"SavingAccounts\"].fillna('0'))\n", "dataset[\"CheckingAccount_Code\"] = lb_make.fit_transform(dataset[\"CheckingAccount\"].fillna('0'))\n", "dataset[\"Purpose_Code\"] = lb_make.fit_transform(dataset[\"Purpose\"])\n", "dataset[\"Risk_Code\"] = lb_make.fit_transform(dataset[\"Risk\"])\n", "dataset[[\"Sex\", \"Sex_Code\",\"Housing\",\"Housing_Code\",\"Risk_Code\",\"Risk\"]].head(10)\n" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "#dropping the old features\n", "dataset.drop(['Sex','Housing','SavingAccounts','CheckingAccount','Purpose','Risk'],axis=1,inplace=True)\n" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeJobCreditAmountDurationSex_CodeHousing_CodeSavingAccount_CodeCheckingAccount_CodePurpose_CodeRisk_Code
1222595148010150
3452788242100041
4532487024100010
7353694836120111
9283523430110110
\n", "
" ], "text/plain": [ " Age Job CreditAmount Duration Sex_Code Housing_Code SavingAccount_Code \\\n", "1 22 2 5951 48 0 1 0 \n", "3 45 2 7882 42 1 0 0 \n", "4 53 2 4870 24 1 0 0 \n", "7 35 3 6948 36 1 2 0 \n", "9 28 3 5234 30 1 1 0 \n", "\n", " CheckingAccount_Code Purpose_Code Risk_Code \n", "1 1 5 0 \n", "3 0 4 1 \n", "4 0 1 0 \n", "7 1 1 1 \n", "9 1 1 0 " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4.3. Feature Selection\n", "Statistical tests can be used to select those features that have the strongest relationship with the output variable.The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.\n", "The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the Dataset." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SelectKBest(k=5, score_func=)" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_selection import SelectKBest\n", "from sklearn.feature_selection import chi2\n", "\n", "bestfeatures = SelectKBest(score_func=chi2, k=5)\n", "bestfeatures" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Specs Score\n", "2 CreditAmount 45853.601\n", "3 Duration 327.508\n", "6 SavingAccount_Code 14.395\n", "7 CheckingAccount_Code 7.096\n", "0 Age 6.534\n", "8 Purpose_Code 1.902\n", "4 Sex_Code 0.671\n", "1 Job 0.318\n", "5 Housing_Code 0.007\n" ] } ], "source": [ "Y= dataset[\"Risk_Code\"]\n", "X = dataset.loc[:, dataset.columns != 'Risk_Code']\n", "fit = bestfeatures.fit(X,Y)\n", "dfscores = pd.DataFrame(fit.scores_)\n", "dfcolumns = pd.DataFrame(X.columns)\n", "#concat two dataframes for better visualization \n", "featureScores = pd.concat([dfcolumns,dfscores],axis=1)\n", "featureScores.columns = ['Specs','Score'] #naming the dataframe columns\n", "print(featureScores.nlargest(10,'Score')) #print 10 best features\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As it can be seem from the numbers above Credit Amount is the most important feature followed by duration." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4.4. Data Transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 4.4.1. Rescale Data\n", "When your data is comprised of attributes with varying scales, many machine learning algorithms\n", "can benefit from rescaling the attributes to all have the same scale. Often this is referred to\n", "as normalization and attributes are often rescaled into the range between 0 and 1." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678
00.0540.6670.3130.6360.00.50.00.50.714
10.4640.6670.4190.5451.00.00.00.00.571
20.6070.6670.2530.2731.00.00.00.00.143
30.2861.0000.3680.4551.01.00.00.50.143
40.1611.0000.2730.3641.00.50.00.50.143
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8\n", "0 0.054 0.667 0.313 0.636 0.0 0.5 0.0 0.5 0.714\n", "1 0.464 0.667 0.419 0.545 1.0 0.0 0.0 0.0 0.571\n", "2 0.607 0.667 0.253 0.273 1.0 0.0 0.0 0.0 0.143\n", "3 0.286 1.000 0.368 0.455 1.0 1.0 0.0 0.5 0.143\n", "4 0.161 1.000 0.273 0.364 1.0 0.5 0.0 0.5 0.143" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "X = dataset.loc[:, dataset.columns != 'Risk_Code']\n", "scaler = MinMaxScaler(feature_range=(0, 1))\n", "rescaledX = pd.DataFrame(scaler.fit_transform(X))\n", "# summarize transformed data\n", "rescaledX.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 4.4.2. Standardize Data\n", "Standardization is a useful technique to transform attributes with a Gaussian distribution and\n", "differing means and standard deviations to a standard Gaussian distribution with a mean of\n", "0 and a standard deviation of 1." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678
0-1.0940.1830.9132.139-1.452-0.145-0.4510.5571.063
10.8590.1831.5731.6580.689-1.900-0.451-0.9580.561
21.5380.1830.5440.2140.689-1.900-0.451-0.958-0.944
30.0091.6481.2541.1760.6891.611-0.4510.557-0.944
4-0.5851.6480.6680.6950.689-0.145-0.4510.557-0.944
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8\n", "0 -1.094 0.183 0.913 2.139 -1.452 -0.145 -0.451 0.557 1.063\n", "1 0.859 0.183 1.573 1.658 0.689 -1.900 -0.451 -0.958 0.561\n", "2 1.538 0.183 0.544 0.214 0.689 -1.900 -0.451 -0.958 -0.944\n", "3 0.009 1.648 1.254 1.176 0.689 1.611 -0.451 0.557 -0.944\n", "4 -0.585 1.648 0.668 0.695 0.689 -0.145 -0.451 0.557 -0.944" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "X = dataset.loc[:, dataset.columns != 'Risk_Code']\n", "scaler = StandardScaler().fit(X)\n", "StandardisedX = pd.DataFrame(scaler.fit_transform(X))\n", "# summarize transformed data\n", "StandardisedX.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 4.4.1. Normalize Data\n", "Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called\n", "a unit norm or a vector with the length of 1 in linear algebra)." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678
00.0043.361e-041.00.0080.000e+001.680e-040.01.680e-048.402e-04
10.0062.537e-041.00.0051.269e-040.000e+000.00.000e+005.075e-04
20.0114.106e-041.00.0052.053e-040.000e+000.00.000e+002.053e-04
30.0054.318e-041.00.0051.439e-042.878e-040.01.439e-041.439e-04
40.0055.732e-041.00.0061.911e-041.911e-040.01.911e-041.911e-04
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8\n", "0 0.004 3.361e-04 1.0 0.008 0.000e+00 1.680e-04 0.0 1.680e-04 8.402e-04\n", "1 0.006 2.537e-04 1.0 0.005 1.269e-04 0.000e+00 0.0 0.000e+00 5.075e-04\n", "2 0.011 4.106e-04 1.0 0.005 2.053e-04 0.000e+00 0.0 0.000e+00 2.053e-04\n", "3 0.005 4.318e-04 1.0 0.005 1.439e-04 2.878e-04 0.0 1.439e-04 1.439e-04\n", "4 0.005 5.732e-04 1.0 0.006 1.911e-04 1.911e-04 0.0 1.911e-04 1.911e-04" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import Normalizer\n", "X = dataset.loc[:, dataset.columns != 'Risk_Code']\n", "scaler = Normalizer().fit(X)\n", "NormalizedX = pd.DataFrame(scaler.fit_transform(X))\n", "# summarize transformed data\n", "NormalizedX.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 5. Evaluate Algorithms and Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5.1. Train Test Split" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "# split out validation dataset for the end\n", "Y= dataset[\"Risk_Code\"]\n", "X = dataset.loc[:, dataset.columns != 'Risk_Code']\n", "scaler = StandardScaler().fit(X)\n", "StandardisedX = pd.DataFrame(scaler.fit_transform(X))\n", "validation_size = 0.2\n", "seed = 7\n", "X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5.2. Test Options and Evaluation Metrics\n" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "_cell_guid": "5702bc31-06bf-8b6a-42de-366a6b3311a8" }, "outputs": [], "source": [ "# test options for classification\n", "num_folds = 10\n", "seed = 7\n", "scoring = 'accuracy'\n", "#scoring ='neg_log_loss'\n", "#scoring = 'roc_auc'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5.3. Compare Models and Algorithms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 5.3.1. Common Models" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "_cell_guid": "772802f7-f4e4-84ee-6377-6464ab2e5da4" }, "outputs": [], "source": [ "# spot check the algorithms\n", "models = []\n", "models.append(('LR', LogisticRegression()))\n", "models.append(('LDA', LinearDiscriminantAnalysis()))\n", "models.append(('KNN', KNeighborsClassifier()))\n", "models.append(('CART', DecisionTreeClassifier()))\n", "models.append(('NB', GaussianNB()))\n", "models.append(('SVM', SVC()))\n", "#Neural Network\n", "models.append(('NN', MLPClassifier()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 5.3.2. Ensemble Models" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "#Ensable Models \n", "# Boosting methods\n", "models.append(('AB', AdaBoostClassifier()))\n", "models.append(('GBM', GradientBoostingClassifier()))\n", "# Bagging methods\n", "models.append(('RF', RandomForestClassifier()))\n", "models.append(('ET', ExtraTreesClassifier()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 5.3.3. Deep Learning Model" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "#Writing the Deep Learning Classifier in case the Deep Learning Flag is Set to True\n", "#Set the following Flag to 0 if the Deep LEarning Models Flag has to be enabled\n", "EnableDLModelsFlag = 1\n", "if EnableDLModelsFlag == 1 : \n", " # Function to create model, required for KerasClassifier\n", " def create_model(neurons=12, activation='relu', learn_rate = 0.01, momentum=0):\n", " # create model\n", " model = Sequential()\n", " model.add(Dense(neurons, input_dim=X_train.shape[1], activation=activation))\n", " model.add(Dense(2, activation=activation))\n", " model.add(Dense(1, activation='sigmoid'))\n", " # Compile model\n", " optimizer = SGD(lr=learn_rate, momentum=momentum)\n", " model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", " return model \n", " models.append(('DNN', KerasClassifier(build_fn=create_model, epochs=10, batch_size=10, verbose=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-folds cross validation" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "_cell_guid": "a784ab4a-eb59-98cc-76cf-b55f382d057a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LR: 0.626074 (0.064426)\n", "LDA: 0.611614 (0.055923)\n", "KNN: 0.529791 (0.063048)\n", "CART: 0.563763 (0.097660)\n", "NB: 0.611324 (0.061465)\n", "SVM: 0.592102 (0.077275)\n", "NN: 0.503775 (0.059635)\n", "AB: 0.621138 (0.045846)\n", "GBM: 0.633159 (0.076016)\n", "RF: 0.618815 (0.077372)\n", "ET: 0.582753 (0.074896)\n", "Epoch 1/10\n", "375/375 [==============================] - 1s 4ms/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 2/10\n", "375/375 [==============================] - 0s 136us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 3/10\n", "375/375 [==============================] - 0s 128us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 4/10\n", "375/375 [==============================] - 0s 152us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 5/10\n", "375/375 [==============================] - 0s 147us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 6/10\n", "375/375 [==============================] - 0s 156us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 7/10\n", "375/375 [==============================] - 0s 146us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 8/10\n", "375/375 [==============================] - 0s 161us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 9/10\n", "375/375 [==============================] - 0s 144us/step - loss: 9.0691 - acc: 0.4373\n", "Epoch 10/10\n", "375/375 [==============================] - 0s 142us/step - loss: 9.0691 - acc: 0.4373\n", "42/42 [==============================] - 1s 16ms/step\n", "Epoch 1/10\n", "375/375 [==============================] - 1s 4ms/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 2/10\n", "375/375 [==============================] - 0s 109us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 3/10\n", "375/375 [==============================] - 0s 113us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 4/10\n", "375/375 [==============================] - 0s 126us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 5/10\n", "375/375 [==============================] - 0s 115us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 6/10\n", "375/375 [==============================] - 0s 119us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 7/10\n", "375/375 [==============================] - 0s 109us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 8/10\n", "375/375 [==============================] - 0s 112us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 9/10\n", "375/375 [==============================] - 0s 109us/step - loss: 6.8871 - acc: 0.5680\n", "Epoch 10/10\n", "375/375 [==============================] - 0s 113us/step - loss: 6.8871 - acc: 0.5680\n", "42/42 [==============================] - 1s 15ms/step\n", "Epoch 1/10\n", "375/375 [==============================] - 2s 4ms/step - loss: 0.6925 - acc: 0.5733\n", "Epoch 2/10\n", "375/375 [==============================] - 0s 108us/step - loss: 0.6914 - acc: 0.5787\n", "Epoch 3/10\n", "375/375 [==============================] - 0s 115us/step - loss: 0.6902 - acc: 0.5787\n", "Epoch 4/10\n", "375/375 [==============================] - 0s 120us/step - loss: 0.6892 - acc: 0.5787\n", "Epoch 5/10\n", "375/375 [==============================] - 0s 125us/step - loss: 0.6883 - acc: 0.5787\n", "Epoch 6/10\n", "375/375 [==============================] - 0s 151us/step - loss: 0.6875 - acc: 0.5787\n", "Epoch 7/10\n", "375/375 [==============================] - 0s 200us/step - loss: 0.6868 - acc: 0.5787\n", "Epoch 8/10\n", "375/375 [==============================] - 0s 223us/step - loss: 0.6862 - acc: 0.5787\n", "Epoch 9/10\n", "375/375 [==============================] - 0s 122us/step - loss: 0.6856 - acc: 0.5787\n", "Epoch 10/10\n", "375/375 [==============================] - 0s 133us/step - loss: 0.6851 - acc: 0.5787\n", "42/42 [==============================] - 1s 12ms/step\n", "Epoch 1/10\n", "375/375 [==============================] - 1s 4ms/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 2/10\n", "375/375 [==============================] - 0s 103us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 3/10\n", "375/375 [==============================] - 0s 114us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 4/10\n", "375/375 [==============================] - 0s 110us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 5/10\n", "375/375 [==============================] - 0s 107us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 6/10\n", "375/375 [==============================] - 0s 104us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 7/10\n", "375/375 [==============================] - 0s 106us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 8/10\n", "375/375 [==============================] - 0s 103us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 9/10\n", "375/375 [==============================] - 0s 106us/step - loss: 7.0997 - acc: 0.5547\n", "Epoch 10/10\n", "375/375 [==============================] - 0s 105us/step - loss: 7.0997 - acc: 0.5547\n", "42/42 [==============================] - 1s 12ms/step\n", "Epoch 1/10\n", "375/375 [==============================] - 1s 4ms/step - loss: 4.6803 - acc: 0.4880\n", "Epoch 2/10\n", "375/375 [==============================] - 0s 112us/step - loss: 1.5742 - acc: 0.4533\n", "Epoch 3/10\n", "375/375 [==============================] - 0s 104us/step - loss: 1.2508 - acc: 0.4507\n", "Epoch 4/10\n", "375/375 [==============================] - 0s 109us/step - loss: 1.1772 - acc: 0.4373\n", "Epoch 5/10\n", "375/375 [==============================] - 0s 106us/step - loss: 1.2157 - acc: 0.4613\n", "Epoch 6/10\n", "375/375 [==============================] - 0s 112us/step - loss: 0.8980 - acc: 0.4533\n", "Epoch 7/10\n", "375/375 [==============================] - 0s 105us/step - loss: 1.0351 - acc: 0.5147\n", "Epoch 8/10\n", "375/375 [==============================] - 0s 101us/step - loss: 0.9598 - acc: 0.4853\n", "Epoch 9/10\n", "375/375 [==============================] - 0s 101us/step - loss: 0.9366 - acc: 0.5067\n", "Epoch 10/10\n", "375/375 [==============================] - 0s 105us/step - loss: 0.8666 - acc: 0.5387\n", "42/42 [==============================] - 1s 12ms/step\n", "Epoch 1/10\n", "375/375 [==============================] - 1s 4ms/step - loss: 0.6928 - acc: 0.5520\n", "Epoch 2/10\n", "375/375 [==============================] - 0s 157us/step - loss: 0.6917 - acc: 0.5733\n", "Epoch 3/10\n", "375/375 [==============================] - 0s 119us/step - loss: 0.6907 - acc: 0.5733\n", "Epoch 4/10\n", "375/375 [==============================] - 0s 103us/step - loss: 0.6898 - acc: 0.5733\n", "Epoch 5/10\n", "375/375 [==============================] - 0s 108us/step - loss: 0.6891 - acc: 0.5733\n", "Epoch 6/10\n", "375/375 [==============================] - 0s 110us/step - loss: 0.6884 - acc: 0.5733\n", "Epoch 7/10\n", "375/375 [==============================] - 0s 110us/step - loss: 0.6877 - acc: 0.5733\n", "Epoch 8/10\n", "375/375 [==============================] - 0s 102us/step - loss: 0.6871 - acc: 0.5733\n", "Epoch 9/10\n", "375/375 [==============================] - 0s 106us/step - loss: 0.6867 - acc: 0.5733\n", "Epoch 10/10\n", "375/375 [==============================] - 0s 101us/step - loss: 0.6863 - acc: 0.5733\n", "42/42 [==============================] - 1s 13ms/step\n", "Epoch 1/10\n", "375/375 [==============================] - 1s 4ms/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 2/10\n", "375/375 [==============================] - 0s 109us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 3/10\n", "375/375 [==============================] - 0s 103us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 4/10\n", "375/375 [==============================] - 0s 109us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 5/10\n", "375/375 [==============================] - 0s 103us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 6/10\n", "375/375 [==============================] - 0s 105us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 7/10\n", "375/375 [==============================] - 0s 112us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 8/10\n", "375/375 [==============================] - 0s 104us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 9/10\n", "375/375 [==============================] - 0s 107us/step - loss: 9.1981 - acc: 0.4293\n", "Epoch 10/10\n", "375/375 [==============================] - 0s 106us/step - loss: 9.1981 - acc: 0.4293\n", "42/42 [==============================] - 1s 13ms/step\n", "Epoch 1/10\n", "376/376 [==============================] - 2s 4ms/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 2/10\n", "376/376 [==============================] - 0s 110us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 3/10\n", "376/376 [==============================] - 0s 107us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 4/10\n", "376/376 [==============================] - 0s 113us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 5/10\n", "376/376 [==============================] - 0s 111us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 6/10\n", "376/376 [==============================] - 0s 113us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 7/10\n", "376/376 [==============================] - 0s 109us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 8/10\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "376/376 [==============================] - 0s 108us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 9/10\n", "376/376 [==============================] - 0s 106us/step - loss: 9.2165 - acc: 0.4282\n", "Epoch 10/10\n", "376/376 [==============================] - 0s 108us/step - loss: 9.2165 - acc: 0.4282\n", "41/41 [==============================] - 1s 15ms/step\n", "Epoch 1/10\n", "376/376 [==============================] - 2s 4ms/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 2/10\n", "376/376 [==============================] - 0s 109us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 3/10\n", "376/376 [==============================] - 0s 112us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 4/10\n", "376/376 [==============================] - 0s 110us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 5/10\n", "376/376 [==============================] - 0s 107us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 6/10\n", "376/376 [==============================] - 0s 108us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 7/10\n", "376/376 [==============================] - 0s 107us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 8/10\n", "376/376 [==============================] - 0s 107us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 9/10\n", "376/376 [==============================] - 0s 110us/step - loss: 6.7416 - acc: 0.5771\n", "Epoch 10/10\n", "376/376 [==============================] - 0s 106us/step - loss: 6.7416 - acc: 0.5771\n", "41/41 [==============================] - 1s 14ms/step\n", "Epoch 1/10\n", "376/376 [==============================] - 2s 4ms/step - loss: 5.4531 - acc: 0.5346\n", "Epoch 2/10\n", "376/376 [==============================] - 0s 113us/step - loss: 3.4579 - acc: 0.5665\n", "Epoch 3/10\n", "376/376 [==============================] - 0s 108us/step - loss: 3.3328 - acc: 0.5452\n", "Epoch 4/10\n", "376/376 [==============================] - 0s 106us/step - loss: 2.5059 - acc: 0.5000\n", "Epoch 5/10\n", "376/376 [==============================] - 0s 108us/step - loss: 2.8887 - acc: 0.5771\n", "Epoch 6/10\n", "376/376 [==============================] - 0s 110us/step - loss: 2.0510 - acc: 0.5532\n", "Epoch 7/10\n", "376/376 [==============================] - 0s 107us/step - loss: 1.8155 - acc: 0.5904\n", "Epoch 8/10\n", "376/376 [==============================] - 0s 111us/step - loss: 1.4380 - acc: 0.6144\n", "Epoch 9/10\n", "376/376 [==============================] - 0s 110us/step - loss: 1.5659 - acc: 0.6250\n", "Epoch 10/10\n", "376/376 [==============================] - 0s 110us/step - loss: 1.5057 - acc: 0.6117\n", "41/41 [==============================] - 1s 15ms/step\n", "DNN: 0.522648 (0.095039)\n" ] } ], "source": [ "results = []\n", "names = []\n", "for name, model in models:\n", " kfold = KFold(n_splits=num_folds, random_state=seed)\n", " cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)\n", " results.append(cv_results)\n", " names.append(name)\n", " msg = \"%s: %f (%f)\" % (name, cv_results.mean(), cv_results.std())\n", " print(msg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Algorithm comparison" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "_cell_guid": "67873e9d-bc9b-6963-f594-805f1efbfbb3" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# compare algorithms\n", "fig = pyplot.figure()\n", "fig.suptitle('Algorithm Comparison')\n", "ax = fig.add_subplot(111)\n", "pyplot.boxplot(results)\n", "ax.set_xticklabels(names)\n", "fig.set_size_inches(15,8)\n", "pyplot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 6. Model Tuning and Grid Search" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "848ca488-b0fd-8e93-2e68-23d32c71d89c" }, "source": [ "Algorithm Tuning: Although some of the models show the most promising options. the grid search for Gradient Bossting Classifier is shown below." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.616376 using {'C': 1.0, 'penalty': 'l2'}\n", "#8 nan (nan) with: {'C': 0.001, 'penalty': 'l1'}\n", "#7 0.572880 (0.067966) with: {'C': 0.001, 'penalty': 'l2'}\n", "#9 nan (nan) with: {'C': 0.01, 'penalty': 'l1'}\n", "#6 0.611324 (0.055957) with: {'C': 0.01, 'penalty': 'l2'}\n", "#10 nan (nan) with: {'C': 0.1, 'penalty': 'l1'}\n", "#5 0.611440 (0.040460) with: {'C': 0.1, 'penalty': 'l2'}\n", "#11 nan (nan) with: {'C': 1.0, 'penalty': 'l1'}\n", "#1 0.616376 (0.056352) with: {'C': 1.0, 'penalty': 'l2'}\n", "#12 nan (nan) with: {'C': 10.0, 'penalty': 'l1'}\n", "#1 0.616376 (0.056352) with: {'C': 10.0, 'penalty': 'l2'}\n", "#13 nan (nan) with: {'C': 100.0, 'penalty': 'l1'}\n", "#1 0.616376 (0.056352) with: {'C': 100.0, 'penalty': 'l2'}\n", "#14 nan (nan) with: {'C': 1000.0, 'penalty': 'l1'}\n", "#1 0.616376 (0.056352) with: {'C': 1000.0, 'penalty': 'l2'}\n" ] } ], "source": [ "# 1. Grid search : Logistic Regression Algorithm \n", "'''\n", "penalty : str, ‘l1’, ‘l2’, ‘elasticnet’ or ‘none’, optional (default=’l2’)\n", "\n", "C : float, optional (default=1.0)\n", "Inverse of regularization strength; must be a positive float.Smaller values specify stronger regularization.\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "grid={\"C\":np.logspace(-3,3,7), \"penalty\":[\"l1\",\"l2\"]}# l1 lasso l2 ridge\n", "C= np.logspace(-3,3,7)\n", "penalty = [\"l1\",\"l2\"]# l1 lasso l2 ridge\n", "param_grid = dict(C=C,penalty=penalty )\n", "model = LogisticRegression()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.611614 using {'n_components': 1}\n", "#1 0.611614 (0.055923) with: {'n_components': 1}\n", "#1 0.611614 (0.055923) with: {'n_components': 3}\n", "#1 0.611614 (0.055923) with: {'n_components': 5}\n", "#1 0.611614 (0.055923) with: {'n_components': 7}\n", "#1 0.611614 (0.055923) with: {'n_components': 9}\n", "#1 0.611614 (0.055923) with: {'n_components': 11}\n", "#1 0.611614 (0.055923) with: {'n_components': 13}\n", "#1 0.611614 (0.055923) with: {'n_components': 15}\n", "#1 0.611614 (0.055923) with: {'n_components': 17}\n", "#1 0.611614 (0.055923) with: {'n_components': 19}\n", "#1 0.611614 (0.055923) with: {'n_components': 600}\n" ] } ], "source": [ "# Grid Search : LDA Algorithm \n", "'''\n", "n_components : int, optional (default=None)\n", "Number of components for dimensionality reduction. If None, will be set to min(n_classes - 1, n_features).\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "components = [1,3,5,7,9,11,13,15,17,19,600]\n", "param_grid = dict(n_components=components)\n", "model = LinearDiscriminantAnalysis()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.633275 using {'n_neighbors': 21, 'weights': 'distance'}\n", "#20 0.575436 (0.053977) with: {'n_neighbors': 1, 'weights': 'uniform'}\n", "#20 0.575436 (0.053977) with: {'n_neighbors': 1, 'weights': 'distance'}\n", "#22 0.573403 (0.072922) with: {'n_neighbors': 3, 'weights': 'uniform'}\n", "#18 0.585250 (0.069232) with: {'n_neighbors': 3, 'weights': 'distance'}\n", "#17 0.587979 (0.076811) with: {'n_neighbors': 5, 'weights': 'uniform'}\n", "#9 0.597271 (0.055041) with: {'n_neighbors': 5, 'weights': 'distance'}\n", "#19 0.580778 (0.082174) with: {'n_neighbors': 7, 'weights': 'uniform'}\n", "#15 0.590302 (0.083559) with: {'n_neighbors': 7, 'weights': 'distance'}\n", "#16 0.590302 (0.062168) with: {'n_neighbors': 9, 'weights': 'uniform'}\n", "#7 0.604530 (0.046160) with: {'n_neighbors': 9, 'weights': 'distance'}\n", "#11 0.592451 (0.053386) with: {'n_neighbors': 11, 'weights': 'uniform'}\n", "#5 0.611731 (0.044295) with: {'n_neighbors': 11, 'weights': 'distance'}\n", "#14 0.592393 (0.067668) with: {'n_neighbors': 13, 'weights': 'uniform'}\n", "#11 0.592451 (0.058359) with: {'n_neighbors': 13, 'weights': 'distance'}\n", "#13 0.592451 (0.059463) with: {'n_neighbors': 15, 'weights': 'uniform'}\n", "#10 0.597271 (0.059064) with: {'n_neighbors': 15, 'weights': 'distance'}\n", "#8 0.604413 (0.050579) with: {'n_neighbors': 17, 'weights': 'uniform'}\n", "#6 0.609292 (0.049731) with: {'n_neighbors': 17, 'weights': 'distance'}\n", "#4 0.616492 (0.054053) with: {'n_neighbors': 19, 'weights': 'uniform'}\n", "#3 0.626132 (0.042168) with: {'n_neighbors': 19, 'weights': 'distance'}\n", "#2 0.628397 (0.060939) with: {'n_neighbors': 21, 'weights': 'uniform'}\n", "#1 0.633275 (0.055367) with: {'n_neighbors': 21, 'weights': 'distance'}\n" ] } ], "source": [ "# Grid Search KNN algorithm tuning\n", "'''\n", "n_neighbors : int, optional (default = 5)\n", " Number of neighbors to use by default for kneighbors queries.\n", "\n", "weights : str or callable, optional (default = ‘uniform’)\n", " weight function used in prediction. Possible values: ‘uniform’, ‘distance’\n", "\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "\n", "neighbors = [1,3,5,7,9,11,13,15,17,19,21]\n", "weights = ['uniform', 'distance']\n", "param_grid = dict(n_neighbors=neighbors, weights = weights )\n", "model = KNeighborsClassifier()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.625900 using {'max_depth': 5}\n", "#8 0.589663 (0.073560) with: {'max_depth': 2}\n", "#4 0.609001 (0.054688) with: {'max_depth': 3}\n", "#2 0.618931 (0.072490) with: {'max_depth': 4}\n", "#1 0.625900 (0.050793) with: {'max_depth': 5}\n", "#4 0.609001 (0.058113) with: {'max_depth': 6}\n", "#7 0.594890 (0.087547) with: {'max_depth': 7}\n", "#6 0.606678 (0.067640) with: {'max_depth': 8}\n", "#3 0.614402 (0.079824) with: {'max_depth': 9}\n", "#23 0.570848 (0.079580) with: {'max_depth': 10}\n", "#21 0.573403 (0.072913) with: {'max_depth': 11}\n", "#10 0.587340 (0.079431) with: {'max_depth': 12}\n", "#17 0.575784 (0.076352) with: {'max_depth': 13}\n", "#11 0.585308 (0.072910) with: {'max_depth': 14}\n", "#12 0.582927 (0.058242) with: {'max_depth': 15}\n", "#24 0.568409 (0.081411) with: {'max_depth': 16}\n", "#19 0.575610 (0.070155) with: {'max_depth': 17}\n", "#18 0.575668 (0.086685) with: {'max_depth': 18}\n", "#22 0.570964 (0.063675) with: {'max_depth': 19}\n", "#28 0.558943 (0.087051) with: {'max_depth': 20}\n", "#9 0.587573 (0.070178) with: {'max_depth': 21}\n", "#26 0.563705 (0.087570) with: {'max_depth': 22}\n", "#13 0.582753 (0.065708) with: {'max_depth': 23}\n", "#20 0.575610 (0.059003) with: {'max_depth': 24}\n", "#14 0.580546 (0.073619) with: {'max_depth': 25}\n", "#25 0.565970 (0.065811) with: {'max_depth': 26}\n", "#27 0.561208 (0.080136) with: {'max_depth': 27}\n", "#15 0.580314 (0.086072) with: {'max_depth': 28}\n", "#16 0.577991 (0.069566) with: {'max_depth': 29}\n" ] } ], "source": [ "# Grid Search : CART Algorithm \n", "'''\n", "max_depth : int or None, optional (default=None)\n", " The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure \n", " or until all leaves contain less than min_samples_split samples.\n", "\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "max_depth = np.arange(2, 30)\n", "param_grid = dict(max_depth=max_depth)\n", "model = DecisionTreeClassifier()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "# Grid Search : NB algorithm tuning\n", "#GaussianNB only accepts priors as an argument so unless you have some priors to set for your model ahead of time \n", "#you will have nothing to grid search over.\n" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.657143 using {'C': 1.0, 'kernel': 'rbf'}\n", "#8 0.613705 (0.033500) with: {'C': 0.1, 'kernel': 'linear'}\n", "#23 0.587515 (0.076731) with: {'C': 0.1, 'kernel': 'poly'}\n", "#24 0.570499 (0.062319) with: {'C': 0.1, 'kernel': 'rbf'}\n", "#18 0.608943 (0.044223) with: {'C': 0.3, 'kernel': 'linear'}\n", "#22 0.601800 (0.066519) with: {'C': 0.3, 'kernel': 'poly'}\n", "#7 0.628281 (0.060724) with: {'C': 0.3, 'kernel': 'rbf'}\n", "#11 0.611324 (0.046564) with: {'C': 0.5, 'kernel': 'linear'}\n", "#18 0.608943 (0.062315) with: {'C': 0.5, 'kernel': 'poly'}\n", "#2 0.656969 (0.068917) with: {'C': 0.5, 'kernel': 'rbf'}\n", "#8 0.613705 (0.048677) with: {'C': 0.7, 'kernel': 'linear'}\n", "#8 0.613705 (0.061995) with: {'C': 0.7, 'kernel': 'poly'}\n", "#6 0.645006 (0.062413) with: {'C': 0.7, 'kernel': 'rbf'}\n", "#11 0.611324 (0.046564) with: {'C': 0.9, 'kernel': 'linear'}\n", "#16 0.611208 (0.068144) with: {'C': 0.9, 'kernel': 'poly'}\n", "#3 0.654704 (0.064995) with: {'C': 0.9, 'kernel': 'rbf'}\n", "#11 0.611324 (0.046564) with: {'C': 1.0, 'kernel': 'linear'}\n", "#20 0.608827 (0.066562) with: {'C': 1.0, 'kernel': 'poly'}\n", "#1 0.657143 (0.064634) with: {'C': 1.0, 'kernel': 'rbf'}\n", "#11 0.611324 (0.046564) with: {'C': 1.3, 'kernel': 'linear'}\n", "#21 0.604123 (0.073433) with: {'C': 1.3, 'kernel': 'poly'}\n", "#4 0.650058 (0.065888) with: {'C': 1.3, 'kernel': 'rbf'}\n", "#11 0.611324 (0.046564) with: {'C': 1.5, 'kernel': 'linear'}\n", "#17 0.609001 (0.074297) with: {'C': 1.5, 'kernel': 'poly'}\n", "#5 0.645296 (0.075887) with: {'C': 1.5, 'kernel': 'rbf'}\n" ] } ], "source": [ "# Grid Search: SVM algorithm tuning\n", "'''\n", "C : float, optional (default=1.0)\n", "Penalty parameter C of the error term.\n", "\n", "kernel : string, optional (default=’rbf’)\n", "Specifies the kernel type to be used in the algorithm. \n", "It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. \n", "Parameters of SVM are C and kernel. \n", "Try a number of kernels with various values of C with less bias and more bias (less than and greater than 1.0 respectively\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5]\n", "kernel_values = ['linear', 'poly', 'rbf']\n", "param_grid = dict(C=c_values, kernel=kernel_values)\n", "model = SVC()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.614053 using {'n_estimators': 100}\n", "#2 0.609350 (0.062495) with: {'n_estimators': 10}\n", "#1 0.614053 (0.058883) with: {'n_estimators': 100}\n" ] } ], "source": [ "# Grid Search: Ada boost Algorithm Tuning \n", "'''\n", "n_estimators : integer, optional (default=50)\n", " The maximum number of estimators at which boosting is terminated. \n", " In case of perfect fit, the learning procedure is stopped early.\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "n_estimators = [10, 100]\n", "param_grid = dict(n_estimators=n_estimators)\n", "model = AdaBoostClassifier()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.632811 using {'max_depth': 3, 'n_estimators': 180}\n", "#4 0.613937 (0.068854) with: {'max_depth': 3, 'n_estimators': 20}\n", "#1 0.632811 (0.094400) with: {'max_depth': 3, 'n_estimators': 180}\n", "#2 0.628339 (0.084035) with: {'max_depth': 5, 'n_estimators': 20}\n", "#3 0.625900 (0.068561) with: {'max_depth': 5, 'n_estimators': 180}\n" ] } ], "source": [ "# Grid Search: GradientBoosting Tuning\n", "'''\n", "n_estimators : int (default=100)\n", " The number of boosting stages to perform. \n", " Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.\n", "max_depth : integer, optional (default=3)\n", " maximum depth of the individual regression estimators. \n", " The maximum depth limits the number of nodes in the tree. \n", " Tune this parameter for best performance; the best value depends on the interaction of the input variables.\n", "\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "n_estimators = [20,180]\n", "max_depth= [3,5]\n", "param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)\n", "model = GradientBoostingClassifier()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.649710 using {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 20}\n", "#1 0.649710 (0.093241) with: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 20}\n", "#6 0.626016 (0.079640) with: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 80}\n", "#8 0.606911 (0.063889) with: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 20}\n", "#4 0.628455 (0.069711) with: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 80}\n", "#7 0.614053 (0.076060) with: {'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 20}\n", "#2 0.630720 (0.057585) with: {'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 80}\n", "#5 0.626074 (0.071196) with: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 20}\n", "#3 0.628513 (0.068331) with: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 80}\n" ] } ], "source": [ "# Grid Search: Random Forest Classifier\n", "'''\n", "n_estimators : int (default=100)\n", " The number of boosting stages to perform. \n", " Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.\n", "max_depth : integer, optional (default=3)\n", " maximum depth of the individual regression estimators. \n", " The maximum depth limits the number of nodes in the tree. \n", " Tune this parameter for best performance; the best value depends on the interaction of the input variables \n", "criterion : string, optional (default=”gini”)\n", " The function to measure the quality of a split. \n", " Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. \n", " \n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "n_estimators = [20,80]\n", "max_depth= [5,10]\n", "criterion = [\"gini\",\"entropy\"]\n", "param_grid = dict(n_estimators=n_estimators, max_depth=max_depth, criterion = criterion )\n", "model = RandomForestClassifier()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.642451 using {'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 20}\n", "#4 0.611672 (0.089702) with: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 20}\n", "#3 0.632985 (0.053067) with: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 80}\n", "#6 0.597735 (0.096033) with: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 20}\n", "#8 0.597387 (0.095569) with: {'criterion': 'gini', 'max_depth': 10, 'n_estimators': 80}\n", "#1 0.642451 (0.077588) with: {'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 20}\n", "#2 0.633101 (0.062141) with: {'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 80}\n", "#5 0.604297 (0.067871) with: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 20}\n", "#7 0.597561 (0.096830) with: {'criterion': 'entropy', 'max_depth': 10, 'n_estimators': 80}\n" ] } ], "source": [ "# Grid Search: ExtraTreesClassifier()\n", "'''\n", "n_estimators : int (default=100)\n", " The number of boosting stages to perform. \n", " Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.\n", "max_depth : integer, optional (default=3)\n", " maximum depth of the individual regression estimators. \n", " The maximum depth limits the number of nodes in the tree. \n", " Tune this parameter for best performance; the best value depends on the interaction of the input variables \n", "criterion : string, optional (default=”gini”)\n", " The function to measure the quality of a split. \n", " Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. \n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "n_estimators = [20,80]\n", "max_depth= [5,10]\n", "criterion = [\"gini\",\"entropy\"]\n", "param_grid = dict(n_estimators=n_estimators, max_depth=max_depth, criterion = criterion )\n", "model = ExtraTreesClassifier()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.635366 using {'hidden_layer_sizes': (20,)}\n", "#1 0.635366 (0.052710) with: {'hidden_layer_sizes': (20,)}\n", "#4 0.604413 (0.050579) with: {'hidden_layer_sizes': (50,)}\n", "#3 0.609059 (0.043019) with: {'hidden_layer_sizes': (20, 20)}\n", "#2 0.633217 (0.066650) with: {'hidden_layer_sizes': (20, 30, 20)}\n" ] } ], "source": [ "# Grid Search : NN algorithm tuning\n", "'''\n", "hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)\n", " The ith element represents the number of neurons in the ith hidden layer.\n", "Other Parameters that can be tuned\n", " learning_rate_init : double, optional, default 0.001\n", " The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.\n", " max_iter : int, optional, default 200\n", " Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "hidden_layer_sizes=[(20,), (50,), (20,20), (20, 30, 20)]\n", "param_grid = dict(hidden_layer_sizes=hidden_layer_sizes)\n", "model = MLPClassifier()\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.625726 using {'neurons': 15}\n", "#4 0.590128 (0.042692) with: {'neurons': 1}\n", "#3 0.604065 (0.039938) with: {'neurons': 5}\n", "#2 0.613879 (0.055881) with: {'neurons': 10}\n", "#1 0.625726 (0.069088) with: {'neurons': 15}\n" ] } ], "source": [ "# Grid Search : Deep Neural Network algorithm tuning\n", "'''\n", "neurons: int\n", " Number of patterns shown to the network before the weights are updated. \n", "batch_size: int\n", " Number of observation to read at a time and keep in memory.\n", "epochs: int\n", " Number of times that the entire training dataset is shown to the network during training.\n", "activation:\n", " The activation function controls the non-linearity of individual neurons and when to fire.\n", "learn_rate :int\n", " controls how much to update the weight at the end of each batch\n", "momentum : int\n", " momentum controls how much to let the previous update influence the current weight update\n", "''' \n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "#Hyperparameters that can be modified\n", "neurons = [1, 5, 10, 15]\n", "batch_size = [10, 20, 40, 60, 80, 100]\n", "epochs = [10, 50, 100]\n", "activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']\n", "learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]\n", "momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]\n", "\n", "#Changing only Neurons for the sake of simplicity\n", "param_grid = dict(neurons=neurons)\n", "model = KerasClassifier(build_fn=create_model, epochs=50, batch_size=10, verbose=0)\n", "kfold = KFold(n_splits=num_folds, random_state=seed)\n", "grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)\n", "grid_result = grid.fit(rescaledX, Y_train)\n", "\n", "#Print Results\n", "print(\"Best: %f using %s\" % (grid_result.best_score_, grid_result.best_params_))\n", "means = grid_result.cv_results_['mean_test_score']\n", "stds = grid_result.cv_results_['std_test_score']\n", "params = grid_result.cv_results_['params']\n", "ranks = grid_result.cv_results_['rank_test_score']\n", "for mean, stdev, param, rank in zip(means, stds, params, ranks):\n", " print(\"#%d %f (%f) with: %r\" % (rank, mean, stdev, param))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# 7. Finalise the Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the details above GBM might be worthy of further study, but for now SVM shows a lot of promise as a low complexity and stable model for this problem.\n", "\n", "Finalize Model with best parameters found during tuning step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 7.1. Results on the Test Dataset" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,\n", " learning_rate=0.1, loss='deviance', max_depth=5,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=20,\n", " n_iter_no_change=None, presort='deprecated',\n", " random_state=None, subsample=1.0, tol=0.0001,\n", " validation_fraction=0.1, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# prepare model\n", "scaler = StandardScaler().fit(X_train)\n", "rescaledX = scaler.transform(X_train)\n", "model = GradientBoostingClassifier(n_estimators=20, max_depth=5) # rbf is default kernel\n", "model.fit(X_train, Y_train)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "_cell_guid": "f9725666-3c21-69d1-ddf6-45e47d982444" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.6666666666666666\n", "[[30 22]\n", " [13 40]]\n", " precision recall f1-score support\n", "\n", " 0 0.70 0.58 0.63 52\n", " 1 0.65 0.75 0.70 53\n", "\n", " accuracy 0.67 105\n", " macro avg 0.67 0.67 0.66 105\n", "weighted avg 0.67 0.67 0.66 105\n", "\n" ] } ], "source": [ "# estimate accuracy on validation set\n", "rescaledValidationX = scaler.transform(X_validation)\n", "predictions = model.predict(X_validation)\n", "print(accuracy_score(Y_validation, predictions))\n", "print(confusion_matrix(Y_validation, predictions))\n", "print(classification_report(Y_validation, predictions))" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,\n", " 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,\n", " 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,\n", " 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,\n", " 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0])" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "998 0\n", "989 1\n", "664 1\n", "474 0\n", "601 0\n", "918 0\n", "114 1\n", "7 1\n", "593 0\n", "201 1\n", "946 0\n", "156 1\n", "375 0\n", "513 1\n", "177 1\n", "89 0\n", "466 0\n", "537 1\n", "634 0\n", "927 0\n", "454 0\n", "648 0\n", "938 0\n", "530 1\n", "818 1\n", "498 1\n", "197 0\n", "961 1\n", "405 0\n", "432 1\n", "806 1\n", "35 0\n", "531 0\n", "334 0\n", "652 0\n", "22 1\n", "677 0\n", "605 1\n", "515 1\n", "51 1\n", "145 1\n", "729 1\n", "475 0\n", "313 0\n", "252 0\n", "97 1\n", "969 1\n", "88 1\n", "501 1\n", "38 1\n", "273 0\n", "793 1\n", "576 1\n", "479 1\n", "442 1\n", "320 0\n", "212 0\n", "172 0\n", "917 0\n", "812 0\n", "207 1\n", "72 1\n", "727 0\n", "491 0\n", "849 0\n", "919 0\n", "328 1\n", "834 0\n", "835 0\n", "721 0\n", "711 0\n", "347 1\n", "896 1\n", "831 0\n", "521 0\n", "930 1\n", "832 0\n", "623 1\n", "684 1\n", "666 1\n", "458 1\n", "157 1\n", "602 0\n", "284 1\n", "714 0\n", "107 1\n", "422 1\n", "653 0\n", "730 1\n", "416 0\n", "293 1\n", "923 1\n", "876 1\n", "191 0\n", "892 1\n", "709 1\n", "814 0\n", "471 0\n", "398 0\n", "506 1\n", "597 0\n", "44 0\n", "34 1\n", "840 0\n", "47 1\n", "Name: Risk_Code, dtype: int32" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Y_validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 7.2. Variable Intuition/Feature Importance\n", "Looking at the details above GBM might be worthy of further study, but for now SVM shows a lot of promise as a low complexity and stable model for this problem.\n", "Let us look into the Feature Importance of the GBM model" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.14559042 0.02828504 0.45990366 0.23325303 0.00326138 0.02257884\n", " 0.03420548 0.02710298 0.04581917]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "model = GradientBoostingClassifier()\n", "model.fit(rescaledX,Y_train)\n", "print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers\n", "#plot graph of feature importances for better visualization\n", "feat_importances = pd.Series(model.feature_importances_, index=X.columns)\n", "feat_importances.nlargest(10).plot(kind='barh')\n", "pyplot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 7.3. Save Model for Later Use" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "# Save Model Using Pickle\n", "from pickle import dump\n", "from pickle import load\n", "\n", "# save the model to disk\n", "filename = 'finalized_model.sav'\n", "dump(model, open(filename, 'wb'))" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7047619047619048\n" ] } ], "source": [ "# some time later...\n", "# load the model from disk\n", "loaded_model = load(open(filename, 'rb'))\n", "# estimate accuracy on validation set\n", "rescaledValidationX = scaler.transform(X_validation)\n", "predictions = model.predict(rescaledValidationX)\n", "result = accuracy_score(Y_validation, predictions)\n", "print(result)" ] } ], "metadata": { "_change_revision": 206, "_is_fork": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }