{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import pandas as pd\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn import metrics\n", "from keras.models import Sequential\n", "from keras.layers import Dense, Dropout, BatchNormalization, MaxPool1D, Flatten, Conv1D\n", "from keras.utils import to_categorical\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.core.display import display, HTML\n", "display(HTML(\"\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Data and Question of Interest\n", "\n", "Let's take a look at the [UCI Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult). This data set was extrated from Census data with the goal of prediction who makes over $50,000.\n", "\n", "I would like to use these data as a means of exploring various machine learning algorithms that will increase in complexity to see how the compare on various evaluation metrics. Additonally, it will be interesting to see how much there is to gain by spending some time fine-tuning these algorithms.\n", "\n", "We will look at the following algorithms:\n", "1. [Logistic Regression](http://learningwithdata.com/logistic-regression-and-optimization.html#logistic-regression-and-optimization)\n", "2. [Gradient Boosting Trees](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)\n", "3. [Deep Learning](https://blog.algorithmia.com/introduction-to-deep-learning-2016/)\n", "\n", "And evaluate them with the following metrics:\n", "1. [F1 Score](https://en.wikipedia.org/wiki/F1_score)\n", "2. [Area Under ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n", "3. [Accuracy](https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf)\n", "\n", "Let's go ahead and read in the data and take a look." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',\n", " 'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']\n", "train_df = pd.read_csv(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\",\n", " header=None, names=names)\n", "test_df = pd.read_csv(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test\",\n", " header=None, names=names, skiprows=[0])\n", "all_df = pd.concat([train_df, test_df])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducationnummaritalstatusoccupationrelationshipracesexcapitalgaincapitallosshoursperweeknativecountrylabel
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", "
" ], "text/plain": [ " age workclass fnlwgt education educationnum \\\n", "0 39 State-gov 77516 Bachelors 13 \n", "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 38 Private 215646 HS-grad 9 \n", "3 53 Private 234721 11th 7 \n", "4 28 Private 338409 Bachelors 13 \n", "\n", " maritalstatus occupation relationship race sex \\\n", "0 Never-married Adm-clerical Not-in-family White Male \n", "1 Married-civ-spouse Exec-managerial Husband White Male \n", "2 Divorced Handlers-cleaners Not-in-family White Male \n", "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", "4 Married-civ-spouse Prof-specialty Wife Black Female \n", "\n", " capitalgain capitalloss hoursperweek nativecountry label \n", "0 2174 0 40 United-States <=50K \n", "1 0 0 13 United-States <=50K \n", "2 0 0 40 United-States <=50K \n", "3 0 0 40 United-States <=50K \n", "4 0 0 40 Cuba <=50K " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we have 14 columns to help us predict our classification. We will drop fnlwgt and education and then convert our categorical features to dummy variables. We will also convert our label to 0 and 1 where 1 means the person made more than $50k" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(48842, 15)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_df.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "drop_columns = ['fnlwgt', 'education']\n", "continuous_features = ['age', 'capitalgain', 'capitalloss', 'hoursperweek']\n", "cat_features =['educationnum', 'workclass', 'maritalstatus', 'occupation', 'relationship', 'race', 'sex', 'nativecountry']" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "all_df_dummies = pd.get_dummies(all_df, columns=cat_features)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "all_df_dummies.drop(drop_columns, 1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = all_df_dummies['label'].apply(lambda x: 0 if '<' in x else 1)\n", "X = all_df_dummies.drop(['label'], 1)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.760718\n", "1 0.239282\n", "Name: label, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like we don't have balanced classes, so good thing we are looking at other metrics than accuracy. Now let's split into training and testing with 1/3 for testing." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(32724, 106)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning Pipeline\n", "\n", "The goal of this project is not to focus on cleaning / data exploration / feature engineering. So we will define a very simple cleaning pipeline that fills any missing values with the median and then scales ever column." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clean_pipeline = Pipeline([('imputer', preprocessing.Imputer(strategy=\"median\")),\n", " ('std_scaler', preprocessing.StandardScaler()),])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train_clean = clean_pipeline.fit_transform(X_train)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_test_clean = clean_pipeline.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Metrics\n", "\n", "A simple function to calculate our metrics of interest" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def evaluate(true, pred):\n", " f1 = metrics.f1_score(true, pred)\n", " roc_auc = metrics.roc_auc_score(true, pred)\n", " accuracy = metrics.accuracy_score(true, pred)\n", " print(\"F1: {0}\\nROC_AUC: {1}\\nACCURACY: {2}\".format(f1, roc_auc, accuracy))\n", " return f1, roc_auc, accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Regression\n", "\n", "The first model up is a simple logistic regression with the default hyperparameters." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = LogisticRegression()\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lr_predictions = clf.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1: 0.6507094739859539\n", "ROC_AUC: 0.7574953226590644\n", "ACCURACY: 0.8488025809653803\n" ] } ], "source": [ "lr_f1, lr_roc_auc, lr_acc = evaluate(y_test, lr_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tuned Logistic Regression\n", "\n", "Now lets spend a bit of time tuning our regularization." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=None, error_score='raise',\n", " estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", " verbose=0, warm_start=False),\n", " fit_params={}, iid=True, n_jobs=10,\n", " param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n", " scoring='f1', verbose=0)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr_grid = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }\n", "tuned_lr = GridSearchCV(LogisticRegression(), lr_grid, scoring='f1', n_jobs=10)\n", "tuned_lr.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are our best parameters" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'C': 1, 'penalty': 'l1'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tuned_lr.best_params_" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1: 0.6512027491408934\n", "ROC_AUC: 0.7578833983412963\n", "ACCURACY: 0.8488646234024072\n" ] } ], "source": [ "tuned_lr_predictions = tuned_lr.predict(X_test)\n", "tuned_lr_f1, tuned_lr_roc_auc, tuned_lr_acc = evaluate(y_test, tuned_lr_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient Boosted Trees\n", "\n", "Now an out of the box boosted tree" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1: 0.6507094739859539\n", "ROC_AUC: 0.7574953226590644\n", "ACCURACY: 0.8488025809653803\n" ] } ], "source": [ "gbt = GradientBoostingClassifier()\n", "gbt.fit(X_train, y_train)\n", "gbt_predictions = clf.predict(X_test)\n", "gbt_f1, gbt_roc_auc, gbt_acc = evaluate(y_test, gbt_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GBT Tuned\n", "\n", "And now a tuned boosted tree. I ran the grid shown below to get my final parameters, but for speed's sake I now just show the best." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", " learning_rate=0.01, loss='deviance', max_depth=5,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_split=1e-07, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=1000, presort='auto', random_state=None,\n", " subsample=1.0, verbose=0, warm_start=False)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#gbt_grid = {'learning_rate': [.01], 'n_estimators': [250, 500, 1000], 'max_depth': [3, 4, 5]}\n", "gbt_tuned = GradientBoostingClassifier(learning_rate=.01, n_estimators=1000, max_depth=5)\n", "gbt_tuned.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1: 0.7042577675489067\n", "ROC_AUC: 0.7885511539729889\n", "ACCURACY: 0.8724407494726393\n" ] } ], "source": [ "gbt_tuned_predictions = gbt_tuned.predict(X_test)\n", "gbt_tuned_f1, gbt_tunded_roc_auc, gbt_tuned_acc = evaluate(y_test, gbt_tuned_predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deep Learning Simple\n", "\n", "Now we have all heard the amazing power of deep learning. So let's take a look at how well it fares with our task. There are a fair amout of hyperparameters with deep nets, but I will pick some reasonable values as our starting point." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model_simple = Sequential()\n", "model_simple.add(Dense(1024, activation='relu' , input_dim = X_train.shape[1]))\n", "model_simple.add(Dropout(0.5))\n", "model_simple.add(Dense(2, activation='softmax', name='softmax'))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y_train_cat = to_categorical(y_train.values, 2)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model_simple.compile(loss='categorical_crossentropy',\n", " optimizer='adam',\n", " metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/25\n", "32724/32724 [==============================] - 2s - loss: 1.3490 - acc: 0.7843 \n", "Epoch 2/25\n", "32724/32724 [==============================] - 1s - loss: 1.3434 - acc: 0.7985 \n", "Epoch 3/25\n", "32724/32724 [==============================] - 1s - loss: 1.4843 - acc: 0.7937 \n", "Epoch 4/25\n", "32724/32724 [==============================] - 1s - loss: 1.4806 - acc: 0.7947 \n", "Epoch 5/25\n", "32724/32724 [==============================] - 1s - loss: 1.4770 - acc: 0.7953 \n", "Epoch 6/25\n", "32724/32724 [==============================] - 1s - loss: 1.4761 - acc: 0.7964 \n", "Epoch 7/25\n", "32724/32724 [==============================] - 1s - loss: 1.4755 - acc: 0.7977 \n", "Epoch 8/25\n", "32724/32724 [==============================] - 1s - loss: 1.4750 - acc: 0.7977 \n", "Epoch 9/25\n", "32724/32724 [==============================] - 1s - loss: 1.4744 - acc: 0.7968 \n", "Epoch 10/25\n", "32724/32724 [==============================] - 1s - loss: 1.4738 - acc: 0.7981 \n", "Epoch 11/25\n", "32724/32724 [==============================] - 1s - loss: 1.4722 - acc: 0.7975 \n", "Epoch 12/25\n", "32724/32724 [==============================] - 1s - loss: 1.4725 - acc: 0.7995 \n", "Epoch 13/25\n", "32724/32724 [==============================] - 1s - loss: 1.4714 - acc: 0.7986 \n", "Epoch 14/25\n", "32724/32724 [==============================] - 1s - loss: 1.4709 - acc: 0.7988 \n", "Epoch 15/25\n", "32724/32724 [==============================] - 1s - loss: 1.4713 - acc: 0.7979 \n", "Epoch 16/25\n", "32724/32724 [==============================] - 1s - loss: 1.4698 - acc: 0.7982 \n", "Epoch 17/25\n", "32724/32724 [==============================] - 1s - loss: 1.4704 - acc: 0.7988 \n", "Epoch 18/25\n", "32724/32724 [==============================] - 1s - loss: 1.4705 - acc: 0.7994 \n", "Epoch 19/25\n", "32724/32724 [==============================] - 1s - loss: 1.4707 - acc: 0.7981 \n", "Epoch 20/25\n", "32724/32724 [==============================] - 1s - loss: 1.4695 - acc: 0.8000 \n", "Epoch 21/25\n", "32724/32724 [==============================] - 1s - loss: 1.4700 - acc: 0.8002 \n", "Epoch 22/25\n", "32724/32724 [==============================] - 1s - loss: 1.4687 - acc: 0.8006 \n", "Epoch 23/25\n", "32724/32724 [==============================] - 1s - loss: 1.4689 - acc: 0.8006 \n", "Epoch 24/25\n", "32724/32724 [==============================] - 1s - loss: 1.4692 - acc: 0.7994 \n", "Epoch 25/25\n", "32724/32724 [==============================] - 1s - loss: 1.4672 - acc: 0.8003 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_simple.fit(X_train.values, y_train_cat, batch_size=32, epochs=25)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1: 0.4076755973931933\n", "ROC_AUC: 0.753604522328225\n", "ACCURACY: 0.7969971460478967\n" ] } ], "source": [ "deep_predictions_simple = model_simple.predict(X_test.values)\n", "deep_simple_f1, deep_simple_roc_auc, deep_simple_acc = evaluate(np.argmax(deep_predictions_simple, 1), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deep Learning Tuned A Bit\n", "\n", "Then I spent about 30 minutes playing with different architectures so see how far I could push a deep net and this is what I got. Note: this is not to say that there isn't a better or even much better architecture, but after trying a fair amount of normal options, nothing better appeared." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = Sequential()\n", "model.add(Dense(1024, activation='elu', kernel_initializer='glorot_normal', input_dim = X_train.shape[1]))\n", "model.add(BatchNormalization())\n", "model.add(Dense(128, activation='elu', kernel_initializer='glorot_normal'))\n", "model.add(BatchNormalization())\n", "model.add(Dense(64, activation='elu', kernel_initializer='glorot_normal'))\n", "model.add(Dropout(0.2))\n", "model.add(Dense(2, activation='softmax', name='softmax'))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model.compile(loss='categorical_crossentropy',\n", " optimizer='adam',\n", " metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/40\n", "32724/32724 [==============================] - 0s - loss: 0.3969 - acc: 0.8129 \n", "Epoch 2/40\n", "32724/32724 [==============================] - 0s - loss: 0.3418 - acc: 0.8409 \n", "Epoch 3/40\n", "32724/32724 [==============================] - 0s - loss: 0.3360 - acc: 0.8401 \n", "Epoch 4/40\n", "32724/32724 [==============================] - 0s - loss: 0.3305 - acc: 0.8430 \n", "Epoch 5/40\n", "32724/32724 [==============================] - 0s - loss: 0.3270 - acc: 0.8458 \n", "Epoch 6/40\n", "32724/32724 [==============================] - 0s - loss: 0.3274 - acc: 0.8465 \n", "Epoch 7/40\n", "32724/32724 [==============================] - 0s - loss: 0.3223 - acc: 0.8495 \n", "Epoch 8/40\n", "32724/32724 [==============================] - 0s - loss: 0.3165 - acc: 0.8524 \n", "Epoch 9/40\n", "32724/32724 [==============================] - 0s - loss: 0.3231 - acc: 0.8477 \n", "Epoch 10/40\n", "32724/32724 [==============================] - 0s - loss: 0.3177 - acc: 0.8527 \n", "Epoch 11/40\n", "32724/32724 [==============================] - 0s - loss: 0.3158 - acc: 0.8522 \n", "Epoch 12/40\n", "32724/32724 [==============================] - 0s - loss: 0.3174 - acc: 0.8505 \n", "Epoch 13/40\n", "32724/32724 [==============================] - 0s - loss: 0.3148 - acc: 0.8527 \n", "Epoch 14/40\n", "32724/32724 [==============================] - 0s - loss: 0.3114 - acc: 0.8539 \n", "Epoch 15/40\n", "32724/32724 [==============================] - 0s - loss: 0.3106 - acc: 0.8552 \n", "Epoch 16/40\n", "32724/32724 [==============================] - 0s - loss: 0.3094 - acc: 0.8555 \n", "Epoch 17/40\n", "32724/32724 [==============================] - 0s - loss: 0.3074 - acc: 0.8548 \n", "Epoch 18/40\n", "32724/32724 [==============================] - 0s - loss: 0.3089 - acc: 0.8559 \n", "Epoch 19/40\n", "32724/32724 [==============================] - 0s - loss: 0.3088 - acc: 0.8555 \n", "Epoch 20/40\n", "32724/32724 [==============================] - 0s - loss: 0.3097 - acc: 0.8564 \n", "Epoch 21/40\n", "32724/32724 [==============================] - 0s - loss: 0.3088 - acc: 0.8554 \n", "Epoch 22/40\n", "32724/32724 [==============================] - 0s - loss: 0.3085 - acc: 0.8547 \n", "Epoch 23/40\n", "32724/32724 [==============================] - 0s - loss: 0.3037 - acc: 0.8589 \n", "Epoch 24/40\n", "32724/32724 [==============================] - 0s - loss: 0.3076 - acc: 0.8555 \n", "Epoch 25/40\n", "32724/32724 [==============================] - 0s - loss: 0.3056 - acc: 0.8581 \n", "Epoch 26/40\n", "32724/32724 [==============================] - 0s - loss: 0.3037 - acc: 0.8587 \n", "Epoch 27/40\n", "32724/32724 [==============================] - 0s - loss: 0.3056 - acc: 0.8567 \n", "Epoch 28/40\n", "32724/32724 [==============================] - 0s - loss: 0.3021 - acc: 0.8588 \n", "Epoch 29/40\n", "32724/32724 [==============================] - 0s - loss: 0.3026 - acc: 0.8584 \n", "Epoch 30/40\n", "32724/32724 [==============================] - 0s - loss: 0.3033 - acc: 0.8600 \n", "Epoch 31/40\n", "32724/32724 [==============================] - 0s - loss: 0.3027 - acc: 0.8574 \n", "Epoch 32/40\n", "32724/32724 [==============================] - 0s - loss: 0.3019 - acc: 0.8585 \n", "Epoch 33/40\n", "32724/32724 [==============================] - 0s - loss: 0.3002 - acc: 0.8593 \n", "Epoch 34/40\n", "32724/32724 [==============================] - 0s - loss: 0.3002 - acc: 0.8603 \n", "Epoch 35/40\n", "32724/32724 [==============================] - 0s - loss: 0.2968 - acc: 0.8619 \n", "Epoch 36/40\n", "32724/32724 [==============================] - 0s - loss: 0.3010 - acc: 0.8598 \n", "Epoch 37/40\n", "32724/32724 [==============================] - 0s - loss: 0.2998 - acc: 0.8609 \n", "Epoch 38/40\n", "32724/32724 [==============================] - 0s - loss: 0.2980 - acc: 0.8619 \n", "Epoch 39/40\n", "32724/32724 [==============================] - 0s - loss: 0.2974 - acc: 0.8616 \n", "Epoch 40/40\n", "32724/32724 [==============================] - 0s - loss: 0.2970 - acc: 0.8623 \n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train.values, y_train_cat, batch_size=512, epochs=40)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "deep_predictions = model.predict(X_test.values)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1: 0.6730386300278773\n", "ROC_AUC: 0.795070458358415\n", "ACCURACY: 0.8471894776026803\n" ] } ], "source": [ "deep_f1, deep_roc_auc, deep_acc = evaluate(np.argmax(deep_predictions, 1), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final Results\n", "\n", "So what did we end up with and what did we learn?" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model_names = [\"LR\", \"Tuned LR\", \"GBT\", \"Tuned GBT\", \"Deep\", \"Deep Tuned\"]\n", "metrics_of_interest = [\"F1\", \"ROC_AUC\", \"ACCURACY\"]\n", "f1s = [lr_f1, tuned_lr_f1, gbt_f1, gbt_tuned_f1, deep_simple_f1, deep_f1]\n", "roc_aucs = [lr_roc_auc, tuned_lr_roc_auc, gbt_roc_auc, gbt_tunded_roc_auc, deep_simple_roc_auc, deep_roc_auc]\n", "accuracy = [lr_acc, tuned_lr_acc, gbt_acc, gbt_tuned_acc, deep_simple_acc, deep_acc]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "results_df = pd.DataFrame(columns=metrics_of_interest, index=model_names, data=np.array([f1s, roc_aucs, accuracy]).T)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
F1ROC_AUCACCURACY
LR0.6507090.7574950.848803
Tuned LR0.6512030.7578830.848865
GBT0.6507090.7574950.848803
Tuned GBT0.7042580.7885510.872441
Deep0.4076760.7536050.796997
Deep Tuned0.6730390.7950700.847189
\n", "
" ], "text/plain": [ " F1 ROC_AUC ACCURACY\n", "LR 0.650709 0.757495 0.848803\n", "Tuned LR 0.651203 0.757883 0.848865\n", "GBT 0.650709 0.757495 0.848803\n", "Tuned GBT 0.704258 0.788551 0.872441\n", "Deep 0.407676 0.753605 0.796997\n", "Deep Tuned 0.673039 0.795070 0.847189" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_df" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "First off, the out of the box logistic regression does basically as well as the tuned version. Tuning helped a bit, but didn't make much of a difference. The out of the box GBT did slightly worse, but basically as well as the tuned logistic regression. Which for some might seem surprising given the successes of XGBoost on Kaggle. That being said, once you spend a bit of time tuning, GBTs do significantly better with a jump across the board and about a a 7.5% increase in F1.\n", "\n", "The deep networks are interesting indeed. The first naive pass does very poorly. The ROC_AUC and Accuracy look okay, but the F1 score points to the issue: it learned that most things are a 0 and overfit to that. As we can see below:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import Counter" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({0: 14508, 1: 1610})" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(np.argmax(deep_predictions_simple, 1))" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({0: 12204, 1: 3914})" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That being said, after spending some time tuning, we are able to boost the deep net's performance a lot. Even geting to the best ROC_AUC score and a competitve F1 and accuracy. So what are the main take aways?\n", "\n", "1. Logistic regression is a nice baseline that may not require a lot of tuning and even if it does need some it is very fast to train\n", "2. GBTs are powerful algorithms, but without tuning may not beat a baseline by much. That being said with a fairly standard grid search across a few values one can see good improvements. This grid search can take some time, though, as GBTs are slower to train the logistic regression.\n", "3. Deep nets can achieve competitve results even outside of text, image, and audio fields. Training a \"standard deep net\", though, without any tuning can lead to very poor results. To really maximize the value of deep network time needs to be spent experiment with architectures. For example, how deep? how wide? regularization? normalization? what kind of initalization? etc. There are tons of options and perhaps the path to tuning is less clear than GBTs. In addition, deep nets can be slow to train, so all of this iteration takes time.\n", "\n", "In conclusion, there really doesn't seeem to be a free lunch. You can get better results with more complex models, but those models do take time and understanding to tune and even then might not provide significant improvements. Lastly, this is clearly just one data set and may not genearlize at all. It would be interesting to run similar tests on other data sets to see if there is a trend." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }