In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization, MaxPool1D, Flatten, Conv1D
from keras.utils import to_categorical
import numpy as np

Using TensorFlow backend.


In [2]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## The Data and Question of Interest

Let's take a look at the [UCI Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult). This data set was extrated from Census data with the goal of prediction who makes over $50,000.

I would like to use these data as a means of exploring various machine learning algorithms that will increase in complexity to see how the compare on various evaluation metrics. Additonally, it will be interesting to see how much there is to gain by spending some time fine-tuning these algorithms.

We will look at the following algorithms:
1. [Logistic Regression](http://learningwithdata.com/logistic-regression-and-optimization.html#logistic-regression-and-optimization)
2. [Gradient Boosting Trees](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
3. [Deep Learning](https://blog.algorithmia.com/introduction-to-deep-learning-2016/)

And evaluate them with the following metrics:
1. [F1 Score](https://en.wikipedia.org/wiki/F1_score)
2. [Area Under ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
3. [Accuracy](https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf)

Let's go ahead and read in the data and take a look.

In [4]:
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                      header=None, names=names)
test_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                      header=None, names=names, skiprows=[0])
all_df = pd.concat([train_df, test_df])

In [5]:
all_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educationnum,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,nativecountry,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


It looks like we have 14 columns to help us predict our classification. We will drop fnlwgt and education and then convert our categorical features to dummy variables. We will also convert our label to 0 and 1 where 1 means the person made more than $50k

In [6]:
all_df.shape

(48842, 15)

In [7]:
drop_columns = ['fnlwgt', 'education']
continuous_features = ['age', 'capitalgain', 'capitalloss', 'hoursperweek']
cat_features =['educationnum', 'workclass', 'maritalstatus', 'occupation', 'relationship', 'race', 'sex', 'nativecountry']

In [8]:
all_df_dummies = pd.get_dummies(all_df, columns=cat_features)

In [9]:
all_df_dummies.drop(drop_columns, 1, inplace=True)

In [10]:
y = all_df_dummies['label'].apply(lambda x: 0 if '<' in x else 1)
X = all_df_dummies.drop(['label'], 1)

In [11]:
y.value_counts(normalize=True)

0    0.760718
1    0.239282
Name: label, dtype: float64

Looks like we don't have balanced classes, so good thing we are looking at other metrics than accuracy. Now let's split into training and testing with 1/3 for testing.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [13]:
X_train.shape

(32724, 106)

## Cleaning Pipeline

The goal of this project is not to focus on cleaning / data exploration / feature engineering. So we will define a very simple cleaning pipeline that fills any missing values with the median and then scales ever column.

In [14]:
clean_pipeline = Pipeline([('imputer', preprocessing.Imputer(strategy="median")),
                           ('std_scaler', preprocessing.StandardScaler()),])

In [15]:
X_train_clean = clean_pipeline.fit_transform(X_train)

In [16]:
X_test_clean = clean_pipeline.transform(X_test)

## Metrics

A simple function to calculate our metrics of interest

In [17]:
def evaluate(true, pred):
    f1 = metrics.f1_score(true, pred)
    roc_auc = metrics.roc_auc_score(true, pred)
    accuracy = metrics.accuracy_score(true, pred)
    print("F1: {0}\nROC_AUC: {1}\nACCURACY: {2}".format(f1, roc_auc, accuracy))
    return f1, roc_auc, accuracy

## Logistic Regression

The first model up is a simple logistic regression with the default hyperparameters.

In [18]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
lr_predictions = clf.predict(X_test)

In [20]:
lr_f1, lr_roc_auc, lr_acc = evaluate(y_test, lr_predictions)

F1: 0.6507094739859539
ROC_AUC: 0.7574953226590644
ACCURACY: 0.8488025809653803


## Tuned Logistic Regression

Now lets spend a bit of time tuning our regularization.

In [21]:
lr_grid = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
tuned_lr = GridSearchCV(LogisticRegression(), lr_grid, scoring='f1', n_jobs=10)
tuned_lr.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=10,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)

Here are our best parameters

In [25]:
tuned_lr.best_params_

{'C': 1, 'penalty': 'l1'}

In [26]:
tuned_lr_predictions = tuned_lr.predict(X_test)
tuned_lr_f1, tuned_lr_roc_auc, tuned_lr_acc = evaluate(y_test, tuned_lr_predictions)

F1: 0.6512027491408934
ROC_AUC: 0.7578833983412963
ACCURACY: 0.8488646234024072


## Gradient Boosted Trees

Now an out of the box boosted tree

In [27]:
gbt = GradientBoostingClassifier()
gbt.fit(X_train, y_train)
gbt_predictions = clf.predict(X_test)
gbt_f1, gbt_roc_auc, gbt_acc = evaluate(y_test, gbt_predictions)

F1: 0.6507094739859539
ROC_AUC: 0.7574953226590644
ACCURACY: 0.8488025809653803


## GBT Tuned

And now a tuned boosted tree. I ran the grid shown below to get my final parameters, but for speed's sake I now just show the best.

In [28]:
#gbt_grid = {'learning_rate': [.01], 'n_estimators': [250, 500, 1000], 'max_depth': [3, 4, 5]}
gbt_tuned = GradientBoostingClassifier(learning_rate=.01, n_estimators=1000, max_depth=5)
gbt_tuned.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.01, loss='deviance', max_depth=5,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=1000, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [29]:
gbt_tuned_predictions = gbt_tuned.predict(X_test)
gbt_tuned_f1, gbt_tunded_roc_auc, gbt_tuned_acc = evaluate(y_test, gbt_tuned_predictions)

F1: 0.7042577675489067
ROC_AUC: 0.7885511539729889
ACCURACY: 0.8724407494726393


## Deep Learning Simple

Now we have all heard the amazing power of deep learning. So let's take a look at how well it fares with our task. There are a fair amout of hyperparameters with deep nets, but I will pick some reasonable values as our starting point.

In [30]:
model_simple = Sequential()
model_simple.add(Dense(1024, activation='relu' , input_dim = X_train.shape[1]))
model_simple.add(Dropout(0.5))
model_simple.add(Dense(2, activation='softmax', name='softmax'))

In [31]:
y_train_cat = to_categorical(y_train.values, 2)

In [32]:
model_simple.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [33]:
model_simple.fit(X_train.values, y_train_cat, batch_size=32, epochs=25)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f92117bfe10>

In [34]:
deep_predictions_simple = model_simple.predict(X_test.values)
deep_simple_f1, deep_simple_roc_auc, deep_simple_acc = evaluate(np.argmax(deep_predictions_simple, 1), y_test)

F1: 0.4076755973931933
ROC_AUC: 0.753604522328225
ACCURACY: 0.7969971460478967


## Deep Learning Tuned A Bit

Then I spent about 30 minutes playing with different architectures so see how far I could push a deep net and this is what I got. Note: this is not to say that there isn't a better or even much better architecture, but after trying a fair amount of normal options, nothing better appeared.

In [35]:
model = Sequential()
model.add(Dense(1024, activation='elu', kernel_initializer='glorot_normal', input_dim = X_train.shape[1]))
model.add(BatchNormalization())
model.add(Dense(128, activation='elu', kernel_initializer='glorot_normal'))
model.add(BatchNormalization())
model.add(Dense(64, activation='elu', kernel_initializer='glorot_normal'))
model.add(Dropout(0.2))
model.add(Dense(2, activation='softmax', name='softmax'))

In [36]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [37]:
model.fit(X_train.values, y_train_cat, batch_size=512, epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7f91f4692780>

In [38]:
deep_predictions = model.predict(X_test.values)

In [39]:
deep_f1, deep_roc_auc, deep_acc = evaluate(np.argmax(deep_predictions, 1), y_test)

F1: 0.6730386300278773
ROC_AUC: 0.795070458358415
ACCURACY: 0.8471894776026803


## Final Results

So what did we end up with and what did we learn?

In [43]:
model_names = ["LR", "Tuned LR", "GBT", "Tuned GBT", "Deep", "Deep Tuned"]
metrics_of_interest = ["F1", "ROC_AUC", "ACCURACY"]
f1s = [lr_f1, tuned_lr_f1, gbt_f1, gbt_tuned_f1, deep_simple_f1, deep_f1]
roc_aucs = [lr_roc_auc, tuned_lr_roc_auc, gbt_roc_auc, gbt_tunded_roc_auc, deep_simple_roc_auc, deep_roc_auc]
accuracy = [lr_acc, tuned_lr_acc, gbt_acc, gbt_tuned_acc, deep_simple_acc, deep_acc]

In [44]:
results_df = pd.DataFrame(columns=metrics_of_interest, index=model_names, data=np.array([f1s, roc_aucs, accuracy]).T)

In [45]:
results_df

Unnamed: 0,F1,ROC_AUC,ACCURACY
LR,0.650709,0.757495,0.848803
Tuned LR,0.651203,0.757883,0.848865
GBT,0.650709,0.757495,0.848803
Tuned GBT,0.704258,0.788551,0.872441
Deep,0.407676,0.753605,0.796997
Deep Tuned,0.673039,0.79507,0.847189


First off, the out of the box logistic regression does basically as well as the tuned version. Tuning helped a bit, but didn't make much of a difference. The out of the box GBT did slightly worse, but basically as well as the tuned logistic regression. Which for some might seem surprising given the successes of XGBoost on Kaggle. That being said, once you spend a bit of time tuning, GBTs do significantly better with a jump across the board and about a a 7.5% increase in F1.

The deep networks are interesting indeed. The first naive pass does very poorly. The ROC_AUC and Accuracy look okay, but the F1 score points to the issue: it learned that most things are a 0 and overfit to that. As we can see below:

In [50]:
from collections import Counter

In [51]:
Counter(np.argmax(deep_predictions_simple, 1))

Counter({0: 14508, 1: 1610})

In [52]:
Counter(y_test)

Counter({0: 12204, 1: 3914})

That being said, after spending some time tuning, we are able to boost the deep net's performance a lot. Even geting to the best ROC_AUC score and a competitve F1 and accuracy. So what are the main take aways?

1. Logistic regression is a nice baseline that may not require a lot of tuning and even if it does need some it is very fast to train
2. GBTs are powerful algorithms, but without tuning may not beat a baseline by much. That being said with a fairly standard grid search across a few values one can see good improvements. This grid search can take some time, though, as GBTs are slower to train the logistic regression.
3. Deep nets can achieve competitve results even outside of text, image, and audio fields. Training a "standard deep net", though, without any tuning can lead to very poor results. To really maximize the value of deep network time needs to be spent experiment with architectures. For example, how deep? how wide? regularization? normalization? what kind of initalization? etc. There are tons of options and perhaps the path to tuning is less clear than GBTs. In addition, deep nets can be slow to train, so all of this iteration takes time.

In conclusion, there really doesn't seeem to be a free lunch. You can get better results with more complex models, but those models do take time and understanding to tune and even then might not provide significant improvements. Lastly, this is clearly just one data set and may not genearlize at all. It would be interesting to run similar tests on other data sets to see if there is a trend.