# Interpretable or Accurate? Why not both?

## Case Study: Predicting Employee Attrition Using Machine Learning

The notebook contains the code for the accompanying blogpost titled [Interpretable or Accurate? Why not both?](https://towardsdatascience.com/interpretable-or-accurate-why-not-both-4d9c73512192?sk=2f44377541a2f49939c921e54eb3cde7)

## Installation

Interpret is supported across Windows, Mac and Linux on Python 3.5+. Please refer the [documentation](https://interpret.ml/docs/getting-started.html) for more details.

### pip
pip install interpret

### conda
conda install -c interpretml interpret

### source
git clone https://github.com/interpretml/interpret.git && cd interpret/scripts && make install



## Importing necessary libraries


In [1]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score

from interpret import show
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
from interpret.data import ClassHistogram
set_visualize_provider(InlineProvider())
from interpret.glassbox import (
    LogisticRegression,
    ClassificationTree,
    ExplainableBoostingClassifier,
)


seed = 42

## Importing the Dataset

In [2]:

df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [3]:
#Encoding the target variable i.e Attrition

target_map = {'Yes': 1, 'No': 0}
target = df["Attrition"].apply(lambda x: target_map[x])
print(target[:10])

0    1
1    0
2    1
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: Attrition, dtype: int64


In [4]:
# Deleting columns that are not useful for the predicitons

df.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours','Attrition'], axis="columns", inplace=True)

In [5]:
# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(df, 
                                                    target, 
                                                    test_size=0.2,
                                                    random_state=seed,
                                                    stratify=target)


## Exploring the Dataset with histogram visualizations

In [6]:
hist = ClassHistogram().explain_data(X_train, y_train, name = 'Train Data')
show(hist)

## Training GlassBox Models

### 1. Explainable Boosting Machine (EBM)

In [7]:
ebm = ExplainableBoostingClassifier(random_state=seed, n_jobs=-1,inner_bags=100,outer_bags=100)
ebm.fit(X_train, y_train)


ExplainableBoostingClassifier(feature_names=['Age', 'BusinessTravel',
                                             'DailyRate', 'Department',
                                             'DistanceFromHome', 'Education',
                                             'EducationField',
                                             'EnvironmentSatisfaction',
                                             'Gender', 'HourlyRate',
                                             'JobInvolvement', 'JobLevel',
                                             'JobRole', 'JobSatisfaction',
                                             'MaritalStatus', 'MonthlyIncome',
                                             'MonthlyRate',
                                             'NumCompaniesWorked', 'OverTime',
                                             'PercentSalaryHike',
                                             'Perfor...
                                             'categorical', 'continuous',
            

## Global Explanations

Global Explanations help to gain a better understanding of the model's overall behavior and what the model learnt overall.

In [8]:
ebm_global = ebm.explain_global(name='EBM')
show(ebm_global)

## Local Explanations: 

Local Explanations helps us understand the reasons behind individual predictionsHow an why individual prediction was made

In [9]:
ebm_local = ebm.explain_local(X_test[:5], y_test[:5], name='EBM')
show(ebm_local)

## Evaluating EBM performance

In [10]:

from interpret.perf import ROC

ebm_perf = ROC(ebm.predict_proba).explain_perf(X_test, y_test, name='EBM')
show(ebm_perf)

## Comparing the performance with other GlassBox models 



### 2.Logistic Regression and Decision Tree 

In [11]:
# We have to transform categorical variables to use Logistic Regression and Decision Tree 
X_enc = pd.get_dummies(df, prefix_sep='.')
feature_names = list(X_enc.columns)
X_train_enc, X_test_enc, y_train, y_test = train_test_split(X_enc, target, test_size=0.20, random_state=seed)

lr = LogisticRegression(random_state=seed, feature_names=feature_names, penalty='l1', solver='liblinear')
lr.fit(X_train_enc, y_train)

tree = ClassificationTree()
tree.fit(X_train_enc, y_train)

<interpret.glassbox.decisiontree.ClassificationTree at 0x7fc4adbc37b8>

## Comparing the performance of all the models

In [12]:

lr_perf = ROC(lr.predict_proba).explain_perf(X_test_enc, y_test, name='Logistic Regression')
tree_perf = ROC(tree.predict_proba).explain_perf(X_test_enc, y_test, name='Classification Tree')

show(lr_perf)
show(tree_perf)
show(ebm_perf)

## Training Blackbox Models


### 1. Random Forest Classifier

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

#Blackbox system can include preprocessing, not just a classifier!
pca = PCA()
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

X_enc = pd.get_dummies(df, prefix_sep='.')
feature_names = list(X_enc.columns)
X_train_enc, X_test_enc, y_train, y_test = train_test_split(X_enc, target, test_size=0.20, random_state=seed)



blackbox_model = Pipeline([('pca', pca), ('rf', rf)])
blackbox_model.fit(X_train_enc, y_train)

Pipeline(steps=[('pca', PCA()), ('rf', RandomForestClassifier(n_jobs=-1))])

### Evaluating BlackBox models

In [14]:
from interpret import show
from interpret.perf import ROC

blackbox_perf = ROC(blackbox_model.predict_proba).explain_perf(X_test_enc, y_test, name='Blackbox')
show(blackbox_perf)

### Explaining local BlackBox predictions with [LIME](https://arxiv.org/abs/1602.04938v3)

In [15]:
from interpret.blackbox import LimeTabular
from interpret import show

#Blackbox explainers need a predict function, and optionally a dataset
lime = LimeTabular(predict_fn=blackbox_model.predict_proba, data=X_train_enc, random_state=1)

#Pick the instances to explain, optionally pass in labels if you have them
lime_local = lime.explain_local(X_test_enc[:5], y_test[:5], name='LIME')

show(lime_local)

### Explaining global BlackBox predictions with [PDP](https://christophm.github.io/interpretable-ml-book/pdp.html)

In [16]:
from interpret.blackbox import PartialDependence

pdp = PartialDependence(predict_fn=blackbox_model.predict_proba, data=X_train_enc)
pdp_global = pdp.explain_global(name='Partial Dependence')

show(pdp_global)
