<img src="https://github.com/dc-aihub/dc-aihub.github.io/blob/master/img/ai-logo-transparent-banner.png?raw=true" 
alt="Ai/Hub Logo"/>

<h1 style="text-align:center;color:#0B8261;"><center>Artificial Intelligence</center></h1>
<h1 style="text-align:center;"><center>Lesson 6</center></h1>
<h1 style="text-align:center;"><center>Decision Trees</center></h1>

<hr />

<center><a href="#Decision-Tree">Decision Trees</a></center>

<center><a href="#Data-Pre-Processing">Data Pre-Processing</a></center>

<center><a href="#Create-the-Model">Create the Model</a></center>

<center><a href="#Train-the-Model">Train the Model</a></center>

<center><a href="#Test-the-Model">Test the Model</a></center>

<center><a href="#Optional">Graphviz (Optional)</a></center>

<hr/>

- ref: http://benalexkeen.com/decision-tree-classifier-in-python-using-scikit-learn/
- dataset: https://www.kaggle.com/c/titanic/data

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;">
OVERVIEW
</div>

<center style="color:#0B8261;">
Decision Trees can be used as classifier or regression models.
<br/><br/>
In general, a tree structure is constructed that breaks the dataset down into smaller subsets eventually resulting in a prediction. There are decision nodes that partition the data and leaf nodes that provide the prediction that can be followed by traversing simple IF..AND..AND..THEN logic down the nodes.
</center>

<br/>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Decision-Tree">
DECISION TREES
</div>

<img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png"/>

<hr />

The root node (aka.the first decision node) partitions the data based on the most influential feature.

There are 2 measures for this, <b>Entropy</b> and <b>Gini Impurity</b>.

<hr />

### Entropy
The root node (the first decision node) partitions the data using the feature that provides the most information gain.

Information gain tells us how important a given attribute of the feature vectors is.

<b>It is calculated as:</b>

>$$\text{Information Gain} = \text{entropy(parent)} – \text{[average entropy(children)]}$$

<b>Where entropy is a common measure of target class impurity, given as:</b>

>$$Entropy = \Sigma_i – p_i \log_2 p_i$$

><i>Where i is each of the target classes.</i>

<hr />

### Gini Impurity
Gini Impurity is another measure of impurity and is calculated as follows:

>$$Gini = 1 – \Sigma_i p_i^2$$
><i>Where i is each of the target classes.</i>

Gini impurity is computationally faster as it doesn’t require calculating logarithmic functions.
Though in reality which of the two methods is used rarely makes too much of a difference.

### Predicting Survival in the Titanic Data Set
We’ll be using a decision tree to make predictions about the Titanic data set from Kaggle. This data set provides information on the Titanic passengers and can be used to predict whether a passenger survived or not.

<hr />

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Data-Pre-Processing">
DATA PRE-PROCESSING
</div>

>**Recall:** Data pre-processing is the step that comes before creating, training, and testing our model by cleaning the data and preparing it for consumption by our model. No model is a good model without only the best data!

In [None]:
import pandas as pd

df = pd.read_csv('train.csv', index_col='PassengerId')

<b>Lets take a look at the data-frame we just created so that we can select the attributes we would like to use for our refined classification model (a Decision Tree).</b>

In [None]:
df.head()

<hr />

<b>We will be using Pclass, Sex, Age, SibSp (Siblings aboard), Parch (Parents/children aboard), and Fare to predict whether a passenger survived.</b>

In [None]:
# go ahead and re-assign the dataframe we created to only include the features listed above
# e.g. if we wanted only Sex, we'd do something like df = df[['Sex']]

# type your code here
df = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]

# print out the first 5 items of the dataframe to make sure we're on the right track!
df.head()

<hr />

<b>We need to convert ‘Sex’ into an integer value of 0 or 1.</b>

In [None]:
# let's use pandas' built in map function to turn all the 'male' instances to 0 and all the 'female' instances to 1
# e.g. if you were to do this for handedness, it would look something like:
#      df['handedness'] = df['handedness'].map({ 'right': 0, 'left': 1 })

# type your code here
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# print out the first 5 items of the dataframe to make sure we're on the right track!
df.head()

<hr />

<b>We will also drop any rows with missing values.</b>

>Missing values are bad as they tend to screw up our classifier. We don't want anything getting in the way of our super talented model!

The data (X) is all our data, and the target (y) is the corresponding result for each row of data.

In [None]:
df = df.dropna()

X = df.drop('Survived', axis=1)
y = df['Survived']

<hr />

Now, we're going to want to split our dataset into training and testing instances. Remember, for both training and testing, we need data and labels. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

<hr/>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Create-the-Model">
CREATE THE MODEL
</div>

Lets initialize our model so we can take a look at it's attributes.

In [None]:
from sklearn import tree

model = tree.DecisionTreeClassifier()

In [None]:
# Displays the model attributes
model

<br/>
<hr/>
<br/>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Train-the-Model">
TRAIN THE MODEL
</div>

Defining some of the attributes like *max_depth*, *max_leaf_nodes*, *min_impurity_split*, and *min_samples_leaf* can help prevent overfitting the model to the training data.

<i>First we fit our model using our training data.</i>

In [None]:
# Fit the training data to the model


<hr />

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Test-the-Model">
TEST THE MODEL
</div>

<b>Then we score the predicted output from the model on our test data against our ground truth test data.</b>

In [None]:
y_predict = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

<hr />

<b>We see an accuracy score of ~81.01%, which is significantly better than 50/50 guessing.

Let’s also take a look at our confusion matrix:</b>

In [None]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(
            confusion_matrix(y_test, y_predict),
            columns=['Predicted Not Survival', 'Predicted Survival'],
            index=['True Not Survival', 'True Survival']
            )

<hr />

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Optional">
OPTIONAL
</div>
<br/>

Download the Graphviz tool to visualize the data.
https://graphviz.gitlab.io/download/

In [None]:
#tree.export_graphviz(model.tree_, out_file='tree.dot', feature_names=X.columns)

We can then convert this dot file to a png file.

In [None]:
#from subprocess import call

#call(['dot', '-T', 'png', 'tree.dot', '-o', 'tree.png'])

We can then view our tree, which will look something similar this:

<img src="http://benalexkeen.com/wp-content/uploads/2017/05/tree.png" alt="Decision Tree" />

<hr />

The root node, with the most information gain, tells us that the biggest factor in determining survival is Sex.

If we zoom in on some of the leaf nodes, we can follow some of the decisions down.

We have already zoomed into the part of the decision tree that describes males, with a ticket lower than first class, that are under the age of 10.

<img src="http://benalexkeen.com/wp-content/uploads/2017/05/tree_leaf_node.png" alt="Leaf Node" />

The impurity is the measure as given at the top by Gini, the samples are the number of observations remaining to classify and the value is the how many samples are in class 0 (Did not survive) and how many samples are in class 1 (Survived).

<hr />

<b>Let’s follow this part of the tree down, the nodes to the left are True and the nodes to the right are False:</b>

1. We see that we have 19 observations left to classify: 9 did not survive and 10 did.
2. From this point the most information gain is how many siblings (SibSp) were aboard.

        A. 9 out of the 10 samples with less than 2.5 siblings survived.
        B. This leaves 10 observations left, 9 did not survive and 1 did.
    
3. 6 of these children that only had one parent (Parch) aboard did not survive.
4. None of the children aged > 3.5 survived
5. Of the 2 remaining children, the one with > 4.5 siblings did not survive.