<img src="https://www.kaunokolegija.lt/kk_wp_content/uploads/sites/5/2020/05/kaunas-university-of-applied-sciences.png" width="300"/> 

------

# Artificial Neural Networks 

### Practical Session

Prof. Dr. Georgios K. Ouzounis
<br/>email: [georgios.ouzounis@go.kauko.lt](georgios.ouzounis@go.kauko.lt)

Last update: 20th June, 2021

-----

## Contents

1. [Challenge](#challenge)
2. [Getting the dataset](#getting-the-dataset)
3. [Load and explore the data](#load-and-explore-the-data) 
4. [Preprocess the data](#preprocess-the-data)
5. [Compile the ANN](#compile-the-ann)
6. [Train and deploy the ANN](#train-and-deploy-the-ann)
7. [Testing individual cases](#testing-individual-caases)
8. [Improving the model](#improving-the-model)

## Challenge <a name="challenge"></a>

<img src="https://ca.res.keymedia.com/files/image/BankTeller(1).jpg" width="300"/>

A sample dataset of customers of a financial institution is given. It consists of 14 features and a total of 10000 records. 

Among the features there is one tagged as **Exited** that takes binary values and if true it means that the given customer rejected a product or if false that he/she retained it.

The goal of this exercise is to train a model that can predict as accurately as possible, the future outcome of new customers. 




## Getting the dataset <a name="getting-the-dataset"></a>

The dataset is a comma-separated values file (CSV) that can be found at the [Kaggle.com website](https://www.kaggle.com/aakash50897/churn-modellingcsv) or at the instructors github account.


In [None]:
!wget https://raw.githubusercontent.com/georgiosouzounis/deep-learning-lectures/main/data/Churn_Modelling.csv

## Load and explore the data <a name="load-and-explore-the-data"></a>


### Import the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

[numpy](http://www.numpy.org): it is the fundamental package for scientific computing with Python. It contains among other things a powerful N-dimensional array object that can be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. 

[matplotlib](https://matplotlib.org):  it is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

[pandas](https://pandas.pydata.org): is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

### Import & explore the dataset

The variable dataset is a python dataframe holding the contents of the opened file. To scout itâ€™s contents use the **info()** and **head()** functions.

In [None]:
#importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')

In [None]:
# view the features
dataset.info()

In [None]:
# view the head of the file (10 top lines)
dataset.head(10)

## Preprocess the data <a name="preprocess-the-data"></a>

### Correlation between independent valiables

Let us first visually inspect if any two independent variables are highly correlated. 

To customize your color maps below read [more here](https://seaborn.pydata.org/tutorial/color_palettes.html)

In [None]:
import seaborn as sns

# get the correlation table
corrmat=dataset.corr()

# get the top correlated feature combinations
top_corr_features = corrmat.index

# create a dummy figure to strech the plot
plt.figure(figsize=(20,20))

# creating a colormap
colormap = sns.color_palette("Blues", as_cmap=True)

# plot the correlation table
g=sns.heatmap(dataset[top_corr_features].corr(), annot=True, cmap=colormap)

### Data Cleaning/ Splitting

The **independent variables** are to be stored in matrix X. Evidently, neither the row ID (column 0), the customer number (column 1) or the surname (column 2) can influence the decision of the customer thus we can read the all other features leaving these three out.

The **dependent variable**, i.e. the one we want to predict, is to be stored on a separate matrix (vector) y and contains the contents of column 13 alone.

In [None]:
# all the independent variables stored in columns 3 to 12 
# are stored in X 
X = dataset.iloc[:, 3:13].values 
X[0,:]

In [None]:
# column index 13 : the dependent variables
y = dataset.iloc[:, 13].values 
y[0]

### Encoding categorical data

The independent variables **Geography** and **Gender** are **strings** that need to be encoded into discrete variables as previously discussed in the **Features** session.

**LabelEncoder** takes in as argument the column index and converts all categorical entries to integer labels.


In [None]:
# counting unique Geographies
n = len(pd.unique(dataset["Geography"]))
print("Number of unique countries: ", n)

# counting unique Genders, in case more than two are provided
n = len(pd.unique(dataset["Gender"]))
print("Number of unique genders: ", n)

Label encoding assigns a unique number on each category for our categorical data:

In [None]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [None]:
# geography column: enumerate countries
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) 

In [None]:
# gender column: enumerate female/male
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

In [None]:
# view the transformed matrix - 1st row 
X[0,:]

This works well for **Gender** as the variable is binary. In the case of **Geographies** though, label encoding in its own is problematic. The LabelEncoder has replaced France with 0, Germany with 1 and Spain with 2 but Germany is not greater than France and France is not smaller than Spain! Labeling of the kind introcuces implications, so we need to create a [dummy variable]((https://en.wikiversity.org/wiki/Dummy_variable_%28statistics%29) for this column. 



ScikitLearn library provides two seperate functions, the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) and [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to do just that. 

ColumnTransformer() implements the transform function and takes as input the column name, the transformer (OneHotEncoder in this case), and the number of columns to be transformed this way; i.e. with unique combinations of 0s and 1s. [Read more here](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).

In [None]:
from sklearn.compose import ColumnTransformer

# use the column transformer function to apply the OneHotEncoder
ct = ColumnTransformer([("Geography", OneHotEncoder(), [1])], remainder = 'passthrough')
# apply the transform to update X
X = ct.fit_transform(X)

Let us inspect the first rows in which each country first appears in: 

In [None]:
X[0,:]

In [None]:
X[1,:]

In [None]:
X[7,:]

It can be seen that 3 new columns were inserted to the left of the **CreditScore** column, replacing the previously label-encoded **Geographies** column. This is all fine except one would have expected 2 columns; 2^2 = 4, i.e. 2 columns of 0s and 1s can provide up to 4 unique combinations.

Redundancies are suspicious in data science! In this case we are facing the [dummy variable trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html), a scenario in which the independent variables are multicollinear - i.e. two or more variables are highly correlated. In simple terms one variable can be predicted from the others. 

If we remove the first column we are left with [0.0, 0.0] for France, [0.0, 1.0] for Spain and [1.0, 0.0] for Germany. This prevents the dummy variable trap!   

In [None]:
# remove the first column to avoid the dummy variable trap
X = X[:, 1:] 
X[0,:]

### Split the dataset to training and testing sets

Next, we need to divide our data set to two subsets, one for testing and one for training. 
ScikitLearn library provides the function [train_test_split()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html):

**sklearn.model_selection.train_test_split()**

that splits arrays or matrices into random train and test subsets.


In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Feature Scaling

Feature scaling is essential as discussed in the **Features** lecture and needs to be applied to both the training and test sets.

That is simply because some variables have values in the thousands while some others have values is the tens or ones. It is very important to ensure that none of our variables dominates over the others.

It is computed using the ScikitLearn library [StandardScaler()](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) which is fitted in the training set and applied to both the training and test sets.

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [None]:
X_train = sc.fit_transform(X_train)

In [None]:
X_test = sc.transform(X_test)

## Compile the ANN <a name="compile-the-ann"></a>

### Import the keras libraries

<a href="https://keras.io"><img src="https://s3.amazonaws.com/keras.io/img/keras-logo-2018-large-1200.png" width="400" align="left"/></a>

- Import the sequential model from the Keras API to initialize our ANN;
- Import the Dense layer template from the Keras API to add hidden layers;
- Create an instance of the sequential model called classifier since our job is in the classification domain.

The Dense layer is a layer in which all inputs are connected to all outputs!


In [None]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense


In [None]:
# Initialising the ANN
classifier = Sequential()

### Add First Hidden Layer

The first Dense layer added to our classifier:

- consists of 6 units (neurons), thus generating 6 outputs;
- has a uniform kernel initialization (weight matrix);
- applies a ReLU activation function on the output of each unit;
- takes a 11 inputs 


In [None]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))

### Add Second Hidden Layer


The second Dense layer added to our classifier:

- consists of 6 units (neurons), thus generating 6 outputs;
- has a uniform kernel initialization (weight matrix);
- applies a ReLU activation function on the output of each unit;
- takes as input the outputs of the previous layer; 


In [None]:
# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

### Add Output Layer

The output Dense layer added to our classifier:

- consists of 1 unit (neuron), thus generating a binary output;
- has a uniform kernel initialization (weight matrix);
- applies a Sigmoid activation function on the output of the single unit;
- takes as input the outputs of the previous layer; 

If the number of categories in the output layer is more than 2 we then need to use the SoftMax activation function.


In [None]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

Before we compile the ANN it  is a good practice to check what layers we put together for confirmation

In [None]:
print(classifier.summary())

For a more user-friendly view one can use the plot_model() function as shown below:

In [None]:
from keras.utils.vis_utils import plot_model
plot_model(classifier, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

### Compile the ANN

In the model compilation we customize the:
    
- [Optimizer](https://keras.io/optimizers/): is the algorithm used to find optimal set of weights. Adam employs Stochastic Gradient Descent (SGD)!
- [Loss function](https://keras.io/losses/#available-loss-functions): SGD requires a loss function. With binary outputs we use a logarithmic loss function called the binary_crossentropy. If the dependent variable was categorical, i.e. taking more than 2 values, we would have used the categorical_crossentropy.
- [Metric](https://keras.io/metrics/): this is the metric used for model improvement; we use accuracy!

In [None]:
#compile the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

## Train and deploy the ANN <a name="train-and-deploy-the-ann"></a>

### Fit the ANN to the training set

We can now train our ANN using the data in our training set X and our class labels (dependent variables) in y. Parameters that can be specified are the:

- **Batch size**: specifies the number of observations fed into the model after which the weight matrix is updated. 
- **Number of epochs**: number of iterations of the whole process!

[more here](https://keras.io/models/model/#fit)


In [None]:
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

### Predicting the Test set results

Objective: using the trained ANN on our Training set X, lets see how well it performs on our Test set for which we have ground truth, i.e. we know the results.

For each probability returned we generate a categorical outcome (true/false) by thresholding it at a value of 50% 


In [None]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [None]:
# threshold the probabilities into True > 0.5 or False
y_pred = (y_pred > 0.5) 

In [None]:
y_pred[0]

### Evaluating the model

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. Use the ScikitLearn library [confucion_matrix()](https://en.wikipedia.org/wiki/Confusion_matrix) function to compute it and display it.

<img src="https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png" align="left"/>

In [None]:
# compute the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
#  visualize the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
class_names = ["remained", "exited"]

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp.plot()

Some more classification quality metrics:

In [None]:
# accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
# precision (for each class)
# average=None; its a binary classification
from sklearn.metrics import precision_score
precision_score(y_test, y_pred, average=None)

In [None]:
# recall (for each class)
# average=None; its a binary classification
from sklearn.metrics import recall_score
recall_score(y_test, y_pred, average=None)

In [None]:
# f1 score (for each class)
# average=None; its a binary classification
from sklearn.metrics import f1_score
f1_score(y_test, y_pred, average=None)

## Testing individual cases <a name="testing-individual-cases"></a>

In this lecture we will learn how to predict the behaviour of an new data sample outside our training and test data sets. 

A new observation (data entry) is given. Given the model we trained can we predict if this new customer is likely to stay or to go?

<img src="https://catalystforbusiness.com/wp-content/uploads/2017/12/customer-care.jpg" align="left" width="400"/>

New customer data

| Geography | Credit Score | Gender | Age | Tenure | Balance | Number of Products | Has Credit Card | Is Active Member | Estimated Salary | 
|---|---|---|---|---|---|---|---|---|---|
| France | 600 | Male | 40 | 3 | 60000 | 2 | Yes | Yes | 50000 |


### Predicting new observations

The new data needs to be placed in the same order/format as in the case of the training/test sets.

1. Create a new NP array and populate it accordingly.
2. Use sc.transform to transform the vector to the desired format.
3. Request a prediction and threshold it as before.


In [None]:
# create the new customer row
new_customer = np.array([[0.0, 0.0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])

In [None]:
# scale the data using the previously defined scaler for our training data
new_customer_scaled = sc.transform(new_customer)

In [None]:
# request a prediction from the ANN using the new data formatted as needed;
new_prediction = classifier.predict(new_customer_scaled)

In [None]:
new_prediction = (new_prediction > 0.5)
new_prediction

## Improving the model <a name="improving-the-model"></a>

In this lecture we will learn how to evaluate, improve and tune the ANN 

### Evaluate the ANN

You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py.

There are [two wrappers available](https://keras.io/scikit-learn-api/). Consider the first: keras.wrappers.scikit_learn.KerasClassifier(build_fn=None, \**sk_params), which implements the Scikit-Learn classifier interface.

In [None]:
# Evaluating the ANN

# load the libraries
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense

We can use the Keras scikit_learn wrapper to compute some statistics about our ANN

1. Create the equivalent sckit_learn compatible classifier.
2. Parameterize it as before and run k-fold cross validation
3. Obtain the metrics

Define a function to configure your classifier as requested:

In [None]:
#define our classifier function

def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier


We need to compile a Keras classifier for the sckit_learn library to compute the k-fold cross validation. The latter will produce a set of accuracy metrics for each run from which we aim at the mean

Use these settings to set the Dropout Regularization to reduce overfitting if necessary.

- [Dropout Regularization in Deep Learning Models With Keras](https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/)
- [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/)



In [None]:
# Run k-fold cross validation

# configure the classifier as needed; set the building function, the batch size and the number of epochs, as before
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)


In [None]:
# Run the k-fold cross validation; n_jobs = number of cpus, when set to -1 it means use all
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1) 

In [None]:
mean = accuracies.mean()
mean

In [None]:
variance = accuracies.std()
variance

We have an insignificant change on our mean accuracy! This means that **no overfitting** occurs!

### Improving the ANN

If overfitting was to be observed, one way to counter it and make the model more general is by using dropout regularization. 

Dropout constraints the number of neurons that get activated in an arbitrary manner. The parameter p specifies (%wise) how many neurons to be switched off in each layer.

We do not need to run this since no overfitting is observed in our case.


In [None]:
# add this library
from keras.layers import Dropout

In [None]:
# re-initialising the ANN
classifier = Sequential()

In [None]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11)) 
classifier.add(Dropout(p = 0.1))

In [None]:
# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dropout(p = 0.1))

In [None]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

In [None]:
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

### Tuning the ANN

We can use the Keras scikit_learn wrapper to compute some statistics about our ANN:

1. Create the equivalent sckit_learn compatible classifier.
2. Parameterize it as before, add more options and run k-fold cross validation for each parameter set
3. Obtain global metrics and get the best settings/accuracy

In [None]:
# load the libraries; note the Grid Search Cross Validation lib
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense

In [None]:
#define our classifier
def build_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier


Compile the Keras classifier with no parameters. 

Create a separate vector of parameters, each with a number of different settings.

Run GridSearchCV using the classifier as estimator, the parameters vector, and by specifying the number of k-folds and the scoring metric.


In [None]:
# configure the classifier as needed; set the building function
classifier = KerasClassifier(build_fn = build_classifier)


In [None]:
# enter different options for the batch size, the number of epochs and the optimizer:
parameters = {'batch_size': [25, 32], 'epochs': [100, 500], 'optimizer': ['adam', 'rmsprop']}

In [None]:
# Customize the Grid Search CV
grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10)

We now know which parameter setting from them all scores the highest accuracy.

Printing out the best parameters we observe the following:


In [None]:
# fit the grid_search model to our training data
grid_search = grid_search.fit(X_train, y_train)

In [None]:
# obtain the best parameters and best accuracy
best_parameters = grid_search.best_params_
best_parameters

In [None]:
best_accuracy = grid_search.best_score_
best_accuracy

<img src="https://drive.google.com/uc?id=1ssIjY7LC98PSTGfU9RlWpig-5pEjpD-r" align="left" width="400"/>