{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "colab": { "name": "artificial_neural_networks.ipynb", "provenance": [], "collapsed_sections": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "qcdwj3EQx_p5" }, "source": [ " \n", "\n", "------" ] }, { "cell_type": "markdown", "metadata": { "id": "KHKxnECC7Sk-" }, "source": [ "# Artificial Neural Networks \n", "\n", "### Practical Session\n", "\n", "Prof. Dr. Georgios K. Ouzounis\n", "
email: [georgios.ouzounis@go.kauko.lt](georgios.ouzounis@go.kauko.lt)\n", "\n", "Last update: 20th June, 2021\n", "\n", "-----" ] }, { "cell_type": "markdown", "metadata": { "id": "PkqMo0sZ7SlC" }, "source": [ "## Contents\n", "\n", "1. [Challenge](#challenge)\n", "2. [Getting the dataset](#getting-the-dataset)\n", "3. [Load and explore the data](#load-and-explore-the-data) \n", "4. [Preprocess the data](#preprocess-the-data)\n", "5. [Compile the ANN](#compile-the-ann)\n", "6. [Train and deploy the ANN](#train-and-deploy-the-ann)\n", "7. [Testing individual cases](#testing-individual-caases)\n", "8. [Improving the model](#improving-the-model)" ] }, { "cell_type": "markdown", "metadata": { "id": "DKjMyCqB7SlD" }, "source": [ "## Challenge \n", "\n", "\n", "\n", "A sample dataset of customers of a financial institution is given. It consists of 14 features and a total of 10000 records. \n", "\n", "Among the features there is one tagged as **Exited** that takes binary values and if true it means that the given customer rejected a product or if false that he/she retained it.\n", "\n", "The goal of this exercise is to train a model that can predict as accurately as possible, the future outcome of new customers. \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "sWWQu_f87SlD" }, "source": [ "## Getting the dataset \n", "\n", "The dataset is a comma-separated values file (CSV) that can be found at the [Kaggle.com website](https://www.kaggle.com/aakash50897/churn-modellingcsv) or at the instructors github account.\n" ] }, { "cell_type": "code", "metadata": { "id": "KsPrX1NBYboZ" }, "source": [ "!wget https://raw.githubusercontent.com/georgiosouzounis/deep-learning-lectures/main/data/Churn_Modelling.csv" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Hi1tXVAV7SlF" }, "source": [ "## Load and explore the data " ] }, { "cell_type": "markdown", "metadata": { "id": "HfCCteKv7SlF" }, "source": [ "\n", "### Import the libraries" ] }, { "cell_type": "code", "metadata": { "id": "ylo6E_937SlG" }, "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "i_8jY7n27SlI" }, "source": [ "[numpy](http://www.numpy.org): it is the fundamental package for scientific computing with Python. It contains among other things a powerful N-dimensional array object that can be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. \n", "\n", "[matplotlib](https://matplotlib.org): it is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.\n", "\n", "[pandas](https://pandas.pydata.org): is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series." ] }, { "cell_type": "markdown", "metadata": { "id": "ViI6AHf_7SlI" }, "source": [ "### Import & explore the dataset\n", "\n", "The variable dataset is a python dataframe holding the contents of the opened file. To scout it’s contents use the **info()** and **head()** functions." ] }, { "cell_type": "code", "metadata": { "id": "CYaJl7mZ7SlJ" }, "source": [ "#importing the dataset\n", "dataset = pd.read_csv('Churn_Modelling.csv')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "zpOW73sp7SlJ" }, "source": [ "# view the features\n", "dataset.info()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "OSPOQVJ67SlL" }, "source": [ "# view the head of the file (10 top lines)\n", "dataset.head(10)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ptI29Y8vSFIb" }, "source": [ "## Preprocess the data " ] }, { "cell_type": "markdown", "metadata": { "id": "UxnSdqt5Hvge" }, "source": [ "### Correlation between independent valiables\n", "\n", "Let us first visually inspect if any two independent variables are highly correlated. \n", "\n", "To customize your color maps below read [more here](https://seaborn.pydata.org/tutorial/color_palettes.html)" ] }, { "cell_type": "code", "metadata": { "id": "aDszANsDH4tK" }, "source": [ "import seaborn as sns\n", "\n", "# get the correlation table\n", "corrmat=dataset.corr()\n", "\n", "# get the top correlated feature combinations\n", "top_corr_features = corrmat.index\n", "\n", "# create a dummy figure to strech the plot\n", "plt.figure(figsize=(20,20))\n", "\n", "# creating a colormap\n", "colormap = sns.color_palette(\"Blues\", as_cmap=True)\n", "\n", "# plot the correlation table\n", "g=sns.heatmap(dataset[top_corr_features].corr(), annot=True, cmap=colormap)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "E0KZc3CU7SlM" }, "source": [ "### Data Cleaning/ Splitting\n", "\n", "The **independent variables** are to be stored in matrix X. Evidently, neither the row ID (column 0), the customer number (column 1) or the surname (column 2) can influence the decision of the customer thus we can read the all other features leaving these three out.\n", "\n", "The **dependent variable**, i.e. the one we want to predict, is to be stored on a separate matrix (vector) y and contains the contents of column 13 alone." ] }, { "cell_type": "code", "metadata": { "id": "o1hWCuME7SlN" }, "source": [ "# all the independent variables stored in columns 3 to 12 \n", "# are stored in X \n", "X = dataset.iloc[:, 3:13].values \n", "X[0,:]" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Jm2xOt3v7SlN" }, "source": [ "# column index 13 : the dependent variables\n", "y = dataset.iloc[:, 13].values \n", "y[0]" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Ow72Cd4k7SlO" }, "source": [ "### Encoding categorical data\n", "\n", "The independent variables **Geography** and **Gender** are **strings** that need to be encoded into discrete variables as previously discussed in the **Features** session.\n", "\n", "**LabelEncoder** takes in as argument the column index and converts all categorical entries to integer labels.\n" ] }, { "cell_type": "code", "metadata": { "id": "bv7Jbikpzi4j" }, "source": [ "# counting unique Geographies\n", "n = len(pd.unique(dataset[\"Geography\"]))\n", "print(\"Number of unique countries: \", n)\n", "\n", "# counting unique Genders, in case more than two are provided\n", "n = len(pd.unique(dataset[\"Gender\"]))\n", "print(\"Number of unique genders: \", n)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "fC5MOqwwz87s" }, "source": [ "Label encoding assigns a unique number on each category for our categorical data:" ] }, { "cell_type": "code", "metadata": { "id": "Zb42bVqY7SlO" }, "source": [ "# Encoding categorical data\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "BBJ0yCu_7SlP" }, "source": [ "# geography column: enumerate countries\n", "labelencoder_X_1 = LabelEncoder()\n", "X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) " ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ZA3utgFa7SlP" }, "source": [ "# gender column: enumerate female/male\n", "labelencoder_X_2 = LabelEncoder()\n", "X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "nLshCAnJ7SlP" }, "source": [ "# view the transformed matrix - 1st row \n", "X[0,:]" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "u6GupQ2o7SlQ" }, "source": [ "This works well for **Gender** as the variable is binary. In the case of **Geographies** though, label encoding in its own is problematic. The LabelEncoder has replaced France with 0, Germany with 1 and Spain with 2 but Germany is not greater than France and France is not smaller than Spain! Labeling of the kind introcuces implications, so we need to create a [dummy variable]((https://en.wikiversity.org/wiki/Dummy_variable_%28statistics%29) for this column. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "9Hc8Pl8q7SlQ" }, "source": [ "\n", "ScikitLearn library provides two seperate functions, the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) and [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to do just that. \n", "\n", "ColumnTransformer() implements the transform function and takes as input the column name, the transformer (OneHotEncoder in this case), and the number of columns to be transformed this way; i.e. with unique combinations of 0s and 1s. [Read more here](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)." ] }, { "cell_type": "code", "metadata": { "id": "Z2nfoYqM7SlR" }, "source": [ "from sklearn.compose import ColumnTransformer\n", "\n", "# use the column transformer function to apply the OneHotEncoder\n", "ct = ColumnTransformer([(\"Geography\", OneHotEncoder(), [1])], remainder = 'passthrough')\n", "# apply the transform to update X\n", "X = ct.fit_transform(X)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cMb1BILxxBk1" }, "source": [ "Let us inspect the first rows in which each country first appears in: " ] }, { "cell_type": "code", "metadata": { "id": "LZvQgKDqVADm" }, "source": [ "X[0,:]" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "18SuY_lMvybw" }, "source": [ "X[1,:]" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "PjV0T0aUxgl7" }, "source": [ "X[7,:]" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "WtN1z2uz3hb4" }, "source": [ "It can be seen that 3 new columns were inserted to the left of the **CreditScore** column, replacing the previously label-encoded **Geographies** column. This is all fine except one would have expected 2 columns; 2^2 = 4, i.e. 2 columns of 0s and 1s can provide up to 4 unique combinations.\n", "\n", "Redundancies are suspicious in data science! In this case we are facing the [dummy variable trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html), a scenario in which the independent variables are multicollinear - i.e. two or more variables are highly correlated. In simple terms one variable can be predicted from the others. \n", "\n", "If we remove the first column we are left with [0.0, 0.0] for France, [0.0, 1.0] for Spain and [1.0, 0.0] for Germany. This prevents the dummy variable trap! " ] }, { "cell_type": "code", "metadata": { "id": "Ec9OwoZS5v2d" }, "source": [ "# remove the first column to avoid the dummy variable trap\n", "X = X[:, 1:] \n", "X[0,:]" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Fb5xB0oU7SlT" }, "source": [ "### Split the dataset to training and testing sets\n", "\n", "Next, we need to divide our data set to two subsets, one for testing and one for training. \n", "ScikitLearn library provides the function [train_test_split()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html):\n", "\n", "**sklearn.model_selection.train_test_split()**\n", "\n", "that splits arrays or matrices into random train and test subsets.\n" ] }, { "cell_type": "code", "metadata": { "id": "AADOSSbY7SlT" }, "source": [ "# Splitting the dataset into the Training set and Test set\n", "from sklearn.model_selection import train_test_split" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "5uRbIWhO7SlU" }, "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0215ItX97SlU" }, "source": [ "### Feature Scaling\n", "\n", "Feature scaling is essential as discussed in the **Features** lecture and needs to be applied to both the training and test sets.\n", "\n", "That is simply because some variables have values in the thousands while some others have values is the tens or ones. It is very important to ensure that none of our variables dominates over the others.\n", "\n", "It is computed using the ScikitLearn library [StandardScaler()](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) which is fitted in the training set and applied to both the training and test sets." ] }, { "cell_type": "code", "metadata": { "id": "gAf4BBly7SlU" }, "source": [ "# Feature Scaling\n", "from sklearn.preprocessing import StandardScaler\n", "sc = StandardScaler()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "MJ5ynwSY7SlV" }, "source": [ "X_train = sc.fit_transform(X_train)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "NomF2GSE7SlV" }, "source": [ "X_test = sc.transform(X_test)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "IAbC-UP-7SlV" }, "source": [ "## Compile the ANN " ] }, { "cell_type": "markdown", "metadata": { "id": "WiOE2K-27SlW" }, "source": [ "### Import the keras libraries" ] }, { "cell_type": "markdown", "metadata": { "id": "De4TVb-Q7SlW" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "HRGjnoKZ7SlW" }, "source": [ "- Import the sequential model from the Keras API to initialize our ANN;\n", "- Import the Dense layer template from the Keras API to add hidden layers;\n", "- Create an instance of the sequential model called classifier since our job is in the classification domain.\n", "\n", "The Dense layer is a layer in which all inputs are connected to all outputs!\n" ] }, { "cell_type": "code", "metadata": { "id": "mwMJzAaB7SlX" }, "source": [ "# Importing the Keras libraries and packages\n", "import keras\n", "from keras.models import Sequential\n", "from keras.layers import Dense\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "u5wyyNCV7SlX" }, "source": [ "# Initialising the ANN\n", "classifier = Sequential()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "EAe4wH_W7SlX" }, "source": [ "### Add First Hidden Layer" ] }, { "cell_type": "markdown", "metadata": { "id": "jL0dNBKC7SlX" }, "source": [ "The first Dense layer added to our classifier:\n", "\n", "- consists of 6 units (neurons), thus generating 6 outputs;\n", "- has a uniform kernel initialization (weight matrix);\n", "- applies a ReLU activation function on the output of each unit;\n", "- takes a 11 inputs \n" ] }, { "cell_type": "code", "metadata": { "id": "mZZlYswg7SlX" }, "source": [ "# Adding the input layer and the first hidden layer\n", "classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "IIntE-447SlY" }, "source": [ "### Add Second Hidden Layer\n" ] }, { "cell_type": "markdown", "metadata": { "id": "wnZjzoai7SlY" }, "source": [ "The second Dense layer added to our classifier:\n", "\n", "- consists of 6 units (neurons), thus generating 6 outputs;\n", "- has a uniform kernel initialization (weight matrix);\n", "- applies a ReLU activation function on the output of each unit;\n", "- takes as input the outputs of the previous layer; \n" ] }, { "cell_type": "code", "metadata": { "id": "NM7Q_ggU7SlY" }, "source": [ "# Adding the second hidden layer\n", "classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "llcBRP507SlZ" }, "source": [ "### Add Output Layer" ] }, { "cell_type": "markdown", "metadata": { "id": "Zl_LmDrQ7SlZ" }, "source": [ "The output Dense layer added to our classifier:\n", "\n", "- consists of 1 unit (neuron), thus generating a binary output;\n", "- has a uniform kernel initialization (weight matrix);\n", "- applies a Sigmoid activation function on the output of the single unit;\n", "- takes as input the outputs of the previous layer; \n", "\n", "If the number of categories in the output layer is more than 2 we then need to use the SoftMax activation function.\n" ] }, { "cell_type": "code", "metadata": { "id": "yO0z8yqz7SlZ" }, "source": [ "# Adding the output layer\n", "classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "eTYX6Z-u6lSp" }, "source": [ "Before we compile the ANN it is a good practice to check what layers we put together for confirmation" ] }, { "cell_type": "code", "metadata": { "id": "LpiGiMDw6tor" }, "source": [ "print(classifier.summary())" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tZKdQjzr61qz" }, "source": [ "For a more user-friendly view one can use the plot_model() function as shown below:" ] }, { "cell_type": "code", "metadata": { "id": "PmSTsfcc61UD" }, "source": [ "from keras.utils.vis_utils import plot_model\n", "plot_model(classifier, to_file='model_plot.png', show_shapes=True, show_layer_names=True)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "mbgMW2Df7Sla" }, "source": [ "### Compile the ANN" ] }, { "cell_type": "markdown", "metadata": { "id": "dl-yXvVI7Sla" }, "source": [ "In the model compilation we customize the:\n", " \n", "- [Optimizer](https://keras.io/optimizers/): is the algorithm used to find optimal set of weights. Adam employs Stochastic Gradient Descent (SGD)!\n", "- [Loss function](https://keras.io/losses/#available-loss-functions): SGD requires a loss function. With binary outputs we use a logarithmic loss function called the binary_crossentropy. If the dependent variable was categorical, i.e. taking more than 2 values, we would have used the categorical_crossentropy.\n", "- [Metric](https://keras.io/metrics/): this is the metric used for model improvement; we use accuracy!" ] }, { "cell_type": "code", "metadata": { "id": "kPM8OTUJ7Sla" }, "source": [ "#compile the ANN\n", "classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "CQh_h1Xz7Sla" }, "source": [ "## Train and deploy the ANN " ] }, { "cell_type": "markdown", "metadata": { "id": "EvJvAvrv7Slb" }, "source": [ "### Fit the ANN to the training set" ] }, { "cell_type": "markdown", "metadata": { "id": "WeKkGJ2K7Slb" }, "source": [ "We can now train our ANN using the data in our training set X and our class labels (dependent variables) in y. Parameters that can be specified are the:\n", "\n", "- **Batch size**: specifies the number of observations fed into the model after which the weight matrix is updated. \n", "- **Number of epochs**: number of iterations of the whole process!\n", "\n", "[more here](https://keras.io/models/model/#fit)\n" ] }, { "cell_type": "code", "metadata": { "id": "h135Gg2W7Slb" }, "source": [ "# Fitting the ANN to the Training set\n", "classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "-OvgIqpv7Slc" }, "source": [ "### Predicting the Test set results" ] }, { "cell_type": "markdown", "metadata": { "id": "JoRvCo-P7Slc" }, "source": [ "Objective: using the trained ANN on our Training set X, lets see how well it performs on our Test set for which we have ground truth, i.e. we know the results.\n", "\n", "For each probability returned we generate a categorical outcome (true/false) by thresholding it at a value of 50% \n" ] }, { "cell_type": "code", "metadata": { "id": "nkiLi0Wo7Slc" }, "source": [ "# Predicting the Test set results\n", "y_pred = classifier.predict(X_test)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "wYwQhAH47Slc" }, "source": [ "# threshold the probabilities into True > 0.5 or False\n", "y_pred = (y_pred > 0.5) " ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "0OMEqzo07Sld" }, "source": [ "y_pred[0]" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "T66iGsmN7Sld" }, "source": [ "### Evaluating the model" ] }, { "cell_type": "markdown", "metadata": { "id": "4FClDOQF7Sld" }, "source": [ "A confusion matrix is a table that is often used to describe the performance of a classification model (or \"classifier\") on a set of test data for which the true values are known. Use the ScikitLearn library [confucion_matrix()](https://en.wikipedia.org/wiki/Confusion_matrix) function to compute it and display it." ] }, { "cell_type": "markdown", "metadata": { "id": "XwYGpNE67Sle" }, "source": [ "" ] }, { "cell_type": "code", "metadata": { "id": "4AIY3rww7Sle" }, "source": [ "# compute the Confusion Matrix\n", "from sklearn.metrics import confusion_matrix\n", "cm = confusion_matrix(y_test, y_pred)\n", "cm" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "_y3-4qXz7Sle" }, "source": [ "# visualize the confusion matrix\n", "from sklearn.metrics import ConfusionMatrixDisplay\n", "class_names = [\"remained\", \"exited\"]\n", "\n", "disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)\n", "disp.plot()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "DVK6cfsjAzY4" }, "source": [ "Some more classification quality metrics:" ] }, { "cell_type": "code", "metadata": { "id": "W_tnBYi1ASaF" }, "source": [ "# accuracy\n", "from sklearn.metrics import accuracy_score\n", "accuracy_score(y_test, y_pred)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "3J5Y8y6lA7sT" }, "source": [ "# precision (for each class)\n", "# average=None; its a binary classification\n", "from sklearn.metrics import precision_score\n", "precision_score(y_test, y_pred, average=None)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "0i4SXh9dBX-5" }, "source": [ "# recall (for each class)\n", "# average=None; its a binary classification\n", "from sklearn.metrics import recall_score\n", "recall_score(y_test, y_pred, average=None)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "oHuk-vc4CGD-" }, "source": [ "# f1 score (for each class)\n", "# average=None; its a binary classification\n", "from sklearn.metrics import f1_score\n", "f1_score(y_test, y_pred, average=None)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "vGSUIF4x7Slf" }, "source": [ "## Testing individual cases \n", "\n", "In this lecture we will learn how to predict the behaviour of an new data sample outside our training and test data sets. " ] }, { "cell_type": "markdown", "metadata": { "id": "rT5xplbn7Slf" }, "source": [ "A new observation (data entry) is given. Given the model we trained can we predict if this new customer is likely to stay or to go?" ] }, { "cell_type": "markdown", "metadata": { "id": "rMBvryFg7Slg" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "a7YSPIX87Slg" }, "source": [ "New customer data\n", "\n", "| Geography | Credit Score | Gender | Age | Tenure | Balance | Number of Products | Has Credit Card | Is Active Member | Estimated Salary | \n", "|---|---|---|---|---|---|---|---|---|---|\n", "| France | 600 | Male | 40 | 3 | 60000 | 2 | Yes | Yes | 50000 |\n" ] }, { "cell_type": "markdown", "metadata": { "id": "vQteefUI7Slg" }, "source": [ "### Predicting new observations" ] }, { "cell_type": "markdown", "metadata": { "id": "-E6yn66b7Slg" }, "source": [ "The new data needs to be placed in the same order/format as in the case of the training/test sets.\n", "\n", "1. Create a new NP array and populate it accordingly.\n", "2. Use sc.transform to transform the vector to the desired format.\n", "3. Request a prediction and threshold it as before.\n" ] }, { "cell_type": "code", "metadata": { "id": "nXbtX4SeCtIx" }, "source": [ "# create the new customer row\n", "new_customer = np.array([[0.0, 0.0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "NvP_3A8oDRRp" }, "source": [ "# scale the data using the previously defined scaler for our training data\n", "new_customer_scaled = sc.transform(new_customer)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "_u94HdkQ7Slh" }, "source": [ "# request a prediction from the ANN using the new data formatted as needed;\n", "new_prediction = classifier.predict(new_customer_scaled)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "WIepR41j7Slh" }, "source": [ "new_prediction = (new_prediction > 0.5)\n", "new_prediction" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lOzFN0-H7Slh" }, "source": [ "## Improving the model \n", "\n", "In this lecture we will learn how to evaluate, improve and tune the ANN " ] }, { "cell_type": "markdown", "metadata": { "id": "Y3zcc48w7Slh" }, "source": [ "### Evaluate the ANN" ] }, { "cell_type": "markdown", "metadata": { "id": "NJyVE2j67Slh" }, "source": [ "You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py.\n", "\n", "There are [two wrappers available](https://keras.io/scikit-learn-api/). Consider the first: keras.wrappers.scikit_learn.KerasClassifier(build_fn=None, \\**sk_params), which implements the Scikit-Learn classifier interface." ] }, { "cell_type": "code", "metadata": { "id": "4R6ifIPS7Slh" }, "source": [ "# Evaluating the ANN\n", "\n", "# load the libraries\n", "from keras.wrappers.scikit_learn import KerasClassifier\n", "from sklearn.model_selection import cross_val_score\n", "from keras.models import Sequential\n", "from keras.layers import Dense" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "vhC5fEB67Sli" }, "source": [ "We can use the Keras scikit_learn wrapper to compute some statistics about our ANN\n", "\n", "1. Create the equivalent sckit_learn compatible classifier.\n", "2. Parameterize it as before and run k-fold cross validation\n", "3. Obtain the metrics\n", "\n", "Define a function to configure your classifier as requested:" ] }, { "cell_type": "code", "metadata": { "id": "3t8oOsKs7Sli" }, "source": [ "#define our classifier function\n", "\n", "def build_classifier():\n", " classifier = Sequential()\n", " classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))\n", " classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))\n", " classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))\n", " classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])\n", " return classifier\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "6pFnzXaE7Sli" }, "source": [ "We need to compile a Keras classifier for the sckit_learn library to compute the k-fold cross validation. The latter will produce a set of accuracy metrics for each run from which we aim at the mean\n", "\n", "Use these settings to set the Dropout Regularization to reduce overfitting if necessary.\n", "\n", "- [Dropout Regularization in Deep Learning Models With Keras](https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/)\n", "- [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/)\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "dxEV9PT_7Sli" }, "source": [ "# Run k-fold cross validation\n", "\n", "# configure the classifier as needed; set the building function, the batch size and the number of epochs, as before\n", "classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "4SZDVJdU7Sli" }, "source": [ "# Run the k-fold cross validation; n_jobs = number of cpus, when set to -1 it means use all\n", "accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1) " ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "0psfG0787Slj" }, "source": [ "mean = accuracies.mean()\n", "mean" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "l8iBVYZT7Slj" }, "source": [ "variance = accuracies.std()\n", "variance" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "xiDOJpss7Slj" }, "source": [ "We have an insignificant change on our mean accuracy! This means that **no overfitting** occurs!" ] }, { "cell_type": "markdown", "metadata": { "id": "O7J00V-c7Slj" }, "source": [ "### Improving the ANN" ] }, { "cell_type": "markdown", "metadata": { "id": "sil1zzn17Slk" }, "source": [ "If overfitting was to be observed, one way to counter it and make the model more general is by using dropout regularization. \n", "\n", "Dropout constraints the number of neurons that get activated in an arbitrary manner. The parameter p specifies (%wise) how many neurons to be switched off in each layer.\n", "\n", "We do not need to run this since no overfitting is observed in our case.\n" ] }, { "cell_type": "code", "metadata": { "id": "PGDbCbsw7Slk" }, "source": [ "# add this library\n", "from keras.layers import Dropout" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "zBCSdrU27Slk" }, "source": [ "# re-initialising the ANN\n", "classifier = Sequential()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "DYBV9JTU7Slk" }, "source": [ "# Adding the input layer and the first hidden layer\n", "classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11)) \n", "classifier.add(Dropout(p = 0.1))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "X3KXym_-7Sll" }, "source": [ "# Adding the second hidden layer\n", "classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))\n", "classifier.add(Dropout(p = 0.1))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "aEPqMaZC7Sll" }, "source": [ "# Adding the output layer\n", "classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "WDqwOE9F7Sln" }, "source": [ "# Compiling the ANN\n", "classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lapUHB-f7Sln" }, "source": [ "### Tuning the ANN" ] }, { "cell_type": "markdown", "metadata": { "id": "wOslsjkL7Slo" }, "source": [ "We can use the Keras scikit_learn wrapper to compute some statistics about our ANN:\n", "\n", "1. Create the equivalent sckit_learn compatible classifier.\n", "2. Parameterize it as before, add more options and run k-fold cross validation for each parameter set\n", "3. Obtain global metrics and get the best settings/accuracy" ] }, { "cell_type": "code", "metadata": { "id": "tgtCgdoD7Slp" }, "source": [ "# load the libraries; note the Grid Search Cross Validation lib\n", "from keras.wrappers.scikit_learn import KerasClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "from keras.models import Sequential\n", "from keras.layers import Dense" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "mDeY2Lfe7Slp" }, "source": [ "#define our classifier\n", "def build_classifier(optimizer):\n", " classifier = Sequential()\n", " classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))\n", " classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))\n", " classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))\n", " classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])\n", " return classifier\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "PFxXX-tC7Slq" }, "source": [ "Compile the Keras classifier with no parameters. \n", "\n", "Create a separate vector of parameters, each with a number of different settings.\n", "\n", "Run GridSearchCV using the classifier as estimator, the parameters vector, and by specifying the number of k-folds and the scoring metric.\n" ] }, { "cell_type": "code", "metadata": { "id": "nF0AJWsB7Slq" }, "source": [ "# configure the classifier as needed; set the building function\n", "classifier = KerasClassifier(build_fn = build_classifier)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "RW8E-Yhu7Slq" }, "source": [ "# enter different options for the batch size, the number of epochs and the optimizer:\n", "parameters = {'batch_size': [25, 32], 'epochs': [100, 500], 'optimizer': ['adam', 'rmsprop']}" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Vo6VlxkY7Slq" }, "source": [ "# Customize the Grid Search CV\n", "grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "bn2AlCFX7Slq" }, "source": [ "We now know which parameter setting from them all scores the highest accuracy.\n", "\n", "Printing out the best parameters we observe the following:\n" ] }, { "cell_type": "code", "metadata": { "id": "WHPbvR877Slq" }, "source": [ "# fit the grid_search model to our training data\n", "grid_search = grid_search.fit(X_train, y_train)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "QmEGcOD57Slr" }, "source": [ "# obtain the best parameters and best accuracy\n", "best_parameters = grid_search.best_params_\n", "best_parameters" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "mX7JeTLs7Slr" }, "source": [ "best_accuracy = grid_search.best_score_\n", "best_accuracy" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "PgUiHzso7Slr" }, "source": [ "" ] }, { "cell_type": "code", "metadata": { "id": "Ha6gIBS87Slr" }, "source": [ "" ], "execution_count": null, "outputs": [] } ] }