{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Artificial Intelligence

\n", "

Lesson 5-B

\n", "

Gaussian Naive Bayes & Pre-Processing Continued

\n", "\n", "
\n", "\n", "
Data Pre-Processing
\n", "\n", "
Data Imputation
\n", "\n", "
One Hot Encoder
\n", "\n", "
Standardization of Data
\n", "\n", "
Data Splicing
\n", "\n", "
Gaussian Naive Bayes Implementation
\n", "\n", "
Accuracy of our GNB Model
\n", "\n", "
\n", "\n", "- ref.: http://dataaspirant.com/2017/02/20/gaussian-naive-bayes-classifier-implementation-python/\n", "- data: https://archive.ics.uci.edu/ml/datasets/Adult" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "OVERVIEW\n", "
\n", "\n", "
\n", "This is a continuation of Lesson 11a; In which we will finish preparing our data for the Gaussian Naive Bayes model included in Scikit-Learn. As well as, many of the other common techniques for pre-processing data before training.\n", "
\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# This will include the required code snippets from the previous lesson. \n", "# 'requirements.py' should be provided in the Lesson10 folder.\n", "%run requirements.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "DATA PRE-PROCESSING\n", "
\n", "\n", "For preprocessing, we are going to make a duplicate copy of our original dataframe. We are duplicating adult_df to adult_df_rev. As we want to perform some imputation for missing values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "adult_df_rev = adult_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before doing that, we need some summary statistics of our dataframe. For this, we can use describe() method. It can be used to generate various summary statistics, excluding NaN values.\n", "\n", "We are passing an “include” parameter with value as “all”, this is used to specify that we want summary statistics of all the attributes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Dataset basic statistics\n", "adult_df_rev.describe(include= 'all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "


\n", "\n", "After displaying the basic statistics about the dataset after running the above command. Spend some time looking at the details about each stat provided.\n", "\n", "
\n", "\n", "
\n", "DATA IMPUTATION\n", "
\n", "\n", "Now, it’s time to impute the missing values. Some of our categorical values have missing values i.e, “?”. We are going to replace the “?” with the above describe methods - top row's value. For example, we are going to replace the “?” values of the workplace attribute with the “Private” value." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for value in ['workclass', 'education',\n", " 'marital_status', 'occupation',\n", " 'relationship','race', 'sex',\n", " 'native_country', 'income']:\n", " adult_df_rev[value].replace(['?'], [adult_df_rev.describe(include='all')[value][2]],\n", " inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "You have successfully performed the data imputation step. 🙂" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# display and verify the changes in the adult_df_rev dataframe\n", "#print(adult_df_rev)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "


\n", "\n", "For Gaussian Naive Bayes, we need to convert all the data values in one format. We are going to encode all the labels with a value between 0 and n_classes-1.\n", "\n", "
\n", "\n", "
\n", "ONE-HOT ENCODER\n", "
\n", "\n", "We are going to use LabelEncoder from the 'scikit-learn' library to implement encoding.\n", "\n", "One-Hot encoders, encode the data into a binary format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "First thing we need to do is initialize the label encoder.\n", "\n", "Then we need to use *fit_transform()* to assign each features data to a new label." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# initialize label encoder\n", "le = preprocessing.LabelEncoder()\n", "\n", "# create and fit new column labels with existing data\n", "# that fits in with the code in the cell below\n", "# example: label_cat = le.fit_transform(dataframe.column_label)\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**After we have successfully created new labels for our new columns.**\n", "\n", "It is time to add the new columns into the copied dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# initialize the encoded categorical columns\n", "# by assigning the newly created columns to the copied dataframe\n", "adult_df_rev['workclass_cat'] = workclass_cat\n", "adult_df_rev['education_cat'] = education_cat\n", "adult_df_rev['marital_cat'] = marital_cat\n", "adult_df_rev['occupation_cat'] = occupation_cat\n", "adult_df_rev['relationship_cat'] = relationship_cat\n", "adult_df_rev['race_cat'] = race_cat\n", "adult_df_rev['sex_cat'] = sex_cat\n", "adult_df_rev['native_country_cat'] = native_country_cat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Now that we have the updated data in the copied dataframe.**\n", "\n", "We need to drop the old features so they do not conflict." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# determine the features you want to drop by label\n", "dummy_fields = ['workclass', 'education', 'marital_status', \n", " 'occupation', 'relationship', 'race',\n", " 'sex', 'native_country']\n", "# drop the old categorical columns from the copied dataframe\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Using the above code snippets, we have created multiple categorical columns like “marital_cat”, “race_cat” etc. You can see the top of the dataframe using adult_df_rev.head()**\n", "\n", "By printing adult_df_rev. You will be able to see that all the columns should be reindexed. \n", "\n", "
\n", "\n", "**However, they are not in the proper order.**\n", "\n", "*For reindexing the columns, you can use the code snippet provided below:*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adult_df_rev = adult_df_rev.reindex_axis(['age', 'workclass_cat', 'fnlwgt', 'education_cat',\n", " 'education_num', 'marital_cat', 'occupation_cat',\n", " 'relationship_cat', 'race_cat', 'sex_cat', 'capital_gain',\n", " 'capital_loss', 'hours_per_week', 'native_country_cat', \n", " 'income'], axis= 1)\n", " \n", "adult_df_rev.head()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The output to the above code snippet will show you that all the columns are reindexed properly. I have passed the list of column names as a parameter and axis=1 for reindexing the columns.\n", "\n", "**You should also notice that the columns we did not drop from the original dataframe are still present.**\n", "\n", "*As well, every value is now an integer aside from the target value.*\n", "\n", "


\n", "\n", "
\n", "STANDARDIZATION OF DATA\n", "
\n", "\n", "Now that all of the data values in our copied dataframe are numeric. We need to convert them onto a single scale, in other words, we need to standardize the values. \n", "\n", "We can use the below formula for standardization:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Standardization of Data\n", "# outline column headers/feature names\n", "num_features = ['age', 'workclass_cat', 'fnlwgt', 'education_cat', 'education_num',\n", " 'marital_cat', 'occupation_cat', 'relationship_cat', 'race_cat',\n", " 'sex_cat', 'capital_gain', 'capital_loss', 'hours_per_week',\n", " 'native_country_cat']\n", "\n", "# create variable for extraction into\n", "scaled_features = {}\n", "\n", "# for each feature/label/column\n", "for each in num_features:\n", " # define and assign the mean and standard deviation\n", " mean, std = adult_df_rev[each].mean(), adult_df_rev[each].std()\n", " # assign the mean and std to the extraction variable for viewing\n", " scaled_features[each] = [mean, std]\n", " # return the current features new value using the \n", " # current features mean, std and the standardization formula displayed above.\n", " adult_df_rev.loc[:, each] = (adult_df_rev[each] - mean)/std" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have converted our data values into standardized values.\n", "You can print and check the output of the *adult_df_rev* dataframe to verify.\n", "\n", "


\n", "\n", "
\n", "DATA SPLICING\n", "
\n", "\n", "Let’s split the dataset into training and test set. We can easily perform this step using sklearn’s train_test_split() method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# select and assign features\n", "features = adult_df_rev.values[:,:14]\n", "# select and assign data\n", "target = adult_df_rev.values[:,14]\n", "# use train_test_split to split data into training and testing sets\n", "features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.33, random_state = 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using above code snippet, we have divided the data into features and target set. The feature set consists of 14 columns i.e, predictor variables and target set consists of 1 column with class values.\n", "\n", "The features_train & target_train consists of training data and the features_test & target_test consists of testing data.\n", "\n", "


\n", "\n", "
\n", "GAUSSIAN NAIVE BAYES IMPLEMENTATION\n", "
\n", "\n", "After completing the data preprocessing. it’s time to implement machine learning algorithm on it. We are going to use sklearn’s GaussianNB module." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Initialize the model\n", "clf = GaussianNB()\n", "# fit the training data\n", "clf.fit(features_train, target_train)\n", "# create prediction classifier\n", "target_pred = clf.predict(features_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have built a GaussianNB classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.\n", "\n", "



\n", "\n", "
\n", "ACCURACY OF GNB MODEL\n", "
\n", "It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run the prediction classifier against the test data\n", "accuracy_score(target_test, target_pred, normalize = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Awesome! Our model is returning an accuracy of ~80%.\n", "\n", "This is not bad with a simple implementation, and could easily be made more effective.\n", "\n", "You can create random test datasets and test the model to get know how well the trained Gaussian Naive Bayes model is performing.\n", "\n", "### Continue to the next Lesson" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }