{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Artificial Intelligence

\n", "

Lesson 7

\n", "

Random Forest

\n", "\n", "
\n", "\n", "
Random Forest
\n", "\n", "
Data Pre-Processing
\n", "\n", "
Create the Model
\n", "\n", "
Train the Model
\n", "\n", "
Test the Model
\n", "\n", "
Summary
\n", "\n", "
Challenge
\n", "\n", "
\n", "\n", "- ref: http://benalexkeen.com/random-forests-in-python-using-scikit-learn/\n", "- dataset: https://www.kaggle.com/c/titanic/data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "OVERVIEW\n", "
\n", "
\n", "Decision Trees are a great tool but they can often overfit the training set of data unless pruned effectively, hindering their predictive capabilities.\n", "
\n", "\n", ">*Random forests are an **ensemble model** of many decision trees, in which each tree will specialize its focus on a particular feature, while maintaining an overview of all features.*\n", "\n", "
\n", "Each tree in the random forest will do its own random train/test split of the data, known as bootstrap aggregation and the samples not included are known as the ‘out-of-bag’ samples. \n", "

\n", "Additionally each tree will do feature bagging at each node-branch-split, in order to lessen the effects of a feature that is highly correlated with the response. Which minimizes individual feature importance and allows for more randomness in the variety of decision trees used to obtain a result.\n", "

\n", "While an individual tree might be sensitive to outliers, the ensemble model will likely not be.\n", "

\n", "The ensemble model predicts new labels by taking a majority vote from each of its trees given a new observation.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "RANDOM FOREST\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "The root node (the first decision node) partitions the data using the feature that provides the most information gain. This root node clustering can be seen in the image below, where the red peaks give way to a Decision Tree, with similar root nodes being clustered close together.\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "Information gain tells us how important a given attribute of the feature vectors is in regards to the end prediction.\n", "\n", "For a more in depth overview of feature importance and it's relation to information gain - specifically related to Random Forest - See the following link:\n", "Selecting Features by Importance\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "NOTE\n", "
\n", "\n", ">- It is vital that you use your understanding from the previous lessons to analyze each section of code before continuing.\n", ">- This lesson assumes you are starting to see patterns in how data is handled and processed before being used to train or test the model.\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "DATA PRE-PROCESSING\n", "
\n", "\n", "For this lesson we are going to create our own extracted dataframe from the provided Titanic dataset in order to make feature importance more evident. We will do this by converting text values to numbers which will also increase the efficiency of processing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# remove warnings\n", "import warnings\n", "#warnings.filterwarnings('ignore')\n", "\n", "%matplotlib inline\n", "\n", "import pandas as pd\n", "pd.options.display.max_columns = 100\n", "\n", "from matplotlib import pyplot as plt\n", "import matplotlib\n", "matplotlib.style.use('ggplot')\n", "\n", "import numpy as np\n", "pd.options.display.max_rows = 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "\n", "
\n", "\n", "**Since we will be doing a lot of pre-processing for this dataset, we will want to know when cells have completed execution - let's make that extremely obvious by writing a simple function.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# print function for determining when feature processing is complete\n", "def status(feature):\n", " print('Processing', feature, ':OK')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Now that we have that, let's create another function to quickly load, split, and combine our training and test data so it can be easily extracted when we need it.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_combined_data():\n", " train = pd.read_csv('train.csv')\n", " test = pd.read_csv('test.csv')\n", " targets = train.Survived # extracting and removing the targets from training data\n", " train.drop(['Survived'], 1, inplace=True)\n", " \n", " combined = train.append(test)\n", " combined.reset_index(inplace=True)\n", " combined.drop('index', inplace=True, axis=1)\n", " return combined" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**With our new get_combine_data function let's create a new dataframe and verify it is working by attempting to extract the data**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "combined = get_combined_data()\n", "combined.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Now that we have the dataset extracted.\n", "\n", "Let's extract the passenger titles from the dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_titles():\n", " global combined\n", " combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())\n", " Title_Dictionary = {\n", " 'Capt': 'Officer',\n", " 'Col': 'Officer',\n", " 'Major': 'Officer',\n", " 'Jonkheer':'Royalty',\n", " 'Don': 'Royalty',\n", " 'Sir': 'Royalty',\n", " 'Dr': 'Officer',\n", " 'Rev': 'Officer',\n", " 'the Countess':'Royalty',\n", " 'Dona': 'Royalty',\n", " 'Mme':'Mrs',\n", " 'Mlle':'Miss',\n", " 'Ms':'Mrs',\n", " 'Mr':'Mr',\n", " 'Mrs':'Mrs',\n", " 'Miss':'Miss',\n", " 'Master':'Master',\n", " 'Lady':'Royalty'\n", " }\n", " combined['Title'] = combined.Title.map(Title_Dictionary)\n", " combined.drop('Name', 1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**With our function in place, let's test it and verify the changes in our combined dataframe**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_titles()\n", "combined.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "*Group* the passenger *Ages* by *Title*, *Pclass*, and *Gender*\n", "\n", ">Use the 'median()' method to display the output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = combined.groupby(['Sex','Pclass','Title'])\n", "grouped.median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "With the mean data for each 'Title' relative to each 'Sex' and 'Pclass' we can now do our own imputation while processing.\n", "
\n", "However, we can assume age is fairly important value in terms of determining a persons ability to survive a disaster - especially without assistance.\n", "
\n", "\n", ">Keeping the above in mind!\n", "\n", ">Let's define defaults for the 'Age' values using the mean data above to fill in any blanks or indiscernible values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def process_age():\n", " \n", " global combined\n", " \n", " # a function that fills the missing values of the Age variable\n", " \n", " def fillAges(row):\n", " if row['Sex']=='female' and row['Pclass'] == 1:\n", " if row['Title'] == 'Miss':\n", " return 30\n", " elif row['Title'] == 'Mrs':\n", " return 45\n", " elif row['Title'] == 'Officer':\n", " return 49\n", " elif row['Title'] == 'Royalty':\n", " return 39\n", "\n", " elif row['Sex']=='female' and row['Pclass'] == 2:\n", " if row['Title'] == 'Miss':\n", " return 20\n", " elif row['Title'] == 'Mrs':\n", " return 30\n", "\n", " elif row['Sex']=='female' and row['Pclass'] == 3:\n", " if row['Title'] == 'Miss':\n", " return 18\n", " elif row['Title'] == 'Mrs':\n", " return 31\n", "\n", " elif row['Sex']=='male' and row['Pclass'] == 1:\n", " if row['Title'] == 'Master':\n", " return 6\n", " elif row['Title'] == 'Mr':\n", " return 41.5\n", " elif row['Title'] == 'Officer':\n", " return 52\n", " elif row['Title'] == 'Royalty':\n", " return 40\n", "\n", " elif row['Sex']=='male' and row['Pclass'] == 2:\n", " if row['Title'] == 'Master':\n", " return 2\n", " elif row['Title'] == 'Mr':\n", " return 30\n", " elif row['Title'] == 'Officer':\n", " return 41.5\n", "\n", " elif row['Sex']=='male' and row['Pclass'] == 3:\n", " if row['Title'] == 'Master':\n", " return 6\n", " elif row['Title'] == 'Mr':\n", " return 26\n", " combined.Age = combined.apply(lambda r: fillAges(r) if np.isnan(r['Age']) else r['Age'], axis=1)\n", " status('age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "NOTE\n", "
\n", "\n", ">- Ensure you understand the full scope of what the function above is doing before continuing!\n", "\n", "
\n", "\n", "\n", "Test the function and view the changes to our dataframe\n", "\n", "\n", ">After executing the function below, note there are still several features which have a number of rows that are either null, NaN or indiscernible." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# execute the function\n", "process_age()\n", "combined.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "
\n", "\n", "#### Define function for creating boolean passenger 'Title' classification \n", ">This will result in faster and clearer results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_names():\n", " global combined\n", " titles_dummies = pd.get_dummies(combined['Title'], prefix='Title')\n", " combined = pd.concat([combined, titles_dummies], axis=1)\n", " combined.drop('Title', axis=1, inplace=True)\n", " status('Name')\n", " \n", "# execute the function\n", "process_names()\n", "combined.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Process the passengers Fare" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_fare():\n", " global combined\n", " combined.Fare.fillna(combined.Fare.mean(), inplace=True)\n", " status('fare')\n", " \n", "# execute the function\n", "process_fare()\n", "combined.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Process the passengers Embarked status" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_embarked():\n", " global combined\n", " combined.Embarked.fillna('S', inplace=True)\n", " # dummy encoding\n", " embarked_dummies = pd.get_dummies(combined['Embarked'], prefix='Embarked')\n", " combined = pd.concat([combined, embarked_dummies], axis=1)\n", " combined.drop('Embarked', axis=1, inplace=True)\n", " status('Embarked')\n", " \n", "# execute the function \n", "process_embarked()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Process the passenger's cabin" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_cabin():\n", " global combined\n", " combined.Cabin.fillna('U', inplace=True)\n", " # mapping each \n", " combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])\n", " # dummy encoding\n", " cabin_dummies = pd.get_dummies(combined['Cabin'], prefix='Cabin')\n", " combined = pd.concat([combined, cabin_dummies], axis=1)\n", " combined.drop('Cabin', axis=1, inplace=True)\n", " status('Cabin')\n", "\n", "# execute the function\n", "process_cabin()\n", "combined.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "\n", "
\n", "\n", "#### Process the passenger genders into binary values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_sex():\n", " global combined\n", " combined['Sex'] = combined['Sex'].map({'male':0, 'female':1})\n", " status('Sex')\n", "\n", "# execute the function \n", "process_sex()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Process the passenger Class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_pclass():\n", " global combined\n", " pclass_dummies = pd.get_dummies(combined['Pclass'], prefix='Pclass')\n", " combined = pd.concat([combined, pclass_dummies], axis=1)\n", " combined.drop('Pclass', axis=1, inplace=True)\n", " status('pclass')\n", "\n", "# execute the function\n", "process_pclass()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Process the passengers Ticket" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_ticket():\n", " global combined\n", " # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix\n", " def cleanTicket(ticket):\n", " ticket = ticket.replace('.','')\n", " ticket = ticket.replace('/','')\n", " ticket = map(lambda t : t.strip(), ticket)\n", " # print(type(ticket))\n", " ticket = list(filter(lambda t : not t.isdigit(), ticket))\n", " if len(ticket) > 0:\n", " return ticket[0]\n", " else:\n", " return 'XXX'\n", " # extracting dummy variables from tickets\n", " combined['Ticket'] = combined['Ticket'].map(cleanTicket)\n", " tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')\n", " combined = pd.concat([combined, tickets_dummies], axis=1)\n", " combined.drop('Ticket', inplace=True, axis=1)\n", " status('Ticket')\n", "\n", "# execute the function\n", "process_ticket()\n", "combined.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Process the passengers Family" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def process_family():\n", " global combined\n", " #\n", " combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1\n", " #\n", " combined['Singleton'] = combined['FamilySize'].map(lambda s : 1 if s == 1 else 0)\n", " #\n", " combined['SmallFamily'] = combined['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)\n", " #\n", " combined['BigFamily'] = combined['FamilySize'].map(lambda s : 1 if s > 4 else 0)\n", " #\n", " status('family')\n", "\n", "# execute the function\n", "process_family()\n", "combined.shape\n", "combined.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "\n", "
\n", "\n", "#### Now it is time to Scale all of the Features we Selected" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def scale_all_features():\n", " global combined\n", " features = list(combined.columns)\n", " features.remove('PassengerId')\n", " combined[features] = combined[features].apply(lambda x : x/x.max(), axis=0)\n", " print('Features scaled successfully!')\n", " \n", "scale_all_features()\n", "combined.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "\n", "
\n", "CREATE THE MODEL\n", "
\n", "\n", "#### Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.feature_selection import SelectKBest\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier\n", "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Define function for determining score of classification" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def compute_score(clf, X, y, scoring='accuracy'):\n", " # determine the cross validation scores\n", " xval = cross_val_score(clf, X, y, cv=5, scoring=scoring)\n", " # return the mean of the cross val scores\n", " return np.mean(xval)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Define function for assigning testing and training data pools" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def recover_train_test_target():\n", " global combined\n", " # get the target values\n", " train0 = pd.read_csv('train.csv')\n", " targets = train0.Survived\n", " \n", " # split the data\n", " train = combined.loc[0:890]\n", " test = combined.loc[891:]\n", " return train, test, targets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Execute the function you just made for assigning data to train/test pools" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train, test, targets = recover_train_test_target()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "\n", "
\n", "\n", "
\n", "TRAIN THE MODEL\n", "
\n", "\n", "#### Import libraries for ExtraTreesClassifier and SelectFromModel feature selection" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.ensemble import ExtraTreesClassifier\n", "from sklearn.feature_selection import SelectFromModel\n", "\n", "# initialize the model ( set the number of estimators )\n", "clf = ExtraTreesClassifier(n_estimators=400)\n", "\n", "# fit the training data and targets as features/column names\n", "clf = clf.fit(train, targets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Create features data frame to observe feature importance\n", "\n", ">This step is enabled by the built in '**feature\\_importances\\_**' attribute in the *ExtraTreesClassifier()* model we initialized earlier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = pd.DataFrame()\n", "# add the training data to the 'feature' column in the 'features' dataframe\n", "features['feature'] = train.columns\n", "# add the feature_importance_ attribute data to the 'importance' column in the 'features' dataframe\n", "features['importance'] = clf.feature_importances_\n", "\n", "# display the list of features and their importance\n", "# adjust the .head() to see the entire list of features\n", "features.sort_values(['importance'], ascending=False).head(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### Select the Features of most Importance\n", "\n", "*Use the selected features to create new training data*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use SelectFromModel to load our pre-existing classifier model into the new dataframe\n", "# prefit = true because we already initialized the data with features and target values\n", "model = SelectFromModel(clf, prefit=True)\n", "# create new training data from the pre-existing model\n", "train_new = model.transform(train)\n", "# observe the training dataframe shape\n", "train_new.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "*Let's do the same for our testing data*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create new testing data\n", "test_new = model.transform(test)\n", "# observe the testing dataframe shape\n", "test_new.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "#### Initialize the Random Forest model and Define a parameter grid" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "forest = RandomForestClassifier(max_features='sqrt')\n", "parameter_grid = {\n", " 'max_depth' : [4,5,6,7,8],\n", " 'n_estimators' : [200, 300, 400],\n", " 'criterion' : ['gini', 'entropy']\n", " }\n", "# get cross validation table for GridSearch (n_splits=5)\n", "cross_validation = StratifiedKFold(n_splits=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "TEST THE MODEL\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "#### Use GridSearchCV to select the best Random Forest results in the paramater grid as well as the best parameters used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_search = GridSearchCV(forest, param_grid=parameter_grid, cv=cross_validation)\n", "grid_search.fit(train_new, targets)\n", "\n", "print('Best score : {}'.format(grid_search.best_score_))\n", "print('Best parameters : {}'.format(grid_search.best_params_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "SUMMARY\n", "
\n", "\n", "Random forests are a popular method for feature ranking, since they are so easy to apply.\n", "\n", "In general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries. \n", "\n", "But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories.\n", "\n", "As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.\n", "\n", "
\n", "\n", "
\n", "CHALLENGE\n", "
\n", "#### Store the results and classification overview in a CSV file locally\n", "This is an example of a submission being created for a competition on Kaggle.com\n", "
\n", "

You are not required to complete this section, it is mainly for observation.

\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create output dataframe\n", "# df_output = pd.DataFrame()\n", "\n", "# get grid_search prediction results\n", "# pipeline = grid_search\n", "# output = pipeline.predict(test_new).astype(int)\n", "\n", "# store the predictions in the 'survived' column\n", "# df_output['Survived'] = output\n", "\n", "# store the passenger ids in the corresponding table\n", "# df_output['PassengerId'] = test['PassengerId']\n", "\n", "# execute the to_csv() method on the output dataframe - hold the index\n", "# df_output[['PassengerId', 'Survived']].to_csv('output.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }