{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bagging and Random Forests\n", "> A Summary of lecture \"Machine Learning with Tree-Based Models in Python\n", "\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/feature_importances.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bagging\n", "- Ensemble Methods\n", " - Voting Classifier\n", " - same training set,\n", " - $\\neq$ algortihms\n", " - Bagging\n", " - One algorithm\n", " - $\\neq$ subsets of the training set\n", "- Bagging\n", " - Bootstrap Aggregation\n", " - Uses a technique known as the bootstrap\n", " - Reduces variance of individual models in the ensemble\n", "_ Bootstrap\n", "![bootstrap](image/bootstrap.png)\n", "- Bootstrap-training\n", "![training](image/bs_training.png)\n", "- Bootstrap-predict\n", "![predict](image/bs_predict.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define the bagging classifier\n", "In the following exercises you'll work with the [Indian Liver Patient dataset](https://www.kaggle.com/uciml/indian-liver-patient-records) from the UCI machine learning repository. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. You'll do so using a Bagging Classifier.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Preprocess" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Age_stdTotal_Bilirubin_stdDirect_Bilirubin_stdAlkaline_Phosphotase_stdAlamine_Aminotransferase_stdAspartate_Aminotransferase_stdTotal_Protiens_stdAlbumin_stdAlbumin_and_Globulin_Ratio_stdIs_male_stdLiver_disease
01.247403-0.420320-0.495414-0.428870-0.355832-0.3191110.2937220.203446-0.14739001
11.0623061.2189361.4235181.675083-0.093573-0.0359620.9396550.077462-0.64846111
21.0623060.6403750.9260170.816243-0.115428-0.1464590.4782740.203446-0.17870711
30.815511-0.372106-0.388807-0.449416-0.366760-0.3122050.2937220.3294310.16578011
41.6792940.0939560.179766-0.395996-0.295731-0.1775370.755102-0.930414-1.71323711
\n", "
" ], "text/plain": [ " Age_std Total_Bilirubin_std Direct_Bilirubin_std \\\n", "0 1.247403 -0.420320 -0.495414 \n", "1 1.062306 1.218936 1.423518 \n", "2 1.062306 0.640375 0.926017 \n", "3 0.815511 -0.372106 -0.388807 \n", "4 1.679294 0.093956 0.179766 \n", "\n", " Alkaline_Phosphotase_std Alamine_Aminotransferase_std \\\n", "0 -0.428870 -0.355832 \n", "1 1.675083 -0.093573 \n", "2 0.816243 -0.115428 \n", "3 -0.449416 -0.366760 \n", "4 -0.395996 -0.295731 \n", "\n", " Aspartate_Aminotransferase_std Total_Protiens_std Albumin_std \\\n", "0 -0.319111 0.293722 0.203446 \n", "1 -0.035962 0.939655 0.077462 \n", "2 -0.146459 0.478274 0.203446 \n", "3 -0.312205 0.293722 0.329431 \n", "4 -0.177537 0.755102 -0.930414 \n", "\n", " Albumin_and_Globulin_Ratio_std Is_male_std Liver_disease \n", "0 -0.147390 0 1 \n", "1 -0.648461 1 1 \n", "2 -0.178707 1 1 \n", "3 0.165780 1 1 \n", "4 -1.713237 1 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "indian = pd.read_csv('./dataset/indian_liver_patient_preprocessed.csv', index_col=0)\n", "indian.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X = indian.drop('Liver_disease', axis='columns')\n", "y = indian['Liver_disease']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import BaggingClassifier\n", "\n", "# Instantiate dt\n", "dt = DecisionTreeClassifier(random_state=1)\n", "\n", "# Instantiate bc\n", "bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate Bagging performance\n", "Now that you instantiated the bagging classifier, it's time to train it and evaluate its test set accuracy.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy of bc: 0.71\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "# Fit bc to the training set\n", "bc.fit(X_train, y_train)\n", "\n", "# Predict test set labels\n", "y_pred = bc.predict(X_test)\n", "\n", "# Evaluate acc_test\n", "acc_test = accuracy_score(y_test, y_pred)\n", "print('Test set accuracy of bc: {:.2f}'.format(acc_test))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy of dt: 0.63\n" ] } ], "source": [ "dt.fit(X_train, y_train)\n", "\n", "y_pred_dt = dt.predict(X_test)\n", "\n", "acc_test_dt = accuracy_score(y_test, y_pred_dt)\n", "print('Test set accuracy of dt: {:.2f}'.format(acc_test_dt))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Out of Bag Evaluation\n", "- Bagging\n", " - Some instances may be sampled several times for one model, other instances may not be sampled at all.\n", "- Out Of Bag (OOB) instances\n", " - On average, for each model, 63% of the training instances are sampled\n", " - The remaining 37% constitute the OOB instances\n", "- OOB Evaluation\n", "![oob](image/oob.png)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare the ground\n", "In the following exercises, you'll compare the OOB accuracy to the test set accuracy of a bagging classifier trained on the Indian Liver Patient dataset.\n", "\n", "In sklearn, you can evaluate the OOB accuracy of an ensemble classifier by setting the parameter ```oob_score``` to ```True``` during instantiation. After training the classifier, the OOB accuracy can be obtained by accessing the ```.oob_score_``` attribute from the corresponding instance.\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import BaggingClassifier\n", "\n", "# Instantiate dt\n", "dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)\n", "\n", "# Instantiate bc\n", "bc = BaggingClassifier(base_estimator=dt, n_estimators=50, oob_score=True, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OOB Score vs Test Set Score\n", "Now that you instantiated bc, you will fit it to the training set and evaluate its test set and OOB accuracies.\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy: 0.698, OOB accuracy: 0.700\n" ] } ], "source": [ "# Fit bc to the training set\n", "bc.fit(X_train, y_train)\n", "\n", "# Predict test set labels\n", "y_pred = bc.predict(X_test)\n", "\n", "# Evaluate test set accuracy\n", "acc_test = accuracy_score(y_test, y_pred)\n", "\n", "# Evaluate OOB accuracy\n", "acc_oob = bc.oob_score_\n", "\n", "# Print acc_test and acc_oob\n", "print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forests (RF)\n", "- Bagging\n", " - Base estimator: Decision Tree, Logistic Regression, Neural Network, ...\n", " - Each estimator is trained on a distinct bootstrap sample of the training set\n", " - Estimators use all features for training and prediction\n", "- Further Diversity with Random Forest\n", " - Base estimator: Decision Tree\n", " - Each estimator is trained on a different bootstrap sample having the same size as the training set\n", " - RF introduces further randomization in the training of individual trees\n", " - $d$ features are sampled at each node without replacement\n", " $$ d < \\text{total number of features} $$\n", "- Random Forest: Training\n", "![rf_training](image/rf_training.png)\n", "- Random Forest: Prediction\n", "![rf_predict](image/rf_prediction.png)\n", "- Feature importance\n", " - Tree based methods: enable measuring the importance of each feature in prediction\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train an RF regressor\n", "In the following exercises you'll predict bike rental demand in the Capital Bikeshare program in Washington, D.C using historical weather data from the [Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand) dataset available through Kaggle. For this purpose, you will be using the random forests algorithm. As a first step, you'll define a random forests regressor and fit it to the training set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Preprocess" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hrholidayworkingdaytemphumwindspeedcntinstantmnthyrClear to partly cloudyLight PrecipitationMisty
00000.760.660.00001491300471100
11000.740.700.1343931300571100
22000.720.740.0896901300671100
33000.720.840.1343331300771100
44000.700.790.194041300871100
\n", "
" ], "text/plain": [ " hr holiday workingday temp hum windspeed cnt instant mnth yr \\\n", "0 0 0 0 0.76 0.66 0.0000 149 13004 7 1 \n", "1 1 0 0 0.74 0.70 0.1343 93 13005 7 1 \n", "2 2 0 0 0.72 0.74 0.0896 90 13006 7 1 \n", "3 3 0 0 0.72 0.84 0.1343 33 13007 7 1 \n", "4 4 0 0 0.70 0.79 0.1940 4 13008 7 1 \n", "\n", " Clear to partly cloudy Light Precipitation Misty \n", "0 1 0 0 \n", "1 1 0 0 \n", "2 1 0 0 \n", "3 1 0 0 \n", "4 1 0 0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bike = pd.read_csv('./dataset/bikes.csv')\n", "bike.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "X = bike.drop('cnt', axis='columns')\n", "y = bike['cnt']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=25, n_jobs=None, oob_score=False,\n", " random_state=2, verbose=0, warm_start=False)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "# Instantiate rf\n", "rf = RandomForestRegressor(n_estimators=25, random_state=2)\n", "\n", "# Fit rf to the training set\n", "rf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate the RF regressor\n", "You'll now evaluate the test set RMSE of the random forests regressor ```rf``` that you trained in the previous exercise." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set RMSE of rf: 54.49\n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error as MSE\n", "\n", "# Predict the test set labels\n", "y_pred = rf.predict(X_test)\n", "\n", "# Evaluate the test set RMSE\n", "rmse_test = MSE(y_test, y_pred) ** 0.5\n", "\n", "# Print rmse_test\n", "print('Test set RMSE of rf: {:.2f}'.format(rmse_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing features importances\n", "In this exercise, you'll determine which features were the most predictive according to the random forests regressor ```rf``` that you trained in a previous exercise.\n", "\n", "For this purpose, you'll draw a horizontal barplot of the feature importance as assessed by ```rf```. Fortunately, this can be done easily thanks to plotting capabilities of ```pandas```." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Create a pd.Series of features importances\n", "importances = pd.Series(data=rf.feature_importances_, index=X_train.columns)\n", "\n", "# Sort importances\n", "importances_sorted = importances.sort_values()\n", "\n", "# Draw a horizontal barplot of importances_sorted\n", "importances_sorted.plot(kind='barh', color='lightgreen')\n", "plt.title('Features Importances')\n", "plt.savefig('../images/feature_importances.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, ```hr``` and ```workingday``` are the most important features according to ```rf```. The importances of these two features add up to more than 90%!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }