{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bagging and Random Forests\n", "> A Summary of lecture \"Machine Learning with Tree-Based Models in Python\n", "\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/feature_importances.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bagging\n", "- Ensemble Methods\n", " - Voting Classifier\n", " - same training set,\n", " - $\\neq$ algortihms\n", " - Bagging\n", " - One algorithm\n", " - $\\neq$ subsets of the training set\n", "- Bagging\n", " - Bootstrap Aggregation\n", " - Uses a technique known as the bootstrap\n", " - Reduces variance of individual models in the ensemble\n", "_ Bootstrap\n", "![bootstrap](image/bootstrap.png)\n", "- Bootstrap-training\n", "![training](image/bs_training.png)\n", "- Bootstrap-predict\n", "![predict](image/bs_predict.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define the bagging classifier\n", "In the following exercises you'll work with the [Indian Liver Patient dataset](https://www.kaggle.com/uciml/indian-liver-patient-records) from the UCI machine learning repository. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. You'll do so using a Bagging Classifier.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Preprocess" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Age_stdTotal_Bilirubin_stdDirect_Bilirubin_stdAlkaline_Phosphotase_stdAlamine_Aminotransferase_stdAspartate_Aminotransferase_stdTotal_Protiens_stdAlbumin_stdAlbumin_and_Globulin_Ratio_stdIs_male_stdLiver_disease
01.247403-0.420320-0.495414-0.428870-0.355832-0.3191110.2937220.203446-0.14739001
11.0623061.2189361.4235181.675083-0.093573-0.0359620.9396550.077462-0.64846111
21.0623060.6403750.9260170.816243-0.115428-0.1464590.4782740.203446-0.17870711
30.815511-0.372106-0.388807-0.449416-0.366760-0.3122050.2937220.3294310.16578011
41.6792940.0939560.179766-0.395996-0.295731-0.1775370.755102-0.930414-1.71323711
\n", "
" ], "text/plain": [ " Age_std Total_Bilirubin_std Direct_Bilirubin_std \\\n", "0 1.247403 -0.420320 -0.495414 \n", "1 1.062306 1.218936 1.423518 \n", "2 1.062306 0.640375 0.926017 \n", "3 0.815511 -0.372106 -0.388807 \n", "4 1.679294 0.093956 0.179766 \n", "\n", " Alkaline_Phosphotase_std Alamine_Aminotransferase_std \\\n", "0 -0.428870 -0.355832 \n", "1 1.675083 -0.093573 \n", "2 0.816243 -0.115428 \n", "3 -0.449416 -0.366760 \n", "4 -0.395996 -0.295731 \n", "\n", " Aspartate_Aminotransferase_std Total_Protiens_std Albumin_std \\\n", "0 -0.319111 0.293722 0.203446 \n", "1 -0.035962 0.939655 0.077462 \n", "2 -0.146459 0.478274 0.203446 \n", "3 -0.312205 0.293722 0.329431 \n", "4 -0.177537 0.755102 -0.930414 \n", "\n", " Albumin_and_Globulin_Ratio_std Is_male_std Liver_disease \n", "0 -0.147390 0 1 \n", "1 -0.648461 1 1 \n", "2 -0.178707 1 1 \n", "3 0.165780 1 1 \n", "4 -1.713237 1 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "indian = pd.read_csv('./dataset/indian_liver_patient_preprocessed.csv', index_col=0)\n", "indian.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X = indian.drop('Liver_disease', axis='columns')\n", "y = indian['Liver_disease']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import BaggingClassifier\n", "\n", "# Instantiate dt\n", "dt = DecisionTreeClassifier(random_state=1)\n", "\n", "# Instantiate bc\n", "bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate Bagging performance\n", "Now that you instantiated the bagging classifier, it's time to train it and evaluate its test set accuracy.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy of bc: 0.71\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "# Fit bc to the training set\n", "bc.fit(X_train, y_train)\n", "\n", "# Predict test set labels\n", "y_pred = bc.predict(X_test)\n", "\n", "# Evaluate acc_test\n", "acc_test = accuracy_score(y_test, y_pred)\n", "print('Test set accuracy of bc: {:.2f}'.format(acc_test))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy of dt: 0.63\n" ] } ], "source": [ "dt.fit(X_train, y_train)\n", "\n", "y_pred_dt = dt.predict(X_test)\n", "\n", "acc_test_dt = accuracy_score(y_test, y_pred_dt)\n", "print('Test set accuracy of dt: {:.2f}'.format(acc_test_dt))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Out of Bag Evaluation\n", "- Bagging\n", " - Some instances may be sampled several times for one model, other instances may not be sampled at all.\n", "- Out Of Bag (OOB) instances\n", " - On average, for each model, 63% of the training instances are sampled\n", " - The remaining 37% constitute the OOB instances\n", "- OOB Evaluation\n", "![oob](image/oob.png)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare the ground\n", "In the following exercises, you'll compare the OOB accuracy to the test set accuracy of a bagging classifier trained on the Indian Liver Patient dataset.\n", "\n", "In sklearn, you can evaluate the OOB accuracy of an ensemble classifier by setting the parameter ```oob_score``` to ```True``` during instantiation. After training the classifier, the OOB accuracy can be obtained by accessing the ```.oob_score_``` attribute from the corresponding instance.\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import BaggingClassifier\n", "\n", "# Instantiate dt\n", "dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)\n", "\n", "# Instantiate bc\n", "bc = BaggingClassifier(base_estimator=dt, n_estimators=50, oob_score=True, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OOB Score vs Test Set Score\n", "Now that you instantiated bc, you will fit it to the training set and evaluate its test set and OOB accuracies.\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy: 0.698, OOB accuracy: 0.700\n" ] } ], "source": [ "# Fit bc to the training set\n", "bc.fit(X_train, y_train)\n", "\n", "# Predict test set labels\n", "y_pred = bc.predict(X_test)\n", "\n", "# Evaluate test set accuracy\n", "acc_test = accuracy_score(y_test, y_pred)\n", "\n", "# Evaluate OOB accuracy\n", "acc_oob = bc.oob_score_\n", "\n", "# Print acc_test and acc_oob\n", "print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forests (RF)\n", "- Bagging\n", " - Base estimator: Decision Tree, Logistic Regression, Neural Network, ...\n", " - Each estimator is trained on a distinct bootstrap sample of the training set\n", " - Estimators use all features for training and prediction\n", "- Further Diversity with Random Forest\n", " - Base estimator: Decision Tree\n", " - Each estimator is trained on a different bootstrap sample having the same size as the training set\n", " - RF introduces further randomization in the training of individual trees\n", " - $d$ features are sampled at each node without replacement\n", " $$ d < \\text{total number of features} $$\n", "- Random Forest: Training\n", "![rf_training](image/rf_training.png)\n", "- Random Forest: Prediction\n", "![rf_predict](image/rf_prediction.png)\n", "- Feature importance\n", " - Tree based methods: enable measuring the importance of each feature in prediction\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train an RF regressor\n", "In the following exercises you'll predict bike rental demand in the Capital Bikeshare program in Washington, D.C using historical weather data from the [Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand) dataset available through Kaggle. For this purpose, you will be using the random forests algorithm. As a first step, you'll define a random forests regressor and fit it to the training set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Preprocess" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
hrholidayworkingdaytemphumwindspeedcntinstantmnthyrClear to partly cloudyLight PrecipitationMisty
00000.760.660.00001491300471100
11000.740.700.1343931300571100
22000.720.740.0896901300671100
33000.720.840.1343331300771100
44000.700.790.194041300871100
\n", "
" ], "text/plain": [ " hr holiday workingday temp hum windspeed cnt instant mnth yr \\\n", "0 0 0 0 0.76 0.66 0.0000 149 13004 7 1 \n", "1 1 0 0 0.74 0.70 0.1343 93 13005 7 1 \n", "2 2 0 0 0.72 0.74 0.0896 90 13006 7 1 \n", "3 3 0 0 0.72 0.84 0.1343 33 13007 7 1 \n", "4 4 0 0 0.70 0.79 0.1940 4 13008 7 1 \n", "\n", " Clear to partly cloudy Light Precipitation Misty \n", "0 1 0 0 \n", "1 1 0 0 \n", "2 1 0 0 \n", "3 1 0 0 \n", "4 1 0 0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bike = pd.read_csv('./dataset/bikes.csv')\n", "bike.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "X = bike.drop('cnt', axis='columns')\n", "y = bike['cnt']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=25, n_jobs=None, oob_score=False,\n", " random_state=2, verbose=0, warm_start=False)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "# Instantiate rf\n", "rf = RandomForestRegressor(n_estimators=25, random_state=2)\n", "\n", "# Fit rf to the training set\n", "rf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate the RF regressor\n", "You'll now evaluate the test set RMSE of the random forests regressor ```rf``` that you trained in the previous exercise." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set RMSE of rf: 54.49\n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error as MSE\n", "\n", "# Predict the test set labels\n", "y_pred = rf.predict(X_test)\n", "\n", "# Evaluate the test set RMSE\n", "rmse_test = MSE(y_test, y_pred) ** 0.5\n", "\n", "# Print rmse_test\n", "print('Test set RMSE of rf: {:.2f}'.format(rmse_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing features importances\n", "In this exercise, you'll determine which features were the most predictive according to the random forests regressor ```rf``` that you trained in a previous exercise.\n", "\n", "For this purpose, you'll draw a horizontal barplot of the feature importance as assessed by ```rf```. Fortunately, this can be done easily thanks to plotting capabilities of ```pandas```." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAdIAAAEICAYAAADrxXV/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deZwdVZ338c+XwAgYIEgigiCNiDKAGOSCGyAoKuICKj6CK4jgNoLO44KOj+I2wjAjEwRHA4MIoqAwCu6oEPbFDmQhiIIkDAiSBAEJAkLyff6o03K5ud19k+q+tzv9fb9e/epzT5069avqJL+cU9V1ZJuIiIhYPWv1OoCIiIjxLIk0IiKihiTSiIiIGpJIIyIiakgijYiIqCGJNCIiooYk0oiIiBqSSCN6SNIiSQ9JWtb0tXnNPveSdMdIxVjHGIulT5Ilrd3rWGLNkkQa0Xuvsz256evOXgazJiaaNfGcYuxIIo0YoyS9UNKVku6TNFfSXk3bDpX0W0kPSLpV0ntL/ZOBnwGbN49wJZ0u6YtN+z9hpFhGxp+QNA94UNLaZb/zJC2RtFDSkU3td5PUL+kvku6W9JUOz2mWpC+W81om6UeSNpF0VunrN5L6mtpb0pHlHJdKOl7SWmXbWpI+Lek2SYslnSFpo7JtYPR5mKT/BS4CLi3d3leO/SJJ20i6SNI9pf+zJE1puS4flTRP0v2SzpG0btP2/SXNKbH/QdK+pX4jSf8t6S5JfyznPKlse5akS0p/SyWd08m1i7EriTRiDJL0dOAnwBeBpwAfBc6TNK00WQy8FtgQOBQ4QdLzbT8IvBq4czVGuAcDrwGmACuAHwFzgacDLwc+LOlVpe0MYIbtDYFtgO+twukdBLyj9LsNcBXwzXKevwU+29L+DUADeD6wP/DuUn9I+dobeCYwGTipZd+XAv8IvArYs9RNKdflKkDAl4HNS7stgWNa+vg/wL7A1sBO5ZhI2g04A/gY1TXbE1hU9vkW8BjwLGBn4JXAe8q2LwAXAhsDWwBfbXeRYvxIIo3ovR+WUed9kn5Y6t4O/NT2T22vsP1LoB/YD8D2T2z/wZVLqP5h3qNmHCfavt32Q8CuwDTbn7f9N9u3AqdQJUGAR4FnSZpqe5ntq1fhON8ssd9PNXr+g+1f2X4M+D5V4ml2nO0/2/5f4D+pEj7A24Cv2L7V9jLgk8BBLdO4x9h+sJzTSmzfYvuXth+xvQT4ClXybb0ud9r+M9V/LqaX+sOA08r+K2z/0fZNkjal+s/Mh8uxFwMntFy7rYDNbT9s+/LOL12MRUmkEb13gO0p5euAUrcV8OamBHsfsDuwGYCkV0u6WtKfy7b9gKk147i9qbwV1fRw8/E/BWxath8GPBu4qUzHvnYVjnN3U/mhNp8nDxHXbVSjR8r321q2rd0UY+u+K5H0VElnl+nXvwDfZuXr+Kem8l+b4tsS+EObbrcC1gHuarp23wCeWrZ/nGokfK2kBZLe3aaPGEdyAz5ibLodONP24a0bJD0JOA94J3C+7UfLSFalSbslnR4E1m/6/LQ2bZr3ux1YaHvbdsHZvhk4uNyvfCNwrqRNytTySNsSWFDKzwAGpqrvpEpaNG17jCoxbzEQanPYbfr+cqnfyfY9kg5g5enhwdxONTXdrv4RYGoZZT+B7T8BhwNI2h34laRLbd/S4XFjjMmINGJs+jbwOkmvkjRJ0rrlAaEtgH8AngQsAR6T9Gqqe3AD7gY2GXjwppgD7CfpKZKeBnx4mONfC/ylPIC0XolhR0m7Akh6u6RptlcA95V9ltc+6/Y+JmljSVsCRwEDD+d8F/iIpK0lTQb+FTinXfIqllDd+31mU90GwDKqB5CeTnW/s1P/DRwq6eXlwaenS9rO9l1UU+3/IWnDsm0bSS8FkPTm8nMEuJcqkY/WtYsuSCKNGINs3071YM2nqBLA7VT/yK9l+wHgSKoHfO4F3gpc0LTvTVRJ5tYytbg5cCbVg0OLqP6RH/JJUdvLgddR3Q9cCCwFTgUGkvO+wAJJy6gePDrI9sO1T7y984HZVP8Z+AlVAgM4jeq8Li0xPgx8aLBObP8V+BJwRbkuLwQ+R/UQ0/2l7//pNCjb11Ie9Cr7X8LjI+R3Uv2H50aqn9G5lGl5qvvP15RrdwFwlO2FnR43xh5lYe+IGKskGdg2054xlmVEGhERUUMSaURERA2Z2o2IiKghI9KIiIga8nukE8TUqVPd19fX6zAiIsaV2bNnL7U9bag2SaQTRF9fH/39/b0OIyJiXJF023BtMrUbERFRQxJpREREDUmkERERNeQe6QSxePliZtw7o9dhRER01VEbHzXqx8iIdByQ1Cfphl7HERERK0siXUO0LGYcERFdkkQ6fkySdEpZCPjCsrTVLEn/KukSquWlIiKiy5JIx49tgZNt70C1/uObSv0U2y+1/R+tO0g6QlK/pP5lS5d1M9aIiAkjiXT8WGh7TinPBvpKedB1JW3PtN2w3Zg8dfJoxxcRMSElkY4fjzSVl/P4E9cP9iCWiIgokkgjIiJqSCKNiIioIeuRThCNRsN5aX1ExKqRNNt2Y6g2GZFGRETUkEQaERFRQxJpREREDUmkERERNSSRRkRE1JBEGhERUUMSaURERA1JpBERETVkDcsJYvHyxcy4d0avw+iKozbOinIR0T0Zka4CSYskTW1Tf+VoHyMiIsamJNIOSZo02DbbL+5mLBERMXZMiEQq6eOSjizlEyRdVMovl/RtSQdLmi/pBknHNe23TNLnJV0DvKipfj1JP5d0+EC78n0vSbMknSvpJklnSVLZtl+pu1zSiZJ+XOo3kXShpOslfQNQ03F+KGm2pAWSjih1h0k6oanN4ZK+MnpXLyIihjIhEilwKbBHKTeAyZLWAXYHbgaOA14GTAd2lXRAaftk4AbbL7B9eambDPwI+I7tU9oca2fgw8D2wDOBl0haF/gG8GrbuwPTmtp/Frjc9s7ABcAzmra92/YuJeYjJW0CnA28vsQPcCjwzVW+IhERMSImSiKdDewiaQOqBbKvokpOewD3AbNsL7H9GHAWsGfZbzlwXktf5wPftH3GIMe61vYdtlcAc4A+YDvgVtsLS5vvNrXfE/g2gO2fAPc2bTtS0lzgamBLYFvbDwIXAa+VtB2wju357QKRdISkfkn9y5YuG+zaREREDRMikdp+FFhENXq7ErgM2BvYBvjfIXZ92PbylrorgFcPTNm28UhTeTnVk9GDtf17iK0VkvYC9gFeZPt5wPXAumXzqcAhDDMatT3TdsN2Y/LUycOEEBERq2NCJNLiUuCj5ftlwPuoRoxXAy+VNLU8UHQwcMkQ/XwGuAf42ioc+ybgmZL6yue3tMT1NgBJrwY2LvUbAffa/msZeb5wYAfb11CNUN/KE0e3ERHRZRMpkV4GbAZcZftu4GHgMtt3AZ8ELgbmAtfZPn+Yvj4MrCvp3zo5sO2HgA8AP5d0OXA3cH/Z/DlgT0nXAa/k8RHyz4G1Jc0DvkCV8Jt9D7jC9r1ERETPyF5pVjFGgaTJtpeVKeGTgZttnzDcfkP092PgBNu/7qR9o9Fwf3//6h4uImJCkjTbdmOoNhNpRNprh0uaAyygmrb9xup0ImmKpN8DD3WaRCMiYvTkFYFdUkafqz0CbernPuDZ9SOKiIiRkBFpREREDUmkERERNSSRRkRE1JBEGhERUUMSaURERA1JpBERETXk118miMXLFzPj3hlDtjlq46O6FE1ExJojI9IekNQn6YZexxEREfUlkUZERNSQRNo7kySdImmBpAslrSdplqQGQFmNZlEpHyLph5J+JGmhpH+S9M+Srpd0taSn9PRMIiImsCTS3tkWONn2DlSLi79pmPY7Ui2bthvwJeCvtnemWqT8naMZaEREDC6JtHcW2p5TyrOBvmHaX2z7AdtLqJZg+1Gpnz/YvpKOkNQvqX/Z0mUjEHJERLRKIu2dR5rKy6meoH6Mx38m6w7RfkXT5xUM8vS17Zm2G7Ybk6dOrh9xRESsJIl0bFkE7FLKB/YwjoiI6FAS6djy78D7JV0JTO11MBERMTzZ7nUM0QWNRsP9/f29DiMiYlyRNNt2Y6g2GZFGRETUkEQaERFRQxJpREREDUmkERERNSSRRkRE1JBEGhERUUMSaURERA1JpBERETW0fUdrrHkWL1/MjHtnDNnmqI2P6lI0ERFrjoxIIyIiakgi7QJJUyR9oNdxRETEyEsi7Y4pQBJpRMQaKIm0O44FtpE0R9Lxkj4m6TeS5kn6HICkPkk3STpV0g2SzpK0j6QrJN0sabfS7hhJZ0q6qNQf3tMzi4iY4JJIu+No4A+2pwO/BLYFdgOmA7tI2rO0exYwA9gJ2A54K7A78FHgU0397QS8BngR8BlJm7c7qKQjJPVL6l+2dNnIn1VERCSR9sAry9f1wHVUCXPbsm2h7fm2VwALgF+7WuduPtDX1Mf5th+yvRS4mCopr8T2TNsN243JUyePztlERExw+fWX7hPwZdvfeEKl1Ac80lS1ounzCp74s2pdRDaLykZE9EhGpN3xALBBKf8CeLekyQCSni7pqavY3/6S1pW0CbAX8JsRizQiIlZJRqRdYPue8tDQDcDPgO8AV0kCWAa8HVi+Cl1eC/wEeAbwBdt3DrfDUyc9NS9ciIgYBUmkXWL7rS1V7V4ztGNT+0OayouatwG/t33ESMYXERGrJ1O7ERERNWREOs7YPqbXMURExOMyIo2IiKghiTQiIqKGJNKIiIgakkgjIiJqSCKNiIioIU/tThCLly9mxr3tfnX1cXlhQ0TEqsuINCIiooYk0hEg6crV3O8ASdvXOG6fpNY3JkVERBclkY4A2y9ezV0PAFY7kVItrZZEGhHRQ0mkI0DSsvJ9L0mzJJ0r6SZJZ6m8mV7SsZJulDRP0r9LejHweuB4SXMkbSPpcEm/kTRX0nmS1i/7ni7pRElXSrpV0oHl0McCe5T9P9KLc4+ImOjysNHI2xnYAbgTuAJ4iaQbgTcA29m2pCm275N0AfBj2+cCSLrP9iml/EXgMOCrpd/NgN2pFgK/ADgXOBr4qO3XtgtE0hHAEQAbb7HxqJxsRMRElxHpyLvW9h22VwBzqKZf/wI8DJwq6Y3AXwfZd0dJl0maD7yNKiEP+KHtFbZvBDbtJBDbM203bDcmT528uucTERFDSCIdeY80lZcDa9t+DNgNOI/qvujPB9n3dOCfbD8X+Byw7iD9asSijYiIWjK12wWSJgPr2/6ppKuBW8qmB4ANmppuANwlaR2qEekfh+m6df+IiOiyJNLu2AA4X9K6VKPJgQeDzgZOkXQkcCDw/4BrgNuA+QyfJOcBj0maC5xu+4TBGj510lPzwoWIiFEg272OIbqg0Wi4v7+/12FERIwrkmbbbgzVJvdIIyIiakgijYiIqCGJNCIiooYk0oiIiBqSSCMiImpIIo2IiKghiTQiIqKGJNIJYvHyxcy4dwYz7p3R61AiItYoSaQRERE1JJG2IemnkqasQvs+STeMZkxDHHtZL44bERGVvGu3Ddv79TqGiIgYHybkiFTSx8uL4pF0gqSLSvnlkr4taZGkqWWk+VtJp0haIOlCSeuVtrtImivpKuCDTX3vIOlaSXMkzZO0bennJknfKnXnSlq/qZ9LJM2W9AtJm5X6bST9vNRfJmm7Ur+1pKsk/UbSF7p86SIiosWETKTApcAepdwAJpely3YHLmtpuy1wsu0dgPuAN5X6bwJH2n5RS/v3ATNsTy9931HqnwPMtL0T1ULfHyjH/CpwoO1dgNOAL5X2M4EPlfqPAl8r9TOA/7K9K/CnoU5S0hGS+iX1L1uaGeCIiNEwURPpbGAXSRtQLZh9FVXS24OVE+lC23Oa9uuTtBEwxfYlpf7MpvZXAZ+S9AlgK9sPlfrbbV9Ryt+mStrPAXYEfilpDvBpYIuyfumLge+X+m8Am5V9XwJ8t81xV2J7pu2G7cbkqZOHuSQREbE6JuQ9UtuPSloEHApcSbWu597ANsBvW5o/0lReDqxHtaZo2/XnbH9H0jXAa4BfSHoPcGub9i79LGgd1UraELivjGrbHmbIE4yIiK6ZqCNSqKZ3P1q+X0Y1JTvHHSzQavs+4H5Ju5eqtw1sk/RM4FbbJwIXADuVTc+QNJAwDwYuB34HTBuol7SOpB1s/wVYKOnNpV6Snlf2vQI4qPW4ERHRGxM5kV5GNV16le27gYdZeVp3KIcCJ5eHjR5qqn8LcEOZkt0OOKPU/xZ4l6R5wFOo7nP+DTgQOE7SXGAO1ZQuVEnysFK/ANi/1B8FfFDSb4CNVuWEIyJi5KmDAVjUJKkP+LHtHXsVQ6PRcH9/f68OHxExLkmabbsxVJuJPCKNiIiobUI+bNRtthdRPZ0bERFrmIxIIyIiakgijYiIqCGJNCIiooYk0oiIiBqSSCMiImpIIo2IiKghiXSCWLx8ca9DiIhYI3UlkUpaaQ0vSe+T9M5h9jtE0kmDbPvUEPstkjS/rBd6oaSnrXrUK/W5uaRzO2h3ZfneJ+mtHbR/QjtJDUkn1os2IiK6pWcjUttft33G8C0HNWgiLfa2/Tygv11bSZNW5WC277R9YAftBt6V2wcMm0hb29nut33kqsQWERG907NEKukYSR8t5V0lzZN0laTjJd3Q1HRzST+XdLOkfyvtjwXWkzRH0lnDHOpS4Fllv2WSPl+WOXuRpF0kXSJptqRfSNqstHuWpF+VEe11krYpI8cbyvZDJJ1f4vqdpM82ndfA6PtYYI8S40fK/peV/q6T9OJB2u0l6celr6dI+mG5NldL2qnp2p0maZakWyUl8UZE9MhYuUf6TeB9ZV3O5S3bplOtqPJc4C2StrR9NPCQ7em2h1tK7LXA/FJ+MnCD7RcA1wBfBQ60vQtwGvCl0u4s4OQyon0xcFebfnejWqFlOvBmSa0vNT4auKzEeAKwGHiF7eeX8zlxkHbNPgdcb3snqlF18wh+O+BVJY7PSlqnNUBJR0jql9S/bOlKs+sRETECev6uXUlTgA1sX1mqvkOV/Ab82vb9pe2NwFbA7R10fbGk5VSLdn+61C0Hzivl51C9//aXkgAmAXdJ2gB4uu0fANh+uBy7tf9f2r6nbPsfYHeqaeTBrAOcJGl6iePZHZzD7sCbShwXSdpE0sDSaT+x/QjwiKTFwKbAHc07254JzAR4xs7PyDI/ERGjoOeJFFgpQ7V4pKm8nM5j3tv20pa6h20PjHgFLCij4MeDkTbssP/WxDRcovoIcDfwPKqZgIc7OEa7azNwnNW9LhERMYJ6PrVr+17gAUkvLFUHdbjro+2mM1fB74Bpkl4EIGkdSTvY/gtwh6QDSv2TJK3fZv9XlHuY6wEHAFe0bH8A2KDp80bAXbZXAO+gGgG3a9fsUqrpYyTtBSwt8UVExBjRrUS6vqQ7mr7+uWX7YcBMSVdRjcLu76DPmcC8Dh42asv234ADgeMkzQXmUN0PhSrRHSlpHnAl0O7XZy4Hziz7nWe7dVp3HvBYeWDpI8DXgHdJuppqWvfBQdo1OwZolDiOBd61OucaERGjR3bvb51Jmmx7WSkfDWxm+6gehzUoSYcADdv/1OtYOtVoNNzfP9Qt3IiIaCVptu3Wh0mfYKzcV3uNpE9SxXMbcEhvw4mIiOjMmEikts8Bzul1HJ2yfTpweo/DiIiIMaDnDxtFRESMZ0mkERERNSSRRkRE1JBEGhERUUMSaURERA1JpBERETUkkU4Qi5cv7nUIERFrpCTSGprXKO2w/emSDizlUyVt36bNIZJOGsk4IyJi9IyJFzJMRLbf0+sYIiKivoxI65sk6RRJCyRdKGk9SdMlXS1pnqQfSNq4dSdJswYWA5d0qKTfS7oEeElTm9dJukbS9ZJ+JWlTSWtJulnStNJmLUm3SJratTOOiIi/SyKtb1vgZNs7APdRLcR9BvAJ2zsB84HPDrazpM2Az1El0FcAzdO9lwMvtL0zcDbw8bIM27cpy6sB+wBz26y9iqQjJPVL6l+2dFnN04yIiHaSSOtbaHtOKc8GtgGm2L6k1H0L2HOI/V8AzLK9pCzt1vzO4S2AX0iaD3wM2KHUnwa8s5TfDXyzXce2Z9pu2G5Mnjp5Vc8rIiI6kERa3yNN5eXAlNXoY7C17L4KnGT7ucB7gXUBbN8O3C3pZVSJ+GerccyIiBgBSaQj737gXkl7lM/vAC4Zov01wF6SNpG0DvDmpm0bAX8s5dZFvU+lmuL9nu3l9cOOiIjVkad2R8e7gK9LWh+4FTh0sIa275J0DHAVcBdwHTCpbD4G+L6kPwJXA1s37XoB1ZRu22ndiIjoDtmDzSrGWFae+D3B9h7DNgYajYb7+/tHOaqIiDWLpNm2G0O1yYh0HJJ0NPB+Hn9yNyIieiT3SMch28fa3sr25b2OJSJioksijYiIqCGJNCIiooYk0oiIiBqSSCMiImpIIo2IiKghiTQiIqKGJNKIiIgakki7QJIlndn0eW1JSyT9uHx+fXnJwmD7T5e0XzdijYiIVZNE2h0PAjtKWq98fgWPv4we2xfYPnaI/acDSaQREWNQEmn3/Ax4TSkfDHx3YIOkQySdVMpvlnSDpLmSLpX0D8DngbdImiPpLZJuljSttF9L0i2Spnb5fCIigiTSbjobOEjSusBOVMuntfMZ4FW2nwe8viz2/RngHNvTbZ9DtXzawHt29wHm2l7a2pGkIyT1S+pfsmTJSJ9PRESQRNo1tucBfVSj0Z8O0fQK4HRJh/P4cmqtTgPeWcrvZpCl1GzPtN2w3Zg2bdpqxR0REUNLIu2uC4B/p2lat5Xt9wGfBrYE5kjapE2b24G7Jb0MeAHVtHFERPRAllHrrtOA+23Pl7RXuwaStrF9DXCNpNdRJdQHgA1amp5KNcV7pu3loxhzREQMISPSLrJ9h+0ZwzQ7XtJ8STcAlwJzgYuB7QceNirtLgAmM8i0bkREdEdGpF1ge3KbulnArFI+HTi9lN/Ypos/A7u21D2P6iGjm0Yu0oiIWFVJpONQeXnD+3n8yd2IiOiRTO2OQ7aPtb2V7ct7HUtExESXRBoREVFDEmlEREQNSaQRERE1JJFGRETUkEQaERFRQxJpREREDUmkERERNQybSCU9TdLZkv4g6UZJP5X0bEl95TV2I07ShyWtPxp9D3HM6ZL2a/r89zVCa/a7rG4fpZ+9JP14JPqKiIiRM2QilSTgB8As29vY3h74FLDpSAWgSmscHwa6lkglrQ1MB/Ybrm1ERESz4UakewOP2v76QIXtObYva24kaZKk4yX9RtI8Se8t9ZMl/VrSdeVF7PuX+j5Jv5X0NeA6qhVOBvo6EtgcuFjSxaXu4IEXuUs6rl2gkhZJOk7SteXrWaX+dZKukXS9pF9J2rTUHyNppqQLgTOAzwNvaXkxPJI2kLRQ0jrl84blWOu0HH9TST+QNLd8vbhlu8o1uqGcy1tK/RNGmpJOknRIKe8r6SZJlwNvLHVrSbpZ0rSmz7dImjrUDzIiIkbHcIl0R2B2B/0cRrU82K5UL1c/XNLWwMPAG2w/nyop/0cZ5QI8BzjD9s62bxvoyPaJwJ3A3rb3lrQ5cBzwMqpR466SDhgkjr/Y3g04CfjPUnc58ELbOwNnAx9var8LsL/ttwKfAc6xPd32OU3xPED1cvnXlKqDgPNsP9py7BOBS2w/D3g+sKBl+xtL/M8D9qFa5WWzQc4DSesCpwCvA/YAnlbiWUG1fNrAe3b3oXp5/dLB+oqIiNEzUg8bvRJ4p6Q5wDXAJsC2gIB/lTQP+BXwdB6fFr7N9tUd9L0r1dTyEtuPAWcBew7S9rtN319UylsAv5A0H/gYsENT+wtsP9RBDKcCh5byobRfuuxlwH8B2F5u+/6W7bsD3y3b7gYuYeUVXZptByy0fbNtUyXPAacB7yzldw8SD5KOkNQvqX/JkiVDHCoiIlbXcIl0AdWobTgCPlRGc9Ntb237QqpR0zRgF9vTgbuBdcs+D3YYo4Zv8nduU/4qcJLt5wLvbTp+xzHYvgLok/RSYJLt1XnIarDzeIwn/hya4zNt2L4duFvSy4AXAD8bpN1M2w3bjWnTpq1GyBERMZzhEulFwJMkHT5QIWnXklCa/QJ4f9N9xGdLejKwEbDY9qOS9ga26jCuB4ANSvka4KWSpkqaBBxMNZpr5y1N368q5Y2AP5byuzo8ZjtnUI10B1tI+9dUS5sN3DPesGX7pVT3YCeV+5t7AtcCt1Et2v0kSRsBLy/tbwK2lrRN+XxwS3+nUo1Sv2d7+RBxR0TEKBoykZYpxTcAryi//rIAOIbqHmazU4EbgevKr8R8g2qt07OAhqR+qtFpp4tQzwR+Juli23cBnwQuBuYC19k+f5D9niTpGuAo4COl7hjg+5IuA4a6j3gxVUJ7wsNGTc4CNubx6eNWRwF7lynk2TxxChmqp5/nlXO4CPi47T+V0eX3yrazgOsBbD8MHAH8pDxsdFtLfxcAkxk8sUdERBeoypXjn6RFQGO0HrqRdCDVg0nvGI3+V5WkBnCC7T06ad9oNNzf3z/KUUVErFkkzbbdGKrN2t0KZjyT9FXg1YyR3zOVdDTVNPLbhmsbERGja41JpLb7RrHvD41W36vD9rHAsb2OIyIi8q7diIiIWpJIIyIiakgijYiIqCGJNCIiooYk0oiIiBqSSCMiImpIIo2IiKghiXSMkzRF0geaPj9h/dKIiOitJNKxbwrwgWFbRURETySRdoGkPkk3STpV0g2SzpK0j6QrJN0saTdJx0g6TdIsSbdKOrLsfiywTXmZ/vGlbrKkc0ufZzUtlh4REV22xrwicBx4FvBmqhVdfgO8lWqx79cDnwLmUC3mvTfVcm6/k/RfwNHAjmU9VyTtBexMtbrMncAVwEuAy7t4LhERUWRE2j0Lbc+3vYJqwfRfl2Xq5gN9pc1PbD9SVrBZDGw6SF/X2r6j9DWnaf8nkHSEpH5J/UuWLBnJc4mIiCKJtHseaSqvaPq8gsdnBprbLGfwGYOO2tmeabthuzFt2rRVjzgiIoaVRDr2PUA11RsREWNQEukYZ/se4IrykNLxw+4QERFdpeo2XazpGo2G+/v7ex1GRMS4Imm27cZQbfqRyIsAAAXfSURBVDIijYiIqCGJNCIiooYk0oiIiBqSSCMiImpIIo2IiKghiTQiIqKGJNKIiIgakkgjIiJqSCKNiIioIYk0IiKihiTSiIiIGpJI1wCSJvU6hoiIiWqw9S5jDJH0BWCp7Rnl85eAu4E3AHcB04HtexdhRMTElRHp+PDfwLsAJK0FHAT8EdgN+BfbbZOopCMk9UvqX7JkSdeCjYiYSJJIxwHbi4B7JO0MvBK4HrgHuNb2wiH2m2m7Ybsxbdq07gQbETHBZGp3/DgVOAR4GnBaqXuwZ9FERASQEel48gNgX2BX4Bc9jiUiIoqMSMcJ23+TdDFwn+3lknodUkREkEQ6bpSHjF4IvBnA9ixgVg9DiogIMrU7LkjaHrgF+LXtm3sdT0REPC4j0nHA9o3AM3sdR0RErCwj0oiIiBpku9cxRBdIegD4Xa/jGMZUYGmvgxjGeIgRxkeciXFkjIcYYXzE2S7GrWwP+Yv4mdqdOH5nu9HrIIYiqT8xjozxEGdiHBnjIUYYH3GuboyZ2o2IiKghiTQiIqKGJNKJY2avA+hAYhw54yHOxDgyxkOMMD7iXK0Y87BRREREDRmRRkRE1JBEGhERUUMS6RpE0r6SfifpFklHt9n+JEnnlO3XSOrrfpQdxbmnpOskPSbpwDEa4z9LulHSPEm/lrTVGIzxfZLmS5oj6fLyqsmuGy7OpnYHSrKkrv+KRAfX8hBJS8q1nCPpPWMtxtLm/5Q/lwskfWesxSjphKZr+HtJ93U7xg7jfIakiyVdX/6O7zdkh7bztQZ8AZOAP1C9SvAfgLnA9i1tPgB8vZQPAs4Zo3H2ATsBZwAHjtEY9wbWL+X3d/tadhjjhk3l1wM/H4vXsrTbALgUuBpojLUYqdYCPqnb128VY9wWuB7YuHx+6liLsaX9h4DTxui1nAm8v5S3BxYN1WdGpGuO3YBbbN9q+2/A2cD+LW32B75VyucCL1f312MbNk7bi2zPA1Z0ObYBncR4se2/lo9XA1uMwRj/0vTxyUAvnizs5M8lwBeAfwMe7mZwRacx9lInMR4OnGz7XgDbi8dgjM0OBr7blcieqJM4DWxYyhsBdw7VYRLpmuPpwO1Nn+8odW3b2H4MuB/YpCvRtYmhaBdnr61qjIcBPxvViFbWUYySPijpD1RJ6sguxdZs2Dgl7QxsafvH3QysSac/7zeVab5zJW3ZndD+rpMYnw08W9IVkq6WtG/Xoqt0/Pem3ArZGrioC3G16iTOY4C3S7oD+CnV6HlQSaRrjnYjy9YRSCdtRttYiGE4Hcco6e1AAzh+VCNqc+g2dSvFaPtk29sAnwA+PepRrWzIOMs6uycA/7drEa2sk2v5I6DP9k7Ar3h8ZqdbOolxbarp3b2oRnunSpoyynE1W5W/2wcB59pePorxDKaTOA8GTre9BbAfcGb5s9pWEuma4w6g+X/JW7DydMTf20ham2rK4s9dia5NDEW7OHutoxgl7QP8C/B62490KbYBq3odzwYOGNWI2hsuzg2AHYFZkhZRLV5/QZcfOBr2Wtq+p+lnfAqwS5diG9Dp3+/zbT9qeyHVIhXbdim+geN3+mfyIHozrQudxXkY8D0A21cB61K90L69bt/ozdeo3UBfG7iVarpk4Ab6Di1tPsgTHzb63liMs6nt6fTmYaNOruXOVA8sbDuGf97bNpVfB/SPxThb2s+i+w8bdXItN2sqvwG4egzGuC/wrVKeSjV9uclYirG0ew6wiPJCoLH4Z5LqVs0hpfyPVIl20Hi7fhL5GtU/IPsBvy//wP9Lqfs81YgJqv9VfR+4BbgWeOYYjXNXqv81PgjcAywYgzH+CrgbmFO+LhiDMc4AFpT4Lh4qgfUyzpa2XU+kHV7LL5drObdcy+3GYIwCvgLcCMwHDhprMZbPxwDH9uLP4ipcy+2BK8rPew7wyqH6yysCIyIiasg90oiIiBqSSCMiImpIIo2IiKghiTQiIqKGJNKIiIgakkgjIiJqSCKNiIio4f8Dm1NlIVWunIUAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Create a pd.Series of features importances\n", "importances = pd.Series(data=rf.feature_importances_, index=X_train.columns)\n", "\n", "# Sort importances\n", "importances_sorted = importances.sort_values()\n", "\n", "# Draw a horizontal barplot of importances_sorted\n", "importances_sorted.plot(kind='barh', color='lightgreen')\n", "plt.title('Features Importances')\n", "plt.savefig('../images/feature_importances.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, ```hr``` and ```workingday``` are the most important features according to ```rf```. The importances of these two features add up to more than 90%!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }