{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 13 - Ensemble Methods - Random Forest\n", "\n", "\n", "by [Alejandro Correa Bahnsen](albahnsen.com/) and [Jesus Solano](https://github.com/jesugome)\n", "\n", "version 1.5, February 2019\n", "\n", "## Part of the class [Practical Machine Learning](https://github.com/albahnsen/PracticalMachineLearningClass)\n", "\n", "\n", "\n", "This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Special thanks goes to [Kevin Markham](https://github.com/justmarkham))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why are we learning about ensembling?\n", "\n", "- Very popular method for improving the predictive performance of machine learning models\n", "- Provides a foundation for understanding more sophisticated models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lesson objectives\n", "\n", "Students will be able to:\n", "\n", "- Explain the difference between bagged trees and Random Forests\n", "- Build and tune a Random Forest model in scikit-learn\n", "- Decide whether a decision tree or a Random Forest is a better model for a given problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: Building and tuning decision trees\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksLeagueDivisionPutOutsAssistsErrorsSalaryNewLeague
131581724383914344983569321414375NW6324310475.0N
2479130186672763162445763224266263AW8808214480.0A
3496141206578371156281575225828838354NE200113500.0N
43218710394230239610112484633NE80540491.5N
55941694745135114408113319501336194AW28242125750.0A
\n", "
" ], "text/plain": [ " AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns \\\n", "1 315 81 7 24 38 39 14 3449 835 69 321 \n", "2 479 130 18 66 72 76 3 1624 457 63 224 \n", "3 496 141 20 65 78 37 11 5628 1575 225 828 \n", "4 321 87 10 39 42 30 2 396 101 12 48 \n", "5 594 169 4 74 51 35 11 4408 1133 19 501 \n", "\n", " CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague \n", "1 414 375 N W 632 43 10 475.0 N \n", "2 266 263 A W 880 82 14 480.0 A \n", "3 838 354 N E 200 11 3 500.0 N \n", "4 46 33 N E 805 40 4 91.5 N \n", "5 336 194 A W 282 421 25 750.0 A " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# read in the data\n", "url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/hitters.csv'\n", "hitters = pd.read_csv(url)\n", "\n", "# remove rows with missing values\n", "hitters.dropna(inplace=True)\n", "hitters.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksLeagueDivisionPutOutsAssistsErrorsSalaryNewLeague
131581724383914344983569321414375006324310475.00
2479130186672763162445763224266263108808214480.01
349614120657837115628157522582883835401200113500.00
432187103942302396101124846330180540491.50
559416947451351144081133195013361941028242125750.01
\n", "
" ], "text/plain": [ " AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns \\\n", "1 315 81 7 24 38 39 14 3449 835 69 321 \n", "2 479 130 18 66 72 76 3 1624 457 63 224 \n", "3 496 141 20 65 78 37 11 5628 1575 225 828 \n", "4 321 87 10 39 42 30 2 396 101 12 48 \n", "5 594 169 4 74 51 35 11 4408 1133 19 501 \n", "\n", " CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague \n", "1 414 375 0 0 632 43 10 475.0 0 \n", "2 266 263 1 0 880 82 14 480.0 1 \n", "3 838 354 0 1 200 11 3 500.0 0 \n", "4 46 33 0 1 805 40 4 91.5 0 \n", "5 336 194 1 0 282 421 25 750.0 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# encode categorical variables as integers\n", "hitters['League'] = pd.factorize(hitters.League)[0]\n", "hitters['Division'] = pd.factorize(hitters.Division)[0]\n", "hitters['NewLeague'] = pd.factorize(hitters.NewLeague)[0]\n", "hitters.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# allow plots to appear in the notebook\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# scatter plot of Years versus Hits colored by Salary\n", "hitters.plot(kind='scatter', x='Years', y='Hits', c='Salary', colormap='jet', xlim=(0, 25), ylim=(0, 250))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'League',\n", " 'Division', 'PutOuts', 'Assists', 'Errors', 'NewLeague'],\n", " dtype='object')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# define features: exclude career statistics (which start with \"C\") and the response (Salary)\n", "feature_cols = hitters.columns[hitters.columns.str.startswith('C') == False].drop('Salary')\n", "feature_cols" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 263.000000\n", "mean 535.925882\n", "std 451.118681\n", "min 67.500000\n", "25% 190.000000\n", "50% 425.000000\n", "75% 750.000000\n", "max 2460.000000\n", "Name: Salary, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hitters.Salary.describe()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# define X and y\n", "X = hitters[feature_cols]\n", "y = (hitters.Salary > 425).astype(int)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'League',\n", " 'Division', 'PutOuts', 'Assists', 'Errors', 'NewLeague'],\n", " dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting if salary is high with a decision tree\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Review - Building a Decision Tree by hand" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "max_depth = None\n", "num_pct = 10\n", "max_features = None\n", "min_gain=0.001" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For feature 1 calculate possible splitting points" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hits\n" ] } ], "source": [ "j = 1\n", "print(X.columns[j])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Split the variable in num_ctp points\n", "splits = np.percentile(X.iloc[:, j], np.arange(0, 100, 100.0 / num_pct).tolist())" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Only unique values for filter binary and few unique values features\n", "splits = np.unique(splits)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1. , 52. , 66.8, 77. , 92. , 103. , 120. , 136. , 148.6,\n", " 168. ])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "split the data using split 5" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "k = 5" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "filter_l = X.iloc[:, j] < splits[k]\n", "\n", "y_l = y.loc[filter_l]\n", "y_r = y.loc[~filter_l]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Gini \n", "\n", "The Gini Impurity of a node is the probability that a randomly chosen sample in a node would be incorrectly labeled if it was labeled by the distribution of samples in the node." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each node" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def gini(y):\n", " if y.shape[0] == 0:\n", " return 0\n", " else:\n", " return 1 - (y.mean()**2 + (1 - y.mean())**2)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.39928079856159704" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gini_l = gini(y_l)\n", "gini_l" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.42690311418685123" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gini_r = gini(y_r)\n", "gini_r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The gini impurity of the split is the Gini Impurity of each node is weighted by the fraction of points from the parent node in that node." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### putting all in a function" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def gini_impurity(X_col, y, split):\n", " \"Calculate the gain of an split k on feature j\"\n", " \n", " filter_l = X_col < split\n", " y_l = y.loc[filter_l]\n", " y_r = y.loc[~filter_l]\n", " \n", " n_l = y_l.shape[0]\n", " n_r = y_r.shape[0]\n", " \n", " gini_y = gini(y)\n", " gini_l = gini(y_l)\n", " gini_r = gini(y_r)\n", " \n", " gini_impurity_ = gini_y - (n_l / (n_l + n_r) * gini_l + n_r / (n_l + n_r) * gini_r)\n", " \n", " return gini_impurity_" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0862547016583845" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gini_impurity(X.iloc[:, j], y, splits[k])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### test all splits on all features" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def best_split(X, y, num_pct=10):\n", " \n", " features = range(X.shape[1])\n", " \n", " best_split = [0, 0, 0] # j, split, gain\n", " \n", " # For all features\n", " for j in features:\n", " \n", " splits = np.percentile(X.iloc[:, j], np.arange(0, 100, 100.0 / (num_pct+1)).tolist())\n", " splits = np.unique(splits)[1:]\n", " \n", " # For all splits\n", " for split in splits:\n", " gain = gini_impurity(X.iloc[:, j], y, split)\n", " \n", " if gain > best_split[2]:\n", " best_split = [j, split, gain]\n", " \n", " return best_split" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6, 6.0, 0.1428365268140297)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "j, split, gain = best_split(X, y, 5)\n", "j, split, gain" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "filter_l = X.iloc[:, j] < split\n", "\n", "y_l = y.loc[filter_l]\n", "y_r = y.loc[~filter_l]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(263, 116, 147)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.shape[0], y_l.shape[0], y_r.shape[0]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.49049429657794674, 0.1896551724137931, 0.7278911564625851)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.mean(), y_l.mean(), y_r.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recursively grow the tree " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def tree_grow(X, y, level=0, min_gain=0.001, max_depth=None, num_pct=10):\n", " \n", " # If only one observation\n", " if X.shape[0] == 1:\n", " tree = dict(y_pred=y.iloc[:1].values[0], y_prob=0.5, level=level, split=-1, n_samples=1, gain=0)\n", " return tree\n", " \n", " # Calculate the best split\n", " j, split, gain = best_split(X, y, num_pct)\n", " \n", " # save tree and estimate prediction\n", " y_pred = int(y.mean() >= 0.5) \n", " y_prob = (y.sum() + 1.0) / (y.shape[0] + 2.0) # Laplace correction\n", " \n", " tree = dict(y_pred=y_pred, y_prob=y_prob, level=level, split=-1, n_samples=X.shape[0], gain=gain)\n", " \n", " # Check stooping criteria\n", " if gain < min_gain:\n", " return tree\n", " if max_depth is not None:\n", " if level >= max_depth:\n", " return tree \n", " \n", " # No stooping criteria was meet, then continue to create the partition\n", " filter_l = X.iloc[:, j] < split\n", " X_l, y_l = X.loc[filter_l], y.loc[filter_l]\n", " X_r, y_r = X.loc[~filter_l], y.loc[~filter_l]\n", " tree['split'] = [j, split]\n", "\n", " # Next iteration to each split\n", " \n", " tree['sl'] = tree_grow(X_l, y_l, level + 1, min_gain=min_gain, max_depth=max_depth, num_pct=num_pct)\n", " tree['sr'] = tree_grow(X_r, y_r, level + 1, min_gain=min_gain, max_depth=max_depth, num_pct=num_pct)\n", " \n", " return tree" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'y_pred': 0,\n", " 'y_prob': 0.49056603773584906,\n", " 'level': 0,\n", " 'split': [6, 5.0],\n", " 'n_samples': 263,\n", " 'gain': 0.15865574114903452,\n", " 'sl': {'y_pred': 0,\n", " 'y_prob': 0.10869565217391304,\n", " 'level': 1,\n", " 'split': -1,\n", " 'n_samples': 90,\n", " 'gain': 0.01935558112773289},\n", " 'sr': {'y_pred': 1,\n", " 'y_prob': 0.6914285714285714,\n", " 'level': 1,\n", " 'split': -1,\n", " 'n_samples': 173,\n", " 'gain': 0.1127122881295256}}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree_grow(X, y, level=0, min_gain=0.001, max_depth=1, num_pct=10)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": false }, "outputs": [], "source": [ "tree = tree_grow(X, y, level=0, min_gain=0.001, max_depth=3, num_pct=10)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'y_pred': 0,\n", " 'y_prob': 0.49056603773584906,\n", " 'level': 0,\n", " 'split': [6, 5.0],\n", " 'n_samples': 263,\n", " 'gain': 0.15865574114903452,\n", " 'sl': {'y_pred': 0,\n", " 'y_prob': 0.10869565217391304,\n", " 'level': 1,\n", " 'split': [5, 65.0],\n", " 'n_samples': 90,\n", " 'gain': 0.01935558112773289,\n", " 'sl': {'y_pred': 0,\n", " 'y_prob': 0.07407407407407407,\n", " 'level': 2,\n", " 'split': [0, 185.0],\n", " 'n_samples': 79,\n", " 'gain': 0.009619566461418955,\n", " 'sl': {'y_pred': 0,\n", " 'y_prob': 0.3333333333333333,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 7,\n", " 'gain': 0.40816326530612246},\n", " 'sr': {'y_pred': 0,\n", " 'y_prob': 0.05405405405405406,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 72,\n", " 'gain': 0.009027777777777565}},\n", " 'sr': {'y_pred': 0,\n", " 'y_prob': 0.38461538461538464,\n", " 'level': 2,\n", " 'split': [0, 470.90909090909093],\n", " 'n_samples': 11,\n", " 'gain': 0.2203856749311295,\n", " 'sl': {'y_pred': 0,\n", " 'y_prob': 0.14285714285714285,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 5,\n", " 'gain': 0},\n", " 'sr': {'y_pred': 1,\n", " 'y_prob': 0.625,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 6,\n", " 'gain': 0.4444444444444444}}},\n", " 'sr': {'y_pred': 1,\n", " 'y_prob': 0.6914285714285714,\n", " 'level': 1,\n", " 'split': [1, 103.0],\n", " 'n_samples': 173,\n", " 'gain': 0.1127122881295256,\n", " 'sl': {'y_pred': 0,\n", " 'y_prob': 0.43037974683544306,\n", " 'level': 2,\n", " 'split': [5, 22.0],\n", " 'n_samples': 77,\n", " 'gain': 0.07695385846646363,\n", " 'sl': {'y_pred': 0,\n", " 'y_prob': 0.17857142857142858,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 26,\n", " 'gain': 0.06860475087899842},\n", " 'sr': {'y_pred': 1,\n", " 'y_prob': 0.5660377358490566,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 51,\n", " 'gain': 0.09501691508611931}},\n", " 'sr': {'y_pred': 1,\n", " 'y_prob': 0.8979591836734694,\n", " 'level': 2,\n", " 'split': [2, 6.0],\n", " 'n_samples': 96,\n", " 'gain': 0.01107413837448551,\n", " 'sl': {'y_pred': 1,\n", " 'y_prob': 0.7058823529411765,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 15,\n", " 'gain': 0.16547008547008554},\n", " 'sr': {'y_pred': 1,\n", " 'y_prob': 0.927710843373494,\n", " 'level': 3,\n", " 'split': -1,\n", " 'n_samples': 81,\n", " 'gain': 0.006994315787586275}}}}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def tree_predict(X, tree, proba=False):\n", " \n", " predicted = np.ones(X.shape[0])\n", "\n", " # Check if final node\n", " if tree['split'] == -1:\n", " if not proba:\n", " predicted = predicted * tree['y_pred']\n", " else:\n", " predicted = predicted * tree['y_prob']\n", " \n", " else:\n", " \n", " j, split = tree['split']\n", " filter_l = (X.iloc[:, j] < split)\n", " X_l = X.loc[filter_l]\n", " X_r = X.loc[~filter_l]\n", "\n", " if X_l.shape[0] == 0: # If left node is empty only continue with right\n", " predicted[~filter_l] = tree_predict(X_r, tree['sr'], proba)\n", " elif X_r.shape[0] == 0: # If right node is empty only continue with left\n", " predicted[filter_l] = tree_predict(X_l, tree['sl'], proba)\n", " else:\n", " predicted[filter_l] = tree_predict(X_l, tree['sl'], proba)\n", " predicted[~filter_l] = tree_predict(X_r, tree['sr'], proba)\n", "\n", " return predicted " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1., 1., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 1., 1., 0., 0.,\n", " 0., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1.,\n", " 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,\n", " 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0.,\n", " 0., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0.,\n", " 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,\n", " 0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0.,\n", " 0., 1., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 1.,\n", " 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1.,\n", " 0., 1., 1., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1.,\n", " 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0.,\n", " 1., 1., 0., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 0.,\n", " 0., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0.,\n", " 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 1., 1.,\n", " 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1.,\n", " 1., 0., 0., 1., 1., 1., 1., 1.])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree_predict(X, tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using sklearn" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# list of values to try for max_depth\n", "max_depth_range = range(1, 21)\n", "\n", "# list to store the average RMSE for each value of max_depth\n", "accuracy_scores = []\n", "\n", "# use 10-fold cross-validation with each value of max_depth\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "for depth in max_depth_range:\n", " clf = DecisionTreeClassifier(max_depth=depth, random_state=1)\n", " accuracy_scores.append(cross_val_score(clf, X, y, cv=10, scoring='accuracy').mean())\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0,0.5,'Accuracy')" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# plot max_depth (x-axis) versus RMSE (y-axis)\n", "plt.plot(max_depth_range, accuracy_scores)\n", "plt.xlabel('max_depth')\n", "plt.ylabel('Accuracy')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.8205754985754986, 4)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show the best accuracy and the corresponding max_depth\n", "sorted(zip(accuracy_scores, max_depth_range))[::-1][0]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False, random_state=1,\n", " splitter='best')" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# max_depth=2 was best, so fit a tree using that parameter\n", "clf = DecisionTreeClassifier(max_depth=4, random_state=1)\n", "clf.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featureimportance
0AtBat0.000000
7League0.000000
8Division0.000000
10Assists0.000000
11Errors0.000000
12NewLeague0.000000
9PutOuts0.006048
2HmRun0.010841
4RBI0.012073
3Runs0.021020
5Walks0.103473
1Hits0.298269
6Years0.548277
\n", "
" ], "text/plain": [ " feature importance\n", "0 AtBat 0.000000\n", "7 League 0.000000\n", "8 Division 0.000000\n", "10 Assists 0.000000\n", "11 Errors 0.000000\n", "12 NewLeague 0.000000\n", "9 PutOuts 0.006048\n", "2 HmRun 0.010841\n", "4 RBI 0.012073\n", "3 Runs 0.021020\n", "5 Walks 0.103473\n", "1 Hits 0.298269\n", "6 Years 0.548277" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute feature importances\n", "pd.DataFrame({'feature':feature_cols, 'importance':clf.feature_importances_}).sort_values('importance')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 10.000000\n", "mean 0.820575\n", "std 0.083007\n", "min 0.692308\n", "25% 0.751781\n", "50% 0.830484\n", "75% 0.879630\n", "max 0.923077\n", "dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cross_val_score(clf, X, y, cv=10)).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: Random Forests\n", "\n", "Random Forests is a **slight variation of bagged trees** that has even better performance:\n", "\n", "- Exactly like bagging, we create an ensemble of decision trees using bootstrapped samples of the training set.\n", "- However, when building each tree, each time a split is considered, a **random sample of m features** is chosen as split candidates from the **full set of p features**. The split is only allowed to use **one of those m features**.\n", " - A new random sample of features is chosen for **every single tree at every single split**.\n", " - For **classification**, m is typically chosen to be the square root of p.\n", " - For **regression**, m is typically chosen to be somewhere between p/3 and p.\n", "\n", "What's the point?\n", "\n", "- Suppose there is **one very strong feature** in the data set. When using bagged trees, most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are **highly correlated**.\n", "- Averaging highly correlated quantities does not significantly reduce variance (which is the entire goal of bagging).\n", "- By randomly leaving out candidate features from each split, **Random Forests \"decorrelates\" the trees**, such that the averaging process can reduce the variance of the resulting model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting salary with a Random Forest" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\albah\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", " from numpy.core.umath_tests import inner1d\n" ] }, { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "clf = RandomForestClassifier()\n", "clf" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 10.000000\n", "mean 0.817749\n", "std 0.062496\n", "min 0.703704\n", "25% 0.783333\n", "50% 0.811254\n", "75% 0.846154\n", "max 0.923077\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cross_val_score(clf, X, y, cv=10)).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning n_estimators\n", "\n", "One important tuning parameter is **n_estimators**, which is the number of trees that should be grown. It should be a large enough value that the error seems to have \"stabilized\"." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# list of values to try for n_estimators\n", "estimator_range = range(10, 310, 10)\n", "\n", "# list to store the average Accuracy for each value of n_estimators\n", "accuracy_scores = []\n", "\n", "# use 5-fold cross-validation with each value of n_estimators (WARNING: SLOW!)\n", "for estimator in estimator_range:\n", " clf = RandomForestClassifier(n_estimators=estimator, random_state=1, n_jobs=-1)\n", " accuracy_scores.append(cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean())" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0,0.5,'Accuracy')" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(estimator_range, accuracy_scores)\n", "plt.xlabel('n_estimators')\n", "plt.ylabel('Accuracy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuning max_features\n", "\n", "The other important tuning parameter is **max_features**, which is the number of features that should be considered at each split." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# list of values to try for max_features\n", "feature_range = range(1, len(feature_cols)+1)\n", "\n", "# list to store the average Accuracy for each value of max_features\n", "accuracy_scores = []\n", "\n", "# use 10-fold cross-validation with each value of max_features (WARNING: SLOW!)\n", "for feature in feature_range:\n", " clf = RandomForestClassifier(n_estimators=200, max_features=feature, random_state=1, n_jobs=-1)\n", " accuracy_scores.append(cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean())" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0,0.5,'Accuracy')" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(feature_range, accuracy_scores)\n", "plt.xlabel('max_features')\n", "plt.ylabel('Accuracy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting a Random Forest with the best parameters" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features=6, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,\n", " oob_score=False, random_state=1, verbose=0, warm_start=False)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# max_features=6 is best and n_estimators=200 is sufficiently large\n", "clf = RandomForestClassifier(n_estimators=200, max_features=6, random_state=1, n_jobs=-1)\n", "clf.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featureimportance
8Division0.006081
7League0.008834
12NewLeague0.009709
11Errors0.032638
10Assists0.040503
2HmRun0.047118
9PutOuts0.051506
0AtBat0.078822
3Runs0.080185
5Walks0.082160
4RBI0.091048
1Hits0.132156
6Years0.339239
\n", "
" ], "text/plain": [ " feature importance\n", "8 Division 0.006081\n", "7 League 0.008834\n", "12 NewLeague 0.009709\n", "11 Errors 0.032638\n", "10 Assists 0.040503\n", "2 HmRun 0.047118\n", "9 PutOuts 0.051506\n", "0 AtBat 0.078822\n", "3 Runs 0.080185\n", "5 Walks 0.082160\n", "4 RBI 0.091048\n", "1 Hits 0.132156\n", "6 Years 0.339239" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute feature importances\n", "pd.DataFrame({'feature':feature_cols, 'importance':clf.feature_importances_}).sort_values('importance')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing Random Forests with decision trees\n", "\n", "**Advantages of Random Forests:**\n", "\n", "- Performance is competitive with the best supervised learning methods\n", "- Provides a more reliable estimate of feature importance\n", "- Allows you to estimate out-of-sample error without using train/test split or cross-validation\n", "\n", "**Disadvantages of Random Forests:**\n", "\n", "- Less interpretable\n", "- Slower to train\n", "- Slower to predict" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 1 }