{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 5 - Random forest deep dive\n", "\n", "> A deep dive into how Random Forests work and some tricks for making them more accurate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lewtun/dslectures/master?urlpath=lab/tree/notebooks%2Flesson05_random-forest-deep-dive.ipynb) \n", "[![slides](https://img.shields.io/static/v1?label=slides&message=lesson05_random-forest-deep-dive.pdf&color=blue&logo=Google-drive)](https://drive.google.com/open?id=1xRpRXFY_wMzHn2QgW3nDzULh8Y7G5IVt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Understand how to go from simple to complex models\n", "* Understand the concepts of bagging and out-of-bag score\n", "* Gain an introduction to hyperparameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This lesson is adapted from Jeremy Howard's fantastic online course [_Introduction to Machine Learning for Coders_](https://course18.fast.ai/ml), in particular:\n", "\n", "* [2 - Random Forest Deep Dive](https://course18.fast.ai/lessonsml1/lesson2.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework\n", "\n", "* Solve the exercises included in this notebook\n", "* Read chapter 7 of _Hands-On Machine Learning with Scikit-Learn and TensorFlow_ by Aurèlien Geron" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lesson we will analyse the preprocessed table of clean housing data and their addresses that we prepared in lesson 3:\n", "\n", "* `housing_processed.csv`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How does a random forest work for regression tasks?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we use a Random Forest to solve a regression task, the basic idea is that the _leaf nodes_ of each Decision Tree predict a numerical value that is the average of the training instances associated with that node. These predictions are then _averaged_ to obtain the final prediction from the forest of trees. In this lesson we will look at how we can build predictions from a single tree and get better predictions by adding progressively more trees to the forest." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reload modules before executing user code\n", "%load_ext autoreload\n", "# reload all modules every time before executing Python code\n", "%autoreload 2\n", "# render plots in notebook\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# uncomment to update the library if working locally\n", "# !pip install dslectures --upgrade" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data wrangling\n", "import pandas as pd\n", "import numpy as np\n", "from numpy.testing import assert_array_equal\n", "from dslectures.core import *\n", "from dslectures.structured import *\n", "from pathlib import Path\n", "import pickle\n", "\n", "# data viz\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.tree import plot_tree\n", "\n", "sns.set(color_codes=True)\n", "sns.set_palette(sns.color_palette(\"muted\"))\n", "\n", "# ml magic\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import r2_score\n", "from sklearn.ensemble import RandomForestRegressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset already exists at '../data/housing_processed.csv' and is not downloaded again.\n" ] } ], "source": [ "get_dataset('housing_processed.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also make use of the `pathlib` library to handle our filepaths:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "housing.csv keep_cols.npy\n", "housing_addresses.csv submission.csv\n", "housing_columns_to_keep.npy submission_mean.csv\n", "housing_gmaps_data_raw.csv test.csv\n", "housing_merged.csv train.csv\n", "housing_model.pkl uc\n", "housing_processed.csv word2vec-google-news-300.pkl\n", "imdb.csv\n" ] } ], "source": [ "DATA = Path('../data/')\n", "!ls {DATA}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valuecitypostal_coderooms_per_householdbedrooms_per_householdbedrooms_per_roompopulation_per_householdocean_proximity_INLANDocean_proximity_<1H OCEANocean_proximity_NEAR BAYocean_proximity_NEAR OCEANocean_proximity_ISLAND
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.069947056.9841271.0238100.1465912.55555600100
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0620946116.2381370.9718800.1557972.10984200100
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0620946188.2881361.0734460.1295162.80226000100
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0620946185.8173521.0730590.1844582.54794500100
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0620946186.2818531.0810810.1720962.18146700100
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value city \\\n", "0 322.0 126.0 8.3252 452600.0 69 \n", "1 2401.0 1138.0 8.3014 358500.0 620 \n", "2 496.0 177.0 7.2574 352100.0 620 \n", "3 558.0 219.0 5.6431 341300.0 620 \n", "4 565.0 259.0 3.8462 342200.0 620 \n", "\n", " postal_code rooms_per_household bedrooms_per_household \\\n", "0 94705 6.984127 1.023810 \n", "1 94611 6.238137 0.971880 \n", "2 94618 8.288136 1.073446 \n", "3 94618 5.817352 1.073059 \n", "4 94618 6.281853 1.081081 \n", "\n", " bedrooms_per_room population_per_household ocean_proximity_INLAND \\\n", "0 0.146591 2.555556 0 \n", "1 0.155797 2.109842 0 \n", "2 0.129516 2.802260 0 \n", "3 0.184458 2.547945 0 \n", "4 0.172096 2.181467 0 \n", "\n", " ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY \\\n", "0 0 1 \n", "1 0 1 \n", "2 0 1 \n", "3 0 1 \n", "4 0 1 \n", "\n", " ocean_proximity_NEAR OCEAN ocean_proximity_ISLAND \n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "housing_data = pd.read_csv(DATA/'housing_processed.csv'); housing_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the data loaded, we can recreate our feature matrix $X$ and target vector $y$ and train/validation splits as before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise #1\n", "\n", "* Create the feature matrix $X$ and target vector $y$ from `housing_data`\n", "* Create training and validation sets with a split of 80:20\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Baseline model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a sanity check, let's see how our baseline model performs on the validation set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=10, n_jobs=-1, oob_score=False,\n", " random_state=42, verbose=0, warm_start=False)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To simplify the evaluation of our models, let's write a simple function to keep track of the scores:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def print_rf_scores(fitted_model):\n", " \"\"\"Generates RMSE and R^2 scores from fitted Random Forest model.\"\"\"\n", "\n", " yhat_train = fitted_model.predict(X_train)\n", " R2_train = fitted_model.score(X_train, y_train)\n", " yhat_valid = fitted_model.predict(X_valid)\n", " R2_valid = fitted_model.score(X_valid, y_valid)\n", "\n", " scores = {\n", " \"RMSE on train:\": rmse(y_train, yhat_train),\n", " \"R^2 on train:\": R2_train,\n", " \"RMSE on valid:\": rmse(y_valid, yhat_valid),\n", " \"R^2 on valid:\": R2_valid,\n", " }\n", " if hasattr(fitted_model, \"oob_score_\"):\n", " scores[\"OOB R^2:\"] = fitted_model.oob_score_\n", "\n", " for score_name, score_value in scores.items():\n", " print(score_name, round(score_value, 3))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE on train: 19173.912\n", "R^2 on train: 0.96\n", "RMSE on valid: 46106.026\n", "R^2 on valid: 0.779\n" ] } ], "source": [ "print_rf_scores(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The simplest model: a single tree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's build a model that is so simple that we can actually take a look at it. As we saw in lesson 4, a Random Forest is a simply forest of decision trees, so let's begin by looking at a single tree (called _estimators_ in scikit-learn): " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',\n", " max_depth=3, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=1, n_jobs=-1, oob_score=False,\n", " random_state=42, verbose=0, warm_start=False)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1, random_state=42)\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above we have fixed the following hyperparameters:\n", "\n", "* `n_estimators = 1`: create a forest with one tree, i.e. a _decision tree_\n", "* `max_depth = 3`: how deep or the number of \"levels\" in the tree\n", "* `bootstrap=False`: this setting ensures we use the whole dataset to build the tree\n", "\n", "Let's see how this simple model performs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE on train: 65040.247\n", "R^2 on train: 0.545\n", "RMSE on valid: 67579.11\n", "R^2 on valid: 0.525\n" ] } ], "source": [ "print_rf_scores(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsurprisingly, this single tree yields worse predictions ($R^2 = 0.53$) than our baseline with 10 trees ($R^2 = 0.78$). Nevertheless, we can visualise the tree by accessing the `estimators_` attribute and making use of scikit-learn's plotting API ([link](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# get column names\n", "feature_names = X_train.columns\n", "# we need to specify the background color because of a bug in sklearn\n", "fig, ax = plt.subplots(figsize=(30,10), facecolor='k')\n", "# generate tree plot\n", "plot_tree(model.estimators_[0], filled=True, feature_names=feature_names, ax=ax)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the figure we observe that a tree consists of a sequence of binary decisions, where each box includes information about:\n", "\n", "* The binary split criterion (mean squared error (mse) in this case)\n", "* The number of `samples` in each node. Note we start with the full dataset in the root node and get successively smaller values at each split.\n", "* Darker colours indicate a higher `value`, where `value` refers to the _average_ of of the prices. \n", "* The best single binary split is for `median_income <= 4.068` which improves the mean squared error from 9.3 billion to 6.0 billion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A slightly better model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right now our simple tree model has $R^2 = 0.53$ on the validation set - let's make it better by removing the `max_depth=3` restriction:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=10, n_jobs=-1, oob_score=False,\n", " random_state=42, verbose=0, warm_start=False)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = RandomForestRegressor(n_estimators=10, bootstrap=False, n_jobs=-1, random_state=42)\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE on train: 0.0\n", "R^2 on train: 1.0\n", "RMSE on valid: 58570.323\n", "R^2 on valid: 0.643\n" ] } ], "source": [ "print_rf_scores(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that removing the `max_depth` constraint has yielded a _perfect_ $R^2=1$ on the training set! That is because in this case each leaf node has exactly one element, so we can perfectly segment the data. We also have $R^2 = 0.64$ on the validation set which is better than our shallow tree, but we can do better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to bagging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bagging is a technique that can be used to improve the ability of models to generalise to new data.\n", "\n", "The basic idea between bagging is to consider training _several_ models, each of which is only partially predictive, but crucially, uncorrelated. Since these models are effectively gaining different insights into the data, by averaging their predictions we can create an _ensemble_ that is more predictive!\n", "\n", "As shown in the figure, bagging is a two-step process:\n", "\n", "1. Bootstrapping, i.e. sampling the training set\n", "2. Aggregation, i.e. averaging predictions from multiple models\n", "\n", "This gives us the acronym Bootstrap AGGregatING, or bagging for short đŸ¤“.\n", "\n", "The key for this to work is to ensure the errors of each mode are uncorrelated, so the way we do that with trees is to _**sample with replacement**_ from the data: this produces a set of independent samples upon which we can train our models for the ensemble.\n", "
\n", "\n", "

Figure reference: https://bit.ly/33tGPVT

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hyperparameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we revisit our very first model, we saw that the number of trees (`n_estimators`) is one of the parameters we can tune when building our Random Forest. Let's look at this more closely and see how the performance of the forest improves as we add trees." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=10, n_jobs=-1, oob_score=False,\n", " random_state=42, verbose=0, warm_start=False)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = RandomForestRegressor(n_estimators=10, bootstrap=True, n_jobs=-1, random_state=42)\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After training, each tree is stored in an attribute called `estimators_`. For each tree, we can call `predict()` on the validation set and use the `numpy.stack()` function to concatenate the predictions together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds = np.stack([tree.predict(X_valid) for tree in model.estimators_])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we have 10 trees and 3889 samples in our validation set, we expect that `preds` will have shape $(n_\\mathrm{trees}, n_\\mathrm{samples})$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10, 3889)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** To calculate `preds` we made use of Python's list comprehension. An alternative way to do this would be to use a for-loop as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_list = []\n", "\n", "for tree in model.estimators_:\n", " preds_list.append(tree.predict(X_valid))\n", "\n", "# concatenate list of predictions into single array \n", "preds_v2 = np.stack(preds_list)\n", "\n", "# test that arrays are equal\n", "assert_array_equal(preds, preds_v2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now look at a plot of the $R^2$ values as we increase the number of trees." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_r2_vs_trees(preds, y_valid):\n", " \"\"\"Generate a plot of R^2 score on validation set vs number of trees in Random Forest\"\"\"\n", " fig, ax = plt.subplots()\n", " plt.plot(\n", " [\n", " r2_score(y_valid, np.mean(preds[: i + 1], axis=0))\n", " for i in range(len(preds) + 1)\n", " ]\n", " )\n", " ax.set_ylabel(\"$R^2$ on validation set\")\n", " ax.set_xlabel(\"Number of trees\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_r2_vs_trees(preds, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we add more trees, $R^2$ improves but appears to flatten out. Let's test this numerically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise #2\n", "\n", "* Increase the number of trees in the forest by changing `n_estimators` from 10 to 20, 40, and 80 and print their metrics.\n", "* For the forest with 80 tree, generate the array of predictions for each tree and plot the $R^2$ value against the number of trees.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Out-Of-Bag (OOB) score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, we've been using a validation set to examine the effect of tuning hyperparameters like the number of trees - what happens if the dataset is small and it may not be feasible to create a validation set because you would not have enough data to build a good model? Random Forests have a nice feature called _**Out-Of-Bag (OOB) error**_ which is designed for just this case!\n", "\n", "The key idea is to observe that the first tree of our ensemble was trained on a bagged sample of the full dataset, so if we evaluate this model on the _remaining_ samples we have effectively created a _validation set per tree._ To generate OOB predictions, we can then average all the trees and calculate RMSE, $R^2$, or whatever metric we are interested in.\n", "\n", "To toggle this behaviour in scikit-learn, one makes use of the `oob_score` flag, which adds an `oob_score_` attribute to our model that we can print out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE on train: 16755.861\n", "R^2 on train: 0.97\n", "RMSE on valid: 44145.602\n", "R^2 on valid: 0.797\n", "OOB R^2: 0.786\n" ] } ], "source": [ "model = RandomForestRegressor(n_estimators=40, bootstrap=True, n_jobs=-1, oob_score=True, random_state=42)\n", "model.fit(X_train, y_train)\n", "print_rf_scores(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OOB score is handy when you want to find the best set of hyperparameters in some automated way. For example, scikit-learn has a function called [grid search](https://scikit-learn.org/stable/modules/grid_search.html) that allows you pass a list of hyperparameters and a range of values to scan through. Using the OOB score to evaluate which combination of parameters is best is a good strategy in practice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More hyperparameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at a few more hyperparameters that can be tuned in a Random Forest. From our earlier analysis, we saw that 40 trees gave good performance, so let's pick that as a baseline to compare against." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE on train: 16755.861\n", "R^2 on train: 0.97\n", "RMSE on valid: 44145.602\n", "R^2 on valid: 0.797\n", "OOB R^2: 0.786\n" ] } ], "source": [ "model = RandomForestRegressor(n_estimators=40, bootstrap=True, n_jobs=-1, oob_score=True, random_state=42)\n", "model.fit(X_train, y_train)\n", "print_rf_scores(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Minimum number of samples per leaf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The hyperparameter `min_samples_leaf` controls whether or not the tree should continue splitting a given node based on the number of samples in that node. By default, `min_samples_leaf = 1`, so each tree will split all the way down to a single sample, but in practice it can be useful to work with values 3, 5, 10, 25 and see if the performance improves." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise #3\n", "Train several Random Forest models with `min_samples_leaf` values of 3, 5, 10, and 25. By comparing the scores, do you see an improvement in performance compared to our baseline?\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Maximum number of features per split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another good hyperparameter to tune is `max_features`, which controls what _**random**_ number or fraction of _**columns**_ we consider when making a single split at a tree node. Here, the motivation is that we might have situations where a few columns in our data are highly predictive, so each tree will be biased towards picking the same splits and thus reduce the generalisation power of our ensemble. To counteract that, we can tune `max_features`, where good values to try are `1.0`, `0.5`, `log2`, or `sqrt`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise #4\n", "\n", "Train a Random Forest model with `max_features` values of 0.5, 1.0, 'log2', and 'sqrt'. By comparing the scores, do you see an improvement in performance compared to our baseline? Hint: you may find it useful to create a for-loop over the list `[0.5, 1.0, 'log2', 'sqrt']`.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }