{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lesson 4 - Introduction to random forests\n",
"\n",
"> How to train a Random Forest model to predict housing prices and evaluate the quality of the model's predictions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lewtun/dslectures/master?urlpath=lab/tree/notebooks%2Flesson04_intro-to-random-forests.ipynb) \n",
"[![slides](https://img.shields.io/static/v1?label=slides&message=lesson04_intro-to-random-forests.pdf&color=blue&logo=Google-drive)](https://drive.google.com/open?id=18yESZldXJrdXiaOQWmiA35-8vIqOnNK8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Learning objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Understand the main steps involved in training a machine learning model\n",
"* Gain an introduction to scikit-learn's API\n",
"* Understand the need to generate a training and validation set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This lesson is adapted from Jeremy Howard's fantastic online course [_Introduction to Machine Learning for Coders_](https://course18.fast.ai/ml), in particular:\n",
"\n",
"* [1 - Introduction to Random Forests](https://course18.fast.ai/lessonsml1/lesson1.html)\n",
"\n",
"You may also find the following textbook chapters and blog posts useful:\n",
"\n",
"* Chapters 2 & 5 of _Hands-On Machine Learning with Scikit-Learn and TensorFlow_ by Aurèlien Geron\n",
"* [About Train, Validation and Test Sets in Machine Learning](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Homework\n",
"\n",
"* Solve the exercises included in this notebook\n",
"* Read chapters 6 and 7 of _Hands-On Machine Learning with Scikit-Learn and TensorFlow_ by Aurèlien Geron"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this lesson we will analyse the preprocessed table of clean housing data and their addresses that we prepared in lesson 3:\n",
"\n",
"* `housing_processed.csv`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is a machine learning model?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Tom Mitchell](https://en.wikipedia.org/wiki/Tom_M._Mitchell), one of the pioneers of machine learning, proposed this definition:\n",
"\n",
"> A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.\n",
"\n",
"Framed in our example to predict housing prices in California (task $T$), we can run a Random Forest algorithm on data about past housing prices (experience $E$) and, if it has successfully \"learned\", it will then do better at predicting future housing prices (performance measure $P$)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# reload modules before executing user code\n",
"%load_ext autoreload\n",
"# reload all modules every time before executing Python code\n",
"%autoreload 2\n",
"# render plots in notebook\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# data wrangling\n",
"import pandas as pd\n",
"import numpy as np\n",
"from dslectures.core import *\n",
"from pathlib import Path\n",
"\n",
"# data viz\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"sns.set(color_codes=True)\n",
"sns.set_palette(sns.color_palette(\"muted\"))\n",
"\n",
"# ml magic\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error, r2_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As usual, we can download our datasets using our helper function `get_datasets`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Download of housing_processed.csv dataset complete.\n"
]
}
],
"source": [
"get_dataset('housing_processed.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also make use of the `pathlib` library to handle our filepaths:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"housing.csv housing_processed.csv\n",
"housing_addresses.csv imdb.csv\n",
"housing_gmaps_data_raw.csv uc\n",
"housing_merged.csv word2vec-google-news-300.pkl\n"
]
}
],
"source": [
"DATA = Path('../data/')\n",
"!ls {DATA}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" 0 1 2 \\\n",
"longitude -122.230000 -122.220000 -122.240000 \n",
"latitude 37.880000 37.860000 37.850000 \n",
"housing_median_age 41.000000 21.000000 52.000000 \n",
"total_rooms 880.000000 7099.000000 1467.000000 \n",
"total_bedrooms 129.000000 1106.000000 190.000000 \n",
"population 322.000000 2401.000000 496.000000 \n",
"households 126.000000 1138.000000 177.000000 \n",
"median_income 8.325200 8.301400 7.257400 \n",
"median_house_value 452600.000000 358500.000000 352100.000000 \n",
"city 69.000000 620.000000 620.000000 \n",
"postal_code 94705.000000 94611.000000 94618.000000 \n",
"rooms_per_household 6.984127 6.238137 8.288136 \n",
"bedrooms_per_household 1.023810 0.971880 1.073446 \n",
"bedrooms_per_room 0.146591 0.155797 0.129516 \n",
"population_per_household 2.555556 2.109842 2.802260 \n",
"ocean_proximity_INLAND 0.000000 0.000000 0.000000 \n",
"ocean_proximity_<1H OCEAN 0.000000 0.000000 0.000000 \n",
"ocean_proximity_NEAR BAY 1.000000 1.000000 1.000000 \n",
"ocean_proximity_NEAR OCEAN 0.000000 0.000000 0.000000 \n",
"ocean_proximity_ISLAND 0.000000 0.000000 0.000000 \n",
"\n",
" 3 4 \n",
"longitude -122.250000 -122.250000 \n",
"latitude 37.850000 37.850000 \n",
"housing_median_age 52.000000 52.000000 \n",
"total_rooms 1274.000000 1627.000000 \n",
"total_bedrooms 235.000000 280.000000 \n",
"population 558.000000 565.000000 \n",
"households 219.000000 259.000000 \n",
"median_income 5.643100 3.846200 \n",
"median_house_value 341300.000000 342200.000000 \n",
"city 620.000000 620.000000 \n",
"postal_code 94618.000000 94618.000000 \n",
"rooms_per_household 5.817352 6.281853 \n",
"bedrooms_per_household 1.073059 1.081081 \n",
"bedrooms_per_room 0.184458 0.172096 \n",
"population_per_household 2.547945 2.181467 \n",
"ocean_proximity_INLAND 0.000000 0.000000 \n",
"ocean_proximity_<1H OCEAN 0.000000 0.000000 \n",
"ocean_proximity_NEAR BAY 1.000000 1.000000 \n",
"ocean_proximity_NEAR OCEAN 0.000000 0.000000 \n",
"ocean_proximity_ISLAND 0.000000 0.000000 "
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housing_data.head().T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"#### Exercise #1\n",
"\n",
"Even though you may be told a dataset has been cleaned and prepared for training a model, you should always perform some sanity checks! \n",
"\n",
"* Check that `housing_data` is free of missing values\n",
"* Check that all columns are numerical\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Select a performance measure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we can train any model, we need to think about which performance measure we wish to optimise for. For regression problems the Root Mean Square Error (RMSE) is often used as it measures the _**standard deviation**_ of the errors the algorithm makes in its predictions and gives a higher weight to large errors. For example, an RMSE equal to 50,000 means that about 68% of the algorithm's predictions fall within 50,000 CHF of the actual value, and about 95% fall within 100,000 CHF."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Note: In general, lower values of RMSE indicate a better fit to the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mathematically, the formula for RMSE is:\n",
"\n",
"$$ \\mathrm{RMSE} = \\sqrt{\\frac{1}{m}\\sum_{i=1}^m \\left(\\hat{y}_i - y_i\\right)^2}$$\n",
"\n",
"where $m$ is the number of instances in the dataset you are measuring the RMSE on, $\\hat{y}_i$ is the model's prediction for the $i^{th}$ instance, and $y_i$ is the actual label. Let's create a simple function that uses scitkit-learn's [mean_squared_error function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) (which is just RMSE$^2$):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def rmse(y, yhat):\n",
" \"\"\"A utility function to calculate the Root Mean Square Error (RMSE).\n",
" \n",
" Args:\n",
" y (array): Actual values for target.\n",
" yhat (array): Predicted values for target.\n",
" \n",
" Returns:\n",
" rmse (double): The RMSE.\n",
" \"\"\"\n",
" return np.sqrt(mean_squared_error(y, yhat))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"#### Exercise #2\n",
"\n",
"Whenever you create a Python function it is a good idea to test that it behaves as you expect on some dummy data. Given the two NumPy arrays:\n",
"\n",
"```python\n",
"y_dummy = np.array([2,2,3])\n",
"yhat_dummy = np.array([0,2,6])\n",
"```\n",
"\n",
"check that our `rmse` function matches what you would get by calculating the explicit formula for RMSE. You may find the `numpy.sum()` and `array.size` methods to be useful.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Note: Another common metric for evaluating regression models is the coefficient of determination $R^2 = 1 - u/v$, where $u = \\sum_i (y_{i} - \\hat{y}_{i})^2$ is the residual sum of squares and $v=\\sum_i(y_{i} - \\bar{y}_{i})^2$ is the total sum of squares. Better models have $R^2$ values closer to 1."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introducing scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've checked that the training data is clean and free from obvious anomalies, it's time to train our model! To do so, we will make use of the scikit-learn library.\n",
"\n",
"scikit-learn is one of the best known Python libraries for machine learning and provides efficient implementations of a large number of common algorithms. It has a uniform _Estimator API_ as well as excellent online documentation. The main benefit of its API is that once you understand the basic use and syntax of scikit-learn for one type of model, switching to a new model or algorithm is very easy.\n",
"\n",
"**Basics of the API**\n",
"\n",
"The most common steps one takes when building a model in scikit-learn are:\n",
"1. Choose a class of model by importing the appropriate estimator class from scikit-learn.\n",
"2. Choose model _hyperparameters_ by instantiating this class with the desired values.\n",
"3. Arrange data into a feature matrix and target vector (see discussion below).\n",
"4. Fit the model to your data by calling the `fit()` method.\n",
"5. Evaluate the predictions of the model:\n",
" * For supervised learning we typically predict _labels_ for new data using the `predict()` method.\n",
" * For unsupervised learning, we often transform or infer properties of the data using the `transform()` or `predict()` methods.\n",
" \n",
"Let's go through each of these steps to build a Random Forest regressor to predict California housing prices."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Choose a model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In scikit-learn, every class of model is represented by a Python class. We want a Random Forest regressor, so looking at the online [docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor) we should import the `RandomForestRegressor`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestRegressor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Choose hyperparameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we have chosen our model class, there are still some options open to us:\n",
"\n",
"* What is the maximum depth of the tree? The default is `None` which means the nodes are expanded until all leaves are pure.\n",
"* Other parameters can be found in the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier), but for now we take a simple model with just 10 trees.\n",
"\n",
"The above choices are often referred to as _hyperparameters_ or parameters that must be set before the model is fit to the data. We can instantiate the `RandomForestRegressor` class and specify the desired hyperparameters as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Note: You will often see the function argument `random_state` in scikit-learn and other libraries in the PyData stack. This parameter usually controls the random seed used to generate the function's output and setting is explicitly allows us to have reproducible results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Arrange data into a feature matrix and target vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"scikit-learn requires that the data be arranged into a two-dimensional feature matrix and a one-dimensional target array. By convention: \n",
"\n",
"* The feature matrix is often stored in a variable called `X`. This matrix is typically two-dimensional with shape `[n_samples, n_features]`, where `n_samples` refers to the number of row (i.e. housing districts in our example) and `n_features` refers to all columns except `median_house_value` which is our target.\n",
"* The target or label array is usually denoted by `y`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"#### Exercise #3\n",
"\n",
"Create the feature matrix `X` and target vector `y` from `housing_data`.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit the model to your data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it is time to apply our model to data! This can be done with the `fit()` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" max_samples=None, min_impurity_decrease=0.0,\n",
" min_impurity_split=None, min_samples_leaf=1,\n",
" min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
" n_estimators=10, n_jobs=-1, oob_score=False,\n",
" random_state=42, verbose=0, warm_start=False)"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate the predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step is to generate predictions and evaluate them with our chosen performance metric, in this case the RMSE."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"yhat = model.predict(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"19186.0842161522"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rmse(y, yhat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is not a bad score since the majority of the house prices fall in the range of $115,000-250,000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 19443.000000\n",
"mean 191793.406162\n",
"std 96775.724042\n",
"min 14999.000000\n",
"25% 116700.000000\n",
"50% 173400.000000\n",
"75% 247100.000000\n",
"max 499100.000000\n",
"Name: median_house_value, dtype: float64"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housing_data['median_house_value'].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and thus we are looking at roughly a 10-20% error in our predictions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Warning: Evaluating our model's predictions on the same data it was trained on is usually a recipe for disaster! Why? The problem is that the model may memorise the structure of the data it sees and fail to provide good predictions when shown new data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Better evaluation using training and validation splits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One way to measure how well a model will generalise to new cases is to split your data into two sets: the _**training set**_ and the _**validation set**_. As these names imply, you train your model using the training set and validate it using the validation set. The error rate on new cases is called the _**generalisation error**_ and by evaluating your model on the validation set, you get an estimation of this error.\n",
"\n",
"Creating a validation set is theoretically quite simple: just pick some instances randomly and set them aside (we set the random number generator's seed `random_state` so that is always generates the same shuffled indices):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"15554 train rows + 3889 valid rows\n"
]
}
],
"source": [
"X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"print(f'{len(X_train)} train rows + {len(X_valid)} valid rows')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With these two datasets, we first fit on the training set and evaluate the prediction on the validation one:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" max_samples=None, min_impurity_decrease=0.0,\n",
" min_impurity_split=None, min_samples_leaf=1,\n",
" min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
" n_estimators=10, n_jobs=-1, oob_score=False,\n",
" random_state=42, verbose=0, warm_start=False)"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)\n",
"model.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"46106.026083048855"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred = model.predict(X_valid)\n",
"rmse(y_valid, y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualising the model errors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although numerical scores are useful, deeper insights can often be gained by _visualising_ the errors the model makes. Let's look at two common ways to diagnose _regression_ models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prediction errors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get a sense of how often our model is predicting values that are close to the expected values, we'll plot the actual `median_house_value` labels from the test dataset against the predicted value generated by our final model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def plot_prediction_error(fitted_model, X, y):\n",
" \"\"\"\n",
" A utility function to visualise the prediction errors of regression models.\n",
" \n",
" Args:\n",
" fitted_model: A scikit-learn regression model.\n",
" X: The feature matrix to generate predictions on.\n",
" y: The target vector compare the predictions against.\n",
" \"\"\"\n",
" y_pred = model.predict(X)\n",
" plt.figure(figsize=(8, 4))\n",
" sns.scatterplot(y, y_pred)\n",
" sns.lineplot([y.min(), y.max()], [y.min(), y.max()], lw=2, color=\"r\")\n",
" plt.xlabel(\"Actual Median House Price\")\n",
" plt.ylabel(\"Predicted Median House Price\")\n",
" plt.title(f\"Prediction Error for {model.__class__.__name__}\")\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"