{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 4 - Introduction to random forests\n", "\n", "> How to train a Random Forest model to predict housing prices and evaluate the quality of the model's predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lewtun/dslectures/master?urlpath=lab/tree/notebooks%2Flesson04_intro-to-random-forests.ipynb) \n", "[![slides](https://img.shields.io/static/v1?label=slides&message=lesson04_intro-to-random-forests.pdf&color=blue&logo=Google-drive)](https://drive.google.com/open?id=18yESZldXJrdXiaOQWmiA35-8vIqOnNK8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Understand the main steps involved in training a machine learning model\n", "* Gain an introduction to scikit-learn's API\n", "* Understand the need to generate a training and validation set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This lesson is adapted from Jeremy Howard's fantastic online course [_Introduction to Machine Learning for Coders_](https://course18.fast.ai/ml), in particular:\n", "\n", "* [1 - Introduction to Random Forests](https://course18.fast.ai/lessonsml1/lesson1.html)\n", "\n", "You may also find the following textbook chapters and blog posts useful:\n", "\n", "* Chapters 2 & 5 of _Hands-On Machine Learning with Scikit-Learn and TensorFlow_ by Aurèlien Geron\n", "* [About Train, Validation and Test Sets in Machine Learning](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework\n", "\n", "* Solve the exercises included in this notebook\n", "* Read chapters 6 and 7 of _Hands-On Machine Learning with Scikit-Learn and TensorFlow_ by Aurèlien Geron" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lesson we will analyse the preprocessed table of clean housing data and their addresses that we prepared in lesson 3:\n", "\n", "* `housing_processed.csv`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is a machine learning model?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Tom Mitchell](https://en.wikipedia.org/wiki/Tom_M._Mitchell), one of the pioneers of machine learning, proposed this definition:\n", "\n", "> A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.\n", "\n", "Framed in our example to predict housing prices in California (task $T$), we can run a Random Forest algorithm on data about past housing prices (experience $E$) and, if it has successfully \"learned\", it will then do better at predicting future housing prices (performance measure $P$)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reload modules before executing user code\n", "%load_ext autoreload\n", "# reload all modules every time before executing Python code\n", "%autoreload 2\n", "# render plots in notebook\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data wrangling\n", "import pandas as pd\n", "import numpy as np\n", "from dslectures.core import *\n", "from pathlib import Path\n", "\n", "# data viz\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set(color_codes=True)\n", "sns.set_palette(sns.color_palette(\"muted\"))\n", "\n", "# ml magic\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_error, r2_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, we can download our datasets using our helper function `get_datasets`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Download of housing_processed.csv dataset complete.\n" ] } ], "source": [ "get_dataset('housing_processed.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also make use of the `pathlib` library to handle our filepaths:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "housing.csv housing_processed.csv\n", "housing_addresses.csv imdb.csv\n", "housing_gmaps_data_raw.csv uc\n", "housing_merged.csv word2vec-google-news-300.pkl\n" ] } ], "source": [ "DATA = Path('../data/')\n", "!ls {DATA}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valuecitypostal_coderooms_per_householdbedrooms_per_householdbedrooms_per_roompopulation_per_householdocean_proximity_INLANDocean_proximity_<1H OCEANocean_proximity_NEAR BAYocean_proximity_NEAR OCEANocean_proximity_ISLAND
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.069947056.9841271.0238100.1465912.55555600100
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0620946116.2381370.9718800.1557972.10984200100
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0620946188.2881361.0734460.1295162.80226000100
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0620946185.8173521.0730590.1844582.54794500100
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0620946186.2818531.0810810.1720962.18146700100
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value city \\\n", "0 322.0 126.0 8.3252 452600.0 69 \n", "1 2401.0 1138.0 8.3014 358500.0 620 \n", "2 496.0 177.0 7.2574 352100.0 620 \n", "3 558.0 219.0 5.6431 341300.0 620 \n", "4 565.0 259.0 3.8462 342200.0 620 \n", "\n", " postal_code rooms_per_household bedrooms_per_household \\\n", "0 94705 6.984127 1.023810 \n", "1 94611 6.238137 0.971880 \n", "2 94618 8.288136 1.073446 \n", "3 94618 5.817352 1.073059 \n", "4 94618 6.281853 1.081081 \n", "\n", " bedrooms_per_room population_per_household ocean_proximity_INLAND \\\n", "0 0.146591 2.555556 0 \n", "1 0.155797 2.109842 0 \n", "2 0.129516 2.802260 0 \n", "3 0.184458 2.547945 0 \n", "4 0.172096 2.181467 0 \n", "\n", " ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY \\\n", "0 0 1 \n", "1 0 1 \n", "2 0 1 \n", "3 0 1 \n", "4 0 1 \n", "\n", " ocean_proximity_NEAR OCEAN ocean_proximity_ISLAND \n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "housing_data = pd.read_csv(DATA/'housing_processed.csv'); housing_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Tip: We often use `DataFrame.head()` to peek at the first 5 rows of a `pandas.DataFrame`. When you have a lot of columns, you may find it is simpler to peek at the _transpose_ with `DataFrame.head().T`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
longitude-122.230000-122.220000-122.240000-122.250000-122.250000
latitude37.88000037.86000037.85000037.85000037.850000
housing_median_age41.00000021.00000052.00000052.00000052.000000
total_rooms880.0000007099.0000001467.0000001274.0000001627.000000
total_bedrooms129.0000001106.000000190.000000235.000000280.000000
population322.0000002401.000000496.000000558.000000565.000000
households126.0000001138.000000177.000000219.000000259.000000
median_income8.3252008.3014007.2574005.6431003.846200
median_house_value452600.000000358500.000000352100.000000341300.000000342200.000000
city69.000000620.000000620.000000620.000000620.000000
postal_code94705.00000094611.00000094618.00000094618.00000094618.000000
rooms_per_household6.9841276.2381378.2881365.8173526.281853
bedrooms_per_household1.0238100.9718801.0734461.0730591.081081
bedrooms_per_room0.1465910.1557970.1295160.1844580.172096
population_per_household2.5555562.1098422.8022602.5479452.181467
ocean_proximity_INLAND0.0000000.0000000.0000000.0000000.000000
ocean_proximity_<1H OCEAN0.0000000.0000000.0000000.0000000.000000
ocean_proximity_NEAR BAY1.0000001.0000001.0000001.0000001.000000
ocean_proximity_NEAR OCEAN0.0000000.0000000.0000000.0000000.000000
ocean_proximity_ISLAND0.0000000.0000000.0000000.0000000.000000
\n", "
" ], "text/plain": [ " 0 1 2 \\\n", "longitude -122.230000 -122.220000 -122.240000 \n", "latitude 37.880000 37.860000 37.850000 \n", "housing_median_age 41.000000 21.000000 52.000000 \n", "total_rooms 880.000000 7099.000000 1467.000000 \n", "total_bedrooms 129.000000 1106.000000 190.000000 \n", "population 322.000000 2401.000000 496.000000 \n", "households 126.000000 1138.000000 177.000000 \n", "median_income 8.325200 8.301400 7.257400 \n", "median_house_value 452600.000000 358500.000000 352100.000000 \n", "city 69.000000 620.000000 620.000000 \n", "postal_code 94705.000000 94611.000000 94618.000000 \n", "rooms_per_household 6.984127 6.238137 8.288136 \n", "bedrooms_per_household 1.023810 0.971880 1.073446 \n", "bedrooms_per_room 0.146591 0.155797 0.129516 \n", "population_per_household 2.555556 2.109842 2.802260 \n", "ocean_proximity_INLAND 0.000000 0.000000 0.000000 \n", "ocean_proximity_<1H OCEAN 0.000000 0.000000 0.000000 \n", "ocean_proximity_NEAR BAY 1.000000 1.000000 1.000000 \n", "ocean_proximity_NEAR OCEAN 0.000000 0.000000 0.000000 \n", "ocean_proximity_ISLAND 0.000000 0.000000 0.000000 \n", "\n", " 3 4 \n", "longitude -122.250000 -122.250000 \n", "latitude 37.850000 37.850000 \n", "housing_median_age 52.000000 52.000000 \n", "total_rooms 1274.000000 1627.000000 \n", "total_bedrooms 235.000000 280.000000 \n", "population 558.000000 565.000000 \n", "households 219.000000 259.000000 \n", "median_income 5.643100 3.846200 \n", "median_house_value 341300.000000 342200.000000 \n", "city 620.000000 620.000000 \n", "postal_code 94618.000000 94618.000000 \n", "rooms_per_household 5.817352 6.281853 \n", "bedrooms_per_household 1.073059 1.081081 \n", "bedrooms_per_room 0.184458 0.172096 \n", "population_per_household 2.547945 2.181467 \n", "ocean_proximity_INLAND 0.000000 0.000000 \n", "ocean_proximity_<1H OCEAN 0.000000 0.000000 \n", "ocean_proximity_NEAR BAY 1.000000 1.000000 \n", "ocean_proximity_NEAR OCEAN 0.000000 0.000000 \n", "ocean_proximity_ISLAND 0.000000 0.000000 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "housing_data.head().T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise #1\n", "\n", "Even though you may be told a dataset has been cleaned and prepared for training a model, you should always perform some sanity checks! \n", "\n", "* Check that `housing_data` is free of missing values\n", "* Check that all columns are numerical\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Select a performance measure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we can train any model, we need to think about which performance measure we wish to optimise for. For regression problems the Root Mean Square Error (RMSE) is often used as it measures the _**standard deviation**_ of the errors the algorithm makes in its predictions and gives a higher weight to large errors. For example, an RMSE equal to 50,000 means that about 68% of the algorithm's predictions fall within 50,000 CHF of the actual value, and about 95% fall within 100,000 CHF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: In general, lower values of RMSE indicate a better fit to the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mathematically, the formula for RMSE is:\n", "\n", "$$ \\mathrm{RMSE} = \\sqrt{\\frac{1}{m}\\sum_{i=1}^m \\left(\\hat{y}_i - y_i\\right)^2}$$\n", "\n", "where $m$ is the number of instances in the dataset you are measuring the RMSE on, $\\hat{y}_i$ is the model's prediction for the $i^{th}$ instance, and $y_i$ is the actual label. Let's create a simple function that uses scitkit-learn's [mean_squared_error function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) (which is just RMSE$^2$):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def rmse(y, yhat):\n", " \"\"\"A utility function to calculate the Root Mean Square Error (RMSE).\n", " \n", " Args:\n", " y (array): Actual values for target.\n", " yhat (array): Predicted values for target.\n", " \n", " Returns:\n", " rmse (double): The RMSE.\n", " \"\"\"\n", " return np.sqrt(mean_squared_error(y, yhat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "#### Exercise #2\n", "\n", "Whenever you create a Python function it is a good idea to test that it behaves as you expect on some dummy data. Given the two NumPy arrays:\n", "\n", "```python\n", "y_dummy = np.array([2,2,3])\n", "yhat_dummy = np.array([0,2,6])\n", "```\n", "\n", "check that our `rmse` function matches what you would get by calculating the explicit formula for RMSE. You may find the `numpy.sum()` and `array.size` methods to be useful.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: Another common metric for evaluating regression models is the coefficient of determination $R^2 = 1 - u/v$, where $u = \\sum_i (y_{i} - \\hat{y}_{i})^2$ is the residual sum of squares and $v=\\sum_i(y_{i} - \\bar{y}_{i})^2$ is the total sum of squares. Better models have $R^2$ values closer to 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've checked that the training data is clean and free from obvious anomalies, it's time to train our model! To do so, we will make use of the scikit-learn library.\n", "\n", "scikit-learn is one of the best known Python libraries for machine learning and provides efficient implementations of a large number of common algorithms. It has a uniform _Estimator API_ as well as excellent online documentation. The main benefit of its API is that once you understand the basic use and syntax of scikit-learn for one type of model, switching to a new model or algorithm is very easy.\n", "\n", "**Basics of the API**\n", "\n", "The most common steps one takes when building a model in scikit-learn are:\n", "1. Choose a class of model by importing the appropriate estimator class from scikit-learn.\n", "2. Choose model _hyperparameters_ by instantiating this class with the desired values.\n", "3. Arrange data into a feature matrix and target vector (see discussion below).\n", "4. Fit the model to your data by calling the `fit()` method.\n", "5. Evaluate the predictions of the model:\n", " * For supervised learning we typically predict _labels_ for new data using the `predict()` method.\n", " * For unsupervised learning, we often transform or infer properties of the data using the `transform()` or `predict()` methods.\n", " \n", "Let's go through each of these steps to build a Random Forest regressor to predict California housing prices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choose a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In scikit-learn, every class of model is represented by a Python class. We want a Random Forest regressor, so looking at the online [docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor) we should import the `RandomForestRegressor`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choose hyperparameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have chosen our model class, there are still some options open to us:\n", "\n", "* What is the maximum depth of the tree? The default is `None` which means the nodes are expanded until all leaves are pure.\n", "* Other parameters can be found in the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier), but for now we take a simple model with just 10 trees.\n", "\n", "The above choices are often referred to as _hyperparameters_ or parameters that must be set before the model is fit to the data. We can instantiate the `RandomForestRegressor` class and specify the desired hyperparameters as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: You will often see the function argument `random_state` in scikit-learn and other libraries in the PyData stack. This parameter usually controls the random seed used to generate the function's output and setting is explicitly allows us to have reproducible results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Arrange data into a feature matrix and target vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "scikit-learn requires that the data be arranged into a two-dimensional feature matrix and a one-dimensional target array. By convention: \n", "\n", "* The feature matrix is often stored in a variable called `X`. This matrix is typically two-dimensional with shape `[n_samples, n_features]`, where `n_samples` refers to the number of row (i.e. housing districts in our example) and `n_features` refers to all columns except `median_house_value` which is our target.\n", "* The target or label array is usually denoted by `y`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise #3\n", "\n", "Create the feature matrix `X` and target vector `y` from `housing_data`.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit the model to your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it is time to apply our model to data! This can be done with the `fit()` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=10, n_jobs=-1, oob_score=False,\n", " random_state=42, verbose=0, warm_start=False)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate the predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final step is to generate predictions and evaluate them with our chosen performance metric, in this case the RMSE." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "yhat = model.predict(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19186.0842161522" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rmse(y, yhat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is not a bad score since the majority of the house prices fall in the range of $115,000-250,000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 19443.000000\n", "mean 191793.406162\n", "std 96775.724042\n", "min 14999.000000\n", "25% 116700.000000\n", "50% 173400.000000\n", "75% 247100.000000\n", "max 499100.000000\n", "Name: median_house_value, dtype: float64" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "housing_data['median_house_value'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and thus we are looking at roughly a 10-20% error in our predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Warning: Evaluating our model's predictions on the same data it was trained on is usually a recipe for disaster! Why? The problem is that the model may memorise the structure of the data it sees and fail to provide good predictions when shown new data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Better evaluation using training and validation splits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to measure how well a model will generalise to new cases is to split your data into two sets: the _**training set**_ and the _**validation set**_. As these names imply, you train your model using the training set and validate it using the validation set. The error rate on new cases is called the _**generalisation error**_ and by evaluating your model on the validation set, you get an estimation of this error.\n", "\n", "Creating a validation set is theoretically quite simple: just pick some instances randomly and set them aside (we set the random number generator's seed `random_state` so that is always generates the same shuffled indices):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15554 train rows + 3889 valid rows\n" ] } ], "source": [ "X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "print(f'{len(X_train)} train rows + {len(X_valid)} valid rows')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With these two datasets, we first fit on the training set and evaluate the prediction on the validation one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=10, n_jobs=-1, oob_score=False,\n", " random_state=42, verbose=0, warm_start=False)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=42)\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "46106.026083048855" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred = model.predict(X_valid)\n", "rmse(y_valid, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualising the model errors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although numerical scores are useful, deeper insights can often be gained by _visualising_ the errors the model makes. Let's look at two common ways to diagnose _regression_ models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prediction errors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a sense of how often our model is predicting values that are close to the expected values, we'll plot the actual `median_house_value` labels from the test dataset against the predicted value generated by our final model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_prediction_error(fitted_model, X, y):\n", " \"\"\"\n", " A utility function to visualise the prediction errors of regression models.\n", " \n", " Args:\n", " fitted_model: A scikit-learn regression model.\n", " X: The feature matrix to generate predictions on.\n", " y: The target vector compare the predictions against.\n", " \"\"\"\n", " y_pred = model.predict(X)\n", " plt.figure(figsize=(8, 4))\n", " sns.scatterplot(y, y_pred)\n", " sns.lineplot([y.min(), y.max()], [y.min(), y.max()], lw=2, color=\"r\")\n", " plt.xlabel(\"Actual Median House Price\")\n", " plt.ylabel(\"Predicted Median House Price\")\n", " plt.title(f\"Prediction Error for {model.__class__.__name__}\")\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_prediction_error(model, X_valid, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we’re looking for here is a clear, linear relationship between the predicted and actual values. The red line denotes what could be considered an \"optimal\" model, so we want our points to be bunched around this line. We can see the apart from a few outliers, the random forest performs fairly well. (In fact those outliers might suggest something is fishy with the data or that these houses are special for reasons not reflected in the data.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Residual plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A residual is the difference between the labeled value and the predicted value for each instance in our dataset: \n", "\n", "$$ \\mathrm{residual} = y_\\mathrm{actual} - y_\\mathrm{predicted} $$\n", "\n", "We can plot residuals to visualize the extent to which our model has captured the behavior of the data. By plotting the residuals for a series of instances, we can check whether they’re consistent with random error; we should not be able to predict the error for any given instance. If the data points appear to be evenly (randomly) dispersed around the plotted line, our model is performing well. In some sense, the resulting plot is a rotated version of our prediction error one above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_residuals(fitted_model, X, y):\n", " '''\n", " A utility function to visualise the residuals of regression models.\n", " \n", " Args:\n", " fitted_model: A scikit-learn regression model.\n", " X: The feature matrix to generate predictions on.\n", " y: The target vector compare the predictions against. \n", " '''\n", " y_pred = model.predict(X)\n", "\n", " sns.residplot(y_pred, y - y_pred)\n", " plt.ylabel('Residuals')\n", " plt.xlabel('Predicted Median House Price')\n", " plt.title(f'Residuals for {model.__class__.__name__}')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_residuals(model, X_valid, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we’re looking for is a mostly symmetrical distribution with points that tend to cluster towards the middle of the plot, ideally around smaller numbers of the y-axis. If we observe some kind of structure that does not coincide with the plotted line, we have failed to capture the behavior of the data and should either consider some feature engineering, selecting a new model, or an exploration of the hyperparameters.\n", "\n", "In the case above, we see that again the outliers suggest some room for improvement with our Random Forest model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### Exercise #3\n", "\n", "Use the `plt.subplots()` functionality from lesson 3 to create a new function `plot_errors_and_residuals` that combines the above plots into a single figure. You may find the `ax.set_xlabel()`, `ax.set_ylabel()`, and `ax.set_title()` functions are useful for configuring the labels and title on each individual plot.\n", "\n", "\n", "#### Exercise #4\n", "\n", "Instead of using an ensemble of decision trees, scikit-learn also provides an estimator to train a _single_ decision tree on the data (see documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)). Repeat the same 5 steps above for a decision tree regressor, using the default hyperparameters. Do you notice anything unusual in the performance metrics if you fit the model on the whole dataset?\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }