{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient-boosting decision tree\n", "\n", "In this notebook, we present the gradient boosting decision tree (GBDT) algorithm.\n", "\n", "Even if AdaBoost and GBDT are both boosting algorithms, they are different in\n", "nature: the former assigns weights to specific samples, whereas GBDT fits\n", "successive decision trees on the residual errors (hence the name \"gradient\") of\n", "their preceding tree. Therefore, each new tree in the ensemble tries to refine\n", "its predictions by specifically addressing the errors made by the previous\n", "learner, instead of predicting the target directly.\n", "\n", "In this section, we provide some intuitions on the way learners are combined\n", "to give the final prediction. For such purpose, we tackle a single-feature\n", "regression problem, which is more intuitive for demonstrating the underlying\n", "machinery.\n", "\n", "Later in this notebook we compare the performance of GBDT (boosting) with that\n", "of a Random Forest (bagging) for a particular dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "\n", "def generate_data(n_samples=50):\n", " \"\"\"Generate synthetic dataset. Returns `data_train`, `data_test`,\n", " `target_train`.\"\"\"\n", " x_max, x_min = 1.4, -1.4\n", " rng = np.random.default_rng(0) # Create a random number generator\n", " x = rng.uniform(x_min, x_max, size=(n_samples,))\n", " noise = rng.normal(size=(n_samples,)) * 0.3\n", " y = x**3 - 0.5 * x**2 + noise\n", "\n", " data_train = pd.DataFrame(x, columns=[\"Feature\"])\n", " data_test = pd.DataFrame(\n", " np.linspace(x_max, x_min, num=300), columns=[\"Feature\"]\n", " )\n", " target_train = pd.Series(y, name=\"Target\")\n", "\n", " return data_train, data_test, target_train\n", "\n", "\n", "data_train, data_test, target_train = generate_data()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.scatterplot(\n", " x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n", ")\n", "_ = plt.title(\"Synthetic regression dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we previously discussed, boosting is based on assembling a sequence of\n", "learners. We start by creating a decision tree regressor. We set the depth of\n", "the tree to underfit the data on purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree = DecisionTreeRegressor(max_depth=3, random_state=0)\n", "tree.fit(data_train, target_train)\n", "\n", "target_train_predicted = tree.predict(data_train)\n", "target_test_predicted = tree.predict(data_test)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Using the term \"test\" here refers to data not used for training. It should not\n", "be confused with data coming from a train-test split, as it was generated in\n", "equally-spaced intervals for the visual evaluation of the predictions.\n", "\n", "To avoid writing the same code in multiple places we define a helper function\n", "to plot the data samples as well as the decision tree predictions and\n", "residuals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_decision_tree_with_residuals(y_train, y_train_pred, y_test_pred):\n", " \"\"\"Plot the synthetic data, predictions, and residuals for a decision tree.\n", " Handles are returned to allow custom legends for the plot.\"\"\"\n", " _fig_, ax = plt.subplots()\n", " # plot the data\n", " sns.scatterplot(\n", " x=data_train[\"Feature\"], y=y_train, color=\"black\", alpha=0.5, ax=ax\n", " )\n", " # plot the predictions\n", " line_predictions = ax.plot(data_test[\"Feature\"], y_test_pred, \"--\")\n", "\n", " # plot the residuals\n", " for value, true, predicted in zip(\n", " data_train[\"Feature\"], y_train, y_train_pred\n", " ):\n", " lines_residuals = ax.plot(\n", " [value, value], [true, predicted], color=\"red\"\n", " )\n", "\n", " handles = [line_predictions[0], lines_residuals[0]]\n", "\n", " return handles, ax" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "handles, ax = plot_decision_tree_with_residuals(\n", " target_train, target_train_predicted, target_test_predicted\n", ")\n", "legend_labels = [\"Initial decision tree\", \"Initial residuals\"]\n", "ax.legend(handles, legend_labels, bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", "_ = ax.set_title(\"Decision Tree together \\nwith errors on the training set\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Tip

\n", "

In the cell above, we manually edited the legend to get only a single label\n", "for all the residual lines.

\n", "
\n", "Since the tree underfits the data, its accuracy is far from perfect on the\n", "training data. We can observe this in the figure above by looking at the\n", "difference between the predictions and the ground-truth data. We represent\n", "these errors, called \"residuals\", using solid red lines.\n", "\n", "Indeed, our initial tree is not expressive enough to handle the complexity of\n", "the data, as shown by the residuals. In a gradient-boosting algorithm, the\n", "idea is to create a second tree which, given the same `data`, tries to predict\n", "the residuals instead of the vector `target`, i.e. we have a second tree that\n", "is able to predict the errors made by the initial tree.\n", "\n", "Let's train such a tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "residuals = target_train - target_train_predicted\n", "\n", "tree_residuals = DecisionTreeRegressor(max_depth=5, random_state=0)\n", "tree_residuals.fit(data_train, residuals)\n", "\n", "target_train_predicted_residuals = tree_residuals.predict(data_train)\n", "target_test_predicted_residuals = tree_residuals.predict(data_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "handles, ax = plot_decision_tree_with_residuals(\n", " residuals,\n", " target_train_predicted_residuals,\n", " target_test_predicted_residuals,\n", ")\n", "legend_labels = [\n", " \"Predicted residuals\",\n", " \"Residuals of the\\npredicted residuals\",\n", "]\n", "ax.legend(handles, legend_labels, bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", "_ = ax.set_title(\"Prediction of the initial residuals\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that this new tree only manages to fit some of the residuals. We now\n", "focus on a specific sample from the training set (as we know that the sample\n", "can be well predicted using two successive trees). We will use this sample to\n", "explain how the predictions of both trees are combined. Let's first select\n", "this sample in `data_train`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample = data_train.iloc[[-7]]\n", "x_sample = sample[\"Feature\"].iloc[0]\n", "target_true = target_train.iloc[-7]\n", "target_true_residual = residuals.iloc[-7]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot the original data, the predictions of the initial decision tree and\n", "highlight our sample of interest, i.e. this is just a zoom of the plot\n", "displaying the initial shallow tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "handles, ax = plot_decision_tree_with_residuals(\n", " target_train, target_train_predicted, target_test_predicted\n", ")\n", "ax.scatter(\n", " sample, target_true, label=\"Sample of interest\", color=\"tab:orange\", s=200\n", ")\n", "ax.set_xlim([-1, 0])\n", "ax.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", "_ = ax.set_title(\"Zoom of sample of interest\\nin the initial decision tree\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly we plot a zoom of the plot with the prediction of the initial residuals" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "handles, ax = plot_decision_tree_with_residuals(\n", " residuals,\n", " target_train_predicted_residuals,\n", " target_test_predicted_residuals,\n", ")\n", "plt.scatter(\n", " sample,\n", " target_true_residual,\n", " label=\"Sample of interest\",\n", " color=\"tab:orange\",\n", " s=200,\n", ")\n", "legend_labels = [\n", " \"Predicted residuals\",\n", " \"Residuals of the\\npredicted residuals\",\n", "]\n", "ax.set_xlim([-1, 0])\n", "ax.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", "_ = ax.set_title(\"Zoom of sample of interest\\nin the initial residuals\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our sample of interest, our initial tree is making an error (small\n", "residual). When fitting the second tree, the residual in this case is\n", "perfectly fitted and predicted. We can quantitatively check this prediction\n", "using the fitted tree. First, let's check the prediction of the initial tree\n", "and compare it with the true value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"True value to predict for f(x={x_sample:.3f}) = {target_true:.3f}\")\n", "\n", "y_pred_first_tree = tree.predict(sample)[0]\n", "print(\n", " f\"Prediction of the first decision tree for x={x_sample:.3f}: \"\n", " f\"y={y_pred_first_tree:.3f}\"\n", ")\n", "print(f\"Error of the tree: {target_true - y_pred_first_tree:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we visually observed, we have a small error. Now, we can use the second\n", "tree to try to predict this residual." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " f\"Prediction of the residual for x={x_sample:.3f}: \"\n", " f\"{tree_residuals.predict(sample)[0]:.3f}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that our second tree is capable of predicting the exact residual\n", "(error) of our first tree. Therefore, we can predict the value of `x` by\n", "summing the prediction of all the trees in the ensemble." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred_first_and_second_tree = (\n", " y_pred_first_tree + tree_residuals.predict(sample)[0]\n", ")\n", "print(\n", " \"Prediction of the first and second decision trees combined for \"\n", " f\"x={x_sample:.3f}: y={y_pred_first_and_second_tree:.3f}\"\n", ")\n", "print(f\"Error of the tree: {target_true - y_pred_first_and_second_tree:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We chose a sample for which only two trees were enough to make the perfect\n", "prediction. However, we saw in the previous plot that two trees were not\n", "enough to correct the residuals of all samples. Therefore, one needs to add\n", "several trees to the ensemble to successfully correct the error (i.e. the\n", "second tree corrects the first tree's error, while the third tree corrects the\n", "second tree's error and so on).\n", "\n", "## First comparison of GBDT vs. random forests\n", "\n", "We now compare the generalization performance of random-forest and gradient\n", "boosting on the California housing dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "from sklearn.model_selection import cross_validate\n", "\n", "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n", "target *= 100 # rescale the target in k$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "gradient_boosting = GradientBoostingRegressor(n_estimators=200)\n", "cv_results_gbdt = cross_validate(\n", " gradient_boosting,\n", " data,\n", " target,\n", " scoring=\"neg_mean_absolute_error\",\n", " n_jobs=2,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Gradient Boosting Decision Tree\")\n", "print(\n", " \"Mean absolute error via cross-validation: \"\n", " f\"{-cv_results_gbdt['test_score'].mean():.3f} \u00b1 \"\n", " f\"{cv_results_gbdt['test_score'].std():.3f} k$\"\n", ")\n", "print(f\"Average fit time: {cv_results_gbdt['fit_time'].mean():.3f} seconds\")\n", "print(\n", " f\"Average score time: {cv_results_gbdt['score_time'].mean():.3f} seconds\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "random_forest = RandomForestRegressor(n_estimators=200, n_jobs=2)\n", "cv_results_rf = cross_validate(\n", " random_forest,\n", " data,\n", " target,\n", " scoring=\"neg_mean_absolute_error\",\n", " n_jobs=2,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Random Forest\")\n", "print(\n", " \"Mean absolute error via cross-validation: \"\n", " f\"{-cv_results_rf['test_score'].mean():.3f} \u00b1 \"\n", " f\"{cv_results_rf['test_score'].std():.3f} k$\"\n", ")\n", "print(f\"Average fit time: {cv_results_rf['fit_time'].mean():.3f} seconds\")\n", "print(f\"Average score time: {cv_results_rf['score_time'].mean():.3f} seconds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In terms of computing performance, the forest can be parallelized and then\n", "benefit from using multiple cores of the CPU. In terms of scoring performance,\n", "both algorithms lead to very close results.\n", "\n", "However, we see that gradient boosting is overall faster than random forest.\n", "One of the reasons is that random forests typically rely on deep trees (that\n", "overfit individually) whereas boosting models build shallow trees (that\n", "underfit individually) which are faster to fit and predict. In the following\n", "exercise we will explore more in depth how these two models compare." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }