{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient-boosting decision tree (GBDT)\n", "\n", "In this notebook, we will present the gradient boosting decision tree\n", "algorithm and contrast it with AdaBoost.\n", "\n", "Gradient-boosting differs from AdaBoost due to the following reason: instead\n", "of assigning weights to specific samples, GBDT will fit a decision tree on the\n", "residuals error (hence the name \"gradient\") of the previous tree. Therefore,\n", "each new tree in the ensemble predicts the error made by the previous learner\n", "instead of predicting the target directly.\n", "\n", "In this section, we will provide some intuition about the way learners are\n", "combined to give the final prediction. In this regard, let's go back to our\n", "regression problem which is more intuitive for demonstrating the underlying\n", "machinery." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Create a random number generator that will be used to set the randomness\n", "rng = np.random.RandomState(0)\n", "\n", "\n", "def generate_data(n_samples=50):\n", " \"\"\"Generate synthetic dataset. Returns `data_train`, `data_test`,\n", " `target_train`.\"\"\"\n", " x_max, x_min = 1.4, -1.4\n", " len_x = x_max - x_min\n", " x = rng.rand(n_samples) * len_x - len_x / 2\n", " noise = rng.randn(n_samples) * 0.3\n", " y = x**3 - 0.5 * x**2 + noise\n", "\n", " data_train = pd.DataFrame(x, columns=[\"Feature\"])\n", " data_test = pd.DataFrame(\n", " np.linspace(x_max, x_min, num=300), columns=[\"Feature\"]\n", " )\n", " target_train = pd.Series(y, name=\"Target\")\n", "\n", " return data_train, data_test, target_train\n", "\n", "\n", "data_train, data_test, target_train = generate_data()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.scatterplot(\n", " x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n", ")\n", "_ = plt.title(\"Synthetic regression dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we previously discussed, boosting will be based on assembling a sequence of\n", "learners. We will start by creating a decision tree regressor. We will set the\n", "depth of the tree so that the resulting learner will underfit the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree = DecisionTreeRegressor(max_depth=3, random_state=0)\n", "tree.fit(data_train, target_train)\n", "\n", "target_train_predicted = tree.predict(data_train)\n", "target_test_predicted = tree.predict(data_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the term \"test\" here refers to data that was not used for training. It\n", "should not be confused with data coming from a train-test split, as it was\n", "generated in equally-spaced intervals for the visual evaluation of the\n", "predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the data\n", "sns.scatterplot(\n", " x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n", ")\n", "# plot the predictions\n", "line_predictions = plt.plot(data_test[\"Feature\"], target_test_predicted, \"--\")\n", "\n", "# plot the residuals\n", "for value, true, predicted in zip(\n", " data_train[\"Feature\"], target_train, target_train_predicted\n", "):\n", " lines_residuals = plt.plot([value, value], [true, predicted], color=\"red\")\n", "\n", "plt.legend(\n", " [line_predictions[0], lines_residuals[0]], [\"Fitted tree\", \"Residuals\"]\n", ")\n", "_ = plt.title(\"Prediction function together \\nwith errors on the training set\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Tip

\n", "

In the cell above, we manually edited the legend to get only a single label\n", "for all the residual lines.

\n", "
\n", "Since the tree underfits the data, its accuracy is far from perfect on the\n", "training data. We can observe this in the figure by looking at the difference\n", "between the predictions and the ground-truth data. We represent these errors,\n", "called \"Residuals\", by unbroken red lines.\n", "\n", "Indeed, our initial tree was not expressive enough to handle the complexity of\n", "the data, as shown by the residuals. In a gradient-boosting algorithm, the\n", "idea is to create a second tree which, given the same data `data`, will try to\n", "predict the residuals instead of the vector `target`. We would therefore have\n", "a tree that is able to predict the errors made by the initial tree.\n", "\n", "Let's train such a tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "residuals = target_train - target_train_predicted\n", "\n", "tree_residuals = DecisionTreeRegressor(max_depth=5, random_state=0)\n", "tree_residuals.fit(data_train, residuals)\n", "\n", "target_train_predicted_residuals = tree_residuals.predict(data_train)\n", "target_test_predicted_residuals = tree_residuals.predict(data_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.scatterplot(x=data_train[\"Feature\"], y=residuals, color=\"black\", alpha=0.5)\n", "line_predictions = plt.plot(\n", " data_test[\"Feature\"], target_test_predicted_residuals, \"--\"\n", ")\n", "\n", "# plot the residuals of the predicted residuals\n", "for value, true, predicted in zip(\n", " data_train[\"Feature\"], residuals, target_train_predicted_residuals\n", "):\n", " lines_residuals = plt.plot([value, value], [true, predicted], color=\"red\")\n", "\n", "plt.legend(\n", " [line_predictions[0], lines_residuals[0]],\n", " [\"Fitted tree\", \"Residuals\"],\n", " bbox_to_anchor=(1.05, 0.8),\n", " loc=\"upper left\",\n", ")\n", "_ = plt.title(\"Prediction of the previous residuals\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that this new tree only manages to fit some of the residuals. We will\n", "focus on a specific sample from the training set (i.e. we know that the sample\n", "will be well predicted using two successive trees). We will use this sample to\n", "explain how the predictions of both trees are combined. Let's first select\n", "this sample in `data_train`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample = data_train.iloc[[-2]]\n", "x_sample = sample[\"Feature\"].iloc[0]\n", "target_true = target_train.iloc[-2]\n", "target_true_residual = residuals.iloc[-2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot the previous information and highlight our sample of interest.\n", "Let's start by plotting the original data and the prediction of the first\n", "decision tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the previous information:\n", "# * the dataset\n", "# * the predictions\n", "# * the residuals\n", "\n", "sns.scatterplot(\n", " x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n", ")\n", "plt.plot(data_test[\"Feature\"], target_test_predicted, \"--\")\n", "for value, true, predicted in zip(\n", " data_train[\"Feature\"], target_train, target_train_predicted\n", "):\n", " lines_residuals = plt.plot([value, value], [true, predicted], color=\"red\")\n", "\n", "# Highlight the sample of interest\n", "plt.scatter(\n", " sample, target_true, label=\"Sample of interest\", color=\"tab:orange\", s=200\n", ")\n", "plt.xlim([-1, 0])\n", "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", "_ = plt.title(\"Tree predictions\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's plot the residuals information. We will plot the residuals computed\n", "from the first decision tree and show the residual predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the previous information:\n", "# * the residuals committed by the first tree\n", "# * the residual predictions\n", "# * the residuals of the residual predictions\n", "\n", "sns.scatterplot(x=data_train[\"Feature\"], y=residuals, color=\"black\", alpha=0.5)\n", "plt.plot(data_test[\"Feature\"], target_test_predicted_residuals, \"--\")\n", "for value, true, predicted in zip(\n", " data_train[\"Feature\"], residuals, target_train_predicted_residuals\n", "):\n", " lines_residuals = plt.plot([value, value], [true, predicted], color=\"red\")\n", "\n", "# Highlight the sample of interest\n", "plt.scatter(\n", " sample,\n", " target_true_residual,\n", " label=\"Sample of interest\",\n", " color=\"tab:orange\",\n", " s=200,\n", ")\n", "plt.xlim([-1, 0])\n", "plt.legend()\n", "_ = plt.title(\"Prediction of the residuals\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our sample of interest, our initial tree is making an error (small\n", "residual). When fitting the second tree, the residual in this case is\n", "perfectly fitted and predicted. We will quantitatively check this prediction\n", "using the fitted tree. First, let's check the prediction of the initial tree\n", "and compare it with the true value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"True value to predict for f(x={x_sample:.3f}) = {target_true:.3f}\")\n", "\n", "y_pred_first_tree = tree.predict(sample)[0]\n", "print(\n", " f\"Prediction of the first decision tree for x={x_sample:.3f}: \"\n", " f\"y={y_pred_first_tree:.3f}\"\n", ")\n", "print(f\"Error of the tree: {target_true - y_pred_first_tree:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we visually observed, we have a small error. Now, we can use the second\n", "tree to try to predict this residual." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " f\"Prediction of the residual for x={x_sample:.3f}: \"\n", " f\"{tree_residuals.predict(sample)[0]:.3f}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that our second tree is capable of predicting the exact residual\n", "(error) of our first tree. Therefore, we can predict the value of `x` by\n", "summing the prediction of all the trees in the ensemble." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred_first_and_second_tree = (\n", " y_pred_first_tree + tree_residuals.predict(sample)[0]\n", ")\n", "print(\n", " \"Prediction of the first and second decision trees combined for \"\n", " f\"x={x_sample:.3f}: y={y_pred_first_and_second_tree:.3f}\"\n", ")\n", "print(f\"Error of the tree: {target_true - y_pred_first_and_second_tree:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We chose a sample for which only two trees were enough to make the perfect\n", "prediction. However, we saw in the previous plot that two trees were not\n", "enough to correct the residuals of all samples. Therefore, one needs to add\n", "several trees to the ensemble to successfully correct the error (i.e. the\n", "second tree corrects the first tree's error, while the third tree corrects the\n", "second tree's error and so on).\n", "\n", "We will compare the generalization performance of random-forest and gradient\n", "boosting on the California housing dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "from sklearn.model_selection import cross_validate\n", "\n", "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n", "target *= 100 # rescale the target in k$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "gradient_boosting = GradientBoostingRegressor(n_estimators=200)\n", "cv_results_gbdt = cross_validate(\n", " gradient_boosting,\n", " data,\n", " target,\n", " scoring=\"neg_mean_absolute_error\",\n", " n_jobs=2,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Gradient Boosting Decision Tree\")\n", "print(\n", " \"Mean absolute error via cross-validation: \"\n", " f\"{-cv_results_gbdt['test_score'].mean():.3f} \u00b1 \"\n", " f\"{cv_results_gbdt['test_score'].std():.3f} k$\"\n", ")\n", "print(f\"Average fit time: {cv_results_gbdt['fit_time'].mean():.3f} seconds\")\n", "print(\n", " f\"Average score time: {cv_results_gbdt['score_time'].mean():.3f} seconds\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "random_forest = RandomForestRegressor(n_estimators=200, n_jobs=2)\n", "cv_results_rf = cross_validate(\n", " random_forest,\n", " data,\n", " target,\n", " scoring=\"neg_mean_absolute_error\",\n", " n_jobs=2,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Random Forest\")\n", "print(\n", " \"Mean absolute error via cross-validation: \"\n", " f\"{-cv_results_rf['test_score'].mean():.3f} \u00b1 \"\n", " f\"{cv_results_rf['test_score'].std():.3f} k$\"\n", ")\n", "print(f\"Average fit time: {cv_results_rf['fit_time'].mean():.3f} seconds\")\n", "print(f\"Average score time: {cv_results_rf['score_time'].mean():.3f} seconds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In term of computation performance, the forest can be parallelized and will\n", "benefit from using multiple cores of the CPU. In terms of scoring performance,\n", "both algorithms lead to very close results.\n", "\n", "However, we see that the gradient boosting is a very fast algorithm to predict\n", "compared to random forest. This is due to the fact that gradient boosting uses\n", "shallow trees. We will go into details in the next notebook about the\n", "hyperparameters to consider when optimizing ensemble methods." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }