{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient-boosting decision tree\n", "\n", "In this notebook, we present the gradient boosting decision tree (GBDT) algorithm.\n", "\n", "Even if AdaBoost and GBDT are both boosting algorithms, they are different in\n", "nature: the former assigns weights to specific samples, whereas GBDT fits\n", "successive decision trees on the residual errors (hence the name \"gradient\") of\n", "their preceding tree. Therefore, each new tree in the ensemble tries to refine\n", "its predictions by specifically addressing the errors made by the previous\n", "learner, instead of predicting the target directly.\n", "\n", "In this section, we provide some intuitions on the way learners are combined\n", "to give the final prediction. For such purpose, we tackle a single-feature\n", "regression problem, which is more intuitive for demonstrating the underlying\n", "machinery.\n", "\n", "Later in this notebook we compare the performance of GBDT (boosting) with that\n", "of a Random Forest (bagging) for a particular dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "\n", "def generate_data(n_samples=50):\n", " \"\"\"Generate synthetic dataset. Returns `data_train`, `data_test`,\n", " `target_train`.\"\"\"\n", " x_max, x_min = 1.4, -1.4\n", " rng = np.random.default_rng(0) # Create a random number generator\n", " x = rng.uniform(x_min, x_max, size=(n_samples,))\n", " noise = rng.normal(size=(n_samples,)) * 0.3\n", " y = x**3 - 0.5 * x**2 + noise\n", "\n", " data_train = pd.DataFrame(x, columns=[\"Feature\"])\n", " data_test = pd.DataFrame(\n", " np.linspace(x_max, x_min, num=300), columns=[\"Feature\"]\n", " )\n", " target_train = pd.Series(y, name=\"Target\")\n", "\n", " return data_train, data_test, target_train\n", "\n", "\n", "data_train, data_test, target_train = generate_data()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.scatterplot(\n", " x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n", ")\n", "_ = plt.title(\"Synthetic regression dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we previously discussed, boosting is based on assembling a sequence of\n", "learners. We start by creating a decision tree regressor. We set the depth of\n", "the tree to underfit the data on purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree = DecisionTreeRegressor(max_depth=3, random_state=0)\n", "tree.fit(data_train, target_train)\n", "\n", "target_train_predicted = tree.predict(data_train)\n", "target_test_predicted = tree.predict(data_test)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Using the term \"test\" here refers to data not used for training. It should not\n", "be confused with data coming from a train-test split, as it was generated in\n", "equally-spaced intervals for the visual evaluation of the predictions.\n", "\n", "To avoid writing the same code in multiple places we define a helper function\n", "to plot the data samples as well as the decision tree predictions and\n", "residuals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_decision_tree_with_residuals(y_train, y_train_pred, y_test_pred):\n", " \"\"\"Plot the synthetic data, predictions, and residuals for a decision tree.\n", " Handles are returned to allow custom legends for the plot.\"\"\"\n", " _fig_, ax = plt.subplots()\n", " # plot the data\n", " sns.scatterplot(\n", " x=data_train[\"Feature\"], y=y_train, color=\"black\", alpha=0.5, ax=ax\n", " )\n", " # plot the predictions\n", " line_predictions = ax.plot(data_test[\"Feature\"], y_test_pred, \"--\")\n", "\n", " # plot the residuals\n", " for value, true, predicted in zip(\n", " data_train[\"Feature\"], y_train, y_train_pred\n", " ):\n", " lines_residuals = ax.plot(\n", " [value, value], [true, predicted], color=\"red\"\n", " )\n", "\n", " handles = [line_predictions[0], lines_residuals[0]]\n", "\n", " return handles, ax" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "handles, ax = plot_decision_tree_with_residuals(\n", " target_train, target_train_predicted, target_test_predicted\n", ")\n", "legend_labels = [\"Initial decision tree\", \"Initial residuals\"]\n", "ax.legend(handles, legend_labels, bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n", "_ = ax.set_title(\"Decision Tree together \\nwith errors on the training set\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Tip
\n", "In the cell above, we manually edited the legend to get only a single label\n", "for all the residual lines.
\n", "