{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient-boosting decision tree (GBDT)\n", "\n", "In this notebook, we will present the gradient boosting decision tree\n", "algorithm and contrast it with AdaBoost.\n", "\n", "Gradient-boosting differs from AdaBoost due to the following reason: instead\n", "of assigning weights to specific samples, GBDT will fit a decision tree on the\n", "residuals error (hence the name \"gradient\") of the previous tree. Therefore,\n", "each new tree in the ensemble predicts the error made by the previous learner\n", "instead of predicting the target directly.\n", "\n", "In this section, we will provide some intuition about the way learners are\n", "combined to give the final prediction. In this regard, let's go back to our\n", "regression problem which is more intuitive for demonstrating the underlying\n", "machinery." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Create a random number generator that will be used to set the randomness\n", "rng = np.random.RandomState(0)\n", "\n", "\n", "def generate_data(n_samples=50):\n", " \"\"\"Generate synthetic dataset. Returns `data_train`, `data_test`,\n", " `target_train`.\"\"\"\n", " x_max, x_min = 1.4, -1.4\n", " len_x = x_max - x_min\n", " x = rng.rand(n_samples) * len_x - len_x / 2\n", " noise = rng.randn(n_samples) * 0.3\n", " y = x**3 - 0.5 * x**2 + noise\n", "\n", " data_train = pd.DataFrame(x, columns=[\"Feature\"])\n", " data_test = pd.DataFrame(\n", " np.linspace(x_max, x_min, num=300), columns=[\"Feature\"]\n", " )\n", " target_train = pd.Series(y, name=\"Target\")\n", "\n", " return data_train, data_test, target_train\n", "\n", "\n", "data_train, data_test, target_train = generate_data()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.scatterplot(\n", " x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n", ")\n", "_ = plt.title(\"Synthetic regression dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we previously discussed, boosting will be based on assembling a sequence of\n", "learners. We will start by creating a decision tree regressor. We will set the\n", "depth of the tree so that the resulting learner will underfit the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeRegressor\n", "\n", "tree = DecisionTreeRegressor(max_depth=3, random_state=0)\n", "tree.fit(data_train, target_train)\n", "\n", "target_train_predicted = tree.predict(data_train)\n", "target_test_predicted = tree.predict(data_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the term \"test\" here refers to data that was not used for training. It\n", "should not be confused with data coming from a train-test split, as it was\n", "generated in equally-spaced intervals for the visual evaluation of the\n", "predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# plot the data\n", "sns.scatterplot(\n", " x=data_train[\"Feature\"], y=target_train, color=\"black\", alpha=0.5\n", ")\n", "# plot the predictions\n", "line_predictions = plt.plot(data_test[\"Feature\"], target_test_predicted, \"--\")\n", "\n", "# plot the residuals\n", "for value, true, predicted in zip(\n", " data_train[\"Feature\"], target_train, target_train_predicted\n", "):\n", " lines_residuals = plt.plot([value, value], [true, predicted], color=\"red\")\n", "\n", "plt.legend(\n", " [line_predictions[0], lines_residuals[0]], [\"Fitted tree\", \"Residuals\"]\n", ")\n", "_ = plt.title(\"Prediction function together \\nwith errors on the training set\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Tip
\n", "In the cell above, we manually edited the legend to get only a single label\n", "for all the residual lines.
\n", "