{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "435a256a-7d50-410e-9bfa-b303d4eaff95", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython" ] }, { "cell_type": "markdown", "id": "4bb18763", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "e44880af", "metadata": {}, "source": [ "# Gradient boosting example" ] }, { "cell_type": "markdown", "id": "50b29d0b", "metadata": {}, "source": [ "## What is gradient boosting?\n", "\n", "If you're inside the world of machine learning, it's for sure you have heard about gradient boosting algorithms such as xgboost or lightgbm. Indeed, gradient boosting represents the state-of-start for a lot of machine learning task, but how does it work? We'll try to answer this question specifically for the case of gradient boosting for trees which is the most popular case up today\n", "\n", "For me, understanding gradient boosting is all about undestarding its link to gradient descent. Remember gradient descent is the algorithm to minimize a loss function $L(\\theta)$ by sustracting the gradient to the parameters\n", "\n", "$$\n", "\\theta = \\theta - \\frac{\\partial L(\\theta)}{\\partial \\theta}\n", "$$" ] }, { "cell_type": "markdown", "id": "f8600839", "metadata": {}, "source": [ "At first glance it doesn't make much sense; trees are based on a split-gain function (Gini, entropy), not a loss function. Moreover, what would be $\\theta$?\n", "\n", "Recall $\\theta = (\\theta_1,\\ldots,\\theta_n)$ are the learned parameters we use for making predictions, which are the paramters of the loss function. In gradient boosting, we consider the loss function as a function of the predictions instead, so we want to find $\\min_{p}L(y,p)$ and the way to achieve that is analogous to gradient descent, i.e, updating the predictions on the opposite direction of the gradients. But how can you update the predicitions? I mean, once the tree is built it has a fixed structure. \n", "\n", "Here comes the idea of additive modeling, in which you add (sequentially in this case) the predictions of several models in order to get a better performance. The first tree makes predictions on the original target, then, with the second tree, we try to minimize the loss function adding something which is not exactly minus the gradient loss, as in gradient descent, but instead we add *predictions on the gradients loss*, i.e, we fit a tree over with target the gradient loss. So, if $X,y$ represents the original data and $p = \\text{predict}(X,y)$ \n", "\n", "$$\n", "p = p-\\text{predict}\\left(X,\\dfrac{\\partial L(y,p)}{\\partial p}\\right)\n", "$$ " ] }, { "cell_type": "markdown", "id": "892cfad9", "metadata": {}, "source": [ "This equation would be the *gradient descent* version of trees. Maybe this is nothing new for you, but I hope it gives you a better or new way of seeing gradient boosting." ] }, { "cell_type": "markdown", "id": "4b3d6fd9", "metadata": {}, "source": [ "## Implementing Gradient boosting\n", "\n", "Now let's get our hands on some data and see a woking example" ] }, { "cell_type": "code", "execution_count": 2, "id": "9e0da69f", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor\n", "from sklearn.metrics import f1_score,roc_auc_score\n", "import matplotlib.pyplot as plt\n", "import warnings\n", "\n", "warnings.filterwarnings('ignore')\n", "seed = 1301" ] }, { "cell_type": "markdown", "id": "db2cfc66", "metadata": {}, "source": [ "We will work with a modified version of our familiar titanic dataset. This is a ready-to-train version: doesn't not contains missings and categorical data has been encoded." ] }, { "cell_type": "code", "execution_count": 3, "id": "993b0cca", "metadata": { "attributes": { "classes": [ "code-cell" ], "id": "" }, "tags": [ "output-scoll" ] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Passengerid | \n", "Age | \n", "Fare | \n", "Sex | \n", "sibsp | \n", "zero | \n", "zero.1 | \n", "zero.2 | \n", "zero.3 | \n", "zero.4 | \n", "... | \n", "zero.12 | \n", "zero.13 | \n", "zero.14 | \n", "Pclass | \n", "zero.15 | \n", "zero.16 | \n", "Embarked | \n", "zero.17 | \n", "zero.18 | \n", "2urvived | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "22.0 | \n", "7.2500 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "2.0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "2 | \n", "38.0 | \n", "71.2833 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0 | \n", "1 | \n", "
| 2 | \n", "3 | \n", "26.0 | \n", "7.9250 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "2.0 | \n", "0 | \n", "0 | \n", "1 | \n", "
| 3 | \n", "4 | \n", "35.0 | \n", "53.1000 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "2.0 | \n", "0 | \n", "0 | \n", "1 | \n", "
| 4 | \n", "5 | \n", "35.0 | \n", "8.0500 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "2.0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows × 28 columns
\n", "