{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Early stopping in Gradient Boosting\n\nGradient Boosting is an ensemble technique that combines multiple weak\nlearners, typically decision trees, to create a robust and powerful\npredictive model. It does so in an iterative fashion, where each new stage\n(tree) corrects the errors of the previous ones.\n\nEarly stopping is a technique in Gradient Boosting that allows us to find\nthe optimal number of iterations required to build a model that generalizes\nwell to unseen data and avoids overfitting. The concept is simple: we set\naside a portion of our dataset as a validation set (specified using\n`validation_fraction`) to assess the model's performance during training.\nAs the model is iteratively built with additional stages (trees), its\nperformance on the validation set is monitored as a function of the\nnumber of steps.\n\nEarly stopping becomes effective when the model's performance on the\nvalidation set plateaus or worsens (within deviations specified by `tol`)\nover a certain number of consecutive stages (specified by `n_iter_no_change`).\nThis signals that the model has reached a point where further iterations may\nlead to overfitting, and it's time to stop training.\n\nThe number of estimators (trees) in the final model, when early stopping is\napplied, can be accessed using the `n_estimators_` attribute. Overall, early\nstopping is a valuable tool to strike a balance between model performance and\nefficiency in gradient boosting.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\nFirst we load and prepares the California Housing Prices dataset for\ntraining and evaluation. It subsets the dataset, splits it into training\nand validation sets.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.metrics import mean_squared_error\nfrom sklearn.model_selection import train_test_split\n\ndata = fetch_california_housing()\nX, y = data.data[:600], data.target[:600]\n\nX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Training and Comparison\nTwo :class:`~sklearn.ensemble.GradientBoostingRegressor` models are trained:\none with and another without early stopping. The purpose is to compare their\nperformance. It also calculates the training time and the `n_estimators_`\nused by both models.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "params = dict(n_estimators=1000, max_depth=5, learning_rate=0.1, random_state=42)\n\ngbm_full = GradientBoostingRegressor(**params)\ngbm_early_stopping = GradientBoostingRegressor(\n **params,\n validation_fraction=0.1,\n n_iter_no_change=10,\n)\n\nstart_time = time.time()\ngbm_full.fit(X_train, y_train)\ntraining_time_full = time.time() - start_time\nn_estimators_full = gbm_full.n_estimators_\n\nstart_time = time.time()\ngbm_early_stopping.fit(X_train, y_train)\ntraining_time_early_stopping = time.time() - start_time\nestimators_early_stopping = gbm_early_stopping.n_estimators_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Error Calculation\nThe code calculates the :func:`~sklearn.metrics.mean_squared_error` for both\ntraining and validation datasets for the models trained in the previous\nsection. It computes the errors for each boosting iteration. The purpose is\nto assess the performance and convergence of the models.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_errors_without = []\nval_errors_without = []\n\ntrain_errors_with = []\nval_errors_with = []\n\nfor i, (train_pred, val_pred) in enumerate(\n zip(\n gbm_full.staged_predict(X_train),\n gbm_full.staged_predict(X_val),\n )\n):\n train_errors_without.append(mean_squared_error(y_train, train_pred))\n val_errors_without.append(mean_squared_error(y_val, val_pred))\n\nfor i, (train_pred, val_pred) in enumerate(\n zip(\n gbm_early_stopping.staged_predict(X_train),\n gbm_early_stopping.staged_predict(X_val),\n )\n):\n train_errors_with.append(mean_squared_error(y_train, train_pred))\n val_errors_with.append(mean_squared_error(y_val, val_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize Comparison\nIt includes three subplots:\n\n1. Plotting training errors of both models over boosting iterations.\n2. Plotting validation errors of both models over boosting iterations.\n3. Creating a bar chart to compare the training times and the estimator used\n of the models with and without early stopping.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, axes = plt.subplots(ncols=3, figsize=(12, 4))\n\naxes[0].plot(train_errors_without, label=\"gbm_full\")\naxes[0].plot(train_errors_with, label=\"gbm_early_stopping\")\naxes[0].set_xlabel(\"Boosting Iterations\")\naxes[0].set_ylabel(\"MSE (Training)\")\naxes[0].set_yscale(\"log\")\naxes[0].legend()\naxes[0].set_title(\"Training Error\")\n\naxes[1].plot(val_errors_without, label=\"gbm_full\")\naxes[1].plot(val_errors_with, label=\"gbm_early_stopping\")\naxes[1].set_xlabel(\"Boosting Iterations\")\naxes[1].set_ylabel(\"MSE (Validation)\")\naxes[1].set_yscale(\"log\")\naxes[1].legend()\naxes[1].set_title(\"Validation Error\")\n\ntraining_times = [training_time_full, training_time_early_stopping]\nlabels = [\"gbm_full\", \"gbm_early_stopping\"]\nbars = axes[2].bar(labels, training_times)\naxes[2].set_ylabel(\"Training Time (s)\")\n\nfor bar, n_estimators in zip(bars, [n_estimators_full, estimators_early_stopping]):\n height = bar.get_height()\n axes[2].text(\n bar.get_x() + bar.get_width() / 2,\n height + 0.001,\n f\"Estimators: {n_estimators}\",\n ha=\"center\",\n va=\"bottom\",\n )\n\nplt.tight_layout()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference in training error between the `gbm_full` and the\n`gbm_early_stopping` stems from the fact that `gbm_early_stopping` sets\naside `validation_fraction` of the training data as internal validation set.\nEarly stopping is decided based on this internal validation score.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\nIn our example with the :class:`~sklearn.ensemble.GradientBoostingRegressor`\nmodel on the California Housing Prices dataset, we have demonstrated the\npractical benefits of early stopping:\n\n- **Preventing Overfitting:** We showed how the validation error stabilizes\n or starts to increase after a certain point, indicating that the model\n generalizes better to unseen data. This is achieved by stopping the training\n process before overfitting occurs.\n- **Improving Training Efficiency:** We compared training times between\n models with and without early stopping. The model with early stopping\n achieved comparable accuracy while requiring significantly fewer\n estimators, resulting in faster training.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }