{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Gradient Boosting regularization\n\nIllustration of the effect of different regularization strategies\nfor Gradient Boosting. The example is taken from Hastie et al 2009 [1]_.\n\nThe loss function used is binomial deviance. Regularization via\nshrinkage (``learning_rate < 1.0``) improves performance considerably.\nIn combination with shrinkage, stochastic gradient boosting\n(``subsample < 1.0``) can produce more accurate models by reducing the\nvariance via bagging.\nSubsampling without shrinkage usually does poorly.\nAnother strategy to reduce the variance is by subsampling the features\nanalogous to the random splits in Random Forests\n(via the ``max_features`` parameter).\n\n.. [1] T. Hastie, R. Tibshirani and J. Friedman, \"Elements of Statistical\n Learning Ed. 2\", Springer, 2009.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn import datasets, ensemble\nfrom sklearn.metrics import log_loss\nfrom sklearn.model_selection import train_test_split\n\nX, y = datasets.make_hastie_10_2(n_samples=4000, random_state=1)\n\n# map labels from {-1, 1} to {0, 1}\nlabels, y = np.unique(y, return_inverse=True)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)\n\noriginal_params = {\n \"n_estimators\": 400,\n \"max_leaf_nodes\": 4,\n \"max_depth\": None,\n \"random_state\": 2,\n \"min_samples_split\": 5,\n}\n\nplt.figure()\n\nfor label, color, setting in [\n (\"No shrinkage\", \"orange\", {\"learning_rate\": 1.0, \"subsample\": 1.0}),\n (\"learning_rate=0.2\", \"turquoise\", {\"learning_rate\": 0.2, \"subsample\": 1.0}),\n (\"subsample=0.5\", \"blue\", {\"learning_rate\": 1.0, \"subsample\": 0.5}),\n (\n \"learning_rate=0.2, subsample=0.5\",\n \"gray\",\n {\"learning_rate\": 0.2, \"subsample\": 0.5},\n ),\n (\n \"learning_rate=0.2, max_features=2\",\n \"magenta\",\n {\"learning_rate\": 0.2, \"max_features\": 2},\n ),\n]:\n params = dict(original_params)\n params.update(setting)\n\n clf = ensemble.GradientBoostingClassifier(**params)\n clf.fit(X_train, y_train)\n\n # compute test set deviance\n test_deviance = np.zeros((params[\"n_estimators\"],), dtype=np.float64)\n\n for i, y_proba in enumerate(clf.staged_predict_proba(X_test)):\n test_deviance[i] = 2 * log_loss(y_test, y_proba[:, 1])\n\n plt.plot(\n (np.arange(test_deviance.shape[0]) + 1)[::5],\n test_deviance[::5],\n \"-\",\n color=color,\n label=label,\n )\n\nplt.legend(loc=\"upper right\")\nplt.xlabel(\"Boosting Iterations\")\nplt.ylabel(\"Test Set Deviance\")\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }