{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Prediction Intervals for Gradient Boosting Regression\n\nThis example shows how quantile regression can be used to create prediction\nintervals. See `sphx_glr_auto_examples_ensemble_plot_hgbt_regression.py`\nfor an example showcasing some other features of\n:class:`~ensemble.HistGradientBoostingRegressor`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate some data for a synthetic regression problem by applying the\nfunction f to uniformly sampled random inputs.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.model_selection import train_test_split\n\n\ndef f(x):\n \"\"\"The function to predict.\"\"\"\n return x * np.sin(x)\n\n\nrng = np.random.RandomState(42)\nX = np.atleast_2d(rng.uniform(0, 10.0, size=1000)).T\nexpected_y = f(X).ravel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make the problem interesting, we generate observations of the target y as\nthe sum of a deterministic term computed by the function f and a random noise\nterm that follows a centered [log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution). To make this even\nmore interesting we consider the case where the amplitude of the noise\ndepends on the input variable x (heteroscedastic noise).\n\nThe lognormal distribution is non-symmetric and long tailed: observing large\noutliers is likely but it is impossible to observe small outliers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sigma = 0.5 + X.ravel() / 10\nnoise = rng.lognormal(sigma=sigma) - np.exp(sigma**2 / 2)\ny = expected_y + noise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split into train, test datasets:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting non-linear quantile and least squares regressors\n\nFit gradient boosting models trained with the quantile loss and\nalpha=0.05, 0.5, 0.95.\n\nThe models obtained for alpha=0.05 and alpha=0.95 produce a 90% confidence\ninterval (95% - 5% = 90%).\n\nThe model trained with alpha=0.5 produces a regression of the median: on\naverage, there should be the same number of target observations above and\nbelow the predicted values.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.metrics import mean_pinball_loss, mean_squared_error\n\nall_models = {}\ncommon_params = dict(\n learning_rate=0.05,\n n_estimators=200,\n max_depth=2,\n min_samples_leaf=9,\n min_samples_split=9,\n)\nfor alpha in [0.05, 0.5, 0.95]:\n gbr = GradientBoostingRegressor(loss=\"quantile\", alpha=alpha, **common_params)\n all_models[\"q %1.2f\" % alpha] = gbr.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that :class:`~sklearn.ensemble.HistGradientBoostingRegressor` is much\nfaster than :class:`~sklearn.ensemble.GradientBoostingRegressor` starting with\nintermediate datasets (`n_samples >= 10_000`), which is not the case of the\npresent example.\n\nFor the sake of comparison, we also fit a baseline model trained with the\nusual (mean) squared error (MSE).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gbr_ls = GradientBoostingRegressor(loss=\"squared_error\", **common_params)\nall_models[\"mse\"] = gbr_ls.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create an evenly spaced evaluation set of input values spanning the [0, 10]\nrange.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "xx = np.atleast_2d(np.linspace(0, 10, 1000)).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the true conditional mean function f, the predictions of the conditional\nmean (loss equals squared error), the conditional median and the conditional\n90% interval (from 5th to 95th conditional percentiles).\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\ny_pred = all_models[\"mse\"].predict(xx)\ny_lower = all_models[\"q 0.05\"].predict(xx)\ny_upper = all_models[\"q 0.95\"].predict(xx)\ny_med = all_models[\"q 0.50\"].predict(xx)\n\nfig = plt.figure(figsize=(10, 10))\nplt.plot(xx, f(xx), \"g:\", linewidth=3, label=r\"$f(x) = x\\,\\sin(x)$\")\nplt.plot(X_test, y_test, \"b.\", markersize=10, label=\"Test observations\")\nplt.plot(xx, y_med, \"r-\", label=\"Predicted median\")\nplt.plot(xx, y_pred, \"r-\", label=\"Predicted mean\")\nplt.plot(xx, y_upper, \"k-\")\nplt.plot(xx, y_lower, \"k-\")\nplt.fill_between(\n xx.ravel(), y_lower, y_upper, alpha=0.4, label=\"Predicted 90% interval\"\n)\nplt.xlabel(\"$x$\")\nplt.ylabel(\"$f(x)$\")\nplt.ylim(-10, 25)\nplt.legend(loc=\"upper left\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing the predicted median with the predicted mean, we note that the\nmedian is on average below the mean as the noise is skewed towards high\nvalues (large outliers). The median estimate also seems to be smoother\nbecause of its natural robustness to outliers.\n\nAlso observe that the inductive bias of gradient boosting trees is\nunfortunately preventing our 0.05 quantile to fully capture the sinoisoidal\nshape of the signal, in particular around x=8. Tuning hyper-parameters can\nreduce this effect as shown in the last part of this notebook.\n\n## Analysis of the error metrics\n\nMeasure the models with :func:`~sklearn.metrics.mean_squared_error` and\n:func:`~sklearn.metrics.mean_pinball_loss` metrics on the training dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\n\ndef highlight_min(x):\n x_min = x.min()\n return [\"font-weight: bold\" if v == x_min else \"\" for v in x]\n\n\nresults = []\nfor name, gbr in sorted(all_models.items()):\n metrics = {\"model\": name}\n y_pred = gbr.predict(X_train)\n for alpha in [0.05, 0.5, 0.95]:\n metrics[\"pbl=%1.2f\" % alpha] = mean_pinball_loss(y_train, y_pred, alpha=alpha)\n metrics[\"MSE\"] = mean_squared_error(y_train, y_pred)\n results.append(metrics)\n\npd.DataFrame(results).set_index(\"model\").style.apply(highlight_min)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One column shows all models evaluated by the same metric. The minimum number\non a column should be obtained when the model is trained and measured with\nthe same metric. This should be always the case on the training set if the\ntraining converged.\n\nNote that because the target distribution is asymmetric, the expected\nconditional mean and conditional median are significantly different and\ntherefore one could not use the squared error model get a good estimation of\nthe conditional median nor the converse.\n\nIf the target distribution were symmetric and had no outliers (e.g. with a\nGaussian noise), then median estimator and the least squares estimator would\nhave yielded similar predictions.\n\nWe then do the same on the test set.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "results = []\nfor name, gbr in sorted(all_models.items()):\n metrics = {\"model\": name}\n y_pred = gbr.predict(X_test)\n for alpha in [0.05, 0.5, 0.95]:\n metrics[\"pbl=%1.2f\" % alpha] = mean_pinball_loss(y_test, y_pred, alpha=alpha)\n metrics[\"MSE\"] = mean_squared_error(y_test, y_pred)\n results.append(metrics)\n\npd.DataFrame(results).set_index(\"model\").style.apply(highlight_min)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Errors are higher meaning the models slightly overfitted the data. It still\nshows that the best test metric is obtained when the model is trained by\nminimizing this same metric.\n\nNote that the conditional median estimator is competitive with the squared\nerror estimator in terms of MSE on the test set: this can be explained by\nthe fact the squared error estimator is very sensitive to large outliers\nwhich can cause significant overfitting. This can be seen on the right hand\nside of the previous plot. The conditional median estimator is biased\n(underestimation for this asymmetric noise) but is also naturally robust to\noutliers and overfits less.\n\n\n## Calibration of the confidence interval\n\nWe can also evaluate the ability of the two extreme quantile estimators at\nproducing a well-calibrated conditional 90%-confidence interval.\n\nTo do this we can compute the fraction of observations that fall between the\npredictions:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def coverage_fraction(y, y_low, y_high):\n return np.mean(np.logical_and(y >= y_low, y <= y_high))\n\n\ncoverage_fraction(\n y_train,\n all_models[\"q 0.05\"].predict(X_train),\n all_models[\"q 0.95\"].predict(X_train),\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the training set the calibration is very close to the expected coverage\nvalue for a 90% confidence interval.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "coverage_fraction(\n y_test, all_models[\"q 0.05\"].predict(X_test), all_models[\"q 0.95\"].predict(X_test)\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the test set, the estimated confidence interval is slightly too narrow.\nNote, however, that we would need to wrap those metrics in a cross-validation\nloop to assess their variability under data resampling.\n\n## Tuning the hyper-parameters of the quantile regressors\n\nIn the plot above, we observed that the 5th percentile regressor seems to\nunderfit and could not adapt to sinusoidal shape of the signal.\n\nThe hyper-parameters of the model were approximately hand-tuned for the\nmedian regressor and there is no reason that the same hyper-parameters are\nsuitable for the 5th percentile regressor.\n\nTo confirm this hypothesis, we tune the hyper-parameters of a new regressor\nof the 5th percentile by selecting the best model parameters by\ncross-validation on the pinball loss with alpha=0.05:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.experimental import enable_halving_search_cv # noqa\nfrom sklearn.model_selection import HalvingRandomSearchCV\nfrom sklearn.metrics import make_scorer\nfrom pprint import pprint\n\nparam_grid = dict(\n learning_rate=[0.05, 0.1, 0.2],\n max_depth=[2, 5, 10],\n min_samples_leaf=[1, 5, 10, 20],\n min_samples_split=[5, 10, 20, 30, 50],\n)\nalpha = 0.05\nneg_mean_pinball_loss_05p_scorer = make_scorer(\n mean_pinball_loss,\n alpha=alpha,\n greater_is_better=False, # maximize the negative loss\n)\ngbr = GradientBoostingRegressor(loss=\"quantile\", alpha=alpha, random_state=0)\nsearch_05p = HalvingRandomSearchCV(\n gbr,\n param_grid,\n resource=\"n_estimators\",\n max_resources=250,\n min_resources=50,\n scoring=neg_mean_pinball_loss_05p_scorer,\n n_jobs=2,\n random_state=0,\n).fit(X_train, y_train)\npprint(search_05p.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that the hyper-parameters that were hand-tuned for the median\nregressor are in the same range as the hyper-parameters suitable for the 5th\npercentile regressor.\n\nLet's now tune the hyper-parameters for the 95th percentile regressor. We\nneed to redefine the `scoring` metric used to select the best model, along\nwith adjusting the alpha parameter of the inner gradient boosting estimator\nitself:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.base import clone\n\nalpha = 0.95\nneg_mean_pinball_loss_95p_scorer = make_scorer(\n mean_pinball_loss,\n alpha=alpha,\n greater_is_better=False, # maximize the negative loss\n)\nsearch_95p = clone(search_05p).set_params(\n estimator__alpha=alpha,\n scoring=neg_mean_pinball_loss_95p_scorer,\n)\nsearch_95p.fit(X_train, y_train)\npprint(search_95p.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result shows that the hyper-parameters for the 95th percentile regressor\nidentified by the search procedure are roughly in the same range as the hand-\ntuned hyper-parameters for the median regressor and the hyper-parameters\nidentified by the search procedure for the 5th percentile regressor. However,\nthe hyper-parameter searches did lead to an improved 90% confidence interval\nthat is comprised by the predictions of those two tuned quantile regressors.\nNote that the prediction of the upper 95th percentile has a much coarser shape\nthan the prediction of the lower 5th percentile because of the outliers:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_lower = search_05p.predict(xx)\ny_upper = search_95p.predict(xx)\n\nfig = plt.figure(figsize=(10, 10))\nplt.plot(xx, f(xx), \"g:\", linewidth=3, label=r\"$f(x) = x\\,\\sin(x)$\")\nplt.plot(X_test, y_test, \"b.\", markersize=10, label=\"Test observations\")\nplt.plot(xx, y_upper, \"k-\")\nplt.plot(xx, y_lower, \"k-\")\nplt.fill_between(\n xx.ravel(), y_lower, y_upper, alpha=0.4, label=\"Predicted 90% interval\"\n)\nplt.xlabel(\"$x$\")\nplt.ylabel(\"$f(x)$\")\nplt.ylim(-10, 25)\nplt.legend(loc=\"upper left\")\nplt.title(\"Prediction with tuned hyper-parameters\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot looks qualitatively better than for the untuned models, especially\nfor the shape of the of lower quantile.\n\nWe now quantitatively evaluate the joint-calibration of the pair of\nestimators:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "coverage_fraction(y_train, search_05p.predict(X_train), search_95p.predict(X_train))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "coverage_fraction(y_test, search_05p.predict(X_test), search_95p.predict(X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The calibration of the tuned pair is sadly not better on the test set: the\nwidth of the estimated confidence interval is still too narrow.\n\nAgain, we would need to wrap this study in a cross-validation loop to\nbetter assess the variability of those estimates.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }