{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Features in Histogram Gradient Boosting Trees\n\n`histogram_based_gradient_boosting` (HGBT) models may be one of the most\nuseful supervised learning models in scikit-learn. They are based on a modern\ngradient boosting implementation comparable to LightGBM and XGBoost. As such,\nHGBT models are more feature rich than and often outperform alternative models\nlike random forests, especially when the number of samples is larger than some\nten thousands (see\n`sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py`).\n\nThe top usability features of HGBT models are:\n\n1. Several available loss functions for mean and quantile regression tasks, see\n `Quantile loss `.\n2. `categorical_support_gbdt`, see\n `sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`.\n3. Early stopping.\n4. `nan_support_hgbt`, which avoids the need for an imputer.\n5. `monotonic_cst_gbdt`.\n6. `interaction_cst_hgbt`.\n\nThis example aims at showcasing all points except 2 and 6 in a real life\nsetting.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the data\nThe [electricity dataset](http://www.openml.org/d/151) consists of data\ncollected from the Australian New South Wales Electricity Market. In this\nmarket, prices are not fixed and are affected by supply and demand. They are\nset every five minutes. Electricity transfers to/from the neighboring state of\nVictoria were done to alleviate fluctuations.\n\nThe dataset, originally named ELEC2, contains 45,312 instances dated from 7\nMay 1996 to 5 December 1998. Each sample of the dataset refers to a period of\n30 minutes, i.e. there are 48 instances for each time period of one day. Each\nsample on the dataset has 7 columns:\n\n- date: between 7 May 1996 to 5 December 1998. Normalized between 0 and 1;\n- day: day of week (1-7);\n- period: half hour intervals over 24 hours. Normalized between 0 and 1;\n- nswprice/nswdemand: electricity price/demand of New South Wales;\n- vicprice/vicdemand: electricity price/demand of Victoria.\n\nOriginally, it is a classification task, but here we use it for the regression\ntask to predict the scheduled electricity transfer between states.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n\nelectricity = fetch_openml(\n name=\"electricity\", version=1, as_frame=True, parser=\"pandas\"\n)\ndf = electricity.frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This particular dataset has a stepwise constant target for the first 17,760\nsamples:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df[\"transfer\"][:17_760].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us drop those entries and explore the hourly electricity transfer over\ndifferent days of the week:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport seaborn as sns\n\ndf = electricity.frame.iloc[17_760:]\nX = df.drop(columns=[\"transfer\", \"class\"])\ny = df[\"transfer\"]\n\nfig, ax = plt.subplots(figsize=(15, 10))\npointplot = sns.lineplot(x=df[\"period\"], y=df[\"transfer\"], hue=df[\"day\"], ax=ax)\nhandles, lables = ax.get_legend_handles_labels()\nax.set(\n title=\"Hourly energy transfer for different days of the week\",\n xlabel=\"Normalized time of the day\",\n ylabel=\"Normalized energy transfer\",\n)\n_ = ax.legend(handles, [\"Sun\", \"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that energy transfer increases systematically during weekends.\n\n## Effect of number of trees and early stopping\nFor the sake of illustrating the effect of the (maximum) number of trees, we\ntrain a :class:`~sklearn.ensemble.HistGradientBoostingRegressor` over the\ndaily electricity transfer using the whole dataset. Then we visualize its\npredictions depending on the `max_iter` parameter. Here we don't try to\nevaluate the performance of the model and its capacity to generalize but\nrather its capability to learn from the training data.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingRegressor\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=False)\n\nprint(f\"Training sample size: {X_train.shape[0]}\")\nprint(f\"Test sample size: {X_test.shape[0]}\")\nprint(f\"Number of features: {X_train.shape[1]}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "max_iter_list = [5, 50]\naverage_week_demand = (\n df.loc[X_test.index].groupby([\"day\", \"period\"], observed=False)[\"transfer\"].mean()\n)\ncolors = sns.color_palette(\"colorblind\")\nfig, ax = plt.subplots(figsize=(10, 5))\naverage_week_demand.plot(color=colors[0], label=\"recorded average\", linewidth=2, ax=ax)\n\nfor idx, max_iter in enumerate(max_iter_list):\n hgbt = HistGradientBoostingRegressor(\n max_iter=max_iter, categorical_features=None, random_state=42\n )\n hgbt.fit(X_train, y_train)\n\n y_pred = hgbt.predict(X_test)\n prediction_df = df.loc[X_test.index].copy()\n prediction_df[\"y_pred\"] = y_pred\n average_pred = prediction_df.groupby([\"day\", \"period\"], observed=False)[\n \"y_pred\"\n ].mean()\n average_pred.plot(\n color=colors[idx + 1], label=f\"max_iter={max_iter}\", linewidth=2, ax=ax\n )\n\nax.set(\n title=\"Predicted average energy transfer during the week\",\n xticks=[(i + 0.2) * 48 for i in range(7)],\n xticklabels=[\"Sun\", \"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\"],\n xlabel=\"Time of the week\",\n ylabel=\"Normalized energy transfer\",\n)\n_ = ax.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With just a few iterations, HGBT models can achieve convergence (see\n`sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py`),\nmeaning that adding more trees does not improve the model anymore. In the\nfigure above, 5 iterations are not enough to get good predictions. With 50\niterations, we are already able to do a good job.\n\nSetting `max_iter` too high might degrade the prediction quality and cost a lot of\navoidable computing resources. Therefore, the HGBT implementation in scikit-learn\nprovides an automatic **early stopping** strategy. With it, the model\nuses a fraction of the training data as internal validation set\n(`validation_fraction`) and stops training if the validation score does not\nimprove (or degrades) after `n_iter_no_change` iterations up to a certain\ntolerance (`tol`).\n\nNotice that there is a trade-off between `learning_rate` and `max_iter`:\nGenerally, smaller learning rates are preferable but require more iterations\nto converge to the minimum loss, while larger learning rates converge faster\n(less iterations/trees needed) but at the cost of a larger minimum loss.\n\nBecause of this high correlation between the learning rate the number of iterations,\na good practice is to tune the learning rate along with all (important) other\nhyperparameters, fit the HBGT on the training set with a large enough value\nfor `max_iter` and determine the best `max_iter` via early stopping and some\nexplicit `validation_fraction`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "common_params = {\n \"max_iter\": 1_000,\n \"learning_rate\": 0.3,\n \"validation_fraction\": 0.2,\n \"random_state\": 42,\n \"categorical_features\": None,\n \"scoring\": \"neg_root_mean_squared_error\",\n}\n\nhgbt = HistGradientBoostingRegressor(early_stopping=True, **common_params)\nhgbt.fit(X_train, y_train)\n\n_, ax = plt.subplots()\nplt.plot(-hgbt.validation_score_)\n_ = ax.set(\n xlabel=\"number of iterations\",\n ylabel=\"root mean squared error\",\n title=f\"Loss of hgbt with early stopping (n_iter={hgbt.n_iter_})\",\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then overwrite the value for `max_iter` to a reasonable value and avoid\nthe extra computational cost of the inner validation. Rounding up the number\nof iterations may account for variability of the training set:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import math\n\ncommon_params[\"max_iter\"] = math.ceil(hgbt.n_iter_ / 100) * 100\ncommon_params[\"early_stopping\"] = False\nhgbt = HistGradientBoostingRegressor(**common_params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

The inner validation done during early stopping is not optimal for\n time series.

\n\n## Support for missing values\nHGBT models have native support of missing values. During training, the tree\ngrower decides where samples with missing values should go (left or right\nchild) at each split, based on the potential gain. When predicting, these\nsamples are sent to the learnt child accordingly. If a feature had no missing\nvalues during training, then for prediction, samples with missing values for that\nfeature are sent to the child with the most samples (as seen during fit).\n\nThe present example shows how HGBT regressions deal with values missing\ncompletely at random (MCAR), i.e. the missingness does not depend on the\nobserved data or the unobserved data. We can simulate such scenario by\nrandomly replacing values from randomly selected features with `nan` values.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn.metrics import root_mean_squared_error\n\nrng = np.random.RandomState(42)\nfirst_week = slice(0, 336) # first week in the test set as 7 * 48 = 336\nmissing_fraction_list = [0, 0.01, 0.03]\n\n\ndef generate_missing_values(X, missing_fraction):\n total_cells = X.shape[0] * X.shape[1]\n num_missing_cells = int(total_cells * missing_fraction)\n row_indices = rng.choice(X.shape[0], num_missing_cells, replace=True)\n col_indices = rng.choice(X.shape[1], num_missing_cells, replace=True)\n X_missing = X.copy()\n X_missing.iloc[row_indices, col_indices] = np.nan\n return X_missing\n\n\nfig, ax = plt.subplots(figsize=(12, 6))\nax.plot(y_test.values[first_week], label=\"Actual transfer\")\n\nfor missing_fraction in missing_fraction_list:\n X_train_missing = generate_missing_values(X_train, missing_fraction)\n X_test_missing = generate_missing_values(X_test, missing_fraction)\n hgbt.fit(X_train_missing, y_train)\n y_pred = hgbt.predict(X_test_missing[first_week])\n rmse = root_mean_squared_error(y_test[first_week], y_pred)\n ax.plot(\n y_pred[first_week],\n label=f\"missing_fraction={missing_fraction}, RMSE={rmse:.3f}\",\n alpha=0.5,\n )\nax.set(\n title=\"Daily energy transfer predictions on data with MCAR values\",\n xticks=[(i + 0.2) * 48 for i in range(7)],\n xticklabels=[\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"],\n xlabel=\"Time of the week\",\n ylabel=\"Normalized energy transfer\",\n)\n_ = ax.legend(loc=\"lower right\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the model degrades as the proportion of missing values increases.\n\n## Support for quantile loss\n\nThe quantile loss in regression enables a view of the variability or\nuncertainty of the target variable. For instance, predicting the 5th and 95th\npercentiles can provide a 90% prediction interval, i.e. the range within which\nwe expect a new observed value to fall with 90% probability.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import mean_pinball_loss\n\nquantiles = [0.95, 0.05]\npredictions = []\n\nfig, ax = plt.subplots(figsize=(12, 6))\nax.plot(y_test.values[first_week], label=\"Actual transfer\")\n\nfor quantile in quantiles:\n hgbt_quantile = HistGradientBoostingRegressor(\n loss=\"quantile\", quantile=quantile, **common_params\n )\n hgbt_quantile.fit(X_train, y_train)\n y_pred = hgbt_quantile.predict(X_test[first_week])\n\n predictions.append(y_pred)\n score = mean_pinball_loss(y_test[first_week], y_pred)\n ax.plot(\n y_pred[first_week],\n label=f\"quantile={quantile}, pinball loss={score:.2f}\",\n alpha=0.5,\n )\n\nax.fill_between(\n range(len(predictions[0][first_week])),\n predictions[0][first_week],\n predictions[1][first_week],\n color=colors[0],\n alpha=0.1,\n)\nax.set(\n title=\"Daily energy transfer predictions with quantile loss\",\n xticks=[(i + 0.2) * 48 for i in range(7)],\n xticklabels=[\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"],\n xlabel=\"Time of the week\",\n ylabel=\"Normalized energy transfer\",\n)\n_ = ax.legend(loc=\"lower right\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe a tendence to over-estimate the energy transfer. This could be be\nquantitatively confirmed by computing empirical coverage numbers as done in\nthe `calibration of confidence intervals section `.\nKeep in mind that those predicted percentiles are just estimations from a\nmodel. One can still improve the quality of such estimations by:\n\n- collecting more data-points;\n- better tuning of the model hyperparameters, see\n `sphx_glr_auto_examples_ensemble_plot_gradient_boosting_quantile.py`;\n- engineering more predictive features from the same data, see\n `sphx_glr_auto_examples_applications_plot_cyclical_feature_engineering.py`.\n\n## Monotonic constraints\n\nGiven specific domain knowledge that requires the relationship between a\nfeature and the target to be monotonically increasing or decreasing, one can\nenforce such behaviour in the predictions of a HGBT model using monotonic\nconstraints. This makes the model more interpretable and can reduce its\nvariance (and potentially mitigate overfitting) at the risk of increasing\nbias. Monotonic constraints can also be used to enforce specific regulatory\nrequirements, ensure compliance and align with ethical considerations.\n\nIn the present example, the policy of transferring energy from Victoria to New\nSouth Wales is meant to alleviate price fluctuations, meaning that the model\npredictions have to enforce such goal, i.e. transfer should increase with\nprice and demand in New South Wales, but also decrease with price and demand\nin Victoria, in order to benefit both populations.\n\nIf the training data has feature names, it\u2019s possible to specify the monotonic\nconstraints by passing a dictionary with the convention:\n\n- 1: monotonic increase\n- 0: no constraint\n- -1: monotonic decrease\n\nAlternatively, one can pass an array-like object encoding the above convention by\nposition.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.inspection import PartialDependenceDisplay\n\nmonotonic_cst = {\n \"date\": 0,\n \"day\": 0,\n \"period\": 0,\n \"nswdemand\": 1,\n \"nswprice\": 1,\n \"vicdemand\": -1,\n \"vicprice\": -1,\n}\nhgbt_no_cst = HistGradientBoostingRegressor(\n categorical_features=None, random_state=42\n).fit(X, y)\nhgbt_cst = HistGradientBoostingRegressor(\n monotonic_cst=monotonic_cst, categorical_features=None, random_state=42\n).fit(X, y)\n\nfig, ax = plt.subplots(nrows=2, figsize=(15, 10))\ndisp = PartialDependenceDisplay.from_estimator(\n hgbt_no_cst,\n X,\n features=[\"nswdemand\", \"nswprice\"],\n line_kw={\"linewidth\": 2, \"label\": \"unconstrained\", \"color\": \"tab:blue\"},\n ax=ax[0],\n)\nPartialDependenceDisplay.from_estimator(\n hgbt_cst,\n X,\n features=[\"nswdemand\", \"nswprice\"],\n line_kw={\"linewidth\": 2, \"label\": \"constrained\", \"color\": \"tab:orange\"},\n ax=disp.axes_,\n)\ndisp = PartialDependenceDisplay.from_estimator(\n hgbt_no_cst,\n X,\n features=[\"vicdemand\", \"vicprice\"],\n line_kw={\"linewidth\": 2, \"label\": \"unconstrained\", \"color\": \"tab:blue\"},\n ax=ax[1],\n)\nPartialDependenceDisplay.from_estimator(\n hgbt_cst,\n X,\n features=[\"vicdemand\", \"vicprice\"],\n line_kw={\"linewidth\": 2, \"label\": \"constrained\", \"color\": \"tab:orange\"},\n ax=disp.axes_,\n)\n_ = plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that `nswdemand` and `vicdemand` seem already monotonic without constraint.\nThis is a good example to show that the model with monotonicity constraints is\n\"overconstraining\".\n\nAdditionally, we can verify that the predictive quality of the model is not\nsignificantly degraded by introducing the monotonic constraints. For such\npurpose we use :class:`~sklearn.model_selection.TimeSeriesSplit`\ncross-validation to estimate the variance of the test score. By doing so we\nguarantee that the training data does not succeed the testing data, which is\ncrucial when dealing with data that have a temporal relationship.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import make_scorer, root_mean_squared_error\nfrom sklearn.model_selection import TimeSeriesSplit, cross_validate\n\nts_cv = TimeSeriesSplit(n_splits=5, gap=48, test_size=336) # a week has 336 samples\nscorer = make_scorer(root_mean_squared_error)\n\ncv_results = cross_validate(hgbt_no_cst, X, y, cv=ts_cv, scoring=scorer)\nrmse = cv_results[\"test_score\"]\nprint(f\"RMSE without constraints = {rmse.mean():.3f} +/- {rmse.std():.3f}\")\n\ncv_results = cross_validate(hgbt_cst, X, y, cv=ts_cv, scoring=scorer)\nrmse = cv_results[\"test_score\"]\nprint(f\"RMSE with constraints = {rmse.mean():.3f} +/- {rmse.std():.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That being said, notice the comparison is between two different models that\nmay be optimized by a different combination of hyperparameters. That is the\nreason why we do no use the `common_params` in this section as done before.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }