{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Combine predictors using stacking\n\n.. currentmodule:: sklearn\n\nStacking is an `ensemble method `. In this strategy, the\nout-of-fold predictions from several base estimators are used to train a\nmeta-model that combines their outputs at inference time. Unlike\n:class:`~sklearn.ensemble.VotingRegressor`, which averages predictions with\nfixed (optionally user-specified) weights,\n:class:`~sklearn.ensemble.StackingRegressor` learns the combination through its\n`final_estimator`.\n\nIn this example, we illustrate the use case in which different regressors are\nstacked together and a final regularized linear regressor is used to output the\nprediction. We compare the performance of each individual regressor with the\nstacking strategy. Here, stacking slightly improves the overall performance.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate data\n\nWe use synthetic data generated from a sinusoid plus a linear trend with\nheteroscedastic Gaussian noise. A sudden drop is introduced, as it cannot be\ndescribed by a linear model, but a tree-based model can naturally deal with\nit.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nimport pandas as pd\n\nrng = np.random.RandomState(42)\nX = rng.uniform(-3, 3, size=500)\ntrend = 2.4 * X\nseasonal = 3.1 * np.sin(3.2 * X)\ndrop = 10.0 * (X > 2).astype(float)\nsigma = 0.75 + 0.75 * X**2\ny = trend + seasonal - drop + rng.normal(loc=0.0, scale=np.sqrt(sigma))\n\ndf = pd.DataFrame({\"X\": X, \"y\": y})\n_ = df.plot.scatter(x=\"X\", y=\"y\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stack of predictors on a single data set\n\nIt is sometimes not evident which model is more suited for a given task, as\ndifferent model families can achieve similar performance while exhibiting\ndifferent strengths and weaknesses. Stacking combines their outputs to exploit\nthese complementary behaviors and can correct systematic errors that no single\nmodel can fix on its own. With appropriate regularization in the\n`final_estimator`, the :class:`~sklearn.ensemble.StackingRegressor` often\nmatches the strongest base model, and can outperform it when base learners'\nerrors are only partially correlated, allowing the combination to reduce\nindividual bias/variance.\n\nHere, we combine 3 learners (linear and non-linear) and use the default\n:class:`~sklearn.linear_model.RidgeCV` regressor to combine their outputs\ntogether.\n\n

Note

Although some base learners include preprocessing (such as the\n :class:`~sklearn.preprocessing.StandardScaler`), the `final_estimator` does\n not need additional preprocessing when using the default\n `passthrough=False`, as it receives only the base learners' predictions. If\n `passthrough=True`, `final_estimator` should be a pipeline with proper\n preprocessing.

\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingRegressor, StackingRegressor\nfrom sklearn.linear_model import RidgeCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import PolynomialFeatures, SplineTransformer, StandardScaler\n\nlinear_ridge = make_pipeline(StandardScaler(), RidgeCV())\n\nspline_ridge = make_pipeline(\n SplineTransformer(n_knots=6, degree=3),\n PolynomialFeatures(interaction_only=True),\n RidgeCV(),\n)\n\nhgbt = HistGradientBoostingRegressor(random_state=0)\n\nestimators = [\n (\"Linear Ridge\", linear_ridge),\n (\"Spline Ridge\", spline_ridge),\n (\"HGBT\", hgbt),\n]\n\nstacking_regressor = StackingRegressor(estimators=estimators, final_estimator=RidgeCV())\nstacking_regressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Measure and plot the results\n\nWe can directly plot the predictions. Indeed, the sudden drop is correctly\ndescribed by the :class:`~sklearn.ensemble.HistGradientBoostingRegressor`\nmodel (HGBT), but the spline model is smoother and less overfitting. The stacked\nregressor then turns to be a smoother version of the HGBT.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nX = X.reshape(-1, 1)\nlinear_ridge.fit(X, y)\nspline_ridge.fit(X, y)\nhgbt.fit(X, y)\nstacking_regressor.fit(X, y)\n\nx_plot = np.linspace(X.min() - 0.1, X.max() + 0.1, 500).reshape(-1, 1)\npreds = {\n \"Linear Ridge\": linear_ridge.predict(x_plot),\n \"Spline Ridge\": spline_ridge.predict(x_plot),\n \"HGBT\": hgbt.predict(x_plot),\n \"Stacking (Ridge final estimator)\": stacking_regressor.predict(x_plot),\n}\n\nfig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True, sharey=True)\naxes = axes.ravel()\nfor ax, (name, y_pred) in zip(axes, preds.items()):\n ax.scatter(\n X[:, 0],\n y,\n s=6,\n alpha=0.35,\n linewidths=0,\n label=\"observed (sample)\",\n )\n\n ax.plot(x_plot.ravel(), y_pred, linewidth=2, alpha=0.9, label=name)\n ax.set_title(name)\n ax.set_xlabel(\"x\")\n ax.set_ylabel(\"y\")\n ax.legend(loc=\"lower right\")\n\nplt.suptitle(\"Base Models Predictions versus Stacked Predictions\", y=1)\nplt.tight_layout()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot the prediction errors as well and evaluate the performance of the\nindividual predictors and the stack of the regressors.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n\nfrom sklearn.metrics import PredictionErrorDisplay\nfrom sklearn.model_selection import cross_val_predict, cross_validate\n\nfig, axs = plt.subplots(2, 2, figsize=(9, 7))\naxs = np.ravel(axs)\n\nfor ax, (name, est) in zip(\n axs, estimators + [(\"Stacking Regressor\", stacking_regressor)]\n):\n scorers = {r\"$R^2$\": \"r2\", \"MAE\": \"neg_mean_absolute_error\"}\n\n start_time = time.time()\n scores = cross_validate(est, X, y, scoring=list(scorers.values()), n_jobs=-1)\n elapsed_time = time.time() - start_time\n\n y_pred = cross_val_predict(est, X, y, n_jobs=-1)\n scores = {\n key: (\n f\"{np.abs(np.mean(scores[f'test_{value}'])):.2f}\"\n r\" $\\pm$ \"\n f\"{np.std(scores[f'test_{value}']):.2f}\"\n )\n for key, value in scorers.items()\n }\n\n display = PredictionErrorDisplay.from_predictions(\n y_true=y,\n y_pred=y_pred,\n kind=\"actual_vs_predicted\",\n ax=ax,\n scatter_kwargs={\"alpha\": 0.2, \"color\": \"tab:blue\"},\n line_kwargs={\"color\": \"tab:red\"},\n )\n ax.set_title(f\"{name}\\nEvaluation in {elapsed_time:.2f} seconds\")\n\n for name, score in scores.items():\n ax.plot([], [], \" \", label=f\"{name}: {score}\")\n ax.legend(loc=\"upper left\")\n\nplt.suptitle(\"Prediction Errors of Base versus Stacked Predictors\", y=1)\nplt.tight_layout()\nplt.subplots_adjust(top=0.9)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even if the scores overlap considerably after cross-validation, the predictions\nfrom the stacked regressor are slightly better.\n\nOnce fitted, we can inspect the coefficients (or meta-weights) of the trained\n`final_estimator_` (as long as it is a linear model). They reveal how much the\nindividual estimators contribute to the the stacked regressor:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "stacking_regressor.fit(X, y)\nstacking_regressor.final_estimator_.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that in this case, the HGBT model dominates, with the spline\nridge also contributing meaningfully. The plain linear model does not add\nuseful signal once those two are included; with\n:class:`~sklearn.linear_model.RidgeCV` as the `final_estimator`, it is not\ndropped, but receives a small negative weight to correct its residual bias.\n\nIf we use :class:`~sklearn.linear_model.LassoCV` as the\n`final_estimator`, that small, unhelpful contribution is set exactly to zero,\nyielding a simpler blend of the spline ridge and HGBT models.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LassoCV\n\nstacking_regressor = StackingRegressor(estimators=estimators, final_estimator=LassoCV())\nstacking_regressor.fit(X, y)\nstacking_regressor.final_estimator_.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to mimic SuperLearner with scikit-learn\n\nThe `SuperLearner` [Polley2010]_ is a stacking strategy implemented as [an R\npackage](https://cran.r-project.org/web/packages/SuperLearner/index.html), but\nnot available off-the-shelf in Python. It is closely related to the\n:class:`~sklearn.ensemble.StackingRegressor`, as both train the meta-model on\nout-of-fold predictions from the base estimators.\n\nThe key difference is that `SuperLearner` estimates a convex set of\nmeta-weights (non-negative and summing to 1) and omits an intercept; by\ncontrast, :class:`~sklearn.ensemble.StackingRegressor` uses an unconstrained\nmeta-learner with an intercept by default (and can optionally include raw\nfeatures via passthrough).\n\nWithout an intercept, the meta-weights are directly interpretable as\nfractional contributions to the final prediction.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n\nlinear_reg = LinearRegression(fit_intercept=False, positive=True)\nsuper_learner_like = StackingRegressor(\n estimators=estimators, final_estimator=linear_reg\n)\nsuper_learner_like.fit(X, y)\nsuper_learner_like.final_estimator_.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sum of meta-weights in the stacked regressor is close to 1.0, but not\nexactly one:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "super_learner_like.final_estimator_.coef_.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beyond interpretability, the normalization to 1.0 constraint in the `SuperLearner`\npresents the following advantages:\n\n- Consensus-preserving: if all base models output the same value at a point,\n the ensemble returns that same value (no artificial amplification or\n attenuation).\n- Translation-equivariant: adding a constant to every base prediction shifts\n the ensemble by the same constant.\n- Removes one degree of freedom: avoiding redundancy with a constant term and\n modestly stabilizing weights under collinearity.\n\nThe cleanest way to enforce the coefficient normalization with scikit-learn is\nby defining a custom estimator, but doing so is beyond the scope of this\ntutorial.\n\n## Conclusions\n\nThe stacked regressor combines the strengths of the different regressors.\nHowever, notice that training the stacked regressor is much more\ncomputationally expensive than selecting the best performing model.\n\n.. rubric:: References\n\n.. [Polley2010] Polley, E. C. and van der Laan, M. J., [Super Learner In\n Prediction](https://biostats.bepress.com/cgi/viewcontent.cgi?article=1269&context=ucbbiostat),\n 2010.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 0 }