{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# StackingCVRegressor: stacking with cross-validation for regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An ensemble-learning meta-regressor for stacking regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.regressor import StackingCVRegressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stacking is an ensemble learning technique to combine multiple regression models via a meta-regressor. The `StackingCVRegressor` extends the standard stacking algorithm (implemented as [`StackingRegressor`](StackingRegressor.md)) using out-of-fold predictions to prepare the input data for the level-2 regressor.\n", "\n", "In the standard stacking procedure, the first-level regressors are fit to the same training set that is used prepare the inputs for the second-level regressor, which may lead to overfitting. The `StackingCVRegressor`, however, uses the concept of out-of-fold predictions: the dataset is split into k folds, and in k successive rounds, k-1 folds are used to fit the first level regressor. In each round, the first-level regressors are then applied to the remaining 1 subset that was not used for model fitting in each iteration. The resulting predictions are then stacked and provided -- as input data -- to the second-level regressor. After the training of the `StackingCVRegressor`, the first-level regressors are fit to the entire dataset for optimal predicitons." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](./StackingCVRegressor_files/stacking_cv_regressor_overview.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Breiman, Leo. \"[Stacked regressions.](https://link.springer.com/article/10.1023/A:1018046112532#page-1)\" Machine learning 24.1 (1996): 49-64.\n", "- Analogous implementation: [`StackingCVClassifier`](../classifier/StackingCVClassifier.md)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: Boston Housing Data Predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we evaluate some basic prediction models on the boston housing dataset and see how the $R^2$ and MSE scores are affected by combining the models with `StackingCVRegressor`. The code output below demonstrates that the stacked model performs the best on this dataset -- slightly better than the best single regression model." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5-fold cross validation scores:\n", "\n", "R^2 Score: 0.46 (+/- 0.29) [SVM]\n", "R^2 Score: 0.43 (+/- 0.14) [Lasso]\n", "R^2 Score: 0.53 (+/- 0.28) [Random Forest]\n", "R^2 Score: 0.57 (+/- 0.24) [StackingCVRegressor]\n" ] } ], "source": [ "from mlxtend.regressor import StackingCVRegressor\n", "from sklearn.datasets import load_boston\n", "from sklearn.svm import SVR\n", "from sklearn.linear_model import Lasso\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import cross_val_score\n", "import numpy as np\n", "\n", "RANDOM_SEED = 42\n", "\n", "X, y = load_boston(return_X_y=True)\n", "\n", "svr = SVR(kernel='linear')\n", "lasso = Lasso()\n", "rf = RandomForestRegressor(n_estimators=5, \n", " random_state=RANDOM_SEED)\n", "\n", "# Starting from v0.16.0, StackingCVRegressor supports\n", "# `random_state` to get deterministic result.\n", "stack = StackingCVRegressor(regressors=(svr, lasso, rf),\n", " meta_regressor=lasso,\n", " random_state=RANDOM_SEED)\n", "\n", "print('5-fold cross validation scores:\\n')\n", "\n", "for clf, label in zip([svr, lasso, rf, stack], ['SVM', 'Lasso', \n", " 'Random Forest', \n", " 'StackingCVRegressor']):\n", " scores = cross_val_score(clf, X, y, cv=5)\n", " print(\"R^2 Score: %0.2f (+/- %0.2f) [%s]\" % (\n", " scores.mean(), scores.std(), label))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5-fold cross validation scores:\n", "\n", "Neg. MSE Score: -33.33 (+/- 22.36) [SVM]\n", "Neg. MSE Score: -35.53 (+/- 16.99) [Lasso]\n", "Neg. MSE Score: -27.08 (+/- 15.67) [Random Forest]\n", "Neg. MSE Score: -25.85 (+/- 17.22) [StackingCVRegressor]\n" ] } ], "source": [ "stack = StackingCVRegressor(regressors=(svr, lasso, rf),\n", " meta_regressor=lasso)\n", "\n", "print('5-fold cross validation scores:\\n')\n", "\n", "for clf, label in zip([svr, lasso, rf, stack], ['SVM', 'Lasso', \n", " 'Random Forest', \n", " 'StackingCVRegressor']):\n", " scores = cross_val_score(clf, X, y, cv=5, scoring='neg_mean_squared_error')\n", " print(\"Neg. MSE Score: %0.2f (+/- %0.2f) [%s]\" % (\n", " scores.mean(), scores.std(), label))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2: GridSearchCV with Stacking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this second example we demonstrate how `StackingCVRegressor` works in combination with `GridSearchCV`. The stack still allows tuning hyper parameters of the base and meta models!\n", "\n", "For instance, we can use `estimator.get_params().keys()` to get a full list of tunable parameters.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best: 0.679151 using {'lasso__alpha': 1.6, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.4}\n" ] } ], "source": [ "from mlxtend.regressor import StackingCVRegressor\n", "from sklearn.datasets import load_boston\n", "from sklearn.linear_model import Lasso\n", "from sklearn.linear_model import Ridge\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "X, y = load_boston(return_X_y=True)\n", "\n", "ridge = Ridge(random_state=RANDOM_SEED)\n", "lasso = Lasso(random_state=RANDOM_SEED)\n", "rf = RandomForestRegressor(random_state=RANDOM_SEED)\n", "\n", "stack = StackingCVRegressor(regressors=(lasso, ridge),\n", " meta_regressor=rf, \n", " random_state=RANDOM_SEED,\n", " use_features_in_secondary=True)\n", "\n", "params = {'lasso__alpha': [0.1, 1.0, 10.0],\n", " 'ridge__alpha': [0.1, 1.0, 10.0]}\n", "\n", "grid = GridSearchCV(\n", " estimator=stack, \n", " param_grid={\n", " 'lasso__alpha': [x/5.0 for x in range(1, 10)],\n", " 'ridge__alpha': [x/20.0 for x in range(1, 10)],\n", " 'meta_regressor__n_estimators': [10, 100]\n", " }, \n", " cv=5,\n", " refit=True\n", ")\n", "\n", "grid.fit(X, y)\n", "\n", "print(\"Best: %f using %s\" % (grid.best_score_, grid.best_params_))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.632 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.05}\n", "0.645 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.1}\n", "0.641 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.15}\n", "0.653 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.2}\n", "0.622 +/- 0.10 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.25}\n", "0.630 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.3}\n", "0.630 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.35}\n", "0.642 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.4}\n", "0.654 +/- 0.08 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.45}\n", "0.642 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 100, 'ridge__alpha': 0.05}\n", "0.645 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 100, 'ridge__alpha': 0.1}\n", "0.648 +/- 0.09 {'lasso__alpha': 0.2, 'meta_regressor__n_estimators': 100, 'ridge__alpha': 0.15}\n", "...\n", "Best parameters: {'lasso__alpha': 1.6, 'meta_regressor__n_estimators': 10, 'ridge__alpha': 0.4}\n", "Accuracy: 0.68\n" ] } ], "source": [ "cv_keys = ('mean_test_score', 'std_test_score', 'params')\n", "\n", "for r, _ in enumerate(grid.cv_results_['mean_test_score']):\n", " print(\"%0.3f +/- %0.2f %r\"\n", " % (grid.cv_results_[cv_keys[0]][r],\n", " grid.cv_results_[cv_keys[1]][r] / 2.0,\n", " grid.cv_results_[cv_keys[2]][r]))\n", " if r > 10:\n", " break\n", "print('...')\n", "\n", "print('Best parameters: %s' % grid.best_params_)\n", "print('Accuracy: %.2f' % grid.best_score_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**\n", "\n", "The `StackingCVRegressor` also enables grid search over the `regressors` and even a single base regressor. When there are level-mixed hyperparameters, `GridSearchCV` will try to replace hyperparameters in a top-down order, i.e., `regressors` -> single base regressor -> regressor hyperparameter. For instance, given a hyperparameter grid such as\n", "\n", " params = {'randomforestregressor__n_estimators': [1, 100],\n", " 'regressors': [(regr1, regr1, regr1), (regr2, regr3)]}\n", " \n", "it will first use the instance settings of either `(regr1, regr2, regr3)` or `(regr2, regr3)` . Then it will replace the `'n_estimators'` settings for a matching regressor based on `'randomforestregressor__n_estimators': [1, 100]`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## StackingCVRegressor\n", "\n", "*StackingCVRegressor(regressors, meta_regressor, cv=5, shuffle=True, random_state=None, verbose=0, refit=True, use_features_in_secondary=False, store_train_meta_features=False, n_jobs=None, pre_dispatch='2*n_jobs', multi_output=False)*\n", "\n", "A 'Stacking Cross-Validation' regressor for scikit-learn estimators.\n", "\n", "**Parameters**\n", "\n", "- `regressors` : array-like, shape = [n_regressors]\n", "\n", " A list of regressors.\n", " Invoking the `fit` method on the `StackingCVRegressor` will fit clones\n", " of these original regressors that will\n", " be stored in the class attribute `self.regr_`.\n", "\n", "\n", "- `meta_regressor` : object\n", "\n", " The meta-regressor to be fitted on the ensemble of\n", " regressor\n", "\n", "\n", "- `cv` : int, cross-validation generator or iterable, optional (default: 5)\n", "\n", " Determines the cross-validation splitting strategy.\n", " Possible inputs for cv are:\n", " - None, to use the default 5-fold cross validation,\n", " - integer, to specify the number of folds in a `KFold`,\n", " - An object to be used as a cross-validation generator.\n", " - An iterable yielding train, test splits.\n", " For integer/None inputs, it will use `KFold` cross-validation\n", "\n", "\n", "- `shuffle` : bool (default: True)\n", "\n", " If True, and the `cv` argument is integer, the training data will\n", " be shuffled at fitting stage prior to cross-validation. If the `cv`\n", " argument is a specific cross validation technique, this argument is\n", " omitted.\n", "\n", "\n", "- `random_state` : int, RandomState instance or None, optional (default: None)\n", "\n", " Constrols the randomness of the cv splitter. Used when `cv` is\n", " integer and `shuffle=True`. New in v0.16.0.\n", "\n", "\n", "- `verbose` : int, optional (default=0)\n", "\n", " Controls the verbosity of the building process. New in v0.16.0\n", "\n", "\n", "- `refit` : bool (default: True)\n", "\n", " Clones the regressors for stacking regression if True (default)\n", " or else uses the original ones, which will be refitted on the dataset\n", " upon calling the `fit` method. Setting refit=False is\n", " recommended if you are working with estimators that are supporting\n", " the scikit-learn fit/predict API interface but are not compatible\n", " to scikit-learn's `clone` function.\n", "\n", "\n", "- `use_features_in_secondary` : bool (default: False)\n", "\n", " If True, the meta-regressor will be trained both on\n", " the predictions of the original regressors and the\n", " original dataset.\n", " If False, the meta-regressor will be trained only on\n", " the predictions of the original regressors.\n", "\n", "\n", "- `store_train_meta_features` : bool (default: False)\n", "\n", " If True, the meta-features computed from the training data\n", " used for fitting the\n", " meta-regressor stored in the `self.train_meta_features_` array,\n", " which can be\n", " accessed after calling `fit`.\n", "\n", "\n", "- `n_jobs` : int or None, optional (default=None)\n", "\n", " The number of CPUs to use to do the computation.\n", " ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.\n", " ``-1`` means using all processors. See :term:`Glossary `\n", " for more details. New in v0.16.0.\n", "\n", "\n", "- `pre_dispatch` : int, or string, optional\n", "\n", " Controls the number of jobs that get dispatched during parallel\n", " execution. Reducing this number can be useful to avoid an\n", " explosion of memory consumption when more jobs get dispatched\n", " than CPUs can process. This parameter can be:\n", " - None, in which case all the jobs are immediately\n", " created and spawned. Use this for lightweight and\n", " fast-running jobs, to avoid delays due to on-demand\n", " spawning of the jobs\n", " - An int, giving the exact number of total jobs that are\n", " spawned\n", " - A string, giving an expression as a function of n_jobs,\n", " as in '2*n_jobs'\n", "\n", "\n", "- `multi_output` : bool (default: False)\n", "\n", " If True, allow multi-output targets, but forbid nan or inf values.\n", " If False, `y` will be checked to be a vector. (New in v0.19.0.)\n", "\n", "**Attributes**\n", "\n", "- `train_meta_features` : numpy array, shape = [n_samples, n_regressors]\n", "\n", " meta-features for training data, where n_samples is the\n", " number of samples\n", " in training data and len(self.regressors) is the number of regressors.\n", "\n", "**Examples**\n", "\n", "For usage examples, please see\n", " https://rasbt.github.io/mlxtend/user_guide/regressor/StackingCVRegressor/\n", "\n", "### Methods\n", "\n", "
\n", "\n", "*fit(X, y, groups=None, sample_weight=None)*\n", "\n", "Fit ensemble regressors and the meta-regressor.\n", "\n", "**Parameters**\n", "\n", "- `X` : numpy array, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", "\n", "- `y` : numpy array, shape = [n_samples] or [n_samples, n_targets]\n", "\n", " Target values. Multiple targets are supported only if\n", " self.multi_output is True.\n", "\n", "\n", "- `groups` : numpy array/None, shape = [n_samples]\n", "\n", " The group that each sample belongs to. This is used by specific\n", " folding strategies such as GroupKFold()\n", "\n", "\n", "- `sample_weight` : array-like, shape = [n_samples], optional\n", "\n", " Sample weights passed as sample_weights to each regressor\n", " in the regressors list as well as the meta_regressor.\n", " Raises error if some regressor does not support\n", " sample_weight in the fit() method.\n", "\n", "**Returns**\n", "\n", "- `self` : object\n", "\n", "\n", "
\n", "\n", "*fit_transform(X, y=None, **fit_params)*\n", "\n", "Fit to data, then transform it.\n", "\n", " Fits transformer to `X` and `y` with optional parameters `fit_params`\n", " and returns a transformed version of `X`.\n", "\n", "**Parameters**\n", "\n", "- `X` : array-like of shape (n_samples, n_features)\n", "\n", " Input samples.\n", "\n", "\n", "- `y` : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None\n", "\n", " Target values (None for unsupervised transformations).\n", "\n", "\n", "- `**fit_params` : dict\n", "\n", " Additional fit parameters.\n", "\n", "**Returns**\n", "\n", "- `X_new` : ndarray array of shape (n_samples, n_features_new)\n", "\n", " Transformed array.\n", "\n", "
\n", "\n", "*get_params(deep=True)*\n", "\n", "Get parameters for this estimator.\n", "\n", "**Parameters**\n", "\n", "- `deep` : bool, default=True\n", "\n", " If True, will return the parameters for this estimator and\n", " contained subobjects that are estimators.\n", "\n", "**Returns**\n", "\n", "- `params` : dict\n", "\n", " Parameter names mapped to their values.\n", "\n", "
\n", "\n", "*predict(X)*\n", "\n", "Predict target values for X.\n", "\n", "**Parameters**\n", "\n", "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n", "\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", "**Returns**\n", "\n", "- `y_target` : array-like, shape = [n_samples] or [n_samples, n_targets]\n", "\n", " Predicted target values.\n", "\n", "
\n", "\n", "*predict_meta_features(X)*\n", "\n", "Get meta-features of test-data.\n", "\n", "**Parameters**\n", "\n", "- `X` : numpy array, shape = [n_samples, n_features]\n", "\n", " Test vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", "**Returns**\n", "\n", "- `meta-features` : numpy array, shape = [n_samples, len(self.regressors)]\n", "\n", " meta-features for test data, where n_samples is the number of\n", " samples in test data and len(self.regressors) is the number\n", " of regressors. If self.multi_output is True, then the number of\n", " columns is len(self.regressors) * n_targets.\n", "\n", "
\n", "\n", "*score(X, y, sample_weight=None)*\n", "\n", "Return the coefficient of determination :math:`R^2` of the\n", " prediction.\n", "\n", " The coefficient :math:`R^2` is defined as :math:`(1 - \\frac{u}{v})`,\n", " where :math:`u` is the residual sum of squares ``((y_true - y_pred)\n", "** 2).sum()`` and :math:`v` is the total sum of squares ``((y_true -\n", "y_true.mean()) ** 2).sum()``. The best possible score is 1.0 and it\n", "\n", "can be negative (because the model can be arbitrarily worse). A\n", "\n", "constant model that always predicts the expected value of `y`,\n", " disregarding the input features, would get a :math:`R^2` score of\n", " 0.0.\n", "\n", "**Parameters**\n", "\n", "- `X` : array-like of shape (n_samples, n_features)\n", "\n", " Test samples. For some estimators this may be a precomputed\n", " kernel matrix or a list of generic objects instead with shape\n", " ``(n_samples, n_samples_fitted)``, where ``n_samples_fitted``\n", " is the number of samples used in the fitting for the estimator.\n", "\n", "\n", "- `y` : array-like of shape (n_samples,) or (n_samples, n_outputs)\n", "\n", " True values for `X`.\n", "\n", "\n", "- `sample_weight` : array-like of shape (n_samples,), default=None\n", "\n", " Sample weights.\n", "\n", "**Returns**\n", "\n", "- `score` : float\n", "\n", " :math:`R^2` of ``self.predict(X)`` wrt. `y`.\n", "\n", "**Notes**\n", "\n", "The :math:`R^2` score used when calling ``score`` on a regressor uses\n", " ``multioutput='uniform_average'`` from version 0.23 to keep consistent\n", " with default value of :func:`~sklearn.metrics.r2_score`.\n", " This influences the ``score`` method of all the multioutput\n", " regressors (except for\n", " :class:`~sklearn.multioutput.MultiOutputRegressor`).\n", "\n", "
\n", "\n", "*set_params(**params)*\n", "\n", "Set the parameters of this estimator.\n", "\n", " Valid parameter keys can be listed with ``get_params()``.\n", "\n", "**Returns**\n", "\n", "self\n", "\n", "### Properties\n", "\n", "
\n", "\n", "*named_regressors*\n", "\n", "**Returns**\n", "\n", "List of named estimator tuples, like [('svc', SVC(...))]\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.regressor/StackingCVRegressor.md', 'r') as f:\n", " print(f.read())" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }