{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Single estimator versus bagging: bias-variance decomposition\n\nThis example illustrates and compares the bias-variance decomposition of the\nexpected mean squared error of a single estimator against a bagging ensemble.\n\nIn regression, the expected mean squared error of an estimator can be\ndecomposed in terms of bias, variance and noise. On average over datasets of\nthe regression problem, the bias term measures the average amount by which the\npredictions of the estimator differ from the predictions of the best possible\nestimator for the problem (i.e., the Bayes model). The variance term measures\nthe variability of the predictions of the estimator when fit over different\nrandom instances of the same problem. Each problem instance is noted \"LS\", for\n\"Learning Sample\", in the following. Finally, the noise measures the irreducible part\nof the error which is due the variability in the data.\n\nThe upper left figure illustrates the predictions (in dark red) of a single\ndecision tree trained over a random dataset LS (the blue dots) of a toy 1d\nregression problem. It also illustrates the predictions (in light red) of other\nsingle decision trees trained over other (and different) randomly drawn\ninstances LS of the problem. Intuitively, the variance term here corresponds to\nthe width of the beam of predictions (in light red) of the individual\nestimators. The larger the variance, the more sensitive are the predictions for\n`x` to small changes in the training set. The bias term corresponds to the\ndifference between the average prediction of the estimator (in cyan) and the\nbest possible model (in dark blue). On this problem, we can thus observe that\nthe bias is quite low (both the cyan and the blue curves are close to each\nother) while the variance is large (the red beam is rather wide).\n\nThe lower left figure plots the pointwise decomposition of the expected mean\nsquared error of a single decision tree. It confirms that the bias term (in\nblue) is low while the variance is large (in green). It also illustrates the\nnoise part of the error which, as expected, appears to be constant and around\n`0.01`.\n\nThe right figures correspond to the same plots but using instead a bagging\nensemble of decision trees. In both figures, we can observe that the bias term\nis larger than in the previous case. In the upper right figure, the difference\nbetween the average prediction (in cyan) and the best possible model is larger\n(e.g., notice the offset around `x=2`). In the lower right figure, the bias\ncurve is also slightly higher than in the lower left figure. In terms of\nvariance however, the beam of predictions is narrower, which suggests that the\nvariance is lower. Indeed, as the lower right figure confirms, the variance\nterm (in green) is lower than for single decision trees. Overall, the bias-\nvariance decomposition is therefore no longer the same. The tradeoff is better\nfor bagging: averaging several decision trees fit on bootstrap copies of the\ndataset slightly increases the bias term but allows for a larger reduction of\nthe variance, which results in a lower overall mean squared error (compare the\nred curves int the lower figures). The script output also confirms this\nintuition. The total error of the bagging ensemble is lower than the total\nerror of a single decision tree, and this difference indeed mainly stems from a\nreduced variance.\n\nFor further details on bias-variance decomposition, see section 7.3 of [1]_.\n\n## References\n\n.. [1] T. Hastie, R. Tibshirani and J. Friedman,\n \"Elements of Statistical Learning\", Springer, 2009.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.ensemble import BaggingRegressor\nfrom sklearn.tree import DecisionTreeRegressor\n\n# Settings\nn_repeat = 50 # Number of iterations for computing expectations\nn_train = 50 # Size of the training set\nn_test = 1000 # Size of the test set\nnoise = 0.1 # Standard deviation of the noise\nnp.random.seed(0)\n\n# Change this for exploring the bias-variance decomposition of other\n# estimators. This should work well for estimators with high variance (e.g.,\n# decision trees or KNN), but poorly for estimators with low variance (e.g.,\n# linear models).\nestimators = [\n (\"Tree\", DecisionTreeRegressor()),\n (\"Bagging(Tree)\", BaggingRegressor(DecisionTreeRegressor())),\n]\n\nn_estimators = len(estimators)\n\n\n# Generate data\ndef f(x):\n x = x.ravel()\n\n return np.exp(-(x**2)) + 1.5 * np.exp(-((x - 2) ** 2))\n\n\ndef generate(n_samples, noise, n_repeat=1):\n X = np.random.rand(n_samples) * 10 - 5\n X = np.sort(X)\n\n if n_repeat == 1:\n y = f(X) + np.random.normal(0.0, noise, n_samples)\n else:\n y = np.zeros((n_samples, n_repeat))\n\n for i in range(n_repeat):\n y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)\n\n X = X.reshape((n_samples, 1))\n\n return X, y\n\n\nX_train = []\ny_train = []\n\nfor i in range(n_repeat):\n X, y = generate(n_samples=n_train, noise=noise)\n X_train.append(X)\n y_train.append(y)\n\nX_test, y_test = generate(n_samples=n_test, noise=noise, n_repeat=n_repeat)\n\nplt.figure(figsize=(10, 8))\n\n# Loop over estimators to compare\nfor n, (name, estimator) in enumerate(estimators):\n # Compute predictions\n y_predict = np.zeros((n_test, n_repeat))\n\n for i in range(n_repeat):\n estimator.fit(X_train[i], y_train[i])\n y_predict[:, i] = estimator.predict(X_test)\n\n # Bias^2 + Variance + Noise decomposition of the mean squared error\n y_error = np.zeros(n_test)\n\n for i in range(n_repeat):\n for j in range(n_repeat):\n y_error += (y_test[:, j] - y_predict[:, i]) ** 2\n\n y_error /= n_repeat * n_repeat\n\n y_noise = np.var(y_test, axis=1)\n y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2\n y_var = np.var(y_predict, axis=1)\n\n print(\n \"{0}: {1:.4f} (error) = {2:.4f} (bias^2) \"\n \" + {3:.4f} (var) + {4:.4f} (noise)\".format(\n name, np.mean(y_error), np.mean(y_bias), np.mean(y_var), np.mean(y_noise)\n )\n )\n\n # Plot figures\n plt.subplot(2, n_estimators, n + 1)\n plt.plot(X_test, f(X_test), \"b\", label=\"$f(x)$\")\n plt.plot(X_train[0], y_train[0], \".b\", label=\"LS ~ $y = f(x)+noise$\")\n\n for i in range(n_repeat):\n if i == 0:\n plt.plot(X_test, y_predict[:, i], \"r\", label=r\"$\\^y(x)$\")\n else:\n plt.plot(X_test, y_predict[:, i], \"r\", alpha=0.05)\n\n plt.plot(X_test, np.mean(y_predict, axis=1), \"c\", label=r\"$\\mathbb{E}_{LS} \\^y(x)$\")\n\n plt.xlim([-5, 5])\n plt.title(name)\n\n if n == n_estimators - 1:\n plt.legend(loc=(1.1, 0.5))\n\n plt.subplot(2, n_estimators, n_estimators + n + 1)\n plt.plot(X_test, y_error, \"r\", label=\"$error(x)$\")\n plt.plot(X_test, y_bias, \"b\", label=\"$bias^2(x)$\"),\n plt.plot(X_test, y_var, \"g\", label=\"$variance(x)$\"),\n plt.plot(X_test, y_noise, \"c\", label=\"$noise(x)$\")\n\n plt.xlim([-5, 5])\n plt.ylim([0, 0.1])\n\n if n == n_estimators - 1:\n plt.legend(loc=(1.1, 0.5))\n\nplt.subplots_adjust(right=0.75)\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }