{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.2 Bias-Variance 分解" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果我们假定基函数的形式和数目都是确定的时候，我们知道，最小二乘回归，或者极大似然方法，在模型过于复杂的时候可能会带来严重的过拟合问题，而限制基函数的数目防止过拟合又限制了模型的拟合能力。虽然我们可以通过加入正则项的方法解决问题，但是这又引申出如何选择合适的正则系数 $\\lambda$ 的问题。显然，我们不可能同时优化 $\\mathbf w,\\lambda$ 来最小化损失函数，因为这样很明显会得到非正则化的解 $\\lambda=0$。\n", "\n", "为了防止极大似然的过拟合问题，我们从贝叶斯理论的角度来考虑这个问题。在考虑这个问题之前，我们需要了解一个在统计观点中重要的模型复杂度问题，偏差-方差权衡问题（`bias-variance trade-off`）。 \n", "\n", "在 [1.5.5](../Chap-01-Introduction/01-05-Decision-Theory.ipynb#1.5.5-回归问题的损失函数) 节，我们讨论了在给定条件分布 $p(t|\\mathbf x)$ 的情况下，不同损失函数下的最优解。我们知道，对于平方损失函数，最优的预测解 $h(\\mathbf x)$ 为条件期望：\n", "\n", "$$\n", "h(\\mathbf x)=\\mathbb E[t|\\mathbf x] = \\int tp(t|\\mathbf x)dt\n", "$$\n", "\n", "在 [1.5.5](../Chap-01-Introduction/01-05-Decision-Theory.ipynb#1.5.5-回归问题的损失函数) 节，我们知道，平方损失可以写成：\n", "\n", "$$\n", "\\mathbb E[L]=\\int\\{y(\\mathbf x)-h(\\mathbf x)\\}^2 p(\\mathbf x) d\\mathbf x + \\int\\{h(\\mathbf x)-t\\}^2 p(\\mathbf x,t) d\\mathbf xdt\n", "$$\n", "\n", "第二部分是与 $y(\\mathbf x)$ 独立的部分，是当 $y(\\mathbf x) = h(\\mathbf x)$ 时，我们能达到的最小损失，对应于数据本身的噪声部分。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 偏差和方差" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果我们有足够多的数据，我们就能很好的模拟出 $h(\\mathbf x)$，从而得到最优的 $y(\\mathbf x)=h(\\mathbf x)$；但是现实情况下，我们的数据点是有限的，因此事实上，我们并不知道 $h(\\mathbf x)$ 的精确值。\n", "\n", "为了对 $h(x)$ 进行建模，我们考虑使用 $y(\\mathbf{x,w})$ 的模型，其中 $\\mathbf w$ 是参数向量。进行参数估计时，贝叶斯学派的观点是我们首先对参数 $\\mathbf w$ 的先验分布进行建模；统计学派的观点是我们使用一组数据集 $\\mathcal D$ 对 $\\mathbf w$ 进行估计。\n", "\n", "假设我们从分布 $p(t,\\mathbf x)$ 取出一组大小为 $N$ 的数据集 $\\mathcal D$；对于任何一个数据集 $\\mathcal D$，我们可以得到一个预测函数 $y(\\mathbf x,\\mathcal D)$。系综（`ensemble`）中的不同数据集会给出不同的预测函数，从而得到不同的平方损失函数。因此，对于一个特定的算法，其性能由这个系综的平均性能决定。\n", "\n", "现在考虑平方损失函数的第一部分：\n", "\n", "$$\n", "\\{y(\\mathbf x;\\mathcal D)-h(\\mathbf x)\\}^2\n", "$$\n", "\n", "这个部分会依赖于特定数据集 $\\mathcal D$ 的选择，因此我们需要将其对 $\\mathbf D$ 做一个期望。我们考虑将它拆分为两部分：\n", "\n", "$$\n", "\\begin{aligned}\n", "& \\{y(\\mathbf x;\\mathcal D)-\\mathbb E_D[y(\\mathbf x;\\mathcal D)]+\\mathbb E_D[y(\\mathbf x;\\mathcal D)]-h(\\mathbf x)\\}^2 \\\\ \n", "= & \\{y(\\mathbf x;\\mathcal D)-\\mathbb E_D[y(\\mathbf x;\\mathcal D)]\\}^2 + \\{\\mathbb E_D[y(\\mathbf x;\\mathcal D)]-h(\\mathbf x)\\}^2 \\\\\n", "& + 2\\{y(\\mathbf x;\\mathcal D)-\\mathbb E_D[y(\\mathbf x;\\mathcal D)]\\}\\{\\mathbb E_D[y(\\mathbf x;\\mathcal D)]-h(\\mathbf x)\\}\\\\\n", "\\end{aligned}\n", "$$\n", "\n", "并对 $\\mathcal D$ 求期望（最后一项会消掉）：\n", "\n", "$$\n", "\\begin{aligned}\n", "& \\mathbb E_D\\left[\\{y(\\mathbf x;\\mathcal D)-h(\\mathbf x)\\}^2\\right] \\\\\n", "= & \\underbrace{\\left\\{\\mathbb E_D[y(\\mathbf x;\\mathcal D)]-h(\\mathbf x)\\right\\}^2}_\\color{blue}{(\\text{bias})^2} +\n", "\\underbrace{\\mathbb E_D\\left[\\{y(\\mathbf x;\\mathcal D)-\\mathbb E_D[y(\\mathbf x;\\mathcal D)]\\}^2\\right]}_\\color{red}{\\text{variance}}\n", "\\end{aligned}\n", "$$\n", "\n", "我们看到，$y(\\mathbf x;\\mathcal D)$ 与 $h(\\mathbf x)$ 的平方误差期望可以表示为两个部分，第一部分叫偏差（`bias`）的平方，表示在不同数据集上的期望与回归函数之间的误差，第二部分叫方差（`variance`），衡量的是这个特定数据集的 $y(\\mathbf x;\\mathcal D)$ 对回归函数的敏感程度。\n", "\n", "如果我们回到平方损失函数，我们得到：\n", "\n", "$$\n", "\\mathbb E[L] = \\rm expected~loss = (bias)^2 + variance + noise\n", "$$\n", "\n", "其中\n", "\n", "$$\n", "\\begin{aligned}\n", "{\\rm (bias)^2} & = \\int \\left\\{\\mathbb E_D[y(\\mathbf x;\\mathcal D)]-h(\\mathbf x)\\right\\}^2 p(\\mathbf x)d\\mathbf x \\\\\n", "{\\rm variance} & = \\int \\mathbb E_D\\left[\\{y(\\mathbf x;\\mathcal D)-\\mathbb E_D[y(\\mathbf x;\\mathcal D)]\\}^2\\right] p(\\mathbf x)d\\mathbf x \\\\\n", "{\\rm noise} & = \\int\\{h(\\mathbf x)-t\\}^2 p(\\mathbf x,t) d\\mathbf xdt\n", "\\end{aligned}\n", "$$\n", "\n", "这里偏差和方差项是在平均意义下的结果。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 偏差方差权衡" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "最小化 $\\mathbb E[L]$ 对应与最小化这三项的和，偏差的平方，方差，和常数噪声项。\n", "\n", "高自由度的模型（低 $\\lambda$）通常是低偏差、高方差，而低自由度（高 $\\lambda$）的模型则对应于高偏差，低偏差的。一个好的模型需要权衡两者之间的平衡。\n", "\n", "虽然偏差-方差分解提供了一种从统计上理解模型复杂度的方法，但是从实际角度来说，我们通常只有一组观测数据，因此对一组系综的数据集进行均值操作是不现实的。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }