{ "cells": [ { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# %load ../../preconfig.py\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['axes.grid'] = False\n", "import seaborn as sns\n", "sns.set(color_codes=True)\n", "\n", "import numpy as np\n", "import pandas as pd\n", "#import itertools\n", "\n", "import sklearn\n", "\n", "import logging\n", "logger = logging.getLogger()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercises\n", "=========" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.1\n", "##### Question\n", "Suppose each of K-classes has an associated target $t_k$, which is a vector of all zeros, except a one in the $k$th position. Show that classifying to the largest element of $\\hat{y}$ amounts to choosing the closest target, $\\min_k \\|tk − \\hat{y}\\|$, if the elements of $\\hat{y}$ sum to one.\n", "\n", "\n", "##### Solution\n", "Given: \n", "$t_k = e_k$; \n", "$\\sum_i \\hat{y}_i = 1$.\n", "\n", "need to proof: $\\text{arg max}_i \\hat{y}_i = \\text{arg min}_k \\|t_k - \\hat{y}\\|$.\n", "\n", "Proof: \n", "\n", "\\begin{align}\n", "\\text{arg min}_k \\|t_k - \\hat{y}\\| &= \\text{arg min}_k \\left ( \\displaystyle \\sum_{j \\neq k} \\hat{y}_j + | \\hat{y}_k - 1 | \\right )\\\\\n", " &= \\text{arg min}_k \\left ( \\displaystyle \\sum_{j \\neq k} \\hat{y}_j + 1 - \\hat{y}_k \\right ) \\\\\n", " &= \\text{arg min}_k \\left ( 1 - \\hat{y}_k + 1 - \\hat{y}_k \\right ) \\\\\n", " &= \\text{arg min}_k 2(1 - \\hat{y}_k) \\\\\n", " &= \\text{arg max}_k \\hat{y}_k\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.2\n", "##### Question\n", "Show how to compute the Bayes decision boundary for the simula- tion example in Figure 2.5.\n", "\n", "\n", "##### Solution\n", "[ref: Elements of Statistical Learning - Andrew Tulloch](https://github.com/ajtulloch/Elements-of-Statistical-Learning/blob/master/ESL-Solutions.pdf)\n", "\n", "As Eq.(2.22) of textbook:\n", "\\begin{align}\n", " \\hat{G}(x) &= \\text{arg min}_{g \\in \\mathcal{G}} [ 1 - P(g | X = x) ] \\\\\n", " &= \\text{arg max}_{g \\in \\mathcal{G}} P(g | X = x)\n", "\\end{align}\n", "\n", "The optimal Bayes decision boundary is where:\n", "\\begin{align}\n", " P(\\text{orange} | X = x) &= P(\\text{blue} | X = x) \\\\\n", " \\frac{P( X = x | orange ) P( orange )}{P(X = x)} &= \\frac{P( X = x | blue ) P( blue )}{P(X = x)} \\\\\n", " P( X = x | orange ) P( orange ) &= P( X = x | blue ) P( blue )\n", "\\end{align}\n", "\n", "As descriped in Sec 2.3.3, $P(orange)$ is same as $P(blue)$, and $P(X = x | orange)$ and $P(X = x | blue)$ are generated as bivariate Gassuian distribution. Hence we can work out the optimal Bayes decision boundary exactly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.3\n", "##### Question\n", "Derive equation (2.24).\n", "\n", "Suppose there are $N$ points, uniformly distributed in the unit sphere in $\\mathbb{R}^p$. What is the median distance from the origin to the closest data point? \n", "\n", "##### Solution\n", "[ref: Example Sheet 1: Solutions](http://www2.stat.duke.edu/~banks/cam-lectures.dir/Examples1-solns.pdf)\n", "\n", "(1)\n", "Suppose $r$ is the median distance from the origin to the closest data point.\n", "\n", "Let $r_{\\text{closest}}$ are all possible closetst points. \n", "Because $r$ is the median case, $\\forall j$\n", "$$P(r_{\\text{closest}}^j \\geq r) = \\frac{1}{2}$$\n", "\n", "Because $r_{\\text{closest}}^j$ is the closest point, so all $N$ points have distance $\\geq r_{\\text{closest}}^j \\geq r$. \n", "\n", "together, we get:\n", "$$P(\\text{all N points have distance } \\geq r) = \\frac{1}{2}$$\n", "\n", "(2)\n", "First, all points are uniformly distributed in the unit sphere in $\\mathbb{R}^p$. \n", "\n", "Second, [the p-dimensional volume of a Euclidean ball of radius R in p-dimensional Euclidean space is](https://en.wikipedia.org/wiki/Volume_of_an_n-ball):\n", "$$V_p(R) = \\frac{\\pi^{p/2}}{\\Gamma(\\frac{p}{2} + 1)}R^p$$\n", "\n", "together, for any point $x$, \n", "\\begin{align}\n", "P(x \\text{ has distance } \\geq r) &= 1 - P(x \\text{ has distance } < r) \\\\\n", " &= 1 - \\frac{\\pi^{p/2}}{\\Gamma(\\frac{p}{2} + 1)}r^p \\Big{/} \\frac{\\pi^{p/2}}{\\Gamma(\\frac{p}{2} + 1)}1^p \n", " &= 1 - r^p\n", "\\end{align}\n", "\n", "Then:\n", "\\begin{align}\n", " P(\\text{all N points have distance } \\geq r) &= P^N (x \\text{ has distance } \\geq r) \\\\\n", " &= (1 - r^p)^N\n", "\\end{align}\n", "\n", "(3)\n", "In all,\n", "$$\\frac{1}{2} = P(\\text{all N points have distance } \\geq r) = (1 - r^p)^N$$\n", "\n", "we get the solution:\n", "$$r = (1 - (\\frac{1}{2})^{1/N})^{1/p}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.4\n", "##### Question\n", "The edge effect problem discussed on page 23 is not peculiar to uniform sampling from bounded domains. \n", "\n", "Consider inputs drawn from a spherical multinormal distribution $X \\sim N(0,I_p)$. The squared distance from any sample point to the origin has a $\\mathcal{X}^2_p$ distribution with mean $p$. Consider a prediction point $x_0$ drawn from this distribution, and let $a = x_0 \\big{/} \\| x0 \\|$ be an associated unit vector. Let $z_i = a^T x_i$ be the projection of each of the training points on this direction.\n", "\n", "Show that the $z_i$ are distributed $N(0,1)$ with expected squared distance from the origin 1, while the target point has expected squared distance $p$ from the origin.\n", "\n", "##### Solution\n", "$z_i = \\alpha^T x_i$, which is a linear combination. Moreover, $x_i \\sim N(0, I_p)$ means that its elements are all independant.\n", "\n", "As [the variance of a linear combination](https://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_.28Bienaym.C3.A9_formula.29) is:\n", "$$\\operatorname{Var}\\left( \\sum_{i=1}^N a_iX_i\\right) = \\sum_{i=1}^N a_i^2\\operatorname{Var}(X_i)$$\n", "\n", "We get:\n", "\\begin{align}\n", "E(z_i) &= \\alpha^T E(x_i) \\\\\n", " & = 0\n", "\\end{align}\n", "and\n", "\\begin{align}\n", "\\operatorname{Var} (z_i) &= \\sum \\alpha_j^2 \\operatorname{Var}(x_j^i) \\\\\n", " & = \\sum \\alpha_j^2 \\\\\n", " & = \\| \\alpha \\|^2_2 \\\\\n", " & = 1\n", "\\end{align}\n", "\n", "Thus, $z_i \\sim N(0,1)$.\n", "\n", "The target point has expected suqared distace $\\| x_i \\|_2^2 = p$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.5\n", "#### (a)\n", "Derive equation (2.27). The last line makes use of (3.8) through a conditioning argument. \n", "\n", "#### Solution\n", "First, we give:\n", "\n", "1. for $y_0 = x_0^T \\beta + \\epsilon; \\ \\epsilon \\sim N(0, \\sigma^2)$:\n", " + $E_{y_0 | x_0}(y_0) = E(y_0 | x_0) = E(x_0^T \\beta + \\epsilon) = x_0^T \\beta$\n", " + $\\operatorname{Var}_{y_0 | x_0}(y_0) = \\operatorname{Var}(y_0 | x_0) = \\operatorname{Var}(x_0^T \\beta + \\epsilon) = \\operatorname{Var}(\\epsilon) = \\sigma^2$\n", "\n", "2. for $\\hat{y}_0 = x_0^T \\hat{\\beta} = x_0^T \\beta + x_0 (X^T X)^{-1} x_0 \\epsilon$: \n", " + expected value:\n", " \\begin{equation}\n", " E_{\\tau}(\\hat{y_0}) = E(y_0 | x_0) = x_0^T \\beta \\quad \\text{unbiased}\n", " \\end{equation}\n", " + variance:\n", " \\begin{align}\n", " \\operatorname{Var}_{\\tau}(\\hat{y_0}) &= \\operatorname{Var}_{\\tau}(x_0^T \\hat{\\beta}) \\\\\n", " &= x_0^T \\operatorname{Var}_{\\tau}(\\hat{\\beta}) x_0 \\\\\n", " &= x_0^T E_{\\tau}((X^T X)^{-1} \\sigma^2) x_0 \\quad \\text{see Eq 3.8} \\\\\n", " &= E_{\\tau} x_0^T (X^T X)^{-1} x_0 \\sigma^2\n", " \\end{align}\n", " \n", "3. [Proof of variance and bias relationship](https://en.wikipedia.org/wiki/Mean_squared_error): \n", "\\begin{align}\n", " E( (\\hat{\\theta} - \\theta)^2 ) &= E( \\left (\\hat{\\theta} - E(\\hat{\\theta}) \\right )^2 ) + (E(\\hat{\\theta}) - \\theta)^ 2 \\\\\n", " &= \\operatorname{Var}(\\hat{\\theta}) + \\operatorname{Bias}^2 (\\hat{\\theta}, \\theta)\n", "\\end{align}\n", "\n", "Thus, \n", "\\begin{align}\n", " \\operatorname{EPE}(x_0) &= E_{y_0 | x_0} E_{\\tau} (y_0 - \\hat{y}_0)^2 \\\\\n", " &= E_{\\tau} E_{\\color{blue}{y_0 | x_0}} (\\color{blue}{y_0} - \\hat{y}_0)^2 \\\\\n", " &= E_{\\tau} \\left ( \\operatorname{Var}(y_0 | x_0) + (E_{y_0 | x_0}(y_0) - \\hat{y}_0)^2 \\right ) \\\\\n", " &= \\operatorname{Var}(y_0 | x_0) + E_{\\color{blue}{\\tau}} \\left( E(y_0 | x_0) - \\color{blue}{\\hat{y}_0} \\right )^2 \\\\\n", " &= \\operatorname{Var}(y_0 | x_0) + \\operatorname{Var}_{\\tau}(\\hat{y}_0) + \\left ( E_{\\tau}(\\hat{y}_0) - E(y_0 | x_0) \\right)^2 \\\\\n", " &= \\operatorname{Var}(y_0 | x_0) + \\operatorname{Var}_{\\tau}(\\hat{y}_0) + \\left ( E_{\\tau}(\\hat{y}_0) - x_0^T \\beta \\right)^2 \\\\\n", " &= \\sigma^2 + E_{\\tau} x_0^T (X^T X)^{-1} x_0 \\sigma^2 + 0^2\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (b)\n", "Derive equation (2.28), making use of the cyclic property of the trace operator [trace(AB) = trace(BA)], and its linearity (which allows us to interchange the order of trace and expectation).\n", "\n", "\n", "#### Solution\n", "[ref: A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani](https://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiF6sf2hsfLAhWSco4KHfJQCCwQFggbMAA&url=http%3A%2F%2Fwaxworksmath.com%2FAuthors%2FG_M%2FHastie%2FWriteUp%2Fweatherwax_epstein_hastie_solutions_manual.pdf&usg=AFQjCNH3VN6HgCDHtXNIbJtAjEEQNZFINA&sig2=b_zFhNYsupRwqtY62dGnwA)\n", "\n", "1. $x_0$ is $p \\times 1$ vector, and $\\mathbf{X}$ is $N \\times p$ matrix, hence $x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 = C_{1 \\times 1} = \\operatorname{trance}(C_{1 \\times 1})$\n", "\n", "2. [properity of Covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix) \n", "\\begin{align}\n", " \\operatorname{Cov}(x_0) &= E(x_0 x_0^T) - E(x_0) E(x_0)^T \\\\\n", " &= E(x_0 x_0^T) \\quad \\text{as } E(\\mathbf{X}) = 0 \\text{ and $x_0$ is picked randomly}\n", "\\end{align}\n", "\n", "Thus,\n", "\\begin{align}\n", " E_{x_0} \\operatorname{EPE}(x_0) &= E_{x_0} x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 \\sigma^2 + \\sigma^2 \\\\\n", " &= E_{x_0} \\operatorname{trance} \\left ( x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 \\right ) \\sigma^2 + \\sigma^2 \\\\\n", " &= E_{x_0} \\operatorname{trance} \\left ( (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 x_0^T \\right ) \\sigma^2 + \\sigma^2 \\quad \\text{cyclic property} \\\\\n", " &\\approx E_{x_0} \\operatorname{trance} \\left ( \\operatorname{Cov}^{-1}(\\mathbf{X}) x_0 x_0^T \\right ) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{as } \\mathbf{X}^T \\mathbf{X} \\to N \\operatorname{Cov}(\\mathbf{X}) \\\\\n", " &= \\operatorname{trance} \\left ( \\operatorname{Cov}^{-1}(\\mathbf{X}) \\, E_{x_0}(x_0 x_0^T) \\right ) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{linearity, interchange}\\\\\n", " &= \\operatorname{trance} \\left ( \\operatorname{Cov}^{-1}(\\mathbf{X}) \\, \\operatorname{Cov}(x_0) \\right ) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{see 2. above}\\\\\n", " &= \\operatorname{trance} (I_p) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{as } \\operatorname{Cov}(x_0) \\to \\operatorname{Cov}(\\mathbf{X}) \\\\\n", " &= p \\frac{\\sigma^2}{N} + \\sigma^2\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.6\n", "#### Question\n", "Consider a regression problem with inputs $x_i$ and outputs $y_i$, and a parameterized model $f_{\\theta}(x)$ to be fit by least squares. Show that if there are observations with tied or identical values of $x$, then the fit can be obtained from a reduced weighted least squares problem.\n", "\n", "#### Solution\n", "[ref: A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani](https://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiF6sf2hsfLAhWSco4KHfJQCCwQFggbMAA&url=http%3A%2F%2Fwaxworksmath.com%2FAuthors%2FG_M%2FHastie%2FWriteUp%2Fweatherwax_epstein_hastie_solutions_manual.pdf&usg=AFQjCNH3VN6HgCDHtXNIbJtAjEEQNZFINA&sig2=b_zFhNYsupRwqtY62dGnwA)\n", "\n", "Assume we have $N$ train samples, and let $N_u$ be the number of *unique* inputs $x$. And for $i$th unique $x_i$, the $y_i = \\{ y_{i,1}, y_{i,2}, \\dotsc, y_{i, n_i} \\} $, namely, consists of $n_i$ observation.\n", "\n", "\\begin{align}\n", " \\displaystyle \\operatorname{argmin}_{\\theta} \\sum_{k=1}^{N} (y_k - f_{\\theta}(x_k))^2 &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} \\sum_{j=1}^{n_i} (y_{ij} - f_{\\theta}(x_i))^2 \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} \\sum_{j=1}^{n_i} y_{ij}^2 - 2 f_{\\theta}(x_i) y_{ij} + f_{\\theta}(x_i)^2 \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\color{blue}{\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2} - 2 f_{\\theta}(x_i) \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij} + f_{\\theta}(x_i)^2 \\right ) \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\color{red}{(\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij})^2} - 2 f_{\\theta}(x_i) \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij} + f_{\\theta}(x_i)^2 - \\color{red}{(\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij})^2} + \\color{blue}{\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2} \\right ) \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\color{red}{\\bar{y}_i^2} - 2 f_{\\theta}(x_i) \\bar{y}_i^2 + f_{\\theta}(x_i)^2 - \\color{red}{\\bar{y}_i^2} + \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2 \\right ) \\quad \\text{def: } \\bar{y}_i = \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}\\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 - \\bar{y}_i^2 + \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2 \\right ) \\\\ \n", " &= \\operatorname{argmin}_{\\theta} \\left ( \\sum_{i=1}^{N_u} n_i \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 - \\sum_{i=1}^{N_u} n_i \\bar{y}_i^2 + \\sum_{i=1}^{N_u} \\sum_{j=1}^{n_i} y_{ij}^2 \\right ) \\\\ \n", " &= \\operatorname{argmin}_{\\theta} \\left ( \\sum_{i=1}^{N_u} n_i \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 + \\mathcal{C} \\right ) \\quad \\text{as $y_{ij}$ is fixed} \\\\ \n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 \n", "\\end{align}\n", "\n", "Thus, it's a *weighted* least squares as every $\\bar{y_i}$ is weighted by $n_i$. And also it is a *reduced* problem since $N_u < N$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.7\n", "Suppose we have a sample of $N$ pairs $x_i, y_i$ draw i.i.d from the distribution characterized as follows:\n", "\\begin{align}\n", " &x_i \\sim h(x), &\\text{the ddesign density} \\\\\n", " &y_i = f(x_i) + \\epsilon_i, &\\text{$f$ is the regression function} \\\\\n", " &\\epsilon_i \\sim (0, \\sigma^2), &\\text{mean zero, variance $\\sigma^2$}\n", "\\end{align}\n", "\n", "We construct an estimator for $f$ *linear* in the $y_i$,\n", "$$\\hat{f}(x_0) = \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i,$$ \n", "where the weights $\\ell_i(x_0; \\mathcal{X}$ do not depend on the $y_i$, but do depend on the entire training sequence of $x_i$, denoted here by $\\mathcal{X}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (a) \n", "Show that linear regression and $k$-nearest-neighbor regression are mem- bers of this class of estimators. Describe explicitly the weights $\\ell_i(x_0; \\mathcal{X}$ in each of these cases.\n", "\n", "#### solution\n", "1. for linear regression \n", " $\\hat{\\beta} = (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{y}$\n", " \n", " so:\n", " \\begin{align}\n", " \\hat{f}(x_0) &= x_0^T \\hat{\\beta} \\\\\n", " &= x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{y} \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\left [ x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T \\right ]_i \\ y_i\n", " \\end{align}\n", " \n", "2. for neighbor model \n", " \\begin{align}\n", " \\hat{f}(x_0) &= \\frac{1}{k} \\displaystyle \\sum_{x_i \\in N_k(x_0)} y_i \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\frac{1}{k} \\ I(x_i \\in N_k(x_0)) \\ y_i\n", " \\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (b)\n", "Decompose the conditional mean-squared error \n", "$$E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_0) - \\hat{f}(x_0) )^2$$\n", "into a conditional squared bias and a conditional variance component. Like $\\mathcal{X}$, $\\mathcal{Y}$ represents the entire training sequence of $y_i$.\n", "\n", "#### solution\n", "[ref:](https://en.wikipedia.org/wiki/Mean_squared_error)\n", "Here $\\mathcal{X}$ is fixed, and $\\mathcal{Y}$ varies. Also $x_0$ and $f(x_0)$ are fixed. \n", "so: \n", "\\begin{align}\n", "E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0)) &= E_{\\mathcal{Y} | \\mathcal{X}} \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, E_{\\mathcal{Y} | \\mathcal{X}} ( y_i ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_i) + \\epsilon_i ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, \\left ( E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_i) ) + E_{\\mathcal{Y} | \\mathcal{X}} ( \\epsilon_i ) \\right ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_i) ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i) \\\\\n", " &= C \\quad \\text{constant when $\\mathcal{X}$ is fixed}\n", "\\end{align}\n", "\n", "Thus:\n", "\n", "\\begin{align}\n", " {} & E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_0) - \\hat{f}(x_0) )^2 \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} ( \\hat{f}(x_0) - f(x_0) )^2 \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) + E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right )^2 \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right )^2 + 2 \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right ) + \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right )^2 \\right ) \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right )^2 + E_{\\mathcal{Y} | \\mathcal{X}} 2 \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right ) + E_{\\mathcal{Y} | \\mathcal{X}} \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right )^2 \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + E_{\\mathcal{Y} | \\mathcal{X}} 2 \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) \\left ( \\underbrace{E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0)}_\\text{constant} \\right ) + \\color{blue}{E_{\\mathcal{Y} | \\mathcal{X}}} \\left ( \\underbrace{E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0)}_\\text{constant} \\right )^2 \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + 2 \\left ( f(x_0) - C \\right ) \\, E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) + \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + 2 \\left ( f(x_0) - C \\right ) \\, \\left ( E_{\\mathcal{Y} | \\mathcal{X}} \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) + \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) \\\\\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (c)\n", "Decompose the (unconditional) mean-squared error \n", "$$E_{\\mathcal{Y}, \\mathcal{X}} ( f(x_0) - \\hat{f}(x_0) )^2$$\n", "into a squared bias and a variance component. \n", "\n", "#### solution\n", "\\begin{align}\n", " {} & E_{\\mathcal{Y}, \\mathcal{X}} (f(x_0) - \\hat{f}(x_0))^2 \\\\\n", " = & E_{\\mathcal{X}} E_{\\mathcal{Y} | \\mathcal{X}} (f(x_0) - \\hat{f}(x_0))^2 \\\\\n", "\\end{align}\n", "use similar method as above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (d)\n", "Establish a relationship between the squared biases and variances in the above two cases.\n", "\n", "#### solution\n", "##### 1. for variance,\n", "As we known in (b), \n", "$$E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0)) = \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i)$$\n", "\n", "and also:\n", "$$ \\hat{f}(x_0) = \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i $$ \n", "$$ y_i = f(x_i) + \\epsilon_i $$\n", "\n", "Thus,\n", "\\begin{align}\n", " \\operatorname{Var}(\\hat{f}(x_0)) &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) \\right ) \\right )^2 \\\\\n", " &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i - \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i) \\right )^2 \\\\\n", " &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\left ( y_i - f(x_i) \\right ) \\right )^2 \\\\\n", " &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\epsilon_i \\right )^2 \\\\\n", "\\end{align}\n", "\n", "##### 2. for squared bias,\n", "\\begin{align}\n", " \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) &= \\left ( E_{\\mathcal{Y} | \\mathcal{X}} \\hat{f}(x_0) - f(x_0) \\right )^2 \\\\\n", " &= \\left( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i) - f(x_0) \\right )^2 \\\\\n", "\\end{align}\n", "\n", "##### 3. guess\n", "variance is only affected by $\\epsilon$, while squared bias is only affected by $f(x)$. Because $\\epsilon$ is independent with $f(x)$, it might not possible that there is a relation between variance and squared bias. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.8\n", "Compare the classification performance of linear regression and $k$-nearest neighbor classification on the zipcode data. In particular, consider only the 2’s and 3’s, and $k$ = 1, 3, 5, 7 and 15. Show both the training and test error for each choice. The zipcode data are available from the book website www-stat.stanford.edu/ElemStatLearn." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_dat = pd.read_csv('./res/zip.train', header=None, sep=' ')\n", "train_dat.rename(columns={0: 'digital'}, inplace=True)\n", "train_dat = train_dat.query('digital == 2 or digital == 3')\n", "train_dat.dropna(axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_dat.set_index('digital', inplace=True)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | 1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "... | \n", "247 | \n", "248 | \n", "249 | \n", "250 | \n", "251 | \n", "252 | \n", "253 | \n", "254 | \n", "255 | \n", "256 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
digital | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
3 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1.00 | \n", "-1.000 | \n", "-0.928 | \n", "-0.204 | \n", "0.751 | \n", "0.466 | \n", "0.234 | \n", "... | \n", "0.466 | \n", "0.639 | \n", "1.000 | \n", "1.000 | \n", "0.791 | \n", "0.439 | \n", "-0.199 | \n", "-0.883 | \n", "-1 | \n", "-1 | \n", "
3 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-0.83 | \n", "0.442 | \n", "1.000 | \n", "1.000 | \n", "0.479 | \n", "-0.328 | \n", "-0.947 | \n", "... | \n", "1.000 | \n", "0.671 | \n", "0.345 | \n", "-0.507 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "-1 | \n", "-1 | \n", "
3 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1.00 | \n", "-1.000 | \n", "-0.104 | \n", "0.549 | \n", "0.579 | \n", "0.579 | \n", "0.857 | \n", "... | \n", "0.388 | \n", "0.579 | \n", "0.811 | \n", "1.000 | \n", "1.000 | \n", "0.715 | \n", "0.107 | \n", "-0.526 | \n", "-1 | \n", "-1 | \n", "
3 rows × 256 columns
\n", "\n", " | 1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "... | \n", "247 | \n", "248 | \n", "249 | \n", "250 | \n", "251 | \n", "252 | \n", "253 | \n", "254 | \n", "255 | \n", "256 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
digital | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
3 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "-0.593 | \n", "0.700 | \n", "1.000 | \n", "1.000 | \n", "1.000 | \n", "1.000 | \n", "0.853 | \n", "... | \n", "1.000 | \n", "0.717 | \n", "0.333 | \n", "0.162 | \n", "-0.393 | \n", "-1.000 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
2 | \n", "-0.996 | \n", "0.572 | \n", "0.396 | \n", "0.063 | \n", "-0.506 | \n", "-0.847 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "... | \n", "-0.668 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "-1.000 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
2 | \n", "-1.000 | \n", "-1.000 | \n", "0.469 | \n", "0.413 | \n", "1.000 | \n", "1.000 | \n", "0.462 | \n", "-0.116 | \n", "-0.937 | \n", "-1.000 | \n", "... | \n", "1.000 | \n", "1.000 | \n", "1.000 | \n", "0.270 | \n", "-0.280 | \n", "-0.855 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
3 rows × 256 columns
\n", "