{ "cells": [ { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# %load ../../preconfig.py\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['axes.grid'] = False\n", "import seaborn as sns\n", "sns.set(color_codes=True)\n", "\n", "import numpy as np\n", "import pandas as pd\n", "#import itertools\n", "\n", "import sklearn\n", "\n", "import logging\n", "logger = logging.getLogger()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercises\n", "=========" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.1\n", "##### Question\n", "Suppose each of K-classes has an associated target $t_k$, which is a vector of all zeros, except a one in the $k$th position. Show that classifying to the largest element of $\\hat{y}$ amounts to choosing the closest target, $\\min_k \\|tk − \\hat{y}\\|$, if the elements of $\\hat{y}$ sum to one.\n", "\n", "\n", "##### Solution\n", "Given: \n", "$t_k = e_k$; \n", "$\\sum_i \\hat{y}_i = 1$.\n", "\n", "need to proof: $\\text{arg max}_i \\hat{y}_i = \\text{arg min}_k \\|t_k - \\hat{y}\\|$.\n", "\n", "Proof: \n", "\n", "\\begin{align}\n", "\\text{arg min}_k \\|t_k - \\hat{y}\\| &= \\text{arg min}_k \\left ( \\displaystyle \\sum_{j \\neq k} \\hat{y}_j + | \\hat{y}_k - 1 | \\right )\\\\\n", " &= \\text{arg min}_k \\left ( \\displaystyle \\sum_{j \\neq k} \\hat{y}_j + 1 - \\hat{y}_k \\right ) \\\\\n", " &= \\text{arg min}_k \\left ( 1 - \\hat{y}_k + 1 - \\hat{y}_k \\right ) \\\\\n", " &= \\text{arg min}_k 2(1 - \\hat{y}_k) \\\\\n", " &= \\text{arg max}_k \\hat{y}_k\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.2\n", "##### Question\n", "Show how to compute the Bayes decision boundary for the simula- tion example in Figure 2.5.\n", "\n", "\n", "##### Solution\n", "[ref: Elements of Statistical Learning - Andrew Tulloch](https://github.com/ajtulloch/Elements-of-Statistical-Learning/blob/master/ESL-Solutions.pdf)\n", "\n", "As Eq.(2.22) of textbook:\n", "\\begin{align}\n", " \\hat{G}(x) &= \\text{arg min}_{g \\in \\mathcal{G}} [ 1 - P(g | X = x) ] \\\\\n", " &= \\text{arg max}_{g \\in \\mathcal{G}} P(g | X = x)\n", "\\end{align}\n", "\n", "The optimal Bayes decision boundary is where:\n", "\\begin{align}\n", " P(\\text{orange} | X = x) &= P(\\text{blue} | X = x) \\\\\n", " \\frac{P( X = x | orange ) P( orange )}{P(X = x)} &= \\frac{P( X = x | blue ) P( blue )}{P(X = x)} \\\\\n", " P( X = x | orange ) P( orange ) &= P( X = x | blue ) P( blue )\n", "\\end{align}\n", "\n", "As descriped in Sec 2.3.3, $P(orange)$ is same as $P(blue)$, and $P(X = x | orange)$ and $P(X = x | blue)$ are generated as bivariate Gassuian distribution. Hence we can work out the optimal Bayes decision boundary exactly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.3\n", "##### Question\n", "Derive equation (2.24).\n", "\n", "Suppose there are $N$ points, uniformly distributed in the unit sphere in $\\mathbb{R}^p$. What is the median distance from the origin to the closest data point? \n", "\n", "##### Solution\n", "[ref: Example Sheet 1: Solutions](http://www2.stat.duke.edu/~banks/cam-lectures.dir/Examples1-solns.pdf)\n", "\n", "(1)\n", "Suppose $r$ is the median distance from the origin to the closest data point.\n", "\n", "Let $r_{\\text{closest}}$ are all possible closetst points. \n", "Because $r$ is the median case, $\\forall j$\n", "$$P(r_{\\text{closest}}^j \\geq r) = \\frac{1}{2}$$\n", "\n", "Because $r_{\\text{closest}}^j$ is the closest point, so all $N$ points have distance $\\geq r_{\\text{closest}}^j \\geq r$. \n", "\n", "together, we get:\n", "$$P(\\text{all N points have distance } \\geq r) = \\frac{1}{2}$$\n", "\n", "(2)\n", "First, all points are uniformly distributed in the unit sphere in $\\mathbb{R}^p$. \n", "\n", "Second, [the p-dimensional volume of a Euclidean ball of radius R in p-dimensional Euclidean space is](https://en.wikipedia.org/wiki/Volume_of_an_n-ball):\n", "$$V_p(R) = \\frac{\\pi^{p/2}}{\\Gamma(\\frac{p}{2} + 1)}R^p$$\n", "\n", "together, for any point $x$, \n", "\\begin{align}\n", "P(x \\text{ has distance } \\geq r) &= 1 - P(x \\text{ has distance } < r) \\\\\n", " &= 1 - \\frac{\\pi^{p/2}}{\\Gamma(\\frac{p}{2} + 1)}r^p \\Big{/} \\frac{\\pi^{p/2}}{\\Gamma(\\frac{p}{2} + 1)}1^p \n", " &= 1 - r^p\n", "\\end{align}\n", "\n", "Then:\n", "\\begin{align}\n", " P(\\text{all N points have distance } \\geq r) &= P^N (x \\text{ has distance } \\geq r) \\\\\n", " &= (1 - r^p)^N\n", "\\end{align}\n", "\n", "(3)\n", "In all,\n", "$$\\frac{1}{2} = P(\\text{all N points have distance } \\geq r) = (1 - r^p)^N$$\n", "\n", "we get the solution:\n", "$$r = (1 - (\\frac{1}{2})^{1/N})^{1/p}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.4\n", "##### Question\n", "The edge effect problem discussed on page 23 is not peculiar to uniform sampling from bounded domains. \n", "\n", "Consider inputs drawn from a spherical multinormal distribution $X \\sim N(0,I_p)$. The squared distance from any sample point to the origin has a $\\mathcal{X}^2_p$ distribution with mean $p$. Consider a prediction point $x_0$ drawn from this distribution, and let $a = x_0 \\big{/} \\| x0 \\|$ be an associated unit vector. Let $z_i = a^T x_i$ be the projection of each of the training points on this direction.\n", "\n", "Show that the $z_i$ are distributed $N(0,1)$ with expected squared distance from the origin 1, while the target point has expected squared distance $p$ from the origin.\n", "\n", "##### Solution\n", "$z_i = \\alpha^T x_i$, which is a linear combination. Moreover, $x_i \\sim N(0, I_p)$ means that its elements are all independant.\n", "\n", "As [the variance of a linear combination](https://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_.28Bienaym.C3.A9_formula.29) is:\n", "$$\\operatorname{Var}\\left( \\sum_{i=1}^N a_iX_i\\right) = \\sum_{i=1}^N a_i^2\\operatorname{Var}(X_i)$$\n", "\n", "We get:\n", "\\begin{align}\n", "E(z_i) &= \\alpha^T E(x_i) \\\\\n", " & = 0\n", "\\end{align}\n", "and\n", "\\begin{align}\n", "\\operatorname{Var} (z_i) &= \\sum \\alpha_j^2 \\operatorname{Var}(x_j^i) \\\\\n", " & = \\sum \\alpha_j^2 \\\\\n", " & = \\| \\alpha \\|^2_2 \\\\\n", " & = 1\n", "\\end{align}\n", "\n", "Thus, $z_i \\sim N(0,1)$.\n", "\n", "The target point has expected suqared distace $\\| x_i \\|_2^2 = p$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.5\n", "#### (a)\n", "Derive equation (2.27). The last line makes use of (3.8) through a conditioning argument. \n", "\n", "#### Solution\n", "First, we give:\n", "\n", "1. for $y_0 = x_0^T \\beta + \\epsilon; \\ \\epsilon \\sim N(0, \\sigma^2)$:\n", " + $E_{y_0 | x_0}(y_0) = E(y_0 | x_0) = E(x_0^T \\beta + \\epsilon) = x_0^T \\beta$\n", " + $\\operatorname{Var}_{y_0 | x_0}(y_0) = \\operatorname{Var}(y_0 | x_0) = \\operatorname{Var}(x_0^T \\beta + \\epsilon) = \\operatorname{Var}(\\epsilon) = \\sigma^2$\n", "\n", "2. for $\\hat{y}_0 = x_0^T \\hat{\\beta} = x_0^T \\beta + x_0 (X^T X)^{-1} x_0 \\epsilon$: \n", " + expected value:\n", " \$$\n", " E_{\\tau}(\\hat{y_0}) = E(y_0 | x_0) = x_0^T \\beta \\quad \\text{unbiased}\n", " \$$\n", " + variance:\n", " \\begin{align}\n", " \\operatorname{Var}_{\\tau}(\\hat{y_0}) &= \\operatorname{Var}_{\\tau}(x_0^T \\hat{\\beta}) \\\\\n", " &= x_0^T \\operatorname{Var}_{\\tau}(\\hat{\\beta}) x_0 \\\\\n", " &= x_0^T E_{\\tau}((X^T X)^{-1} \\sigma^2) x_0 \\quad \\text{see Eq 3.8} \\\\\n", " &= E_{\\tau} x_0^T (X^T X)^{-1} x_0 \\sigma^2\n", " \\end{align}\n", " \n", "3. [Proof of variance and bias relationship](https://en.wikipedia.org/wiki/Mean_squared_error): \n", "\\begin{align}\n", " E( (\\hat{\\theta} - \\theta)^2 ) &= E( \\left (\\hat{\\theta} - E(\\hat{\\theta}) \\right )^2 ) + (E(\\hat{\\theta}) - \\theta)^ 2 \\\\\n", " &= \\operatorname{Var}(\\hat{\\theta}) + \\operatorname{Bias}^2 (\\hat{\\theta}, \\theta)\n", "\\end{align}\n", "\n", "Thus, \n", "\\begin{align}\n", " \\operatorname{EPE}(x_0) &= E_{y_0 | x_0} E_{\\tau} (y_0 - \\hat{y}_0)^2 \\\\\n", " &= E_{\\tau} E_{\\color{blue}{y_0 | x_0}} (\\color{blue}{y_0} - \\hat{y}_0)^2 \\\\\n", " &= E_{\\tau} \\left ( \\operatorname{Var}(y_0 | x_0) + (E_{y_0 | x_0}(y_0) - \\hat{y}_0)^2 \\right ) \\\\\n", " &= \\operatorname{Var}(y_0 | x_0) + E_{\\color{blue}{\\tau}} \\left( E(y_0 | x_0) - \\color{blue}{\\hat{y}_0} \\right )^2 \\\\\n", " &= \\operatorname{Var}(y_0 | x_0) + \\operatorname{Var}_{\\tau}(\\hat{y}_0) + \\left ( E_{\\tau}(\\hat{y}_0) - E(y_0 | x_0) \\right)^2 \\\\\n", " &= \\operatorname{Var}(y_0 | x_0) + \\operatorname{Var}_{\\tau}(\\hat{y}_0) + \\left ( E_{\\tau}(\\hat{y}_0) - x_0^T \\beta \\right)^2 \\\\\n", " &= \\sigma^2 + E_{\\tau} x_0^T (X^T X)^{-1} x_0 \\sigma^2 + 0^2\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (b)\n", "Derive equation (2.28), making use of the cyclic property of the trace operator [trace(AB) = trace(BA)], and its linearity (which allows us to interchange the order of trace and expectation).\n", "\n", "\n", "#### Solution\n", "[ref: A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani](https://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiF6sf2hsfLAhWSco4KHfJQCCwQFggbMAA&url=http%3A%2F%2Fwaxworksmath.com%2FAuthors%2FG_M%2FHastie%2FWriteUp%2Fweatherwax_epstein_hastie_solutions_manual.pdf&usg=AFQjCNH3VN6HgCDHtXNIbJtAjEEQNZFINA&sig2=b_zFhNYsupRwqtY62dGnwA)\n", "\n", "1. $x_0$ is $p \\times 1$ vector, and $\\mathbf{X}$ is $N \\times p$ matrix, hence $x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 = C_{1 \\times 1} = \\operatorname{trance}(C_{1 \\times 1})$\n", "\n", "2. [properity of Covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix) \n", "\\begin{align}\n", " \\operatorname{Cov}(x_0) &= E(x_0 x_0^T) - E(x_0) E(x_0)^T \\\\\n", " &= E(x_0 x_0^T) \\quad \\text{as } E(\\mathbf{X}) = 0 \\text{ and $x_0$ is picked randomly}\n", "\\end{align}\n", "\n", "Thus,\n", "\\begin{align}\n", " E_{x_0} \\operatorname{EPE}(x_0) &= E_{x_0} x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 \\sigma^2 + \\sigma^2 \\\\\n", " &= E_{x_0} \\operatorname{trance} \\left ( x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 \\right ) \\sigma^2 + \\sigma^2 \\\\\n", " &= E_{x_0} \\operatorname{trance} \\left ( (\\mathbf{X}^T \\mathbf{X})^{-1} x_0 x_0^T \\right ) \\sigma^2 + \\sigma^2 \\quad \\text{cyclic property} \\\\\n", " &\\approx E_{x_0} \\operatorname{trance} \\left ( \\operatorname{Cov}^{-1}(\\mathbf{X}) x_0 x_0^T \\right ) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{as } \\mathbf{X}^T \\mathbf{X} \\to N \\operatorname{Cov}(\\mathbf{X}) \\\\\n", " &= \\operatorname{trance} \\left ( \\operatorname{Cov}^{-1}(\\mathbf{X}) \\, E_{x_0}(x_0 x_0^T) \\right ) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{linearity, interchange}\\\\\n", " &= \\operatorname{trance} \\left ( \\operatorname{Cov}^{-1}(\\mathbf{X}) \\, \\operatorname{Cov}(x_0) \\right ) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{see 2. above}\\\\\n", " &= \\operatorname{trance} (I_p) \\frac{\\sigma^2}{N} + \\sigma^2 \\quad \\text{as } \\operatorname{Cov}(x_0) \\to \\operatorname{Cov}(\\mathbf{X}) \\\\\n", " &= p \\frac{\\sigma^2}{N} + \\sigma^2\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.6\n", "#### Question\n", "Consider a regression problem with inputs $x_i$ and outputs $y_i$, and a parameterized model $f_{\\theta}(x)$ to be fit by least squares. Show that if there are observations with tied or identical values of $x$, then the fit can be obtained from a reduced weighted least squares problem.\n", "\n", "#### Solution\n", "[ref: A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani](https://www.google.com.sg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiF6sf2hsfLAhWSco4KHfJQCCwQFggbMAA&url=http%3A%2F%2Fwaxworksmath.com%2FAuthors%2FG_M%2FHastie%2FWriteUp%2Fweatherwax_epstein_hastie_solutions_manual.pdf&usg=AFQjCNH3VN6HgCDHtXNIbJtAjEEQNZFINA&sig2=b_zFhNYsupRwqtY62dGnwA)\n", "\n", "Assume we have $N$ train samples, and let $N_u$ be the number of *unique* inputs $x$. And for $i$th unique $x_i$, the $y_i = \\{ y_{i,1}, y_{i,2}, \\dotsc, y_{i, n_i} \\}$, namely, consists of $n_i$ observation.\n", "\n", "\\begin{align}\n", " \\displaystyle \\operatorname{argmin}_{\\theta} \\sum_{k=1}^{N} (y_k - f_{\\theta}(x_k))^2 &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} \\sum_{j=1}^{n_i} (y_{ij} - f_{\\theta}(x_i))^2 \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} \\sum_{j=1}^{n_i} y_{ij}^2 - 2 f_{\\theta}(x_i) y_{ij} + f_{\\theta}(x_i)^2 \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\color{blue}{\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2} - 2 f_{\\theta}(x_i) \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij} + f_{\\theta}(x_i)^2 \\right ) \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\color{red}{(\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij})^2} - 2 f_{\\theta}(x_i) \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij} + f_{\\theta}(x_i)^2 - \\color{red}{(\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij})^2} + \\color{blue}{\\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2} \\right ) \\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\color{red}{\\bar{y}_i^2} - 2 f_{\\theta}(x_i) \\bar{y}_i^2 + f_{\\theta}(x_i)^2 - \\color{red}{\\bar{y}_i^2} + \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2 \\right ) \\quad \\text{def: } \\bar{y}_i = \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}\\\\\n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 - \\bar{y}_i^2 + \\frac{1}{n_i} \\sum_{j=1}^{n_i} y_{ij}^2 \\right ) \\\\ \n", " &= \\operatorname{argmin}_{\\theta} \\left ( \\sum_{i=1}^{N_u} n_i \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 - \\sum_{i=1}^{N_u} n_i \\bar{y}_i^2 + \\sum_{i=1}^{N_u} \\sum_{j=1}^{n_i} y_{ij}^2 \\right ) \\\\ \n", " &= \\operatorname{argmin}_{\\theta} \\left ( \\sum_{i=1}^{N_u} n_i \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 + \\mathcal{C} \\right ) \\quad \\text{as $y_{ij}$ is fixed} \\\\ \n", " &= \\operatorname{argmin}_{\\theta} \\sum_{i=1}^{N_u} n_i \\left ( \\bar{y}_i^2 - f_{\\theta}(x_i) \\right )^2 \n", "\\end{align}\n", "\n", "Thus, it's a *weighted* least squares as every $\\bar{y_i}$ is weighted by $n_i$. And also it is a *reduced* problem since $N_u < N$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.7\n", "Suppose we have a sample of $N$ pairs $x_i, y_i$ draw i.i.d from the distribution characterized as follows:\n", "\\begin{align}\n", " &x_i \\sim h(x), &\\text{the ddesign density} \\\\\n", " &y_i = f(x_i) + \\epsilon_i, &\\text{$f$ is the regression function} \\\\\n", " &\\epsilon_i \\sim (0, \\sigma^2), &\\text{mean zero, variance $\\sigma^2$}\n", "\\end{align}\n", "\n", "We construct an estimator for $f$ *linear* in the $y_i$,\n", "$$\\hat{f}(x_0) = \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i,$$ \n", "where the weights $\\ell_i(x_0; \\mathcal{X}$ do not depend on the $y_i$, but do depend on the entire training sequence of $x_i$, denoted here by $\\mathcal{X}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (a) \n", "Show that linear regression and $k$-nearest-neighbor regression are mem- bers of this class of estimators. Describe explicitly the weights $\\ell_i(x_0; \\mathcal{X}$ in each of these cases.\n", "\n", "#### solution\n", "1. for linear regression \n", " $\\hat{\\beta} = (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{y}$\n", " \n", " so:\n", " \\begin{align}\n", " \\hat{f}(x_0) &= x_0^T \\hat{\\beta} \\\\\n", " &= x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{y} \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\left [ x_0^T (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T \\right ]_i \\ y_i\n", " \\end{align}\n", " \n", "2. for neighbor model \n", " \\begin{align}\n", " \\hat{f}(x_0) &= \\frac{1}{k} \\displaystyle \\sum_{x_i \\in N_k(x_0)} y_i \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\frac{1}{k} \\ I(x_i \\in N_k(x_0)) \\ y_i\n", " \\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (b)\n", "Decompose the conditional mean-squared error \n", "$$E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_0) - \\hat{f}(x_0) )^2$$\n", "into a conditional squared bias and a conditional variance component. Like $\\mathcal{X}$, $\\mathcal{Y}$ represents the entire training sequence of $y_i$.\n", "\n", "#### solution\n", "[ref:](https://en.wikipedia.org/wiki/Mean_squared_error)\n", "Here $\\mathcal{X}$ is fixed, and $\\mathcal{Y}$ varies. Also $x_0$ and $f(x_0)$ are fixed. \n", "so: \n", "\\begin{align}\n", "E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0)) &= E_{\\mathcal{Y} | \\mathcal{X}} \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, E_{\\mathcal{Y} | \\mathcal{X}} ( y_i ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_i) + \\epsilon_i ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, \\left ( E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_i) ) + E_{\\mathcal{Y} | \\mathcal{X}} ( \\epsilon_i ) \\right ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_i) ) \\\\\n", " &= \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i) \\\\\n", " &= C \\quad \\text{constant when $\\mathcal{X}$ is fixed}\n", "\\end{align}\n", "\n", "Thus:\n", "\n", "\\begin{align}\n", " {} & E_{\\mathcal{Y} | \\mathcal{X}} ( f(x_0) - \\hat{f}(x_0) )^2 \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} ( \\hat{f}(x_0) - f(x_0) )^2 \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) + E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right )^2 \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right )^2 + 2 \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right ) + \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right )^2 \\right ) \\\\\n", " = & E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right )^2 + E_{\\mathcal{Y} | \\mathcal{X}} 2 \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right ) + E_{\\mathcal{Y} | \\mathcal{X}} \\left ( E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0) \\right )^2 \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + E_{\\mathcal{Y} | \\mathcal{X}} 2 \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) \\left ( \\underbrace{E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0)}_\\text{constant} \\right ) + \\color{blue}{E_{\\mathcal{Y} | \\mathcal{X}}} \\left ( \\underbrace{E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) - f(x_0)}_\\text{constant} \\right )^2 \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + 2 \\left ( f(x_0) - C \\right ) \\, E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) + \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + 2 \\left ( f(x_0) - C \\right ) \\, \\left ( E_{\\mathcal{Y} | \\mathcal{X}} \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0) \\right ) + \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) \\\\\n", " = & \\operatorname{Var}(\\hat{f}(x_0) + \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) \\\\\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (c)\n", "Decompose the (unconditional) mean-squared error \n", "$$E_{\\mathcal{Y}, \\mathcal{X}} ( f(x_0) - \\hat{f}(x_0) )^2$$\n", "into a squared bias and a variance component. \n", "\n", "#### solution\n", "\\begin{align}\n", " {} & E_{\\mathcal{Y}, \\mathcal{X}} (f(x_0) - \\hat{f}(x_0))^2 \\\\\n", " = & E_{\\mathcal{X}} E_{\\mathcal{Y} | \\mathcal{X}} (f(x_0) - \\hat{f}(x_0))^2 \\\\\n", "\\end{align}\n", "use similar method as above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (d)\n", "Establish a relationship between the squared biases and variances in the above two cases.\n", "\n", "#### solution\n", "##### 1. for variance,\n", "As we known in (b), \n", "$$E_{\\mathcal{Y} | \\mathcal{X}}(\\hat{f}(x_0)) = \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i)$$\n", "\n", "and also:\n", "$$\\hat{f}(x_0) = \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i$$ \n", "$$y_i = f(x_i) + \\epsilon_i$$\n", "\n", "Thus,\n", "\\begin{align}\n", " \\operatorname{Var}(\\hat{f}(x_0)) &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) - E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\hat{f}(x_0) \\right ) \\right )^2 \\\\\n", " &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) y_i - \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i) \\right )^2 \\\\\n", " &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\left ( y_i - f(x_i) \\right ) \\right )^2 \\\\\n", " &= E_{\\mathcal{Y} | \\mathcal{X}} \\left ( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\epsilon_i \\right )^2 \\\\\n", "\\end{align}\n", "\n", "##### 2. for squared bias,\n", "\\begin{align}\n", " \\operatorname{Bias}^2 (\\hat{f}(x_0), f(x_0)) &= \\left ( E_{\\mathcal{Y} | \\mathcal{X}} \\hat{f}(x_0) - f(x_0) \\right )^2 \\\\\n", " &= \\left( \\displaystyle \\sum_{i=1}^N \\ell_i(x_0; \\mathcal{X}) \\, f(x_i) - f(x_0) \\right )^2 \\\\\n", "\\end{align}\n", "\n", "##### 3. guess\n", "variance is only affected by $\\epsilon$, while squared bias is only affected by $f(x)$. Because $\\epsilon$ is independent with $f(x)$, it might not possible that there is a relation between variance and squared bias. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex 2.8\n", "Compare the classification performance of linear regression and $k$-nearest neighbor classification on the zipcode data. In particular, consider only the 2’s and 3’s, and $k$ = 1, 3, 5, 7 and 15. Show both the training and test error for each choice. The zipcode data are available from the book website www-stat.stanford.edu/ElemStatLearn." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_dat = pd.read_csv('./res/zip.train', header=None, sep=' ')\n", "train_dat.rename(columns={0: 'digital'}, inplace=True)\n", "train_dat = train_dat.query('digital == 2 or digital == 3')\n", "train_dat.dropna(axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_dat.set_index('digital', inplace=True)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
12345678910...247248249250251252253254255256
digital
3-1-1-1-1.00-1.000-0.928-0.2040.7510.4660.234...0.4660.6391.0001.0000.7910.439-0.199-0.883-1-1
3-1-1-1-0.830.4421.0001.0000.479-0.328-0.947...1.0000.6710.345-0.507-1.000-1.000-1.000-1.000-1-1
3-1-1-1-1.00-1.000-0.1040.5490.5790.5790.857...0.3880.5790.8111.0001.0000.7150.107-0.526-1-1
\n", "

3 rows × 256 columns

\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
12345678910...247248249250251252253254255256
digital
3-1.000-1.000-1.000-0.5930.7001.0001.0001.0001.0000.853...1.0000.7170.3330.162-0.393-1.000-1-1-1-1
2-0.9960.5720.3960.063-0.506-0.847-1.000-1.000-1.000-1.000...-0.668-1.000-1.000-1.000-1.000-1.000-1-1-1-1
2-1.000-1.0000.4690.4131.0001.0000.462-0.116-0.937-1.000...1.0001.0001.0000.270-0.280-0.855-1-1-1-1
\n", "

3 rows × 256 columns

\n", "
" ], "text/plain": [ " 1 2 3 4 5 6 7 8 9 10 \\\n", "digital \n", "3 -1.000 -1.000 -1.000 -0.593 0.700 1.000 1.000 1.000 1.000 0.853 \n", "2 -0.996 0.572 0.396 0.063 -0.506 -0.847 -1.000 -1.000 -1.000 -1.000 \n", "2 -1.000 -1.000 0.469 0.413 1.000 1.000 0.462 -0.116 -0.937 -1.000 \n", "\n", " ... 247 248 249 250 251 252 253 254 255 256 \n", "digital ... \n", "3 ... 1.000 0.717 0.333 0.162 -0.393 -1.000 -1 -1 -1 -1 \n", "2 ... -0.668 -1.000 -1.000 -1.000 -1.000 -1.000 -1 -1 -1 -1 \n", "2 ... 1.000 1.000 1.000 0.270 -0.280 -0.855 -1 -1 -1 -1 \n", "\n", "[3 rows x 256 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_dat.head(3)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def eva(conf, train_dat, test_dat):\n", " x_train = train_dat.values\n", " y_train = train_dat.index.values\n", " \n", " conf['cls'].fit(x_train, y_train)\n", " \n", " x_test = test_dat.values\n", " y_test = test_dat.index.values\n", " \n", " y_pred = conf['cls'].predict(x_test)\n", " \n", " accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)\n", " print('{}, parameter: {}, accuracy: {:.4}'.format(conf['name'], conf['parameter'], accuracy))" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LogisticRegression, parameter: None, accuracy: 0.9643\n", "KNeighborsClassifier, parameter: N=1, accuracy: 0.9753\n", "KNeighborsClassifier, parameter: N=3, accuracy: 0.9698\n", "KNeighborsClassifier, parameter: N=5, accuracy: 0.9698\n", "KNeighborsClassifier, parameter: N=7, accuracy: 0.967\n", "KNeighborsClassifier, parameter: N=15, accuracy: 0.9615\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "configuration = [\n", " {'cls': LogisticRegression(), 'name': 'LogisticRegression', 'parameter': None},\n", " {'cls': KNeighborsClassifier(n_neighbors=1), 'name': 'KNeighborsClassifier', 'parameter': 'N=1'},\n", " {'cls': KNeighborsClassifier(n_neighbors=3), 'name': 'KNeighborsClassifier', 'parameter': 'N=3'},\n", " {'cls': KNeighborsClassifier(n_neighbors=5), 'name': 'KNeighborsClassifier', 'parameter': 'N=5'},\n", " {'cls': KNeighborsClassifier(n_neighbors=7), 'name': 'KNeighborsClassifier', 'parameter': 'N=7'},\n", " {'cls': KNeighborsClassifier(n_neighbors=15), 'name': 'KNeighborsClassifier', 'parameter': 'N=15'},\n", "]\n", "\n", "for conf in configuration:\n", " eva(conf, train_dat, test_dat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.9\n", "Consider a linear regression model with $p$ parameters, fit by least squares to a set of training data $(x_1,y_1), \\dotsc,(x_N,y_N)$ drawn at random from a population. Let $\\hat{\\beta}$ be the least squares estimate. Suppose we have some test data $(\\hat{x}_1, \\hat{y}_1), \\dotsc,(\\hat{x}_N, \\hat{y}_N)$ drawn at random from the same propulation as the training data.\n", "\n", "If $R_{tr}(\\beta) = \\frac{1}{N} \\sum_1^N (y_i - \\beta^T x_i)^2$, and $R_{te}(\\beta) = \\frac{1}{M} \\sum_1^M (\\hat{y}_i - \\beta^T \\hat{x}_i)^2$, prove that\n", "$$E[R_{tr}(\\hat{\\beta})] \\leq E[R_{te}(\\hat{\\beta})],$$\n", "where the expectation are over all that is random in each expression.\n", "\n", "#### solution\n", "Ref: Hint from [Homework 2 - Hector Corrada Bravo](http://www.cbcb.umd.edu/~hcorrada/PracticalML/assignments/hw2.pdf) \n", "\n", "define: $\\beta^{\\ast}$ is the least squares estimate in **test data**.\n", "\n", "1. \n", "As both training data and test data are picked randomly, so obviously:\n", "$$E[R_{tr}(\\hat{\\beta})] = E[R_{te}(\\beta^{\\ast})]$$\n", "\n", "2. \n", "On the other hand, because $\\beta^{\\ast}$ is the least squares estimate in **test data**, namely, \n", "$$\\beta^\\ast = \\operatorname{argmin}_\\beta \\frac{1}{M} \\sum_1^M (\\hat{y}_i - \\beta^T \\hat{x}_i)^2$$\n", "so obviously:\n", "$$R_{te}(\\beta^\\ast) = \\frac{1}{M} \\sum_1^M (\\hat{y}_i - \\left (\\color{blue}{\\beta^{\\ast}} \\right )^T \\hat{x}_i)^2 \\leq \\frac{1}{M} \\sum_1^M (\\hat{y}_i - \\color{blue}{\\hat{\\beta}}^T \\hat{x}_i)^2 = R_{te}(\\hat{\\beta})$$\n", "Thus:\n", "$$E[R_{te}(\\beta^\\ast)] \\leq E[R_{te}(\\hat{\\beta})]$$\n", "\n", "3. \n", "Final, we get:\n", "$$E[R_{tr}(\\hat{\\beta})] = E[R_{te}(\\beta^{\\ast})] \\leq E[R_{te}(\\hat{\\beta})]$$\n", "namely,\n", "$$E[R_{tr}(\\hat{\\beta})] \\leq E[R_{te}(\\hat{\\beta})]$$" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }