{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.6 Statistical Models, Supervised Learning and Function Approximation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our goal is to find a useful approximation $\\hat{f(x)}$ to the function $f(x)$ that underlies the predictive relationship between the inputs and outputs. \n",
    "\n",
    "- We saw that squared error loss lead us to the regression function $f(x)=E(Y|X = x)$ for a qualitive response.\n",
    "\n",
    "- The nearest-neighbor methods estimates directly the conditional expections, but may result in large errors for the high dimension input spaces. (*The curse of dimensionality*)\n",
    "\n",
    "We anticipate using other classes of models for f(x) to overcome the dimensionality problems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y)\n",
    "\n",
    "(2.29) Suppose that our data arose from a statistical model: $$Y = f(X) + \\varepsilon$$\n",
    "\n",
    "where the random error $\\varepsilon$ has $E(\\varepsilon) = 0$ and independent of X.\n",
    "\n",
    "The additive error model is a useful approximation to the truth. Generally there will be unmeasured variables that also contribute to the output, including measurement error. The additive model assumes that can capture all departures via the error $\\varepsilon$.\n",
    "\n",
    "The assumption in (2.29) that the errors are independent and identically distributed is not strictly necessary. For example, simple modifications can be made to avoid the independence assumption, e.g $Var(Y| X = x) = \\sigma(x)$, and now both the mean and variance depend on X."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6.2 Supervised Learning\n",
    "\n",
    "Suppose that errors are additive and that the model $Y = f(X) + \\varepsilon$. Supervised learning attempts to learn $f$ through a teacher and learns by examples (i.e by a training set of observations ($\\mathcal{T} = (x_i, y_i), i = 1...N$)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6.3 Function Approximation\n",
    "\n",
    "The goal is to obtain a useful approximation to f(x) for all x in some region of $\\mathbb{R}^p$, given the representations in $\\mathcal{T}$. \n",
    "\n",
    "Many of the approximations have associated a set of parameters $\\theta$ that can be modified to suit the data at hand, e.g the linear model $f(x)=x^T\\beta$ has $\\theta=\\beta$. Another class of useful approximators can be expressed as *linear basis expansions*:\n",
    "\n",
    "$$f_\\theta(x) = \\sum_{k=1}^{K}h_k(x)\\theta_k$$\n",
    "\n",
    "where the $h_k$ are a suitable set of functions or transformations of the input vector x. We also encounter *nonlinear expansions*, such as the sigmoid transformation: \n",
    "\n",
    "$$h_k(x) = \\frac{1}{1+exp(-x^T\\beta_k)}$$\n",
    "\n",
    "We can use least squares to estimate the parameters $\\theta$ in $f_\\theta$, by minimizing the residual sum-of-squares: \n",
    "\n",
    "$$RSS(\\theta)=\\sum_{i=1}^N(y_i - f_\\theta(x_i)) ^ 2$$\n",
    "\n",
    "While least squares is very convenient, it is not only criterion used and in some cases would not make sense. A more general principle for estimation is *maximum likelihood estimation*. Suppose we have a random sample $y_i$, i = 1...N from a density $Pr_\\theta(y)$ indexed by some parameters $\\theta.$ The log-probability of the observed sample is:\n",
    "$$L(\\theta)=\\sum_{i=1}^N logPr_\\theta(y_i)$$\n",
    "\n",
    "Least squares for the additive error model $Y = f_\\theta(X)+\\varepsilon$, with $\\varepsilon \\sim N(0, \\sigma^2)$ is equivalent to maximum likelihood using the conditional likelihood\n",
    "\n",
    "$$Pr(Y|X, \\theta)=N(f_\\theta(X), \\sigma^2)$$\n",
    "\n",
    "The log-likelihood of the data is: \n",
    "$$\n",
    "\\begin{align}\n",
    "L(\\theta) &= \\sum_{i=1}^N log \\left(\\frac{1}{\\sqrt{2\\pi\\sigma^2}}e^{-\\frac{(y_i-f_\\theta(x_i))^2}{2\\sigma^2}}\\right)\\\\\n",
    "&= \\sum_{i=1}^N log \\left(\\frac{1}{\\sqrt{2\\pi\\sigma^2}}\\right) - \\sum_{i=1}^N(\\frac{(y_i-f_\\theta(x_i))^2}{2\\sigma^2})\\\\\n",
    "&= \\sum_{i=1}^N log \\left((2\\pi\\sigma^2)^{-\\frac{1}{2}}\\right) - \\frac{1}{2\\sigma^2}\\sum_{i=1}^N(y_i-f_\\theta(x_i))^2\\\\\n",
    "&= \\sum_{i=1}^N \\left(-\\frac{1}{2}log(2\\pi)-log(\\sigma)\\right) - \\frac{1}{2\\sigma^2}\\sum_{i=1}^N(y_i-f_\\theta(x_i))^2\\\\\n",
    "&= \\frac{N}{2}log(2\\pi)-Nlog(\\sigma) - \\frac{1}{2\\sigma^2}\\sum_{i=1}^N(y_i-f_\\theta(x_i))^2\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "and the only term involving $\\theta$ is the last.\n",
    "\n",
    "A more interesting example is the multinomial likelihood for the regression function Pr(G|X) for a qualitative output G. Suppose we have a model $Pr(G = \\mathcal{G}_k| X = x) = p_{k, \\theta}(x), k = 1...K$ indexed by $\\theta$. Then the log-likelihood (a.k.a the cross-entropy) is :\n",
    "\n",
    "$$L(\\theta)=\\sum_{i=1}^N log (p_{g_i, \\theta}(x_i))$$"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}