{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.8 Classes of Restricted Estimators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The variety of nonparametric regression techniques or learning methods fall into a number of different classes depending on the nature of the restrictions imposed. Each of the classes has associated with it one or more parameters, sometimes called *smoothing* parameters, that control the effective size of the local neighborhood."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.8.1 Roughness Penalty and Bayesian Methods\n",
    "\n",
    "Here the class of functions is controlled by explicitly penalizing RSS(f) with a roughness penalty \n",
    "\n",
    "$$PRSS(f; \\lambda) = RSS(f) + \\lambda J(f)$$\n",
    "\n",
    "The user-selected $J(f)$ will be large for functions $f$ that vary too rapidly over small regions of input space. For example, the popular *cubic smoothing spline* for one-dimensional inputs is the solution to the PRSS:\n",
    "\n",
    "$$PRSS(f; \\lambda) = \\sum_{i=1}^N(y_i - f(x_i))^2 + \\lambda\\int[f^{''}(x)]^2dx$$\n",
    "\n",
    "The amount of penalty is dictated by $\\lambda>=0$:\n",
    "- For $\\lambda=0$, no penalty imposed, and any interpolating function will do\n",
    "- While for $\\lambda = \\infty$ only linear functions are permitted\n",
    "\n",
    "Penalty function, or *regularization* methods, express our prior belief that the type of functions we seek exhibit a certain type of smooth behavior, and indeed can usually be cast in a Bayesian framework:\n",
    "- The penalty J corresponds to a log-prior\n",
    "- and $PRSS(f; \\lambda)$ the log-posterior distribution;\n",
    "- and minimizing $PRSS(f; \\lambda)$ amounts to finding the posterior mode."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.8.2 Kernel Methods and Local Regression\n",
    "\n",
    "These methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of the local neighborhood, and of the class of regular functions fitted locally.\n",
    "\n",
    "The local neighborhood is specified by a *kernel function* $K_\\lambda(x_0, x)$ which assigns weights to points x in a region around $x_0$, e.g the *Gaussian kernel* based on the Gaussian density function:\n",
    "\n",
    "$$K_\\lambda(x_0, x) = \\frac{1}{\\lambda}exp\\left[-\\frac{||x-x_0||^2}{2\\lambda}\\right]$$\n",
    "\n",
    "and assigns weights to points that die exponentially with their squared euclidean distance from $x_0$ and $\\lambda$ corresponds to the variance of the Gaussian density and controls the width of the neighborhood.\n",
    "\n",
    "(2.40) The simplest form of kernel estimate is the Nadaraya-Watson weighted average:\n",
    "\n",
    "$$\\hat{f}(x_0)=\\frac{\\sum_{i=1}^N K_\\lambda(x_0, x_i)y_i}{\\sum_{i=1}^N K_\\lambda(x_0, x_i)}$$\n",
    "\n",
    "(2.41) In general we can define a local regression estimate of $f(x_0)$ as $f_\\hat{\\theta}(x_0)$, where $\\hat{\\theta}$ minimizes:\n",
    "\n",
    "$$RSS(f_\\theta, x_0) = \\sum_{i=1}^N K_\\lambda(x_0, x_i)(y_i - f_\\theta(x_i))^2$$\n",
    "\n",
    "and $f_\\theta$ is some parameterized function, e.g:\n",
    "\n",
    "- $f_\\theta(x) = \\theta_0$, the constant function; this results in the Nadaraya-Watson estimate in (2.41)\n",
    "\n",
    "- $f_\\theta(x) = \\theta_0+\\theta_1x$ gives the local linear regression model.\n",
    "\n",
    "These methods needs to be modified in high dimensions, to avoid curse of dimensionality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.8.3 Basis Functions and Dictionary Methods\n",
    "\n",
    "This class of methods includes the linear and polynomial expansions, but more importantly a wide variety of more flexible models. The models for $f$ is a *linear expansion of basis functions*:\n",
    "\n",
    "$$f_\\theta(x) = \\sum_{m=1}^M \\theta_{m}h_m(x)$$\n",
    "\n",
    "the term linear here refers to the action of the parameters $\\theta$.\n",
    "\n",
    "*TODO*: 1D polynomial splines of degrees K.\n",
    "\n",
    "*Radial basis functions* are symmetric p-dimensional kernels located at particular centroids:\n",
    "\n",
    "$$f_\\theta(x) = \\sum_{m=1}^M{K_{\\lambda_m}(\\mu_m, x)}\\theta_m$$\n",
    "\n",
    "e.g the Gaussian kernel $K_\\lambda(\\mu, x) = e^{-||x-\\mu||^2}/2\\lambda$ is popular. Radial basis functions have centroids $\\mu_m$ and scales $\\lambda_m$ that have to be determined. In general, we would like the data to dictate them.\n",
    "\n",
    "A single-layer feed-forward neural networks model with linear output weights can be thought of as an adaptive basis function methods: \n",
    "\n",
    "$$f_\\theta(x) = \\sum_{m=1}^M{\\beta_m\\sigma(\\alpha_m^T{x}+b_m)}$$\n",
    "\n",
    "where $\\sigma(x) = 1/(1+e^{-x})$ is know as the *activation* function.\n",
    "\n",
    "\n",
    "These adaptively chosen basis function methods are also know as *dictionary* methods, where one has available a infinite set or dictionary $\\mathcal{D}$ of candidate basis function from which to choose."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}