{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1 toc-item\"><a href=\"#Newton's-Method-and-Fisher-Scoring-(KL-Chapter-14)\" data-toc-modified-id=\"Newton's-Method-and-Fisher-Scoring-(KL-Chapter-14)-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Newton's Method and Fisher Scoring (KL Chapter 14)</a></div><div class=\"lev2 toc-item\"><a href=\"#Notations\" data-toc-modified-id=\"Notations-11\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Notations</a></div><div class=\"lev2 toc-item\"><a href=\"#Newton's-method\" data-toc-modified-id=\"Newton's-method-12\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Newton's method</a></div><div class=\"lev2 toc-item\"><a href=\"#Fisher's-scoring-method\" data-toc-modified-id=\"Fisher's-scoring-method-13\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Fisher's scoring method</a></div><div class=\"lev2 toc-item\"><a href=\"#Generalized-linear-model-(GLM)-(KL-14.7)\" data-toc-modified-id=\"Generalized-linear-model-(GLM)-(KL-14.7)-14\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Generalized linear model (GLM) (KL 14.7)</a></div><div class=\"lev3 toc-item\"><a href=\"#Logistic-regression\" data-toc-modified-id=\"Logistic-regression-141\"><span class=\"toc-item-num\">1.4.1&nbsp;&nbsp;</span>Logistic regression</a></div><div class=\"lev3 toc-item\"><a href=\"#GLM\" data-toc-modified-id=\"GLM-142\"><span class=\"toc-item-num\">1.4.2&nbsp;&nbsp;</span>GLM</a></div><div class=\"lev2 toc-item\"><a href=\"#Nonlinear-regression---Gauss-Newton-method-(KL-14.4-14.6)\" data-toc-modified-id=\"Nonlinear-regression---Gauss-Newton-method-(KL-14.4-14.6)-15\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>Nonlinear regression - Gauss-Newton method (KL 14.4-14.6)</a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Newton's Method and Fisher Scoring (KL Chapter 14)\n",
    "\n",
    "Consider maximizing log-likelihood function $L(\\mathbf{\\theta})$, $\\theta \\in \\Theta \\subset \\mathbb{R}^p$.\n",
    "\n",
    "## Notations\n",
    "\n",
    "* **Gradient** (or **score**) of $L$:\n",
    "    $$\n",
    "    \\nabla L(\\theta) = \\begin{pmatrix}\n",
    "    \\frac{\\partial L(\\theta)}{\\partial \\theta_1} \\\\\n",
    "    \\vdots \\\\\n",
    "    \\frac{\\partial L(\\theta)}{\\partial \\theta_p}\n",
    "    \\end{pmatrix}\n",
    "    $$\n",
    "\n",
    "* **Differential** of $L$:\n",
    "    $$\n",
    "    dL(\\theta) = [\\nabla L(\\theta)]^T\n",
    "    $$\n",
    " \n",
    "* **Hessian** of $L$:  \n",
    "    $$\n",
    "    \\nabla^2 L(\\theta) = d^2 L(\\theta) = \\begin{pmatrix}\n",
    "    \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_1 \\partial \\theta_1} & \\dots & \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_1 \\partial \\theta_p} \\\\\n",
    "    \\vdots & \\ddots & \\vdots \\\\\n",
    "    \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_p \\partial \\theta_1} & \\dots & \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_p \\partial \\theta_p}\n",
    "    \\end{pmatrix}\n",
    "    $$\n",
    "    \n",
    "* **Observed information matrix**:  \n",
    "\n",
    "$$\n",
    "    - \\nabla^2 L(\\theta) = - d^2 L(\\theta).\n",
    "$$\n",
    "    \n",
    "* **Expected (Fisher) information matrix**: \n",
    "$$\n",
    "    \\mathbf{E}[- \\nabla^2 L(\\theta)] = \\mathbf{E}[- d^2 L(\\theta)].\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Newton's method\n",
    "\n",
    "* Newton's method was originally developed for finding roots of nonlinear equations\n",
    "$f(\\mathbf{x}) = \\mathbf{0}$ (KL 5.4).\n",
    "\n",
    "* Newton's method, aka **Newton-Raphson method**, is considered the gold standard for its fast (quadratic) convergence\n",
    "$$\n",
    "\t\\frac{\\|\\mathbf{\\theta}^{(t+1)} - \\mathbf{\\theta}^*\\|}{\\|\\mathbf{\\theta}^{(t)} - \\mathbf{\\theta}^*\\|^2} \\to \\text{constant}.\n",
    "$$\n",
    "\n",
    "* Idea: iterative quadratic approximation. \n",
    "\n",
    "* Taylor expansion around the current iterate $\\mathbf{\\theta}^{(t)}$\n",
    "$$\n",
    "\tL(\\mathbf{\\theta}) \\approx L(\\mathbf{\\theta}^{(t)}) + dL(\\mathbf{\\theta}^{(t)}) (\\mathbf{\\theta} - \\mathbf{\\theta}^{(t)}) + \\frac 12 (\\mathbf{\\theta} - \\mathbf{\\theta}^{(t)})^T d^2L(\\mathbf{\\theta}^{(t)}) (\\mathbf{\\theta} - \\mathbf{\\theta}^{(t)})\n",
    "$$\n",
    "and then maximize the quadratic approximation.\n",
    "\n",
    "* To maximize the quadratic function, we equate its gradient to zero\n",
    "$$\n",
    "\t\\nabla L(\\theta^{(t)}) + [d^2L(\\theta^{(t)})] (\\theta - \\theta^{(t)}) = \\mathbf{0}_p,\n",
    "$$\n",
    "which suggests the next iterate\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\theta^{(t+1)} &=& \\theta^{(t)} - [d^2L(\\theta^{(t)})]^{-1} \\nabla L(\\theta^{(t)}) \\\\\n",
    "\t&=& \\theta^{(t)} + [-d^2L(\\theta^{(t)})]^{-1} \\nabla L(\\theta^{(t)}).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "We call this **naive Newton's method**.\n",
    "\n",
    "* **Stability issue**: naive Newton's iterate is **not** guaranteed to be an ascent algorithm. It's equally happy to head uphill or downhill. Remedies:  \n",
    "    0. approximate $-d^2L(\\theta^{(t)})$ by a positive definite $\\mathbf{A}$ (if it's not), **and** \n",
    "    0. line search (backtracking).  \n",
    "Why insist on a _positive definite_ approximation of Hessian? By first-order Taylor expansion,\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "    & & L(\\theta^{(t)} + s \\Delta \\theta^{(t)}) - L(\\theta^{(t)}) \\\\\n",
    "    &=& dL(\\theta^{(t)}) s \\Delta \\theta^{(t)} + o(s) \\\\\n",
    "    &=& s dL(\\theta^{(t)}) [\\mathbf{A}^{(t)}]^{-1} \\nabla L(\\theta^{(t)}) + o(s).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "For $s$ sufficiently small, right hand side is strictly positive when $\\mathbf{A}^{(t)}$ is positive definite.\n",
    "\n",
    "* In summary, a **Newton-type algorithm** iterates according to\n",
    "$$\n",
    "\t\\boxed{ \\theta^{(t+1)} = \\theta^{(t)} + s [\\mathbf{A}^{(t)}]^{-1} \\nabla L(\\theta^{(t)}) \n",
    "\t= \\theta^{(t)} + s \\Delta \\theta^{(t)} }\n",
    "$$\n",
    "where $\\mathbf{A}^{(t)}$ is a pd approximation of $-d^2L(\\theta^{(t)})$ and $s$ is a step length.\n",
    "\n",
    "* For strictly concave $L$, $-d^2L(\\theta^{(t)})$ is always positive definite. Line search is still needed to guarantee convergence.\n",
    "\n",
    "* Line search strategy: step-halving ($s=1,1/2,\\ldots$), golden section search, cubic interpolation, Amijo rule, ... Note the **Newton direction**  \n",
    "$$\n",
    "\\Delta \\theta^{(t)} = [\\mathbf{A}^{(t)}]^{-1} \\nabla L(\\theta^{(t)})\n",
    "$$\n",
    "only needs to be calculated once. Cost of line search mainly lies in objective function evaluation.\n",
    "\n",
    "* How to approximate $-d^2L(\\theta)$? More of an art than science. Often requires problem specific analysis. \n",
    "\n",
    "* Taking $\\mathbf{A} = \\mathbf{I}$ leads to the method of **steepest ascent**, aka  **gradient ascent**.\n",
    "\n",
    "<img src=\"http://trond.hjorteland.com/thesis/img208.gif\" width=\"400\" align=\"center\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fisher's scoring method\n",
    "\n",
    "* **Fisher's scoring method**: replace $-d^2L(\\theta)$ by the expected Fisher information matrix\n",
    "$$\n",
    "\t\\mathbf{I}(\\theta) = \\mathbf{E}[-d^2L(\\theta)] = \\mathbf{E}[\\nabla L(\\theta) dL(\\theta)] \\succeq \\mathbf{0}_{p \\times p},\n",
    "$$\n",
    "which is psd under exchangeability of expectation and differentiation.\n",
    "\n",
    "    Therefore the Fisher's scoring algorithm iterates according to\n",
    "$$\n",
    "\t\\boxed{ \\theta^{(t+1)} = \\theta^{(t)} + s [\\mathbf{I}(\\theta^{(t)})]^{-1} \\nabla L(\\theta^{(t)})}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generalized linear model (GLM) (KL 14.7)\n",
    "\n",
    "### Logistic regression\n",
    "\n",
    "Let's consider a concrete example: logistic regression.\n",
    "\n",
    "* The goal is to predict whether a credit card transaction is fraud ($y_i=1$) or not ($y_i=0$). Predictors ($\\mathbf{x}_i$) include: time of transaction, last location, merchant, ...\n",
    "\n",
    "* $y_i \\in \\{0,1\\}$, $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. Model $y_i \\sim $Bernoulli$(p_i)$.\n",
    "\n",
    "* Logistic regression. Density\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\tf(y_i|p_i) &=& p_i^{y_i} (1-p_i)^{1-y_i} \\\\\n",
    "\t&=& e^{y_i \\ln p_i + (1-y_i) \\ln (1-p_i)} \\\\\n",
    "\t&=& e^{y_i \\ln \\frac{p_i}{1-p_i} + \\ln (1-p_i)},\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\mathbf{E} (y_i) = p_i &=& \\frac{e^{\\mathbf{x}_i^T \\beta}}{1+ e^{\\mathbf{x}_i^T \\beta}} \\quad \\text{(mean function, inverse link function)} \\\\\n",
    "\t\\mathbf{x}_i^T \\beta &=& \\ln \\left( \\frac{p_i}{1-p_i} \\right) \\quad \\text{(logit link function)}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "\n",
    "* Given data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$,\n",
    "\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\tL_n(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\ln p_i + (1-y_i) \\ln (1-p_i) \\right] \\\\\n",
    "\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\ln (1 + e^{\\mathbf{x}_i^T \\beta}) \\right] \\\\\n",
    "\t\\nabla L_n(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\frac{e^{\\mathbf{x}_i^T \\beta}}{1+e^{\\mathbf{x}_i^T \\beta}} \\mathbf{x}_i \\right) \\\\\n",
    "\t&=& \\sum_{i=1}^n (y_i - p_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})\t\\\\\n",
    "\t- d^2L_n(\\beta) &=& \\sum_{i=1}^n p_i(1-p_i) \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n",
    "\t\\text{where } \\mathbf{W} &=& \\text{diag}(w_1,\\ldots,w_n), w_i = p_i (1-p_i) \\\\\n",
    "\t\\mathbf{I}_n(\\beta) &=& \\mathbf{E} [- d^2L_n(\\beta)] = - d^2L_n(\\beta).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "\n",
    "* Newton's method == Fisher's scoring iteration: \n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-d^2 L(\\beta^{(t)})]^{-1} \\nabla L(\\beta^{(t)})\t\\\\\n",
    "\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p}^{(t)}) \\\\\n",
    "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)}) \\right] \\\\\n",
    "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where \n",
    "$$\n",
    "\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)})\n",
    "$$ \n",
    "are the working responses. A Newton's iteration is equivalent to solving a weighed least squares problem $\\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2$. Thus the name **IRWLS (iteratively re-weighted least squares)**.\n",
    "\n",
    "### GLM \n",
    "\n",
    "Let's consider the more general class of generalized linear models (GLM).\n",
    "\n",
    "\n",
    "| Family           | Canonical Link                          | Variance Function |\n",
    "|------------------|-------------------------------|-------------------|\n",
    "| Normal           | $\\eta=\\mu$                    | 1                 |\n",
    "| Poisson          | $\\eta=\\log \\mu$               | $\\mu$             |\n",
    "| Binomial         | $\\eta=\\log \\left( \\frac{\\mu}{1 - \\mu} \\right)$ | $\\mu (1 - \\mu)$   |\n",
    "| Gamma            | $\\eta = \\mu^{-1}$             | $\\mu^2$           |\n",
    "| Inverse Gaussian | $\\eta = \\mu^{-2}$             | $\\mu^3$           |\n",
    "\n",
    "* $Y$ belongs to an exponential family with density\n",
    "$$\n",
    "\tp(y|\\theta,\\phi) = \\exp \\left\\{ \\frac{y\\theta - b(\\theta)}{a(\\phi)} + c(y,\\phi) \\right\\}.\n",
    "$$\n",
    "    * $\\theta$: natural parameter.  \n",
    "    * $\\phi>0$: dispersion parameter.  \n",
    "GLM relates the mean $\\mu = \\mathbf{E}(Y|\\mathbf{x})$ via a strictly increasing link function\n",
    "$$\n",
    "\tg(\\mu) = \\eta = \\mathbf{x} \\beta, \\quad \\mu = g^{-1}(\\eta)\n",
    "$$\n",
    "\n",
    "* Score, Hessian, information\n",
    "\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    " \t\\nabla L_n(\\beta) &=& \\sum_{i=1}^n \\frac{(y_i-\\mu_i) \\mu_i'(\\eta_i)}{\\sigma_i^2} \\mathbf{x}_i \\\\\n",
    "\t- d^2 L_n(\\beta) &=& \\sum_{i=1}^n \\frac{[\\mu_i'(\\eta_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T - \\sum_{i=1}^n \\frac{(y_i - \\mu_i) \\theta''(\\eta_i)}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T\t\\\\\n",
    "\t\\mathbf{I}_n(\\beta) &=& \\mathbf{E} [- d^2 L_n(\\beta)] = \\sum_{i=1}^n \\frac{[\\mu_i'(\\eta_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}.\n",
    "\\end{eqnarray*}    \n",
    "$$\n",
    "\n",
    "* Fisher scoring method\n",
    "$$\n",
    " \t\\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I}(\\beta^{(t)})]^{-1} \\nabla L_n(\\beta^{(t)})\n",
    "$$\n",
    "IRWLS with weights $w_i = [\\mu_i(\\eta_i)]^2/\\sigma_i^2$ and some working responses $z_i$.\n",
    "\n",
    "* For **canonical link**, $\\theta = \\eta$, the second term of Hessian vanishes and Hessian coincides with Fisher information matrix. Convex problem 😄\n",
    "$$\n",
    "\t\\text{Fisher's scoring = Newton's method}.\n",
    "$$\n",
    " \n",
    "* Non-canonical link, non-convex problem 😞\n",
    "$$\n",
    "\t\\text{Fisher's scoring algorithm} \\ne \\text{Newton's method}.\n",
    "$$\n",
    "Example: Probit regression (binary response with probit link).\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "    y_i &\\sim& \\text{Bernoulli}(p_i) \\\\\n",
    "    p_i &=& \\Phi(\\mathbf{x}_i^T \\beta)  \\\\\n",
    "    \\eta_i &=& \\mathbf{x}_i^T \\beta = \\Phi^{-1}(p_i).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where $\\Phi(\\cdot)$ is the cdf of a standard normal.\n",
    " \n",
    "* Julia, R and Matlab implement the Fisher scoring method, aka IRWLS, for GLMs.\n",
    "    * [GLM.jl](https://github.com/JuliaStats/GLM.jl) package."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Nonlinear regression - Gauss-Newton method (KL 14.4-14.6)\n",
    "\n",
    "* Now we finally get to the problem Gauss faced in 1800!  \n",
    "Relocate Ceres by fitting 41 observations to a 6-parameter (nonlinear) orbit.\n",
    "\n",
    "* Nonlinear least squares (curve fitting): \n",
    "$$\n",
    "\t\\text{minimize} \\,\\, f(\\beta) = \\frac{1}{2} \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i, \\beta)]^2\n",
    "$$\n",
    "For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve\n",
    "$$\n",
    "\t\\mu(x, \\beta_1,\\beta_2,\\beta_3) = \\frac{\\beta_3}{1 + e^{-\\beta_1 - \\beta_2 x}}.\n",
    "$$\n",
    "\n",
    "<img src=\"https://cdn.xlstat.com/img/tutorials/nlin5.gif\" width=\"300\" align=\"center\"/>\n",
    "\n",
    "* \"Score\" and \"information matrices\"\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\nabla f(\\beta) &=& - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla \\mu_i(\\beta) \\\\\n",
    "\td^2 f(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\beta) d \\mu_i(\\beta) - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] d^2 \\mu_i(\\beta) \\\\\n",
    "\t\\mathbf{I}(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\beta) d \\mu_i(\\beta) = \\mathbf{J}(\\beta)^T \\mathbf{J}(\\beta),\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where $\\mathbf{J}(\\beta)^T = [\\nabla \\mu_1(\\beta), \\ldots, \\nabla \\mu_n(\\beta)] \\in \\mathbb{R}^{p \\times n}$.\n",
    "\n",
    "* **Gauss-Newton** (= \"Fisher's scoring algorithm\") uses $\\mathbf{I}(\\beta)$, which is always psd.\n",
    "$$\n",
    "\t\\boxed{ \\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I} (\\beta^{(t)})]^{-1} \\nabla L(\\beta^{(t)}) }\n",
    "$$\n",
    "\n",
    "* **Levenberg-Marquardt** method, aka **damped least squares algorithm (DLS)**, adds a ridge term to the approximate Hessian\n",
    "$$\n",
    "\t\\boxed{ \\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I} (\\beta^{(t)}) + \\tau \\mathbf{I}_p]^{-1} \\nabla L(\\beta^{(t)}) }\n",
    "$$\n",
    "bridging between Gauss-Newton and steepest descent.\n",
    "\n",
    "* Other approximation to Hessians: nonlinear GLMs.  \n",
    "See KL 14.4 for examples."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.2",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.2"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "153px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}