{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Newton's Method and Fisher Scoring (KL Chapter 14)\n", "\n", "Consider maximizing log-likelihood function $L(\\mathbf{\\theta})$, $\\theta \\in \\Theta \\subset \\mathbb{R}^p$.\n", "\n", "## Notations\n", "\n", "* **Gradient** (or **score**) of $L$:\n", " $$\n", " \\nabla L(\\theta) = \\begin{pmatrix}\n", " \\frac{\\partial L(\\theta)}{\\partial \\theta_1} \\\\\n", " \\vdots \\\\\n", " \\frac{\\partial L(\\theta)}{\\partial \\theta_p}\n", " \\end{pmatrix}\n", " $$\n", "\n", "* **Differential** of $L$:\n", " $$\n", " dL(\\theta) = [\\nabla L(\\theta)]^T\n", " $$\n", " \n", "* **Hessian** of $L$: \n", " $$\n", " \\nabla^2 L(\\theta) = d^2 L(\\theta) = \\begin{pmatrix}\n", " \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_1 \\partial \\theta_1} & \\dots & \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_1 \\partial \\theta_p} \\\\\n", " \\vdots & \\ddots & \\vdots \\\\\n", " \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_p \\partial \\theta_1} & \\dots & \\frac{\\partial^2 L(\\theta)}{\\partial \\theta_p \\partial \\theta_p}\n", " \\end{pmatrix}\n", " $$\n", " \n", "* **Observed information matrix**: \n", "\n", "$$\n", " - \\nabla^2 L(\\theta) = - d^2 L(\\theta).\n", "$$\n", " \n", "* **Expected (Fisher) information matrix**: \n", "$$\n", " \\mathbf{E}[- \\nabla^2 L(\\theta)] = \\mathbf{E}[- d^2 L(\\theta)].\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Newton's method\n", "\n", "* Newton's method was originally developed for finding roots of nonlinear equations\n", "$f(\\mathbf{x}) = \\mathbf{0}$ (KL 5.4).\n", "\n", "* Newton's method, aka **Newton-Raphson method**, is considered the gold standard for its fast (quadratic) convergence\n", "$$\n", "\t\\frac{\\|\\mathbf{\\theta}^{(t+1)} - \\mathbf{\\theta}^*\\|}{\\|\\mathbf{\\theta}^{(t)} - \\mathbf{\\theta}^*\\|^2} \\to \\text{constant}.\n", "$$\n", "\n", "* Idea: iterative quadratic approximation. \n", "\n", "* Taylor expansion around the current iterate $\\mathbf{\\theta}^{(t)}$\n", "$$\n", "\tL(\\mathbf{\\theta}) \\approx L(\\mathbf{\\theta}^{(t)}) + dL(\\mathbf{\\theta}^{(t)}) (\\mathbf{\\theta} - \\mathbf{\\theta}^{(t)}) + \\frac 12 (\\mathbf{\\theta} - \\mathbf{\\theta}^{(t)})^T d^2L(\\mathbf{\\theta}^{(t)}) (\\mathbf{\\theta} - \\mathbf{\\theta}^{(t)})\n", "$$\n", "and then maximize the quadratic approximation.\n", "\n", "* To maximize the quadratic function, we equate its gradient to zero\n", "$$\n", "\t\\nabla L(\\theta^{(t)}) + [d^2L(\\theta^{(t)})] (\\theta - \\theta^{(t)}) = \\mathbf{0}_p,\n", "$$\n", "which suggests the next iterate\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\theta^{(t+1)} &=& \\theta^{(t)} - [d^2L(\\theta^{(t)})]^{-1} \\nabla L(\\theta^{(t)}) \\\\\n", "\t&=& \\theta^{(t)} + [-d^2L(\\theta^{(t)})]^{-1} \\nabla L(\\theta^{(t)}).\n", "\\end{eqnarray*}\n", "$$\n", "We call this **naive Newton's method**.\n", "\n", "* **Stability issue**: naive Newton's iterate is **not** guaranteed to be an ascent algorithm. It's equally happy to head uphill or downhill. Remedies: \n", " 0. approximate $-d^2L(\\theta^{(t)})$ by a positive definite $\\mathbf{A}$ (if it's not), **and** \n", " 0. line search (backtracking). \n", "Why insist on a _positive definite_ approximation of Hessian? By first-order Taylor expansion,\n", "$$\n", "\\begin{eqnarray*}\n", " & & L(\\theta^{(t)} + s \\Delta \\theta^{(t)}) - L(\\theta^{(t)}) \\\\\n", " &=& dL(\\theta^{(t)}) s \\Delta \\theta^{(t)} + o(s) \\\\\n", " &=& s dL(\\theta^{(t)}) [\\mathbf{A}^{(t)}]^{-1} \\nabla L(\\theta^{(t)}) + o(s).\n", "\\end{eqnarray*}\n", "$$\n", "For $s$ sufficiently small, right hand side is strictly positive when $\\mathbf{A}^{(t)}$ is positive definite.\n", "\n", "* In summary, a **Newton-type algorithm** iterates according to\n", "$$\n", "\t\\boxed{ \\theta^{(t+1)} = \\theta^{(t)} + s [\\mathbf{A}^{(t)}]^{-1} \\nabla L(\\theta^{(t)}) \n", "\t= \\theta^{(t)} + s \\Delta \\theta^{(t)} }\n", "$$\n", "where $\\mathbf{A}^{(t)}$ is a pd approximation of $-d^2L(\\theta^{(t)})$ and $s$ is a step length.\n", "\n", "* For strictly concave $L$, $-d^2L(\\theta^{(t)})$ is always positive definite. Line search is still needed to guarantee convergence.\n", "\n", "* Line search strategy: step-halving ($s=1,1/2,\\ldots$), golden section search, cubic interpolation, Amijo rule, ... Note the **Newton direction** \n", "$$\n", "\\Delta \\theta^{(t)} = [\\mathbf{A}^{(t)}]^{-1} \\nabla L(\\theta^{(t)})\n", "$$\n", "only needs to be calculated once. Cost of line search mainly lies in objective function evaluation.\n", "\n", "* How to approximate $-d^2L(\\theta)$? More of an art than science. Often requires problem specific analysis. \n", "\n", "* Taking $\\mathbf{A} = \\mathbf{I}$ leads to the method of **steepest ascent**, aka **gradient ascent**.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fisher's scoring method\n", "\n", "* **Fisher's scoring method**: replace $-d^2L(\\theta)$ by the expected Fisher information matrix\n", "$$\n", "\t\\mathbf{I}(\\theta) = \\mathbf{E}[-d^2L(\\theta)] = \\mathbf{E}[\\nabla L(\\theta) dL(\\theta)] \\succeq \\mathbf{0}_{p \\times p},\n", "$$\n", "which is psd under exchangeability of expectation and differentiation.\n", "\n", " Therefore the Fisher's scoring algorithm iterates according to\n", "$$\n", "\t\\boxed{ \\theta^{(t+1)} = \\theta^{(t)} + s [\\mathbf{I}(\\theta^{(t)})]^{-1} \\nabla L(\\theta^{(t)})}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalized linear model (GLM) (KL 14.7)\n", "\n", "### Logistic regression\n", "\n", "Let's consider a concrete example: logistic regression.\n", "\n", "* The goal is to predict whether a credit card transaction is fraud ($y_i=1$) or not ($y_i=0$). Predictors ($\\mathbf{x}_i$) include: time of transaction, last location, merchant, ...\n", "\n", "* $y_i \\in \\{0,1\\}$, $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. Model $y_i \\sim $Bernoulli$(p_i)$.\n", "\n", "* Logistic regression. Density\n", "$$\n", "\\begin{eqnarray*}\n", "\tf(y_i|p_i) &=& p_i^{y_i} (1-p_i)^{1-y_i} \\\\\n", "\t&=& e^{y_i \\ln p_i + (1-y_i) \\ln (1-p_i)} \\\\\n", "\t&=& e^{y_i \\ln \\frac{p_i}{1-p_i} + \\ln (1-p_i)},\n", "\\end{eqnarray*}\n", "$$\n", "where\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\mathbf{E} (y_i) = p_i &=& \\frac{e^{\\mathbf{x}_i^T \\beta}}{1+ e^{\\mathbf{x}_i^T \\beta}} \\quad \\text{(mean function, inverse link function)} \\\\\n", "\t\\mathbf{x}_i^T \\beta &=& \\ln \\left( \\frac{p_i}{1-p_i} \\right) \\quad \\text{(logit link function)}.\n", "\\end{eqnarray*}\n", "$$\n", "\n", "* Given data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$,\n", "\n", "$$\n", "\\begin{eqnarray*}\n", "\tL_n(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\ln p_i + (1-y_i) \\ln (1-p_i) \\right] \\\\\n", "\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\ln (1 + e^{\\mathbf{x}_i^T \\beta}) \\right] \\\\\n", "\t\\nabla L_n(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\frac{e^{\\mathbf{x}_i^T \\beta}}{1+e^{\\mathbf{x}_i^T \\beta}} \\mathbf{x}_i \\right) \\\\\n", "\t&=& \\sum_{i=1}^n (y_i - p_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})\t\\\\\n", "\t- d^2L_n(\\beta) &=& \\sum_{i=1}^n p_i(1-p_i) \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n", "\t\\text{where } \\mathbf{W} &=& \\text{diag}(w_1,\\ldots,w_n), w_i = p_i (1-p_i) \\\\\n", "\t\\mathbf{I}_n(\\beta) &=& \\mathbf{E} [- d^2L_n(\\beta)] = - d^2L_n(\\beta).\n", "\\end{eqnarray*}\n", "$$\n", "\n", "* Newton's method == Fisher's scoring iteration: \n", "$$\n", "\\begin{eqnarray*}\n", "\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-d^2 L(\\beta^{(t)})]^{-1} \\nabla L(\\beta^{(t)})\t\\\\\n", "\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p}^{(t)}) \\\\\n", "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)}) \\right] \\\\\n", "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n", "\\end{eqnarray*}\n", "$$\n", "where \n", "$$\n", "\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)})\n", "$$ \n", "are the working responses. A Newton's iteration is equivalent to solving a weighed least squares problem $\\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2$. Thus the name **IRWLS (iteratively re-weighted least squares)**.\n", "\n", "### GLM \n", "\n", "Let's consider the more general class of generalized linear models (GLM).\n", "\n", "\n", "| Family | Canonical Link | Variance Function |\n", "|------------------|-------------------------------|-------------------|\n", "| Normal | $\\eta=\\mu$ | 1 |\n", "| Poisson | $\\eta=\\log \\mu$ | $\\mu$ |\n", "| Binomial | $\\eta=\\log \\left( \\frac{\\mu}{1 - \\mu} \\right)$ | $\\mu (1 - \\mu)$ |\n", "| Gamma | $\\eta = \\mu^{-1}$ | $\\mu^2$ |\n", "| Inverse Gaussian | $\\eta = \\mu^{-2}$ | $\\mu^3$ |\n", "\n", "* $Y$ belongs to an exponential family with density\n", "$$\n", "\tp(y|\\theta,\\phi) = \\exp \\left\\{ \\frac{y\\theta - b(\\theta)}{a(\\phi)} + c(y,\\phi) \\right\\}.\n", "$$\n", " * $\\theta$: natural parameter. \n", " * $\\phi>0$: dispersion parameter. \n", "GLM relates the mean $\\mu = \\mathbf{E}(Y|\\mathbf{x})$ via a strictly increasing link function\n", "$$\n", "\tg(\\mu) = \\eta = \\mathbf{x} \\beta, \\quad \\mu = g^{-1}(\\eta)\n", "$$\n", "\n", "* Score, Hessian, information\n", "\n", "$$\n", "\\begin{eqnarray*}\n", " \t\\nabla L_n(\\beta) &=& \\sum_{i=1}^n \\frac{(y_i-\\mu_i) \\mu_i'(\\eta_i)}{\\sigma_i^2} \\mathbf{x}_i \\\\\n", "\t- d^2 L_n(\\beta) &=& \\sum_{i=1}^n \\frac{[\\mu_i'(\\eta_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T - \\sum_{i=1}^n \\frac{(y_i - \\mu_i) \\theta''(\\eta_i)}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T\t\\\\\n", "\t\\mathbf{I}_n(\\beta) &=& \\mathbf{E} [- d^2 L_n(\\beta)] = \\sum_{i=1}^n \\frac{[\\mu_i'(\\eta_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}.\n", "\\end{eqnarray*} \n", "$$\n", "\n", "* Fisher scoring method\n", "$$\n", " \t\\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I}(\\beta^{(t)})]^{-1} \\nabla L_n(\\beta^{(t)})\n", "$$\n", "IRWLS with weights $w_i = [\\mu_i(\\eta_i)]^2/\\sigma_i^2$ and some working responses $z_i$.\n", "\n", "* For **canonical link**, $\\theta = \\eta$, the second term of Hessian vanishes and Hessian coincides with Fisher information matrix. Convex problem 😄\n", "$$\n", "\t\\text{Fisher's scoring = Newton's method}.\n", "$$\n", " \n", "* Non-canonical link, non-convex problem 😞\n", "$$\n", "\t\\text{Fisher's scoring algorithm} \\ne \\text{Newton's method}.\n", "$$\n", "Example: Probit regression (binary response with probit link).\n", "$$\n", "\\begin{eqnarray*}\n", " y_i &\\sim& \\text{Bernoulli}(p_i) \\\\\n", " p_i &=& \\Phi(\\mathbf{x}_i^T \\beta) \\\\\n", " \\eta_i &=& \\mathbf{x}_i^T \\beta = \\Phi^{-1}(p_i).\n", "\\end{eqnarray*}\n", "$$\n", "where $\\Phi(\\cdot)$ is the cdf of a standard normal.\n", " \n", "* Julia, R and Matlab implement the Fisher scoring method, aka IRWLS, for GLMs.\n", " * [GLM.jl](https://github.com/JuliaStats/GLM.jl) package." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nonlinear regression - Gauss-Newton method (KL 14.4-14.6)\n", "\n", "* Now we finally get to the problem Gauss faced in 1800! \n", "Relocate Ceres by fitting 41 observations to a 6-parameter (nonlinear) orbit.\n", "\n", "* Nonlinear least squares (curve fitting): \n", "$$\n", "\t\\text{minimize} \\,\\, f(\\beta) = \\frac{1}{2} \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i, \\beta)]^2\n", "$$\n", "For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve\n", "$$\n", "\t\\mu(x, \\beta_1,\\beta_2,\\beta_3) = \\frac{\\beta_3}{1 + e^{-\\beta_1 - \\beta_2 x}}.\n", "$$\n", "\n", "\n", "\n", "* \"Score\" and \"information matrices\"\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\nabla f(\\beta) &=& - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla \\mu_i(\\beta) \\\\\n", "\td^2 f(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\beta) d \\mu_i(\\beta) - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] d^2 \\mu_i(\\beta) \\\\\n", "\t\\mathbf{I}(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\beta) d \\mu_i(\\beta) = \\mathbf{J}(\\beta)^T \\mathbf{J}(\\beta),\n", "\\end{eqnarray*}\n", "$$\n", "where $\\mathbf{J}(\\beta)^T = [\\nabla \\mu_1(\\beta), \\ldots, \\nabla \\mu_n(\\beta)] \\in \\mathbb{R}^{p \\times n}$.\n", "\n", "* **Gauss-Newton** (= \"Fisher's scoring algorithm\") uses $\\mathbf{I}(\\beta)$, which is always psd.\n", "$$\n", "\t\\boxed{ \\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I} (\\beta^{(t)})]^{-1} \\nabla L(\\beta^{(t)}) }\n", "$$\n", "\n", "* **Levenberg-Marquardt** method, aka **damped least squares algorithm (DLS)**, adds a ridge term to the approximate Hessian\n", "$$\n", "\t\\boxed{ \\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I} (\\beta^{(t)}) + \\tau \\mathbf{I}_p]^{-1} \\nabla L(\\beta^{(t)}) }\n", "$$\n", "bridging between Gauss-Newton and steepest descent.\n", "\n", "* Other approximation to Hessians: nonlinear GLMs. \n", "See KL 14.4 for examples." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.2", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "153px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }