{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "data": { "text/plain": [ "R version 4.2.2 (2022-10-31)\n", "Platform: x86_64-apple-darwin17.0 (64-bit)\n", "Running under: macOS Catalina 10.15.7\n", "\n", "Matrix products: default\n", "BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib\n", "LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib\n", "\n", "locale:\n", "[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n", "\n", "attached base packages:\n", "[1] stats graphics grDevices utils datasets methods base \n", "\n", "loaded via a namespace (and not attached):\n", " [1] fansi_1.0.4 crayon_1.5.2 digest_0.6.31 utf8_1.2.2 \n", " [5] IRdisplay_1.1 repr_1.1.5 lifecycle_1.0.3 jsonlite_1.8.4 \n", " [9] evaluate_0.20 pillar_1.8.1 rlang_1.0.6 cli_3.6.0 \n", "[13] uuid_1.1-0 vctrs_0.5.2 IRkernel_1.3.2 tools_4.2.2 \n", "[17] glue_1.6.2 fastmap_1.1.0 compiler_4.2.2 base64enc_0.1-3\n", "[21] pbdZMQ_0.3-9 htmltools_0.5.4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sessionInfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Optimization\n", "\n", "### Unconstrained optimization \n", "\n", "* Problem\n", "$$\n", "\\min_{\\mathbf{x}\\in \\mathbb{R}^d} f(\\mathbf{x})\n", "$$\n", "\n", "* Maximization: $\\min_{\\mathbf{x}\\in\\mathbb{R}^d} -f(\\mathbf{x})$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Golden section search\n", "\n", "* $d=1$.\n", "\n", "* Applicable to functions defined on an interval of the real line.\n", "\n", "* Assumption: $f$ is continuous and *strictly unimodal* on the interval\n", "\\\n", "\n", "\n", " A function $f: [a,b] \\to \\mathbb{R}$ is strictly unimodal if \n", " $$\n", " f(\\lambda x + (1-\\lambda) y) < \\max\\{ f(x), f(y) \\},\n", " \\quad\n", " x, y \\in [a, b],~x\\neq y, ~\\lambda \\in (0, 1).\n", " $$\n", "\n", "* Idea: keep track of three points $a < x_1 < b$ such that $f(x_1) < \\min\\{f(a), f(b)\\}$\n", "\n", "* Choose a new point $x_0$ such that $x_0$ belongs to the longer of intervals $(a, x_1)$ and $(x_1, b)$.\n", "\n", "* WLOG, suppose $a < x_0 < x_1$.\n", " - If $f(x_0) > f(x_1)$, then the minimum lies in the interval $(x_0, b)$. Hence set $(a, x_1, b) := (x_0, x_1, b)$.\n", " - Otherwise, the minimum lies in the interval $(a, x_1)$. Hence set $(a, x_1, b) := (a, x_0, x_1)$. \n", "\n", " \n", "\n", "\n", "\n", "* Above rules do not say how to choose $x_0$. A possibility is to keep the lengths of the two possible candidate intervals the same: $b - x_0 = x_1 - a$, and the relative position of the intermediate points ($x_1$ to $(x_0, b)$; $x_0$ to $(a, x_1)$) the same. \n", "\n", "* Let $\\alpha = \\frac{x_0 - a}{b - a}$ and $y = \\frac{x_1 - x_0}{b -a}$. Then we require $\\frac{b - x_1}{b - a} = \\alpha$ and\n", "$$\n", " 1:\\alpha = y + \\alpha : y, \n", " \\quad\n", " \\alpha + y + \\alpha = 1\n", " .\n", "$$\n", "By solving this equation, we have $\\alpha = \\frac{3 - \\sqrt{5}}{2}$ or $1 - \\alpha = \\frac{\\sqrt{5} - 1}{2}$, the golden section.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [], "source": [ "goldensection <- function(f, a, b, tol=1e-6) {\n", " alpha <- (3 - sqrt(5)) * 0.5\n", " f\n", " iter <- 0\n", " while ( b - a > tol ) {\n", " x0 <- a + alpha * (b - a)\n", " x1 <- b - alpha * (b - a)\n", " print(c(a, x1, b))\n", " if ( f(x0) > f(x1) ) {\n", " a <- x0\n", " } else { # if ( f(x0) < f(x1) ) {\n", " b <- x1\n", " x1 <- x0\n", " } #else {\n", " # if ( f(b) < f(a) ) {\n", " # a <- x0\n", " # } else {\n", " # b <- x1\n", " # x1 <- x0 \n", " # }\n", " #}\n", " iter <- iter + 1\n", " }\n", " ret <- list(val = (a + b) / 2, iter = iter)\n", "}" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 0.0100000 0.6156733 0.9900000\n", "[1] 0.3843267 0.7586534 0.9900000\n", "[1] 0.6156733 0.8470199 0.9900000\n", "[1] 0.6156733 0.7586534 0.8470199\n", "[1] 0.6156733 0.7040399 0.7586534\n", "[1] 0.6702868 0.7249004 0.7586534\n", "[1] 0.6702868 0.7040399 0.7249004\n", "[1] 0.6911473 0.7120079 0.7249004\n", "[1] 0.6911473 0.7040399 0.7120079\n", "[1] 0.6911473 0.6991154 0.7040399\n", "[1] 0.6960718 0.7009963 0.7040399\n", "[1] 0.6960718 0.6991154 0.7009963\n", "[1] 0.6979528 0.6998338 0.7009963\n", "[1] 0.6991154 0.7002779 0.7009963\n", "[1] 0.6991154 0.6998338 0.7002779\n", "[1] 0.6995594 0.7000034 0.7002779\n", "[1] 0.6998338 0.7001083 0.7002779\n", "[1] 0.6998338 0.7000034 0.7001083\n", "[1] 0.6999387 0.7000435 0.7001083\n", "[1] 0.6999387 0.7000034 0.7000435\n", "[1] 0.6999787 0.7000187 0.7000435\n", "[1] 0.6999787 0.7000034 0.7000187\n", "[1] 0.6999940 0.7000093 0.7000187\n", "[1] 0.6999940 0.7000034 0.7000093\n", "[1] 0.6999940 0.6999998 0.7000034\n", "[1] 0.6999976 0.7000012 0.7000034\n", "[1] 0.6999976 0.6999998 0.7000012\n", "[1] 0.6999990 0.7000004 0.7000012\n", "[1] 0.6999990 0.6999998 0.7000004\n", "[1] 0.6999995 0.7000000 0.7000004\n", "[1] 0.6999998 0.7000002 0.7000004\n", "[1] 0.6999998 0.7000000 0.7000002\n", "[1] 0.7000000 0.7000001 0.7000002\n", "[1] 0.7000000 0.7000000 0.7000001\n", "[1] 0.7 0.7 0.7\n", "[1] 0.7 0.7 0.7\n", "[1] 0.7 0.7 0.7\n", "[1] 0.7 0.7 0.7\n", "[1] 0.7 0.7 0.7\n" ] } ], "source": [ "# binomial log-likelihood of 7 successes and 3 failures\n", "\n", "F <- function(x) { -7 * log(x) - 3 * log(1 - x) }\n", "goldensection(F, 0.01, 0.99, tol=1e-8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convergence of golden section search is slow but sure to find the global minimum (if the assumptions are met)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Newton's Method\n", "\n", "* First-order optimality condition\n", "$$\n", "\\nabla f(\\mathbf{x}^\\star) = 0.\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Apply Newton-Raphson to $g(\\mathbf{x})=\\nabla f(\\mathbf{x})$.\n", "\n", "\n", "* Idea: iterative quadratic approximation. \n", "\n", "* Second-order Taylor expansion of the objective function around the current iterate $\\mathbf{x}^{(t)}$\n", "$$\n", "\tf(\\mathbf{x}) \\approx f(\\mathbf{x}^{(t)}) + \\nabla f(\\mathbf{x}^{(t)})^T (\\mathbf{x} - \\mathbf{x}^{(t)}) + \\frac {1}{2} (\\mathbf{x} - \\mathbf{x}^{(t)})^T [\\nabla^2 f(\\mathbf{x}^{(t)})] (\\mathbf{x} - \\mathbf{x}^{(t)})\n", "$$\n", "and then minimize the quadratic approximation.\n", "\n", "* To maximize the quadratic appriximation function, we equate its gradient to zero\n", "$$\n", "\t\\nabla f(\\mathbf{x}^{(t)}) + [\\nabla^2 f(\\mathbf{x}^{(t)})] (\\mathbf{x} - \\mathbf{x}^{(t)}) = \\mathbf{0},\n", "$$\n", "which suggests the next iterate\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\mathbf{x}^{(t+1)} &=& \\mathbf{x}^{(t)} - [\\nabla^2 f(\\mathbf{x}^{(t)})]^{-1} \\nabla f(\\mathbf{x}^{(t)}),\n", "\\end{eqnarray*}\n", "$$\n", "a complete analogue of the univariate Newton-Raphson for solving $g(x)=0$.\n", "\n", "* Considered the gold standard for its fast (quadratic) convergence: $\\mathbf{x}^{(t)} \\to \\mathbf{x}^{\\star}$ and\n", "$$\n", "\t\\frac{\\|\\mathbf{x}^{(t+1)} - \\mathbf{x}^{\\star}\\|}{\\|\\mathbf{x}^{(t)} - \\mathbf{x}^{\\star}\\|^2} \\to \\text{constant}.\n", "$$\n", "In words, the estimate gets accurate by two decimal digits per iteration.\n", "\n", "\n", "We call this **naive Newton's method**.\n", "\n", "* **Stability issue**: naive Newton's iterate is **not** guaranteed to be a descent algorithm, i.e., that ensures $f(\\mathbf{x}^{(t+1)}) \\le f(\\mathbf{x}^{(t)})$. It's equally happy to head uphill or downhill. Following example shows that the Newton iterate converges to a local maximum, converges to a local minimum, or diverges depending on starting points." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "purenewton <- function(f, df, d2f, x0, maxiter=10, tol=1e-6) {\n", " xold <- x0\n", " stop <- FALSE\n", " iter <- 1\n", " x <- x0\n", " while ((!stop) && (iter < maxiter)) {\n", " x <- x - df(x) / d2f(x)\n", " print(x)\n", " xdiff <- x - xold\n", " if (abs(xdiff) < tol) stop <- TRUE \n", " xold <- x\n", " iter <- iter + 1\n", " }\n", " return(list(val=x, iter=iter))\n", "}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "f <- function(x) sin(x) # objective function\n", "df <- function(x) cos(x) # gradient\n", "d2f <- function(x) -sin(x) # hessian" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 1.542342\n", "[1] 1.570804\n", "[1] 1.570796\n", "[1] 1.570796\n" ] }, { "data": { "text/html": [ "
\n", "\t
$val
\n", "\t\t
1.5707963267949
\n", "\t
$iter
\n", "\t\t
5
\n", "
\n" ], "text/latex": [ "\\begin{description}\n", "\\item[\\$val] 1.5707963267949\n", "\\item[\\$iter] 5\n", "\\end{description}\n" ], "text/markdown": [ "$val\n", ": 1.5707963267949\n", "$iter\n", ": 5\n", "\n", "\n" ], "text/plain": [ "$val\n", "[1] 1.570796\n", "\n", "$iter\n", "[1] 5\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "purenewton(f, df, d2f, 2.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 0.328211\n", "[1] 3.264834\n", "[1] 11.33789\n", "[1] 10.98155\n", "[1] 10.99558\n", "[1] 10.99557\n" ] }, { "data": { "text/html": [ "
\n", "\t
$val
\n", "\t\t
10.9955742875643
\n", "\t
$iter
\n", "\t\t
7
\n", "
\n" ], "text/latex": [ "\\begin{description}\n", "\\item[\\$val] 10.9955742875643\n", "\\item[\\$iter] 7\n", "\\end{description}\n" ], "text/markdown": [ "$val\n", ": 10.9955742875643\n", "$iter\n", ": 7\n", "\n", "\n" ], "text/plain": [ "$val\n", "[1] 10.99557\n", "\n", "$iter\n", "[1] 7\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "purenewton(f, df, d2f, 2.75)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 4.863691\n", "[1] 4.711224\n", "[1] 4.712389\n", "[1] 4.712389\n" ] }, { "data": { "text/html": [ "
\n", "\t
$val
\n", "\t\t
4.71238898038469
\n", "\t
$iter
\n", "\t\t
5
\n", "
\n" ], "text/latex": [ "\\begin{description}\n", "\\item[\\$val] 4.71238898038469\n", "\\item[\\$iter] 5\n", "\\end{description}\n" ], "text/markdown": [ "$val\n", ": 4.71238898038469\n", "$iter\n", ": 5\n", "\n", "\n" ], "text/plain": [ "$val\n", "[1] 4.712389\n", "\n", "$iter\n", "[1] 5\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "purenewton(f, df, d2f, 4.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practical Newton\n", "\n", "* A remedy for the instability issue:\n", " 1. approximate $\\nabla^2 f(\\mathbf{x}^{(t)})$ by a positive definite $\\mathbf{H}^{(t)}$ (if it's not), **and** \n", " 2. line search (backtracking) to ensure the descent property. \n", "\n", "* Why insist on a _positive definite_ approximation of Hessian? First define the **Newton direction**:\n", "$$\n", "\\Delta \\mathbf{x}^{(t)} = [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}).\n", "$$\n", "By first-order Taylor expansion,\n", "$$\n", "\\begin{align*}\n", " f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)}) - f(\\mathbf{x}^{(t)}) \n", " &= \\nabla f(\\mathbf{x}^{(t)})^T s \\Delta \\mathbf{x}^{(t)} + o(s) \\\\\n", " &= - s \\cdot \\nabla f(\\mathbf{x}^{(t)})^T [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}) + o(s).\n", "\\end{align*}\n", "$$\n", "For $s$ sufficiently small, $f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)}) - f(\\mathbf{x}^{(t)})$ is strictly negative if $\\mathbf{H}^{(t)}$ is positive definite. \n", "\\\n", "The quantity $\\{\\nabla f(\\mathbf{x}^{(t)})^T [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)})\\}^{1/2}$ is termed the **Newton decrement**.\n", "\n", "* In summary, a **practical Newton-type algorithm** iterates according to\n", "$$\n", "\t\\boxed{ \\mathbf{x}^{(t+1)} = \\mathbf{x}^{(t)} - s [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}) \n", "\t= \\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)} }\n", "$$\n", "where $\\mathbf{H}^{(t)}$ is a positive definite approximation to $\\nabla^2 f(\\mathbf{x}^{(t)})$ and $s$ is a step size.\n", "\n", "* For strictly convex $f$, $\\nabla^2 f(\\mathbf{x}^{(t)})$ is always positive definite. In this case, the above algorithm is called **damped Newton**. However, *line search* is still needed (at least for a finite number of times) to guarantee convergence.\n", "\n", "### Line search\n", "\n", "* Line search: compute the Newton direction and search $s$ such that $f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)})$ is minimized.\n", "\n", "* Note the Newton direction only needs to be calculated once. Cost of line search mainly lies in objective function evaluation.\n", "\n", "* Full line search: $s = \\arg\\min_{\\alpha} f(\\mathbf{x}^{(t)} + \\alpha \\Delta \\mathbf{x}^{(t)})$. May use golden section search.\n", "\n", "* Approximate line search: step-halving ($s=1,1/2,\\ldots$), Amijo rule, ...\n", "\n", "\n", "* Backtracking line search (Armijo rule)\n", "```r\n", " # Backtracking line search\n", " # given: descent direction ∆x, x ∈ domf, α ∈ (0,0.5), β ∈ (0,1).\n", " t <- 1.0\n", " while (f(x + t * delx) > f(x) + alpha * t * sum(gradf(x) * delx) {\n", " t <- beta * t\n", " }\n", "```\n", "\n", "\n", "\n", "> The lower dashed line shows the linear extrapolation of $f$, and the upper dashed line has a slope a factor of α smaller. The backtracking condition is that $f$ lies below the upper dashed line, i.e., $0 \\le t \\le t_0$.\n", "\n", "* How to approximate $\\nabla^2 f(\\mathbf{x})$? More of an art than science. Often requires problem specific analysis. \n", "\n", "* Taking $\\mathbf{H}^{(t)} = \\mathbf{I}$ leads to the **gradient descent method** (see below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fisher scoring\n", "\n", "* Consider MLE in which $f(\\mathbf{x}) = -\\ell(\\boldsymbol{\\theta})$, where $\\ell(\\boldsymbol{\\theta})$ is the log-likelihood of parameter $\\boldsymbol{\\theta}$.\n", "\n", "* **Fisher scoring method**: replace $- \\nabla^2 \\ell(\\boldsymbol{\\theta})$ by the expected Fisher information matrix\n", "$$\n", "\t\\mathbf{I}(\\theta) = \\mathbf{E}[-\\nabla^2\\ell(\\boldsymbol{\\theta})] = \\mathbf{E}[\\nabla \\ell(\\boldsymbol{\\theta}) \\nabla \\ell(\\boldsymbol{\\theta})^T] \\succeq \\mathbf{0},\n", "$$\n", "which is true under exchangeability of tne expectation and the differentiation (true for most common distributions).\n", "\n", " Therefore we set $\\mathbf{H}^{(t)}=\\mathbf{I}(\\boldsymbol{\\theta}^{(t)})$ and obtain the Fisher scoring algorithm: \n", "$$\n", "\t\\boxed{ \\boldsymbol{\\theta}^{(t+1)} = \\boldsymbol{\\theta}^{(t)} + s [\\mathbf{I}(\\boldsymbol{\\theta}^{(t)})]^{-1} \\nabla \\ell(\\boldsymbol{\\theta}^{(t)})}.\n", "$$\n", "\n", "* Combined with line search, a descent algorithm can be devised." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: logistic regression\n", "\n", "* Binary data: response $y_i \\in \\{0,1\\}$, predictor $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. \n", "\n", "* Model: $y_i \\sim $Bernoulli$(p_i)$, where\n", "\\begin{eqnarray*}\n", "\t\\mathbf{E} (y_i) = p_i &=& g^{-1}(\\eta_i) = \\frac{e^{\\eta_i}}{1+ e^{\\eta_i}} \\quad \\text{(mean function, inverse link function)} \\\\\n", "\t\\eta_i = \\mathbf{x}_i^T \\beta &=& g(p_i) = \\log \\left( \\frac{p_i}{1-p_i} \\right) \\quad \\text{(logit link function)}.\n", "\\end{eqnarray*}\n", "\n", "\n", "* MLE: density\n", "\\begin{eqnarray*}\n", "\tf(y_i|p_i) &=& p_i^{y_i} (1-p_i)^{1-y_i} \\\\\n", "\t&=& e^{y_i \\log p_i + (1-y_i) \\log (1-p_i)} \\\\\n", "\t&=& \\exp\\left( y_i \\log \\frac{p_i}{1-p_i} + \\log (1-p_i)\\right).\n", "\\end{eqnarray*}\n", "\n", "\n", "* Log likelihood of the data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$, and its derivatives are\n", "\\begin{eqnarray*}\n", "\t\\ell(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\log p_i + (1-y_i) \\log (1-p_i) \\right] \\\\\n", "\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\log (1 + e^{\\mathbf{x}_i^T \\beta}) \\right] \\\\\n", "\t\\nabla \\ell(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\frac{e^{\\mathbf{x}_i^T \\beta}}{1+e^{\\mathbf{x}_i^T \\beta}} \\mathbf{x}_i \\right) \\\\\n", "\t&=& \\sum_{i=1}^n (y_i - p_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})\t\\\\\n", "\t- \\nabla^2\\ell(\\beta) &=& \\sum_{i=1}^n p_i(1-p_i) \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n", "\t\\text{where } \\mathbf{W} &=& \\text{diag}(w_1,\\ldots,w_n), w_i = p_i (1-p_i) \\\\\n", "\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2\\ell(\\beta)] = \\mathbf{X}^T \\mathbf{W} \\mathbf{X} = - \\nabla^2\\ell(\\beta) \\quad \\text{!!!}\n", "\\end{eqnarray*}\n", " (why the last line?)\n", "\n", "* Therefore for this problem **Newton's method == Fisher scoring**: \n", "$$\n", "\\begin{eqnarray*}\n", "\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-\\nabla^2 \\ell(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\t\\\\\n", "\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p}^{(t)}) \\\\\n", "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)}) \\right] \\\\\n", "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n", "\\end{eqnarray*}\n", "$$\n", "where \n", "$$\n", "\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s[\\mathbf{W}^{(t)}]^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)})\n", "$$ \n", "are the working responses. A Newton iteration is equivalent to solving a *weighed* least squares problem \n", "$$\n", "\\min_{\\beta} \\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2\n", "$$\n", "for which we know how to solve well.\n", "Thus the name **IRLS (iteratively re-weighted least squares)**.\n", "\n", "* Implication: if a weighted least squares solver is at hand, then logistic regression models can be fitted.\n", " - IRLS == Fisher scoring == Newton's method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Poisson regression\n", "\n", "* Count data: response $y_i \\in \\{0, 1, 2, \\dotsc \\}$, predictor $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. \n", "\n", "* Model: $y_i \\sim $Poisson$(\\lambda_i)$, where\n", "\\begin{eqnarray*}\n", "\t\\mathbf{E} (y_i) = \\lambda_i &=& g^{-1}(\\eta_i) = e^{\\eta_i} \\quad \\text{(mean function, inverse link function)} \\\\\n", "\t\\eta_i = \\mathbf{x}_i^T \\beta &=& g(p_i) = \\log \\lambda_i \\quad \\text{(logit link function)}.\n", "\\end{eqnarray*}\n", "\n", "\n", "* MLE: density\n", "\\begin{eqnarray*}\n", "\tf(y_i|\\lambda_i) &=& e^{-\\lambda_i}\\frac{\\lambda_i^{y_i}}{y_i!} \\\\\n", "\t&=& \\exp\\left( y_i \\log \\lambda_i - \\lambda_i - \\log(y_i!) \\right)\n", "\\end{eqnarray*}\n", "\n", "\n", "* Log likelihood of the data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$, and its derivatives are\n", "\\begin{eqnarray*}\n", "\t\\ell(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\log \\lambda_i - \\lambda_i - \\log(y_i!) \\right] \\\\\n", "\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\exp(\\mathbf{x}_i^T \\beta) - \\log(y_i!) \\right] \\\\\n", "\t\\nabla \\ell(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\exp(\\mathbf{x}_i^T \\beta)\\mathbf{x}_i \\right) \\\\\n", "\t&=& \\sum_{i=1}^n (y_i - \\lambda_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda})\t\\\\\n", "\t- \\nabla^2\\ell(\\beta) &=& \\sum_{i=1}^n \\lambda_i \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n", "\t\\text{where } \\mathbf{W} = \\text{diag}(\\lambda_1,\\ldots,\\lambda_n) \\\\\n", "\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2\\ell(\\beta)] = \\mathbf{X}^T \\mathbf{W} \\mathbf{X} = - \\nabla^2\\ell(\\beta) \\quad \\text{!!!}\n", "\\end{eqnarray*}\n", " (why the last line?)\n", "\n", "* Therefore for this problem **Newton's method == Fisher scoring**: \n", "$$\n", "\\begin{eqnarray*}\n", "\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-\\nabla^2 \\ell(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\t\\\\\n", "\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)}) \\\\\n", "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)}) \\right] \\\\\n", "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n", "\\end{eqnarray*}\n", "$$\n", "where \n", "$$\n", "\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s[\\mathbf{W}^{(t)}]^{-1} (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)})\n", "$$ \n", "are the working responses. A Newton iteration is again equivalent to solving a *weighed* least squares problem \n", "$$\n", "\\min_{\\beta} \\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2\n", "$$\n", "which leads to the IRLS.\n", "\n", "* Implication: if a weighted least squares solver is at hand, then Poisson regression models can be fitted.\n", " - IRLS == Fisher scoring == Newton's method" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "data": { "text/html": [ "\n", "
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 1\n", "\\item 2\n", "\\item 3\n", "\\item 4\n", "\\item 5\n", "\\item 6\n", "\\item 7\n", "\\item 8\n", "\\item 9\n", "\\item 10\n", "\\item 11\n", "\\item 12\n", "\\item 13\n", "\\item 14\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 1\n", "2. 2\n", "3. 3\n", "4. 4\n", "5. 5\n", "6. 6\n", "7. 7\n", "8. 8\n", "9. 9\n", "10. 10\n", "11. 11\n", "12. 12\n", "13. 13\n", "14. 14\n", "\n", "\n" ], "text/plain": [ " [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Quarterly count of AIDS deaths in Australia (from Dobson, 1990)\n", "deaths <- c(0, 1, 2, 3, 1, 4, 9, 18, 23, 31, 20, 25, 37, 45)\n", "(quarters <- seq_along(deaths))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "# Poisson regression using Fisher scoring (IRLS) and step halving\n", "# Model: lambda = exp(beta0 + beta1 * quarter), or deaths ~ quarter\n", "poissonreg <- function(x, y, maxiter=10, tol=1e-6) {\n", " beta0 <- matrix(0, nrow=2, ncol=1) # initial point\n", " betaold <- beta0\n", " stop <- FALSE\n", " iter <- 1\n", " inneriter <- rep(0, maxiter) # to count no. step halving\n", " beta <- beta0\n", " lik <- function(bet) {eta <- bet[1] + bet[2] * x; sum( y * eta - exp(eta) ) } # log-likelihood\n", " likold <- lik(betaold)\n", " while ((!stop) && (iter < maxiter)) {\n", " eta <- beta[1] + x * beta[2]\n", " w <- exp(eta) # lambda\n", " # line search by step halving\n", " s <- 1.0\n", " for (i in 1:5) {\n", " z <- eta + s * (y / w - 1) # working response\n", " m <- lm(z ~ x, weights=w) # weighted least squares\n", " beta <- as.matrix(coef(m))\n", " curlik <- lik(beta)\n", " if (curlik > likold) break\n", " s <- s * 0.5\n", " inneriter[iter] <- inneriter[iter] + 1\n", " }\n", " print(c(as.numeric(beta), inneriter[iter], curlik))\n", " betadiff <- beta - betaold\n", " if (norm(betadiff, \"F\") < tol) stop <- TRUE \n", " likold <- curlik\n", " betaold <- beta\n", " iter <- iter + 1\n", " }\n", " return(list(val=as.numeric(beta), iter=iter, inneriter=inneriter[1:iter]))\n", "}" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] -1.3076923 0.4184066 3.0000000 443.4763188\n", "[1] 0.6456032 0.2401380 0.0000000 469.9257845\n", "[1] 0.3743785 0.2541525 0.0000000 472.0483495\n", "[1] 0.3400344 0.2564929 0.0000000 472.0625465\n", "[1] 0.3396340 0.2565236 0.0000000 472.0625479\n", "[1] 0.3396339 0.2565236 0.0000000 472.0625479\n" ] }, { "data": { "text/html": [ "
\n", "\t
$val
\n", "\t\t
\n", "
  1. 0.339633920708135
  2. 0.256523593717915
\n", "
\n", "\t
$iter
\n", "\t\t
7
\n", "\t
$inneriter
\n", "\t\t
\n", "
  1. 3
  2. 0
  3. 0
  4. 0
  5. 0
  6. 0
  7. 0
\n", "
\n", "
\n" ], "text/latex": [ "\\begin{description}\n", "\\item[\\$val] \\begin{enumerate*}\n", "\\item 0.339633920708135\n", "\\item 0.256523593717915\n", "\\end{enumerate*}\n", "\n", "\\item[\\$iter] 7\n", "\\item[\\$inneriter] \\begin{enumerate*}\n", "\\item 3\n", "\\item 0\n", "\\item 0\n", "\\item 0\n", "\\item 0\n", "\\item 0\n", "\\item 0\n", "\\end{enumerate*}\n", "\n", "\\end{description}\n" ], "text/markdown": [ "$val\n", ": 1. 0.339633920708135\n", "2. 0.256523593717915\n", "\n", "\n", "\n", "$iter\n", ": 7\n", "$inneriter\n", ": 1. 3\n", "2. 0\n", "3. 0\n", "4. 0\n", "5. 0\n", "6. 0\n", "7. 0\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "$val\n", "[1] 0.3396339 0.2565236\n", "\n", "$iter\n", "[1] 7\n", "\n", "$inneriter\n", "[1] 3 0 0 0 0 0 0\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "poissonreg(quarters, deaths)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "data": { "text/html": [ "
(Intercept)
0.339633920708136
quarters
0.256523593717915
\n" ], "text/latex": [ "\\begin{description*}\n", "\\item[(Intercept)] 0.339633920708136\n", "\\item[quarters] 0.256523593717915\n", "\\end{description*}\n" ], "text/markdown": [ "(Intercept)\n", ": 0.339633920708136quarters\n", ": 0.256523593717915\n", "\n" ], "text/plain": [ "(Intercept) quarters \n", " 0.3396339 0.2565236 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m <- glm(deaths ~ quarters, family = poisson())\n", "coef(m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Homework: repeat this using the Armijo rule.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generalized Linear Models (GLM)\n", "\n", "That IRLS == Fisher scoring == Newton's method for both logistic and Poisson regression is not a coincidence. Let's consider a more general class of generalized linear models (GLM).\n", "\n", "#### Exponential families\n", "\n", "* Random variable $Y$ belongs to an exponential family if the density\n", "$$\n", "\tp(y|\\eta,\\phi) = \\exp \\left\\{ \\frac{y\\eta - b(\\eta)}{a(\\phi)} + c(y,\\phi) \\right\\}.\n", "$$\n", " * $\\eta$: natural parameter. \n", " * $\\phi>0$: dispersion parameter. \n", " * Mean: $\\mu= b'(\\eta)$. When $b'(\\cdot)$ is invertible, function $g(\\cdot)=[b']^{-1}(\\cdot)$ is called the canonical link function.\n", " * Variance $\\mathbf{Var}{Y}=b''(\\eta)a(\\phi)$.\n", "\n", "\n", "* For example, if $Y \\sim \\text{Ber}(\\mu)$, then\n", "$$\n", "p(y|\\eta,\\phi) = \\exp\\left( y \\log \\frac{\\mu}{1-\\mu} + \\log (1-\\mu)\\right).\n", "$$\n", "Hence\n", "$$\n", "\\eta = \\log \\frac{\\mu}{1-\\mu}, \\quad\n", "\\mu = \\frac{e^{\\eta}}{1+e^{\\eta}},\n", "\\quad\n", "b(\\eta) = -\\log (1-\\mu) = \\log(1+e^{\\eta})\n", "$$\n", "Hence\n", "$$\n", "b'(\\eta) = \\frac{e^{\\eta}}{1+e^{\\eta}} = g^{-1}(\\eta).\n", "$$\n", "as above.\n", "\n", "\n", "| Family | Canonical Link | Variance Function |\n", "|------------------|------------------------------------------------|-------------------|\n", "| Normal (unit variance) | $\\eta=\\mu$ | 1 |\n", "| Poisson | $\\eta=\\log \\mu$ | $\\mu$ |\n", "| Binomial | $\\eta=\\log \\left(\\frac{\\mu}{1 - \\mu} \\right)$ | $\\mu (1 - \\mu)$ |\n", "| Gamma | $\\eta = \\mu^{-1}$ | $\\mu^2$ |\n", "| Inverse Gaussian | $\\eta = \\mu^{-2}$ | $\\mu^3$ |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generalized linear models\n", "\n", "GLM models the conditional distribution of $Y$ given predictors $\\mathbf{x}$ through\n", "the conditional mean $\\mu = \\mathbf{E}(Y|\\mathbf{x})$ via a strictly increasing link function\n", "$$\n", "\tg(\\mu) = \\mathbf{x}^T \\beta, \\quad \\mu = g^{-1}(\\mathbf{x}^T\\beta) = b'(\\eta)\n", "$$\n", "\n", "From these relations we have (assuming no overdispersion, i.e., $a(\\phi)\\equiv 1$)\n", "$$\n", "\\mathbf{x}^T d\\beta = g'(\\mu)d\\mu,\n", "\\quad\n", "d\\mu = b''(\\eta)d\\eta, \n", "\\quad\n", "b''(\\eta) = \\mathbf{Var}[Y] = \\sigma^2,\n", "\\quad\n", "d\\eta = \\frac{1}{b''(\\eta)}d\\mu = \\frac{1}{b''(\\eta)g'(\\mu)}\\mathbf{x}^T d\\beta.\n", "$$\n", "Therefore\n", "$$\n", " d\\mu = \\frac{1}{g'(\\mu)}\\mathbf{x}^T d\\beta\n", " .\n", "$$\n", "\n", "Then, after some workout with matrix calculus, we have for $n$ samples:\n", "\n", "* Score, Hessian, information\n", "\n", "\\begin{eqnarray*}\n", " \t\\nabla\\ell(\\beta) &=& \\sum_{i=1}^n \\frac{(y_i-\\mu_i) [1/g'(\\mu_i)]}{\\sigma_i^2} \\mathbf{x}_i, \\quad \\mu_i = g^{-1}(\\mathbf{x}_i^T\\beta), ~\\sigma_i^2 = b''(\\eta_i), \\\\\n", "\t- \\nabla^2 \\ell(\\beta) &=& \\sum_{i=1}^n \\frac{[1/g'(\\mu_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T - \\sum_{i=1}^n \\frac{(y_i - \\mu_i)[b'''(\\eta_i)/[b''(\\eta_i)(g'(\\mu_i)^2] - g''(\\mu_i)/[g'(\\mu_i)]^3]}{\\sigma^2} \\mathbf{x}_i \\mathbf{x}_i^T, \\\\\n", "\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2 \\ell(\\beta)] = \\sum_{i=1}^n \\frac{[1/g'(\\mu_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}.\n", "\\end{eqnarray*} \n", "\n", "\n", "* Fisher scoring method:\n", "$$\n", " \t\\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I}(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\n", "$$\n", "IRLS with weights $w_i = [1/g'(\\mu_i)]^2/\\sigma_i^2$ and some working responses $z_i$.\n", "\n", "* For *canonical link*, $\\mathbf{x}^T\\beta = g(\\mu) =[b']^{-1}(\\mu) = \\eta$. The second term of Hessian vanishes because $d\\eta=\\mathbf{x}^Td\\beta$ and $b''(\\eta)=1/g'(\\mu)$. The Hessian coincides with Fisher information matrix. **IRLS == Fisher scoring == Newton's method**. Hence MLE is a *convex* optimization problem.\n", "\n", " \n", "* Non-canonical link, **Fisher scoring != Newton's method**, and MLE is in general a *non-convex* optimization problem.\n", "\n", " Example: Probit regression (binary response with probit link).\n", "\\begin{eqnarray*}\n", " y_i &\\sim& \\text{Ber}(p_i) \\\\\n", " p_i &=& \\Phi(\\mathbf{x}_i^T \\beta) \\\\\n", " \\eta_i &=& \\log\\left(\\frac{p_i}{1-p_i}\\right) \\neq \\mathbf{x}_i^T \\beta = \\Phi^{-1}(p_i).\n", "\\end{eqnarray*}\n", " where $\\Phi(\\cdot)$ is the cdf of a standard normal. Exceptionally, this problem is a convex optimization problem.\n", " \n", "* R implements the Fisher scoring method, aka IRLS, for GLMs in function `glm()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nonlinear regression - Gauss-Newton method\n", "\n", "* Now we finally get to the problem Gauss faced in 1801! \n", "Relocate the dwarf planet Ceres by fitting 24 observations to a 6-parameter (nonlinear) orbit.\n", " - In 1801, Jan 1 -- Feb 11 (41 days), astronomer Piazzi discovered Ceres, which was lost behind the Sun after observing its orbit 24 times.\n", " - Aug -- Sep, futile search by top astronomers; Laplace claimed it unsolvable.\n", " - Oct -- Nov, Gauss did calculations by method of least squares, sent his results to astronomer von Zach.\n", " - Dec 31, von Zach relocated Ceres according to Gauss’ calculation.\n", "\n", "* Nonlinear least squares (curve fitting): \n", "$$\n", "\t\\text{minimize} \\,\\, f(\\beta) = \\frac{1}{2} \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i, \\beta)]^2\n", "$$\n", "For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve\n", "$$\n", "\t\\mu(x, \\beta_1,\\beta_2,\\beta_3) = \\frac{\\beta_3}{1 + \\exp(-\\beta_1 - \\beta_2 x)}.\n", "$$\n", "\n", "\n", "\n", "* If $\\mu_i$ is a linear function of $\\beta$, i.e., $\\mathbf{x}_i^T\\beta$, then NLLS reduces to the usual least squares.\n", "\n", "* \"Score\" and \"information matrices\"\n", "$$\n", "\\begin{eqnarray*}\n", "\t\\nabla f(\\beta) &=& - \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i,\\beta)] \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n", "\t\\nabla^2 f(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i,\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n", "\t\\mathbf{I}(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T = \\mathbf{J}(\\beta)^T \\mathbf{J}(\\beta),\n", "\\end{eqnarray*}\n", "$$\n", "where $\\mathbf{J}(\\beta)^T = [\\nabla \\mu_1(\\mathbf{x}_1,\\beta), \\ldots, \\nabla \\mu_n(\\mathbf{x}_n,\\beta)] \\in \\mathbb{R}^{p \\times n}$.\n", "\n", "* **Gauss-Newton** (= \"Fisher scoring method\") uses $\\mathbf{I}(\\beta)$, which is always positive semidefinite.\n", "$$\n", "\t\\boxed{ \\beta^{(t+1)} = \\beta^{(t)} - s [\\mathbf{I} (\\beta^{(t)})]^{-1} \\nabla f(\\beta^{(t)}) }\n", "$$\n", "\n", "* Justification\n", " 1. Residuals $y_i - \\mu_i(\\mathbf{x}_i, \\beta)$ are small, or $\\mu_i$ is nearly linear;\n", " 2. Model the data as $y_i \\sim N(\\mu_i(\\mathbf{x}_i, \\beta), \\sigma^2)$, where $\\sigma^2$ is assumed to be known. Then, \n", "\\begin{align*}\n", " \\ell(\\beta) &= -\\frac{1}{2\\sigma^2}\\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla \\mu_i(\\mathbf{x}_i, \\beta) \\\\\n", " \\nabla\\ell(\\beta) &= -\\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i, \\beta) \\\\\n", " -\\nabla^2 \\ell(\\beta) &= \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n", "\\end{align*}\n", " Thus\n", " $$\n", " \\mathbf{E}[-\\nabla^2 \\ell(\\beta)] = \\sum_{i=1}^n \\nabla \\mu_i(\\beta) \\nabla \\mu_i(\\beta)^T.\n", " $$\n", " \n", "* In fact Gauss invented the Gaussian (normal) distribution to justify this method!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient descent method\n", "\n", "* Iteration:\n", "$$\n", "\t\\boxed{ \\mathbf{x}^{(t+1)} = \\mathbf{x}^{(t)} - \\gamma_t\\nabla f(\\mathbf{x}^{(t)}) }\n", "$$\n", "for some step size $\\gamma_t$. This is a special case of the Netwon-type algorithms with $\\mathbf{H}_t = \\mathbf{I}$ and $\\Delta\\mathbf{x} = \\nabla f(\\mathbf{x}^{(t)})$.\n", "\n", "* Idea: iterative linear approximation (first-order Taylor series expansion). \n", "$$\n", "\tf(\\mathbf{x}) \\approx f(\\mathbf{x}^{(t)}) + \\nabla f(\\mathbf{x}^{(t)})^T (\\mathbf{x} - \\mathbf{x}^{(t)})\n", "$$\n", "and then minimize the linear approximation within a compact set: let $\\Delta\\mathbf{x} = \\mathbf{x}^{(t+1)}-\\mathbf{x}^{(t)}$ to choose. Solve\n", "$$\n", " \\min_{\\|\\Delta\\mathbf{x}\\|_2 \\le 1} \\nabla f(\\mathbf{x}^{(t)})^T\\Delta\\mathbf{x}\n", "$$\n", "to obtain $\\Delta\\mathbf{x} = -\\nabla f(\\mathbf{x}^{(t)})/\\|\\nabla f(\\mathbf{x}^{(t)})\\|_2 \\propto -\\nabla f(\\mathbf{x}^{(t)})$.\n", "\n", "\n", "\n", "* Step sizes are chosen so that the descent property is maintained (e.g., line search).\n", " - Step size *must* be chosen at any time, since unlike Netwon's method, first-order approximation is always unbounded below.\n", "\n", "* Pros\n", " - Each iteration is inexpensive.\n", " - No need to derive, compute, store and invert Hessians; attractive in large scale problems. \n", " \n", "* Cons\n", " - Slow convergence (zigzagging).\n", "\n", " \n", " \n", " - Do not work for non-smooth problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convergence\n", "\n", "* If \n", " 1. $f$ is convex and differentiable over $\\mathbb{R}^d$;\n", " 2. $f$ has $L$-Lipschitz gradients, i.e., $\\|\\nabla f(\\mathbf{x}) - \\nabla f(\\mathbf{y})\\|_2 \\le L\\|\\mathbf{x} - \\mathbf{y} \\|_2$;\n", " 3. $p^{\\star} = \\inf_{\\mathbf{x}\\in\\mathbb{R}^d} f(x) > -\\infty$ and attained at $\\mathbf{x}^{\\star}$,\n", " \n", " then \n", "$$\n", " f(\\mathbf{x}^{(t)}) - p^{\\star} \\le \\frac{1}{2\\gamma t}\\|\\mathbf{x}^{(0)} - \\mathbf{x}^{\\star}\\|_2^2 = O(1/t).\n", "$$\n", " for constant step size ($\\gamma_t = \\gamma \\in (0, 1/L]$), a *sublinear* convergence. Similar upper bound with line search.\n", "\n", "* In general, the best we can obtain from gradient descent is linear convergence (cf. quadratic convergence of Newton)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples\n", "\n", "* Least squares: $f(\\beta) = \\frac{1}{2}\\|\\mathbf{X}\\beta - \\mathbf{y}\\|_2^2 = \\frac{1}{2}\\beta^T\\mathbf{X}^T\\mathbf{X}\\beta - \\beta^T\\mathbf{X}^T\\mathbf{y} + \\frac{1}{2}\\mathbf{y}^T\\mathbf{y}$\n", "\n", " - Gradient $\\nabla f(\\beta) = \\mathbf{X}^T\\mathbf{X}\\beta - \\mathbf{X}^T\\mathbf{y}$ is $\\|\\mathbf{X}^T\\mathbf{X}\\|_2$-Lipschitz. Hence $\\gamma \\in (0, 1/\\sigma_{\\max}(\\mathbf{X})^2]$ guarantees descent property and convergence.\n", "\n", "* Logistic regression: $f(\\beta) = -\\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\log (1 + e^{\\mathbf{x}_i^T \\beta}) \\right]$\n", "\n", " - Gradient $\\nabla f(\\beta) = - \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})$ is $\\frac{1}{4}\\|\\mathbf{X}^T\\mathbf{X}\\|_2$-Lipschitz.\n", "\n", " - May be unbounded below, e.g., all the data are of the same class.\n", "\n", " - Adding a ridge penalty $\\frac{\\rho}{2}\\|\\beta\\|_2^2$ makes the problem identifiable.\n", " \n", "* Poisson regression: $f(\\beta) = -\\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\exp(\\mathbf{x}_i^T \\beta) - \\log(y_i!) \\right]$\n", " - Gradient\n", "\\begin{eqnarray*}\n", "\t\\nabla f(\\beta) &=& -\\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\exp(\\mathbf{x}_i^T \\beta)\\mathbf{x}_i \\right) \\\\\n", "\t&=& -\\sum_{i=1}^n (y_i - \\lambda_i) \\mathbf{x}_i = -\\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda})\n", "\\end{eqnarray*}\n", " is *not* Lipschitz (HW). What is the consequence?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "# Poisson regression using gradient descent\n", "# Model: lambda = exp(beta0 + beta1 * quarter), or deaths ~ quarter\n", "poissonreg_grad <- function(x, y, maxiter=10, tol=1e-6) {\n", " beta0 <- matrix(0, nrow=2, ncol=1) # initial point\n", " betaold <- beta0\n", " iter <- 1\n", " inneriter <- rep(0, maxiter) # to count no. step halving\n", " betamat <- matrix(NA, nrow=maxiter, ncol=length(beta0) + 2)\n", " colnames(betamat) <- c(\"beta0\", \"beta1\", \"inneriter\", \"lik\")\n", " beta <- beta0\n", " lik <- function(bet) {eta <- bet[1] + bet[2] * x; sum( y * eta - exp(eta) ) } # log-likelihood\n", " likold <- lik(betaold)\n", " while (iter < maxiter) {\n", " eta <- beta[1] + x * beta[2]\n", " lam <- exp(eta) # lambda\n", " grad <- as.matrix(c(sum(lam - y), sum((lam - y) * x)))\n", " # line search by step halving\n", " #s <- 0.00001\n", " s <- 0.00005\n", " for (i in 1:5) {\n", " beta <- beta - s * grad\n", " curlik <- lik(beta)\n", " if (curlik > likold) break\n", " s <- s * 0.5\n", " inneriter[iter] <- inneriter[iter] + 1\n", " }\n", " #print(c(as.numeric(beta), inneriter[iter], curlik))\n", " betamat[iter,] <- c(as.numeric(beta), inneriter[iter], curlik)\n", " betadiff <- beta - betaold\n", " if (norm(betadiff, \"F\") < tol) break\n", " likold <- curlik\n", " betaold <- beta\n", " iter <- iter + 1\n", " }\n", " #return(list(val=as.numeric(beta), iter=iter, inneriter=inneriter[1:iter]))\n", " return(betamat[1:iter,])\n", "}" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A matrix: 10 × 4 of type dbl
beta0beta1inneriterlik
0.010250000.11495000241.3754
0.019339540.21786340423.2101
0.025051100.28260010471.3151
0.025324010.28309950471.3176
0.025534040.28284810471.3190
0.025772060.28293350471.3201
0.025997250.28286740471.3212
0.026227980.28286940471.3222
0.026456000.28284080471.3233
0.026685010.28282590471.3243
\n" ], "text/latex": [ "A matrix: 10 × 4 of type dbl\n", "\\begin{tabular}{llll}\n", " beta0 & beta1 & inneriter & lik\\\\\n", "\\hline\n", "\t 0.01025000 & 0.1149500 & 0 & 241.3754\\\\\n", "\t 0.01933954 & 0.2178634 & 0 & 423.2101\\\\\n", "\t 0.02505110 & 0.2826001 & 0 & 471.3151\\\\\n", "\t 0.02532401 & 0.2830995 & 0 & 471.3176\\\\\n", "\t 0.02553404 & 0.2828481 & 0 & 471.3190\\\\\n", "\t 0.02577206 & 0.2829335 & 0 & 471.3201\\\\\n", "\t 0.02599725 & 0.2828674 & 0 & 471.3212\\\\\n", "\t 0.02622798 & 0.2828694 & 0 & 471.3222\\\\\n", "\t 0.02645600 & 0.2828408 & 0 & 471.3233\\\\\n", "\t 0.02668501 & 0.2828259 & 0 & 471.3243\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A matrix: 10 × 4 of type dbl\n", "\n", "| beta0 | beta1 | inneriter | lik |\n", "|---|---|---|---|\n", "| 0.01025000 | 0.1149500 | 0 | 241.3754 |\n", "| 0.01933954 | 0.2178634 | 0 | 423.2101 |\n", "| 0.02505110 | 0.2826001 | 0 | 471.3151 |\n", "| 0.02532401 | 0.2830995 | 0 | 471.3176 |\n", "| 0.02553404 | 0.2828481 | 0 | 471.3190 |\n", "| 0.02577206 | 0.2829335 | 0 | 471.3201 |\n", "| 0.02599725 | 0.2828674 | 0 | 471.3212 |\n", "| 0.02622798 | 0.2828694 | 0 | 471.3222 |\n", "| 0.02645600 | 0.2828408 | 0 | 471.3233 |\n", "| 0.02668501 | 0.2828259 | 0 | 471.3243 |\n", "\n" ], "text/plain": [ " beta0 beta1 inneriter lik \n", " [1,] 0.01025000 0.1149500 0 241.3754\n", " [2,] 0.01933954 0.2178634 0 423.2101\n", " [3,] 0.02505110 0.2826001 0 471.3151\n", " [4,] 0.02532401 0.2830995 0 471.3176\n", " [5,] 0.02553404 0.2828481 0 471.3190\n", " [6,] 0.02577206 0.2829335 0 471.3201\n", " [7,] 0.02599725 0.2828674 0 471.3212\n", " [8,] 0.02622798 0.2828694 0 471.3222\n", " [9,] 0.02645600 0.2828408 0 471.3233\n", "[10,] 0.02668501 0.2828259 0 471.3243" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A matrix: 10 × 4 of type dbl
beta0beta1inneriterlik
[12951,]0.33962120.25652470472.0625
[12952,]0.33962120.25652470472.0625
[12953,]0.33962120.25652470472.0625
[12954,]0.33962120.25652470472.0625
[12955,]0.33962120.25652470472.0625
[12956,]0.33962120.25652470472.0625
[12957,]0.33962120.25652470472.0625
[12958,]0.33962130.25652470472.0625
[12959,]0.33962130.25652470472.0625
[12960,]0.33962130.25652470472.0625
\n" ], "text/latex": [ "A matrix: 10 × 4 of type dbl\n", "\\begin{tabular}{r|llll}\n", " & beta0 & beta1 & inneriter & lik\\\\\n", "\\hline\n", "\t{[}12951,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12952,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12953,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12954,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12955,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12956,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12957,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12958,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12959,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n", "\t{[}12960,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A matrix: 10 × 4 of type dbl\n", "\n", "| | beta0 | beta1 | inneriter | lik |\n", "|---|---|---|---|---|\n", "| [12951,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n", "| [12952,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n", "| [12953,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n", "| [12954,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n", "| [12955,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n", "| [12956,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n", "| [12957,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n", "| [12958,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n", "| [12959,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n", "| [12960,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n", "\n" ], "text/plain": [ " beta0 beta1 inneriter lik \n", "[12951,] 0.3396212 0.2565247 0 472.0625\n", "[12952,] 0.3396212 0.2565247 0 472.0625\n", "[12953,] 0.3396212 0.2565247 0 472.0625\n", "[12954,] 0.3396212 0.2565247 0 472.0625\n", "[12955,] 0.3396212 0.2565247 0 472.0625\n", "[12956,] 0.3396212 0.2565247 0 472.0625\n", "[12957,] 0.3396212 0.2565247 0 472.0625\n", "[12958,] 0.3396213 0.2565247 0 472.0625\n", "[12959,] 0.3396213 0.2565247 0 472.0625\n", "[12960,] 0.3396213 0.2565247 0 472.0625" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Australian AIDS data from the IRLS example above\n", "pm <- poissonreg_grad(quarters, deaths, tol=1e-8, maxiter=20000)\n", "head(pm, 10)\n", "tail(pm, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "data": { "text/html": [ "
(Intercept)
0.339633920708136
quarters
0.256523593717915
\n" ], "text/latex": [ "\\begin{description*}\n", "\\item[(Intercept)] 0.339633920708136\n", "\\item[quarters] 0.256523593717915\n", "\\end{description*}\n" ], "text/markdown": [ "(Intercept)\n", ": 0.339633920708136quarters\n", ": 0.256523593717915\n", "\n" ], "text/plain": [ "(Intercept) quarters \n", " 0.3396339 0.2565236 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m <- glm(deaths ~ quarters, family = poisson())\n", "coef(m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Coordinate descent\n", "\n", "* Idea: coordinate-wise minimization is often easier than the full-dimensional minimization\n", "$$\n", " x_j^{(t+1)} = \\arg\\min_{x_j} f(x_1^{(t+1)}, \\dotsc, x_{j-1}^{(t+1)}, x_j, x_{j+1}^{(t)}, \\dotsc, x_d^{(t)})\n", "$$\n", "\n", "\n", "\n", "* Similar to the Gauss-Seidel method for solving linear equations. \n", "\n", "* *Block descent*: a vector version of coordinate descent\n", "$$\n", " \\mathbf{x}_j^{(t+1)} = \\arg\\min_{\\mathbf{x}_j} f(\\mathbf{x}_1^{(t+1)}, \\dotsc, \\mathbf{x}_{j-1}^{(t+1)}, \\mathbf{x}_j, \\mathbf{x}_{j+1}^{(t)}, \\dotsc, \\mathbf{x}_d^{(t)})\n", "$$\n", "\n", "* Q: why objective value converges?\n", " - A: descent property\n", "\n", "* Caution: even if the objective function is convex, coordinate descent may *not* converge to the global minimum and converge to a suboptimal point.\n", "\n", "" ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.2.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "153px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": true, "toc_position": { "height": "441.3333435058594px", "left": "0px", "right": "903.3333129882813px", "top": "140.6666717529297px", "width": "166px" }, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 4 }