{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "R version 4.2.2 (2022-10-31)\n",
       "Platform: x86_64-apple-darwin17.0 (64-bit)\n",
       "Running under: macOS Catalina 10.15.7\n",
       "\n",
       "Matrix products: default\n",
       "BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib\n",
       "LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib\n",
       "\n",
       "locale:\n",
       "[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n",
       "\n",
       "attached base packages:\n",
       "[1] stats     graphics  grDevices utils     datasets  methods   base     \n",
       "\n",
       "loaded via a namespace (and not attached):\n",
       " [1] fansi_1.0.4     crayon_1.5.2    digest_0.6.31   utf8_1.2.2     \n",
       " [5] IRdisplay_1.1   repr_1.1.5      lifecycle_1.0.3 jsonlite_1.8.4 \n",
       " [9] evaluate_0.20   pillar_1.8.1    rlang_1.0.6     cli_3.6.0      \n",
       "[13] uuid_1.1-0      vctrs_0.5.2     IRkernel_1.3.2  tools_4.2.2    \n",
       "[17] glue_1.6.2      fastmap_1.1.0   compiler_4.2.2  base64enc_0.1-3\n",
       "[21] pbdZMQ_0.3-9    htmltools_0.5.4"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sessionInfo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optimization\n",
    "\n",
    "### Unconstrained optimization \n",
    "\n",
    "* Problem\n",
    "$$\n",
    "\\min_{\\mathbf{x}\\in \\mathbb{R}^d} f(\\mathbf{x})\n",
    "$$\n",
    "\n",
    "* Maximization: $\\min_{\\mathbf{x}\\in\\mathbb{R}^d} -f(\\mathbf{x})$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Golden section search\n",
    "\n",
    "* $d=1$.\n",
    "\n",
    "* Applicable to functions defined on an interval of the real line.\n",
    "\n",
    "* Assumption: $f$ is continuous and *strictly unimodal* on the interval\n",
    "\\\n",
    "<img src=\"images/unimodal.png\" width=\"600\" align=\"center\"/>\n",
    "\n",
    "    A function $f: [a,b] \\to \\mathbb{R}$ is strictly unimodal if \n",
    "    $$\n",
    "    f(\\lambda x + (1-\\lambda) y) < \\max\\{ f(x), f(y) \\},\n",
    "    \\quad\n",
    "    x, y \\in [a, b],~x\\neq y, ~\\lambda \\in (0, 1).\n",
    "    $$\n",
    "\n",
    "* Idea: keep track of three points $a < x_1 < b$ such that $f(x_1) < \\min\\{f(a), f(b)\\}$\n",
    "\n",
    "* Choose a new point $x_0$ such that $x_0$ belongs to the longer of intervals $(a, x_1)$ and $(x_1, b)$.\n",
    "\n",
    "* WLOG, suppose $a < x_0 < x_1$.\n",
    "    - If $f(x_0) > f(x_1)$, then the minimum lies in the interval $(x_0, b)$. Hence set $(a, x_1, b) := (x_0, x_1, b)$.\n",
    "    - Otherwise, the minimum lies in the interval $(a, x_1)$. Hence set $(a, x_1, b) := (a, x_0, x_1)$.        \n",
    "<!--    \n",
    "    - If $f(x_0) < f(x_1)$, then the minimum lies in the interval $(a, x_1)$. Hence set $(a, x_1, b) := (a, x_0, x_1)$.    \n",
    "    - If $f(x_0) = f(x_1)$, then set $(a, x_1, b) := (x_0, x_1, b)$ if $f(b) < f(a)$, and set $(a, x_1, b) := (a, x_0, x_1)$ otherwise.\n",
    "-->\n",
    "    \n",
    "<img src=\"images/goldensection.png\" width=\"500\" align=\"center\"/>\n",
    "\n",
    "\n",
    "* Above rules do not say how to choose $x_0$. A possibility is to keep the lengths of the two possible candidate intervals the same: $b - x_0 = x_1 - a$, and the relative position of the intermediate points ($x_1$ to $(x_0, b)$; $x_0$ to $(a, x_1)$) the same. \n",
    "\n",
    "* Let $\\alpha = \\frac{x_0 - a}{b - a}$ and $y = \\frac{x_1 - x_0}{b -a}$. Then we require $\\frac{b - x_1}{b - a} = \\alpha$ and\n",
    "$$\n",
    "    1:\\alpha = y + \\alpha : y, \n",
    "    \\quad\n",
    "    \\alpha + y + \\alpha = 1\n",
    "    .\n",
    "$$\n",
    "By solving this equation, we have $\\alpha = \\frac{3 - \\sqrt{5}}{2}$ or $1 - \\alpha = \\frac{\\sqrt{5} - 1}{2}$, the golden section.\n",
    "\n",
    "<img src=\"images/goldensection2.png\" width=\"500\" align=\"center\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "goldensection <- function(f, a, b, tol=1e-6) {\n",
    "    alpha <- (3 - sqrt(5)) * 0.5\n",
    "    f\n",
    "    iter <- 0\n",
    "    while ( b - a > tol ) {\n",
    "        x0 <- a + alpha * (b - a)\n",
    "        x1 <- b - alpha * (b - a)\n",
    "        print(c(a, x1, b))\n",
    "        if ( f(x0) > f(x1) ) {\n",
    "            a <- x0\n",
    "        } else { # if ( f(x0) < f(x1) ) {\n",
    "            b <- x1\n",
    "            x1 <- x0\n",
    "        } #else {\n",
    "          #  if ( f(b) < f(a) ) {\n",
    "          #      a <- x0\n",
    "          #  } else {\n",
    "          #      b <- x1\n",
    "          #      x1 <- x0                \n",
    "          #  }\n",
    "        #}\n",
    "        iter <- iter + 1\n",
    "    }\n",
    "    ret <- list(val = (a + b) / 2, iter = iter)\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] 0.0100000 0.6156733 0.9900000\n",
      "[1] 0.3843267 0.7586534 0.9900000\n",
      "[1] 0.6156733 0.8470199 0.9900000\n",
      "[1] 0.6156733 0.7586534 0.8470199\n",
      "[1] 0.6156733 0.7040399 0.7586534\n",
      "[1] 0.6702868 0.7249004 0.7586534\n",
      "[1] 0.6702868 0.7040399 0.7249004\n",
      "[1] 0.6911473 0.7120079 0.7249004\n",
      "[1] 0.6911473 0.7040399 0.7120079\n",
      "[1] 0.6911473 0.6991154 0.7040399\n",
      "[1] 0.6960718 0.7009963 0.7040399\n",
      "[1] 0.6960718 0.6991154 0.7009963\n",
      "[1] 0.6979528 0.6998338 0.7009963\n",
      "[1] 0.6991154 0.7002779 0.7009963\n",
      "[1] 0.6991154 0.6998338 0.7002779\n",
      "[1] 0.6995594 0.7000034 0.7002779\n",
      "[1] 0.6998338 0.7001083 0.7002779\n",
      "[1] 0.6998338 0.7000034 0.7001083\n",
      "[1] 0.6999387 0.7000435 0.7001083\n",
      "[1] 0.6999387 0.7000034 0.7000435\n",
      "[1] 0.6999787 0.7000187 0.7000435\n",
      "[1] 0.6999787 0.7000034 0.7000187\n",
      "[1] 0.6999940 0.7000093 0.7000187\n",
      "[1] 0.6999940 0.7000034 0.7000093\n",
      "[1] 0.6999940 0.6999998 0.7000034\n",
      "[1] 0.6999976 0.7000012 0.7000034\n",
      "[1] 0.6999976 0.6999998 0.7000012\n",
      "[1] 0.6999990 0.7000004 0.7000012\n",
      "[1] 0.6999990 0.6999998 0.7000004\n",
      "[1] 0.6999995 0.7000000 0.7000004\n",
      "[1] 0.6999998 0.7000002 0.7000004\n",
      "[1] 0.6999998 0.7000000 0.7000002\n",
      "[1] 0.7000000 0.7000001 0.7000002\n",
      "[1] 0.7000000 0.7000000 0.7000001\n",
      "[1] 0.7 0.7 0.7\n",
      "[1] 0.7 0.7 0.7\n",
      "[1] 0.7 0.7 0.7\n",
      "[1] 0.7 0.7 0.7\n",
      "[1] 0.7 0.7 0.7\n"
     ]
    }
   ],
   "source": [
    "# binomial log-likelihood of 7 successes and 3 failures\n",
    "\n",
    "F <- function(x) { -7 * log(x) - 3 * log(1 - x) }\n",
    "goldensection(F, 0.01, 0.99, tol=1e-8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convergence of golden section search is slow but sure to find the global minimum (if the assumptions are met)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Newton's Method\n",
    "\n",
    "* First-order optimality condition\n",
    "$$\n",
    "\\nabla f(\\mathbf{x}^\\star) = 0.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Apply Newton-Raphson to $g(\\mathbf{x})=\\nabla f(\\mathbf{x})$.\n",
    "\n",
    "\n",
    "* Idea: iterative quadratic approximation. \n",
    "\n",
    "* Second-order Taylor expansion of the objective function around the current iterate $\\mathbf{x}^{(t)}$\n",
    "$$\n",
    "\tf(\\mathbf{x}) \\approx f(\\mathbf{x}^{(t)}) + \\nabla f(\\mathbf{x}^{(t)})^T (\\mathbf{x} - \\mathbf{x}^{(t)}) + \\frac {1}{2} (\\mathbf{x} - \\mathbf{x}^{(t)})^T [\\nabla^2 f(\\mathbf{x}^{(t)})] (\\mathbf{x} - \\mathbf{x}^{(t)})\n",
    "$$\n",
    "and then minimize the quadratic approximation.\n",
    "\n",
    "* To maximize the quadratic appriximation function, we equate its gradient to zero\n",
    "$$\n",
    "\t\\nabla f(\\mathbf{x}^{(t)}) + [\\nabla^2 f(\\mathbf{x}^{(t)})] (\\mathbf{x} - \\mathbf{x}^{(t)}) = \\mathbf{0},\n",
    "$$\n",
    "which suggests the next iterate\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\mathbf{x}^{(t+1)} &=& \\mathbf{x}^{(t)} - [\\nabla^2 f(\\mathbf{x}^{(t)})]^{-1} \\nabla f(\\mathbf{x}^{(t)}),\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "a complete analogue of the univariate Newton-Raphson for solving $g(x)=0$.\n",
    "\n",
    "* Considered the gold standard for its fast (quadratic) convergence: $\\mathbf{x}^{(t)} \\to \\mathbf{x}^{\\star}$ and\n",
    "$$\n",
    "\t\\frac{\\|\\mathbf{x}^{(t+1)} - \\mathbf{x}^{\\star}\\|}{\\|\\mathbf{x}^{(t)} - \\mathbf{x}^{\\star}\\|^2} \\to \\text{constant}.\n",
    "$$\n",
    "In words, the estimate gets accurate by two decimal digits per iteration.\n",
    "\n",
    "\n",
    "We call this **naive Newton's method**.\n",
    "\n",
    "* **Stability issue**: naive Newton's iterate is **not** guaranteed to be a descent algorithm, i.e., that ensures $f(\\mathbf{x}^{(t+1)}) \\le f(\\mathbf{x}^{(t)})$. It's equally happy to head uphill or downhill. Following example shows that the Newton iterate converges to a local maximum, converges to a local minimum, or diverges depending on starting points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "purenewton <- function(f, df, d2f, x0, maxiter=10, tol=1e-6) {\n",
    "    xold <- x0\n",
    "    stop <- FALSE\n",
    "    iter <- 1\n",
    "    x <- x0\n",
    "    while ((!stop) && (iter < maxiter)) {\n",
    "        x <- x - df(x) / d2f(x)\n",
    "        print(x)\n",
    "        xdiff <- x - xold\n",
    "        if (abs(xdiff) < tol) stop <- TRUE \n",
    "        xold <- x\n",
    "        iter <- iter + 1\n",
    "    }\n",
    "    return(list(val=x, iter=iter))\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "f <- function(x) sin(x)   # objective function\n",
    "df <- function(x) cos(x)  # gradient\n",
    "d2f <- function(x) -sin(x) # hessian"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] 1.542342\n",
      "[1] 1.570804\n",
      "[1] 1.570796\n",
      "[1] 1.570796\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<dl>\n",
       "\t<dt>$val</dt>\n",
       "\t\t<dd>1.5707963267949</dd>\n",
       "\t<dt>$iter</dt>\n",
       "\t\t<dd>5</dd>\n",
       "</dl>\n"
      ],
      "text/latex": [
       "\\begin{description}\n",
       "\\item[\\$val] 1.5707963267949\n",
       "\\item[\\$iter] 5\n",
       "\\end{description}\n"
      ],
      "text/markdown": [
       "$val\n",
       ":   1.5707963267949\n",
       "$iter\n",
       ":   5\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "$val\n",
       "[1] 1.570796\n",
       "\n",
       "$iter\n",
       "[1] 5\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "purenewton(f, df, d2f, 2.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/newton_sine_1.gif\" width=\"500\" align=\"left\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] 0.328211\n",
      "[1] 3.264834\n",
      "[1] 11.33789\n",
      "[1] 10.98155\n",
      "[1] 10.99558\n",
      "[1] 10.99557\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<dl>\n",
       "\t<dt>$val</dt>\n",
       "\t\t<dd>10.9955742875643</dd>\n",
       "\t<dt>$iter</dt>\n",
       "\t\t<dd>7</dd>\n",
       "</dl>\n"
      ],
      "text/latex": [
       "\\begin{description}\n",
       "\\item[\\$val] 10.9955742875643\n",
       "\\item[\\$iter] 7\n",
       "\\end{description}\n"
      ],
      "text/markdown": [
       "$val\n",
       ":   10.9955742875643\n",
       "$iter\n",
       ":   7\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "$val\n",
       "[1] 10.99557\n",
       "\n",
       "$iter\n",
       "[1] 7\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "purenewton(f, df, d2f, 2.75)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/newton_sine_2.gif\" width=\"500\" align=\"left\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1] 4.863691\n",
      "[1] 4.711224\n",
      "[1] 4.712389\n",
      "[1] 4.712389\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<dl>\n",
       "\t<dt>$val</dt>\n",
       "\t\t<dd>4.71238898038469</dd>\n",
       "\t<dt>$iter</dt>\n",
       "\t\t<dd>5</dd>\n",
       "</dl>\n"
      ],
      "text/latex": [
       "\\begin{description}\n",
       "\\item[\\$val] 4.71238898038469\n",
       "\\item[\\$iter] 5\n",
       "\\end{description}\n"
      ],
      "text/markdown": [
       "$val\n",
       ":   4.71238898038469\n",
       "$iter\n",
       ":   5\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "$val\n",
       "[1] 4.712389\n",
       "\n",
       "$iter\n",
       "[1] 5\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "purenewton(f, df, d2f, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/newton_sine_3.gif\" width=\"500\" align=\"left\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practical Newton\n",
    "\n",
    "* A remedy for the instability issue:\n",
    "    1. approximate $\\nabla^2 f(\\mathbf{x}^{(t)})$ by a positive definite $\\mathbf{H}^{(t)}$ (if it's not), **and** \n",
    "    2. line search (backtracking) to ensure the descent property.   \n",
    "\n",
    "* Why insist on a _positive definite_ approximation of Hessian? First define the **Newton direction**:\n",
    "$$\n",
    "\\Delta \\mathbf{x}^{(t)} = [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}).\n",
    "$$\n",
    "By first-order Taylor expansion,\n",
    "$$\n",
    "\\begin{align*}\n",
    "    f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)}) - f(\\mathbf{x}^{(t)}) \n",
    "    &= \\nabla f(\\mathbf{x}^{(t)})^T s \\Delta \\mathbf{x}^{(t)} + o(s) \\\\\n",
    "    &= - s \\cdot \\nabla f(\\mathbf{x}^{(t)})^T [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}) + o(s).\n",
    "\\end{align*}\n",
    "$$\n",
    "For $s$ sufficiently small, $f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)}) - f(\\mathbf{x}^{(t)})$ is strictly negative if $\\mathbf{H}^{(t)}$ is positive definite. \n",
    "\\\n",
    "The quantity $\\{\\nabla f(\\mathbf{x}^{(t)})^T [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)})\\}^{1/2}$ is termed the **Newton decrement**.\n",
    "\n",
    "* In summary, a **practical Newton-type algorithm** iterates according to\n",
    "$$\n",
    "\t\\boxed{ \\mathbf{x}^{(t+1)} = \\mathbf{x}^{(t)} - s [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}) \n",
    "\t= \\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)} }\n",
    "$$\n",
    "where $\\mathbf{H}^{(t)}$ is a positive definite approximation to $\\nabla^2 f(\\mathbf{x}^{(t)})$ and $s$ is a step size.\n",
    "\n",
    "* For strictly convex $f$, $\\nabla^2 f(\\mathbf{x}^{(t)})$ is always positive definite. In this case, the above algorithm is called **damped Newton**. However, *line search* is still needed (at least for a finite number of times) to guarantee convergence.\n",
    "\n",
    "### Line search\n",
    "\n",
    "* Line search: compute the Newton direction and search $s$ such that $f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)})$ is minimized.\n",
    "\n",
    "* Note the Newton direction only needs to be calculated once. Cost of line search mainly lies in objective function evaluation.\n",
    "\n",
    "* Full line search: $s = \\arg\\min_{\\alpha} f(\\mathbf{x}^{(t)} + \\alpha \\Delta \\mathbf{x}^{(t)})$. May use golden section search.\n",
    "\n",
    "* Approximate line search: step-halving ($s=1,1/2,\\ldots$), Amijo rule, ...\n",
    "\n",
    "\n",
    "* Backtracking line search (Armijo rule)\n",
    "```r\n",
    "    # Backtracking line search\n",
    "    # given: descent direction ∆x, x ∈ domf, α ∈ (0,0.5), β ∈ (0,1).\n",
    "    t <- 1.0\n",
    "    while (f(x + t * delx) > f(x) + alpha * t * sum(gradf(x) * delx) {\n",
    "        t <- beta * t\n",
    "    }\n",
    "```\n",
    "\n",
    "<img src=\"https://jfukuyama.github.io/teaching/stat710/notes/backtracking-image.png\" width=400>\n",
    "\n",
    "> The lower dashed line shows the linear extrapolation of $f$, and the upper dashed line has a slope a factor of α smaller. The backtracking condition is that $f$ lies below the upper dashed line, i.e., $0 \\le t \\le t_0$.\n",
    "\n",
    "* How to approximate $\\nabla^2 f(\\mathbf{x})$? More of an art than science. Often requires problem specific analysis. \n",
    "\n",
    "* Taking $\\mathbf{H}^{(t)} = \\mathbf{I}$ leads to the **gradient descent method** (see below)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fisher scoring\n",
    "\n",
    "* Consider MLE in which $f(\\mathbf{x}) = -\\ell(\\boldsymbol{\\theta})$, where $\\ell(\\boldsymbol{\\theta})$ is the log-likelihood of parameter $\\boldsymbol{\\theta}$.\n",
    "\n",
    "* **Fisher scoring method**: replace $- \\nabla^2 \\ell(\\boldsymbol{\\theta})$ by the expected Fisher information matrix\n",
    "$$\n",
    "\t\\mathbf{I}(\\theta) = \\mathbf{E}[-\\nabla^2\\ell(\\boldsymbol{\\theta})] = \\mathbf{E}[\\nabla \\ell(\\boldsymbol{\\theta}) \\nabla \\ell(\\boldsymbol{\\theta})^T] \\succeq \\mathbf{0},\n",
    "$$\n",
    "which is true under exchangeability of tne expectation and the differentiation (true for most common distributions).\n",
    "\n",
    "    Therefore we set $\\mathbf{H}^{(t)}=\\mathbf{I}(\\boldsymbol{\\theta}^{(t)})$ and obtain the Fisher scoring algorithm: \n",
    "$$\n",
    "\t\\boxed{ \\boldsymbol{\\theta}^{(t+1)} = \\boldsymbol{\\theta}^{(t)} + s [\\mathbf{I}(\\boldsymbol{\\theta}^{(t)})]^{-1} \\nabla \\ell(\\boldsymbol{\\theta}^{(t)})}.\n",
    "$$\n",
    "\n",
    "* Combined with line search, a descent algorithm can be devised."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: logistic regression\n",
    "\n",
    "* Binary data: response $y_i \\in \\{0,1\\}$, predictor $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. \n",
    "\n",
    "* Model: $y_i \\sim $Bernoulli$(p_i)$, where\n",
    "\\begin{eqnarray*}\n",
    "\t\\mathbf{E} (y_i) = p_i &=& g^{-1}(\\eta_i) = \\frac{e^{\\eta_i}}{1+ e^{\\eta_i}} \\quad \\text{(mean function, inverse link function)} \\\\\n",
    "\t\\eta_i = \\mathbf{x}_i^T \\beta &=& g(p_i) = \\log \\left( \\frac{p_i}{1-p_i} \\right) \\quad \\text{(logit link function)}.\n",
    "\\end{eqnarray*}\n",
    "\n",
    "\n",
    "* MLE: density\n",
    "\\begin{eqnarray*}\n",
    "\tf(y_i|p_i) &=& p_i^{y_i} (1-p_i)^{1-y_i} \\\\\n",
    "\t&=& e^{y_i \\log p_i + (1-y_i) \\log (1-p_i)} \\\\\n",
    "\t&=& \\exp\\left( y_i \\log \\frac{p_i}{1-p_i} + \\log (1-p_i)\\right).\n",
    "\\end{eqnarray*}\n",
    "\n",
    "\n",
    "* Log likelihood of the data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$, and its derivatives are\n",
    "\\begin{eqnarray*}\n",
    "\t\\ell(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\log p_i + (1-y_i) \\log (1-p_i) \\right] \\\\\n",
    "\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\log (1 + e^{\\mathbf{x}_i^T \\beta}) \\right] \\\\\n",
    "\t\\nabla \\ell(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\frac{e^{\\mathbf{x}_i^T \\beta}}{1+e^{\\mathbf{x}_i^T \\beta}} \\mathbf{x}_i \\right) \\\\\n",
    "\t&=& \\sum_{i=1}^n (y_i - p_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})\t\\\\\n",
    "\t- \\nabla^2\\ell(\\beta) &=& \\sum_{i=1}^n p_i(1-p_i) \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n",
    "\t\\text{where } \\mathbf{W} &=& \\text{diag}(w_1,\\ldots,w_n), w_i = p_i (1-p_i) \\\\\n",
    "\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2\\ell(\\beta)] = \\mathbf{X}^T \\mathbf{W} \\mathbf{X} = - \\nabla^2\\ell(\\beta) \\quad \\text{!!!}\n",
    "\\end{eqnarray*}\n",
    "   (why the last line?)\n",
    "\n",
    "* Therefore for this problem **Newton's method == Fisher scoring**: \n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-\\nabla^2 \\ell(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\t\\\\\n",
    "\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p}^{(t)}) \\\\\n",
    "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)}) \\right] \\\\\n",
    "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where \n",
    "$$\n",
    "\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s[\\mathbf{W}^{(t)}]^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)})\n",
    "$$ \n",
    "are the working responses. A Newton iteration is equivalent to solving a *weighed* least squares problem \n",
    "$$\n",
    "\\min_{\\beta} \\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2\n",
    "$$\n",
    "for which we know how to solve well.\n",
    "Thus the name **IRLS (iteratively re-weighted least squares)**.\n",
    "\n",
    "* Implication: if a weighted least squares solver is at hand, then logistic regression models can be fitted.\n",
    "    - IRLS == Fisher scoring == Newton's method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Poisson regression\n",
    "\n",
    "* Count data: response $y_i \\in \\{0, 1, 2, \\dotsc \\}$, predictor $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. \n",
    "\n",
    "* Model: $y_i \\sim $Poisson$(\\lambda_i)$, where\n",
    "\\begin{eqnarray*}\n",
    "\t\\mathbf{E} (y_i) = \\lambda_i &=& g^{-1}(\\eta_i) = e^{\\eta_i} \\quad \\text{(mean function, inverse link function)} \\\\\n",
    "\t\\eta_i = \\mathbf{x}_i^T \\beta &=& g(p_i) = \\log \\lambda_i  \\quad \\text{(logit link function)}.\n",
    "\\end{eqnarray*}\n",
    "\n",
    "\n",
    "* MLE: density\n",
    "\\begin{eqnarray*}\n",
    "\tf(y_i|\\lambda_i) &=& e^{-\\lambda_i}\\frac{\\lambda_i^{y_i}}{y_i!} \\\\\n",
    "\t&=& \\exp\\left( y_i \\log \\lambda_i - \\lambda_i - \\log(y_i!) \\right)\n",
    "\\end{eqnarray*}\n",
    "\n",
    "\n",
    "* Log likelihood of the data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$, and its derivatives are\n",
    "\\begin{eqnarray*}\n",
    "\t\\ell(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\log \\lambda_i - \\lambda_i - \\log(y_i!) \\right] \\\\\n",
    "\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\exp(\\mathbf{x}_i^T \\beta) - \\log(y_i!) \\right] \\\\\n",
    "\t\\nabla \\ell(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\exp(\\mathbf{x}_i^T \\beta)\\mathbf{x}_i \\right) \\\\\n",
    "\t&=& \\sum_{i=1}^n (y_i - \\lambda_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda})\t\\\\\n",
    "\t- \\nabla^2\\ell(\\beta) &=& \\sum_{i=1}^n \\lambda_i \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n",
    "\t\\text{where } \\mathbf{W} = \\text{diag}(\\lambda_1,\\ldots,\\lambda_n) \\\\\n",
    "\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2\\ell(\\beta)] = \\mathbf{X}^T \\mathbf{W} \\mathbf{X} = - \\nabla^2\\ell(\\beta) \\quad \\text{!!!}\n",
    "\\end{eqnarray*}\n",
    "   (why the last line?)\n",
    "\n",
    "* Therefore for this problem **Newton's method == Fisher scoring**: \n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-\\nabla^2 \\ell(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\t\\\\\n",
    "\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)}) \\\\\n",
    "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)}) \\right] \\\\\n",
    "\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where \n",
    "$$\n",
    "\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s[\\mathbf{W}^{(t)}]^{-1} (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)})\n",
    "$$ \n",
    "are the working responses. A Newton iteration is again equivalent to solving a *weighed* least squares problem \n",
    "$$\n",
    "\\min_{\\beta} \\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2\n",
    "$$\n",
    "which leads to the IRLS.\n",
    "\n",
    "* Implication: if a weighted least squares solver is at hand, then Poisson regression models can be fitted.\n",
    "    - IRLS == Fisher scoring == Newton's method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li>12</li><li>13</li><li>14</li></ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 1\n",
       "\\item 2\n",
       "\\item 3\n",
       "\\item 4\n",
       "\\item 5\n",
       "\\item 6\n",
       "\\item 7\n",
       "\\item 8\n",
       "\\item 9\n",
       "\\item 10\n",
       "\\item 11\n",
       "\\item 12\n",
       "\\item 13\n",
       "\\item 14\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 1\n",
       "2. 2\n",
       "3. 3\n",
       "4. 4\n",
       "5. 5\n",
       "6. 6\n",
       "7. 7\n",
       "8. 8\n",
       "9. 9\n",
       "10. 10\n",
       "11. 11\n",
       "12. 12\n",
       "13. 13\n",
       "14. 14\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       " [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Quarterly count of AIDS deaths in Australia (from Dobson, 1990)\n",
    "deaths <- c(0, 1, 2, 3, 1, 4, 9, 18, 23, 31, 20, 25, 37, 45)\n",
    "(quarters <- seq_along(deaths))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Poisson regression using Fisher scoring (IRLS) and step halving\n",
    "# Model: lambda = exp(beta0 + beta1 * quarter), or deaths ~ quarter\n",
    "poissonreg <- function(x, y, maxiter=10, tol=1e-6) {\n",
    "    beta0 <- matrix(0, nrow=2, ncol=1)   # initial point\n",
    "    betaold <- beta0\n",
    "    stop <- FALSE\n",
    "    iter <- 1\n",
    "    inneriter <- rep(0, maxiter)  # to count no. step halving\n",
    "    beta <- beta0\n",
    "    lik <- function(bet) {eta <- bet[1] + bet[2] * x; sum( y * eta - exp(eta) ) } # log-likelihood\n",
    "    likold <- lik(betaold)\n",
    "    while ((!stop) && (iter < maxiter)) {\n",
    "        eta <- beta[1] + x * beta[2]\n",
    "        w <- exp(eta)  # lambda\n",
    "        # line search by step halving\n",
    "        s <- 1.0\n",
    "        for (i in 1:5) {\n",
    "            z <- eta + s * (y / w - 1) # working response\n",
    "            m <- lm(z ~ x, weights=w)  # weighted least squares\n",
    "            beta <- as.matrix(coef(m))\n",
    "            curlik <- lik(beta)\n",
    "            if (curlik > likold) break\n",
    "            s <- s * 0.5\n",
    "            inneriter[iter] <- inneriter[iter] + 1\n",
    "        }\n",
    "        print(c(as.numeric(beta), inneriter[iter], curlik))\n",
    "        betadiff <- beta - betaold\n",
    "        if (norm(betadiff, \"F\") < tol) stop <- TRUE \n",
    "        likold <- curlik\n",
    "        betaold <- beta\n",
    "        iter <- iter + 1\n",
    "    }\n",
    "    return(list(val=as.numeric(beta), iter=iter, inneriter=inneriter[1:iter]))\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1]  -1.3076923   0.4184066   3.0000000 443.4763188\n",
      "[1]   0.6456032   0.2401380   0.0000000 469.9257845\n",
      "[1]   0.3743785   0.2541525   0.0000000 472.0483495\n",
      "[1]   0.3400344   0.2564929   0.0000000 472.0625465\n",
      "[1]   0.3396340   0.2565236   0.0000000 472.0625479\n",
      "[1]   0.3396339   0.2565236   0.0000000 472.0625479\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<dl>\n",
       "\t<dt>$val</dt>\n",
       "\t\t<dd><style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>0.339633920708135</li><li>0.256523593717915</li></ol>\n",
       "</dd>\n",
       "\t<dt>$iter</dt>\n",
       "\t\t<dd>7</dd>\n",
       "\t<dt>$inneriter</dt>\n",
       "\t\t<dd><style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>3</li><li>0</li><li>0</li><li>0</li><li>0</li><li>0</li><li>0</li></ol>\n",
       "</dd>\n",
       "</dl>\n"
      ],
      "text/latex": [
       "\\begin{description}\n",
       "\\item[\\$val] \\begin{enumerate*}\n",
       "\\item 0.339633920708135\n",
       "\\item 0.256523593717915\n",
       "\\end{enumerate*}\n",
       "\n",
       "\\item[\\$iter] 7\n",
       "\\item[\\$inneriter] \\begin{enumerate*}\n",
       "\\item 3\n",
       "\\item 0\n",
       "\\item 0\n",
       "\\item 0\n",
       "\\item 0\n",
       "\\item 0\n",
       "\\item 0\n",
       "\\end{enumerate*}\n",
       "\n",
       "\\end{description}\n"
      ],
      "text/markdown": [
       "$val\n",
       ":   1. 0.339633920708135\n",
       "2. 0.256523593717915\n",
       "\n",
       "\n",
       "\n",
       "$iter\n",
       ":   7\n",
       "$inneriter\n",
       ":   1. 3\n",
       "2. 0\n",
       "3. 0\n",
       "4. 0\n",
       "5. 0\n",
       "6. 0\n",
       "7. 0\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "$val\n",
       "[1] 0.3396339 0.2565236\n",
       "\n",
       "$iter\n",
       "[1] 7\n",
       "\n",
       "$inneriter\n",
       "[1] 3 0 0 0 0 0 0\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "poissonreg(quarters, deaths)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       ".dl-inline {width: auto; margin:0; padding: 0}\n",
       ".dl-inline>dt, .dl-inline>dd {float: none; width: auto; display: inline-block}\n",
       ".dl-inline>dt::after {content: \":\\0020\"; padding-right: .5ex}\n",
       ".dl-inline>dt:not(:first-of-type) {padding-left: .5ex}\n",
       "</style><dl class=dl-inline><dt>(Intercept)</dt><dd>0.339633920708136</dd><dt>quarters</dt><dd>0.256523593717915</dd></dl>\n"
      ],
      "text/latex": [
       "\\begin{description*}\n",
       "\\item[(Intercept)] 0.339633920708136\n",
       "\\item[quarters] 0.256523593717915\n",
       "\\end{description*}\n"
      ],
      "text/markdown": [
       "(Intercept)\n",
       ":   0.339633920708136quarters\n",
       ":   0.256523593717915\n",
       "\n"
      ],
      "text/plain": [
       "(Intercept)    quarters \n",
       "  0.3396339   0.2565236 "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "m <- glm(deaths ~ quarters, family = poisson())\n",
    "coef(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Homework: repeat this using the Armijo rule.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generalized Linear Models (GLM)\n",
    "\n",
    "That IRLS == Fisher scoring == Newton's method for both logistic and Poisson regression is not a coincidence.  Let's consider a more general class of generalized linear models (GLM).\n",
    "\n",
    "#### Exponential families\n",
    "\n",
    "* Random variable $Y$ belongs to an exponential family if the density\n",
    "$$\n",
    "\tp(y|\\eta,\\phi) = \\exp \\left\\{ \\frac{y\\eta - b(\\eta)}{a(\\phi)} + c(y,\\phi) \\right\\}.\n",
    "$$\n",
    "    * $\\eta$: natural parameter.  \n",
    "    * $\\phi>0$: dispersion parameter.  \n",
    "    * Mean: $\\mu= b'(\\eta)$. When $b'(\\cdot)$ is invertible, function $g(\\cdot)=[b']^{-1}(\\cdot)$ is called the canonical link function.\n",
    "    * Variance $\\mathbf{Var}{Y}=b''(\\eta)a(\\phi)$.\n",
    "\n",
    "\n",
    "* For example, if $Y \\sim \\text{Ber}(\\mu)$, then\n",
    "$$\n",
    "p(y|\\eta,\\phi) = \\exp\\left( y \\log \\frac{\\mu}{1-\\mu} + \\log (1-\\mu)\\right).\n",
    "$$\n",
    "Hence\n",
    "$$\n",
    "\\eta = \\log \\frac{\\mu}{1-\\mu}, \\quad\n",
    "\\mu = \\frac{e^{\\eta}}{1+e^{\\eta}},\n",
    "\\quad\n",
    "b(\\eta) = -\\log (1-\\mu) = \\log(1+e^{\\eta})\n",
    "$$\n",
    "Hence\n",
    "$$\n",
    "b'(\\eta) = \\frac{e^{\\eta}}{1+e^{\\eta}} = g^{-1}(\\eta).\n",
    "$$\n",
    "as above.\n",
    "\n",
    "\n",
    "| Family           | Canonical Link                                 | Variance Function |\n",
    "|------------------|------------------------------------------------|-------------------|\n",
    "| Normal (unit variance)           | $\\eta=\\mu$                                     | 1                 |\n",
    "| Poisson          | $\\eta=\\log \\mu$                                | $\\mu$             |\n",
    "| Binomial         | $\\eta=\\log \\left(\\frac{\\mu}{1 - \\mu} \\right)$  | $\\mu (1 - \\mu)$   |\n",
    "| Gamma            | $\\eta = \\mu^{-1}$                              | $\\mu^2$           |\n",
    "| Inverse Gaussian | $\\eta = \\mu^{-2}$                              | $\\mu^3$           |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Generalized linear models\n",
    "\n",
    "GLM models the conditional distribution of $Y$ given predictors $\\mathbf{x}$ through\n",
    "the conditional mean $\\mu = \\mathbf{E}(Y|\\mathbf{x})$ via a strictly increasing link function\n",
    "$$\n",
    "\tg(\\mu) = \\mathbf{x}^T \\beta, \\quad \\mu = g^{-1}(\\mathbf{x}^T\\beta) = b'(\\eta)\n",
    "$$\n",
    "\n",
    "From these relations we have (assuming no overdispersion, i.e., $a(\\phi)\\equiv 1$)\n",
    "$$\n",
    "\\mathbf{x}^T d\\beta = g'(\\mu)d\\mu,\n",
    "\\quad\n",
    "d\\mu = b''(\\eta)d\\eta, \n",
    "\\quad\n",
    "b''(\\eta) = \\mathbf{Var}[Y] = \\sigma^2,\n",
    "\\quad\n",
    "d\\eta = \\frac{1}{b''(\\eta)}d\\mu = \\frac{1}{b''(\\eta)g'(\\mu)}\\mathbf{x}^T d\\beta.\n",
    "$$\n",
    "Therefore\n",
    "$$\n",
    "    d\\mu = \\frac{1}{g'(\\mu)}\\mathbf{x}^T d\\beta\n",
    "    .\n",
    "$$\n",
    "\n",
    "Then, after some workout with matrix calculus, we have for $n$ samples:\n",
    "\n",
    "* Score, Hessian, information\n",
    "\n",
    "\\begin{eqnarray*}\n",
    " \t\\nabla\\ell(\\beta) &=& \\sum_{i=1}^n \\frac{(y_i-\\mu_i) [1/g'(\\mu_i)]}{\\sigma_i^2} \\mathbf{x}_i, \\quad \\mu_i = g^{-1}(\\mathbf{x}_i^T\\beta), ~\\sigma_i^2 = b''(\\eta_i), \\\\\n",
    "\t- \\nabla^2 \\ell(\\beta) &=& \\sum_{i=1}^n \\frac{[1/g'(\\mu_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T - \\sum_{i=1}^n \\frac{(y_i - \\mu_i)[b'''(\\eta_i)/[b''(\\eta_i)(g'(\\mu_i)^2] - g''(\\mu_i)/[g'(\\mu_i)]^3]}{\\sigma^2} \\mathbf{x}_i \\mathbf{x}_i^T, \\\\\n",
    "\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2 \\ell(\\beta)] = \\sum_{i=1}^n \\frac{[1/g'(\\mu_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}.\n",
    "\\end{eqnarray*}    \n",
    "\n",
    "\n",
    "* Fisher scoring method:\n",
    "$$\n",
    " \t\\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I}(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\n",
    "$$\n",
    "IRLS with weights $w_i = [1/g'(\\mu_i)]^2/\\sigma_i^2$ and some working responses $z_i$.\n",
    "\n",
    "* For *canonical link*, $\\mathbf{x}^T\\beta = g(\\mu) =[b']^{-1}(\\mu) = \\eta$. The second term of Hessian vanishes because $d\\eta=\\mathbf{x}^Td\\beta$ and $b''(\\eta)=1/g'(\\mu)$. The Hessian coincides with Fisher information matrix. **IRLS == Fisher scoring == Newton's method**. Hence MLE is a *convex* optimization problem.\n",
    "\n",
    " \n",
    "* Non-canonical link, **Fisher scoring != Newton's method**, and MLE is in general a *non-convex* optimization problem.\n",
    "\n",
    "  Example: Probit regression (binary response with probit link).\n",
    "\\begin{eqnarray*}\n",
    "    y_i &\\sim& \\text{Ber}(p_i) \\\\\n",
    "    p_i &=& \\Phi(\\mathbf{x}_i^T \\beta)  \\\\\n",
    "    \\eta_i &=& \\log\\left(\\frac{p_i}{1-p_i}\\right) \\neq \\mathbf{x}_i^T \\beta = \\Phi^{-1}(p_i).\n",
    "\\end{eqnarray*}\n",
    "  where $\\Phi(\\cdot)$ is the cdf of a standard normal. Exceptionally, this problem is a convex optimization problem.\n",
    " \n",
    "* R implements the Fisher scoring method, aka IRLS, for GLMs in function `glm()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Nonlinear regression - Gauss-Newton method\n",
    "\n",
    "* Now we finally get to the problem Gauss faced in 1801!  \n",
    "Relocate the dwarf planet Ceres <https://en.wikipedia.org/wiki/Ceres_(dwarf_planet)> by fitting 24 observations to a 6-parameter (nonlinear) orbit.\n",
    "    - In 1801, Jan 1 -- Feb 11 (41 days), astronomer Piazzi discovered Ceres, which was lost behind the Sun after observing its orbit 24 times.\n",
    "    - Aug -- Sep, futile search by top astronomers; Laplace claimed it unsolvable.\n",
    "    - Oct -- Nov, Gauss did calculations by method of least squares, sent his results to astronomer von Zach.\n",
    "    - Dec 31, von Zach relocated Ceres according to Gauss’ calculation.\n",
    "\n",
    "* Nonlinear least squares (curve fitting): \n",
    "$$\n",
    "\t\\text{minimize} \\,\\, f(\\beta) = \\frac{1}{2} \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i, \\beta)]^2\n",
    "$$\n",
    "For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve\n",
    "$$\n",
    "\t\\mu(x, \\beta_1,\\beta_2,\\beta_3) = \\frac{\\beta_3}{1 + \\exp(-\\beta_1 - \\beta_2 x)}.\n",
    "$$\n",
    "\n",
    "<img src=\"https://cdn.xlstat.com/img/tutorials/nlin5.gif\" width=\"300\" align=\"center\"/>\n",
    "\n",
    "* If $\\mu_i$ is a linear function of $\\beta$, i.e., $\\mathbf{x}_i^T\\beta$, then NLLS reduces to the usual least squares.\n",
    "\n",
    "* \"Score\" and \"information matrices\"\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\nabla f(\\beta) &=& - \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i,\\beta)] \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n",
    "\t\\nabla^2 f(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i,\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n",
    "\t\\mathbf{I}(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T = \\mathbf{J}(\\beta)^T \\mathbf{J}(\\beta),\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where $\\mathbf{J}(\\beta)^T = [\\nabla \\mu_1(\\mathbf{x}_1,\\beta), \\ldots, \\nabla \\mu_n(\\mathbf{x}_n,\\beta)] \\in \\mathbb{R}^{p \\times n}$.\n",
    "\n",
    "* **Gauss-Newton** (= \"Fisher scoring method\") uses $\\mathbf{I}(\\beta)$, which is always positive semidefinite.\n",
    "$$\n",
    "\t\\boxed{ \\beta^{(t+1)} = \\beta^{(t)} - s [\\mathbf{I} (\\beta^{(t)})]^{-1} \\nabla f(\\beta^{(t)}) }\n",
    "$$\n",
    "\n",
    "* Justification\n",
    "    1. Residuals $y_i - \\mu_i(\\mathbf{x}_i, \\beta)$ are small, or $\\mu_i$ is nearly linear;\n",
    "    2. Model the data as $y_i \\sim N(\\mu_i(\\mathbf{x}_i, \\beta), \\sigma^2)$, where $\\sigma^2$ is assumed to be known. Then, \n",
    "\\begin{align*}\n",
    "        \\ell(\\beta) &= -\\frac{1}{2\\sigma^2}\\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla \\mu_i(\\mathbf{x}_i, \\beta) \\\\\n",
    "        \\nabla\\ell(\\beta) &= -\\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i, \\beta) \\\\\n",
    "        -\\nabla^2 \\ell(\\beta) &= \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n",
    "\\end{align*}\n",
    "    Thus\n",
    "    $$\n",
    "        \\mathbf{E}[-\\nabla^2 \\ell(\\beta)] = \\sum_{i=1}^n \\nabla \\mu_i(\\beta) \\nabla \\mu_i(\\beta)^T.\n",
    "    $$\n",
    "    \n",
    "* In fact Gauss invented the Gaussian (normal) distribution to justify this method!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient descent method\n",
    "\n",
    "* Iteration:\n",
    "$$\n",
    "\t\\boxed{ \\mathbf{x}^{(t+1)} = \\mathbf{x}^{(t)} - \\gamma_t\\nabla f(\\mathbf{x}^{(t)}) }\n",
    "$$\n",
    "for some step size $\\gamma_t$. This is a special case of the Netwon-type algorithms with $\\mathbf{H}_t = \\mathbf{I}$ and $\\Delta\\mathbf{x} = \\nabla f(\\mathbf{x}^{(t)})$.\n",
    "\n",
    "* Idea: iterative linear approximation (first-order Taylor series expansion). \n",
    "$$\n",
    "\tf(\\mathbf{x}) \\approx f(\\mathbf{x}^{(t)}) + \\nabla f(\\mathbf{x}^{(t)})^T (\\mathbf{x} - \\mathbf{x}^{(t)})\n",
    "$$\n",
    "and then minimize the linear approximation within a compact set: let $\\Delta\\mathbf{x} = \\mathbf{x}^{(t+1)}-\\mathbf{x}^{(t)}$ to choose. Solve\n",
    "$$\n",
    "    \\min_{\\|\\Delta\\mathbf{x}\\|_2 \\le 1} \\nabla f(\\mathbf{x}^{(t)})^T\\Delta\\mathbf{x}\n",
    "$$\n",
    "to obtain $\\Delta\\mathbf{x} = -\\nabla f(\\mathbf{x}^{(t)})/\\|\\nabla f(\\mathbf{x}^{(t)})\\|_2 \\propto -\\nabla f(\\mathbf{x}^{(t)})$.\n",
    "\n",
    "\n",
    "\n",
    "* Step sizes are chosen so that the descent property is maintained (e.g., line search).\n",
    "    - Step size *must* be chosen at any time, since unlike Netwon's method, first-order approximation is always unbounded below.\n",
    "\n",
    "* Pros\n",
    "    - Each iteration is inexpensive.\n",
    "    - No need to derive, compute, store and invert Hessians; attractive in large scale problems. \n",
    "    \n",
    "* Cons\n",
    "    - Slow convergence (zigzagging).\n",
    "\n",
    "    <img src=\"https://jfukuyama.github.io/teaching/stat710/notes/gradient-descent-backtracking.png\" width=\"400\" align=\"center\"/>\n",
    "    \n",
    "    - Do not work for non-smooth problems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convergence\n",
    "\n",
    "* If \n",
    "    1. $f$ is convex and differentiable over $\\mathbb{R}^d$;\n",
    "    2. $f$ has $L$-Lipschitz gradients, i.e., $\\|\\nabla f(\\mathbf{x}) - \\nabla f(\\mathbf{y})\\|_2 \\le L\\|\\mathbf{x} - \\mathbf{y} \\|_2$;\n",
    "    3. $p^{\\star} = \\inf_{\\mathbf{x}\\in\\mathbb{R}^d} f(x) > -\\infty$ and attained at $\\mathbf{x}^{\\star}$,\n",
    "    \n",
    "   then \n",
    "$$\n",
    "    f(\\mathbf{x}^{(t)}) - p^{\\star} \\le \\frac{1}{2\\gamma t}\\|\\mathbf{x}^{(0)} - \\mathbf{x}^{\\star}\\|_2^2 = O(1/t).\n",
    "$$\n",
    "    for constant step size ($\\gamma_t = \\gamma \\in (0, 1/L]$), a *sublinear* convergence. Similar upper bound with line search.\n",
    "\n",
    "* In general, the best we can obtain from gradient descent is linear convergence (cf. quadratic convergence of Newton)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examples\n",
    "\n",
    "* Least squares: $f(\\beta) = \\frac{1}{2}\\|\\mathbf{X}\\beta - \\mathbf{y}\\|_2^2 = \\frac{1}{2}\\beta^T\\mathbf{X}^T\\mathbf{X}\\beta - \\beta^T\\mathbf{X}^T\\mathbf{y} + \\frac{1}{2}\\mathbf{y}^T\\mathbf{y}$\n",
    "\n",
    "    - Gradient $\\nabla f(\\beta) = \\mathbf{X}^T\\mathbf{X}\\beta - \\mathbf{X}^T\\mathbf{y}$ is $\\|\\mathbf{X}^T\\mathbf{X}\\|_2$-Lipschitz. Hence $\\gamma \\in (0, 1/\\sigma_{\\max}(\\mathbf{X})^2]$ guarantees descent property and convergence.\n",
    "\n",
    "* Logistic regression: $f(\\beta) = -\\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\log (1 + e^{\\mathbf{x}_i^T \\beta}) \\right]$\n",
    "\n",
    "    - Gradient $\\nabla f(\\beta) = - \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})$ is $\\frac{1}{4}\\|\\mathbf{X}^T\\mathbf{X}\\|_2$-Lipschitz.\n",
    "\n",
    "    - May be unbounded below, e.g., all the data are of the same class.\n",
    "\n",
    "    - Adding a ridge penalty $\\frac{\\rho}{2}\\|\\beta\\|_2^2$ makes the problem identifiable.\n",
    "    \n",
    "* Poisson regression: $f(\\beta) = -\\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\exp(\\mathbf{x}_i^T \\beta) - \\log(y_i!) \\right]$\n",
    "    - Gradient\n",
    "\\begin{eqnarray*}\n",
    "\t\\nabla f(\\beta) &=& -\\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\exp(\\mathbf{x}_i^T \\beta)\\mathbf{x}_i \\right) \\\\\n",
    "\t&=& -\\sum_{i=1}^n (y_i - \\lambda_i) \\mathbf{x}_i = -\\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda})\n",
    "\\end{eqnarray*}\n",
    "    is *not* Lipschitz (HW). What is the consequence?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Poisson regression using gradient descent\n",
    "# Model: lambda = exp(beta0 + beta1 * quarter), or deaths ~ quarter\n",
    "poissonreg_grad <- function(x, y, maxiter=10, tol=1e-6) {\n",
    "    beta0 <- matrix(0, nrow=2, ncol=1)   # initial point\n",
    "    betaold <- beta0\n",
    "    iter <- 1\n",
    "    inneriter <- rep(0, maxiter)  # to count no. step halving\n",
    "    betamat <- matrix(NA, nrow=maxiter, ncol=length(beta0) + 2)\n",
    "    colnames(betamat) <- c(\"beta0\", \"beta1\", \"inneriter\", \"lik\")\n",
    "    beta <- beta0\n",
    "    lik <- function(bet) {eta <- bet[1] + bet[2] * x; sum( y * eta - exp(eta) ) } # log-likelihood\n",
    "    likold <- lik(betaold)\n",
    "    while (iter < maxiter) {\n",
    "        eta <- beta[1] + x * beta[2]\n",
    "        lam <- exp(eta)  # lambda\n",
    "        grad <- as.matrix(c(sum(lam - y), sum((lam - y) * x)))\n",
    "        # line search by step halving\n",
    "        #s <- 0.00001\n",
    "        s <- 0.00005\n",
    "        for (i in 1:5) {\n",
    "            beta <- beta - s * grad\n",
    "            curlik <- lik(beta)\n",
    "            if (curlik > likold) break\n",
    "            s <- s * 0.5\n",
    "            inneriter[iter] <- inneriter[iter] + 1\n",
    "        }\n",
    "        #print(c(as.numeric(beta), inneriter[iter], curlik))\n",
    "        betamat[iter,] <- c(as.numeric(beta), inneriter[iter], curlik)\n",
    "        betadiff <- beta - betaold\n",
    "        if (norm(betadiff, \"F\") < tol) break\n",
    "        likold <- curlik\n",
    "        betaold <- beta\n",
    "        iter <- iter + 1\n",
    "    }\n",
    "    #return(list(val=as.numeric(beta), iter=iter, inneriter=inneriter[1:iter]))\n",
    "    return(betamat[1:iter,])\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"dataframe\">\n",
       "<caption>A matrix: 10 × 4 of type dbl</caption>\n",
       "<thead>\n",
       "\t<tr><th scope=col>beta0</th><th scope=col>beta1</th><th scope=col>inneriter</th><th scope=col>lik</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><td>0.01025000</td><td>0.1149500</td><td>0</td><td>241.3754</td></tr>\n",
       "\t<tr><td>0.01933954</td><td>0.2178634</td><td>0</td><td>423.2101</td></tr>\n",
       "\t<tr><td>0.02505110</td><td>0.2826001</td><td>0</td><td>471.3151</td></tr>\n",
       "\t<tr><td>0.02532401</td><td>0.2830995</td><td>0</td><td>471.3176</td></tr>\n",
       "\t<tr><td>0.02553404</td><td>0.2828481</td><td>0</td><td>471.3190</td></tr>\n",
       "\t<tr><td>0.02577206</td><td>0.2829335</td><td>0</td><td>471.3201</td></tr>\n",
       "\t<tr><td>0.02599725</td><td>0.2828674</td><td>0</td><td>471.3212</td></tr>\n",
       "\t<tr><td>0.02622798</td><td>0.2828694</td><td>0</td><td>471.3222</td></tr>\n",
       "\t<tr><td>0.02645600</td><td>0.2828408</td><td>0</td><td>471.3233</td></tr>\n",
       "\t<tr><td>0.02668501</td><td>0.2828259</td><td>0</td><td>471.3243</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A matrix: 10 × 4 of type dbl\n",
       "\\begin{tabular}{llll}\n",
       " beta0 & beta1 & inneriter & lik\\\\\n",
       "\\hline\n",
       "\t 0.01025000 & 0.1149500 & 0 & 241.3754\\\\\n",
       "\t 0.01933954 & 0.2178634 & 0 & 423.2101\\\\\n",
       "\t 0.02505110 & 0.2826001 & 0 & 471.3151\\\\\n",
       "\t 0.02532401 & 0.2830995 & 0 & 471.3176\\\\\n",
       "\t 0.02553404 & 0.2828481 & 0 & 471.3190\\\\\n",
       "\t 0.02577206 & 0.2829335 & 0 & 471.3201\\\\\n",
       "\t 0.02599725 & 0.2828674 & 0 & 471.3212\\\\\n",
       "\t 0.02622798 & 0.2828694 & 0 & 471.3222\\\\\n",
       "\t 0.02645600 & 0.2828408 & 0 & 471.3233\\\\\n",
       "\t 0.02668501 & 0.2828259 & 0 & 471.3243\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A matrix: 10 × 4 of type dbl\n",
       "\n",
       "| beta0 | beta1 | inneriter | lik |\n",
       "|---|---|---|---|\n",
       "| 0.01025000 | 0.1149500 | 0 | 241.3754 |\n",
       "| 0.01933954 | 0.2178634 | 0 | 423.2101 |\n",
       "| 0.02505110 | 0.2826001 | 0 | 471.3151 |\n",
       "| 0.02532401 | 0.2830995 | 0 | 471.3176 |\n",
       "| 0.02553404 | 0.2828481 | 0 | 471.3190 |\n",
       "| 0.02577206 | 0.2829335 | 0 | 471.3201 |\n",
       "| 0.02599725 | 0.2828674 | 0 | 471.3212 |\n",
       "| 0.02622798 | 0.2828694 | 0 | 471.3222 |\n",
       "| 0.02645600 | 0.2828408 | 0 | 471.3233 |\n",
       "| 0.02668501 | 0.2828259 | 0 | 471.3243 |\n",
       "\n"
      ],
      "text/plain": [
       "      beta0      beta1     inneriter lik     \n",
       " [1,] 0.01025000 0.1149500 0         241.3754\n",
       " [2,] 0.01933954 0.2178634 0         423.2101\n",
       " [3,] 0.02505110 0.2826001 0         471.3151\n",
       " [4,] 0.02532401 0.2830995 0         471.3176\n",
       " [5,] 0.02553404 0.2828481 0         471.3190\n",
       " [6,] 0.02577206 0.2829335 0         471.3201\n",
       " [7,] 0.02599725 0.2828674 0         471.3212\n",
       " [8,] 0.02622798 0.2828694 0         471.3222\n",
       " [9,] 0.02645600 0.2828408 0         471.3233\n",
       "[10,] 0.02668501 0.2828259 0         471.3243"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<table class=\"dataframe\">\n",
       "<caption>A matrix: 10 × 4 of type dbl</caption>\n",
       "<thead>\n",
       "\t<tr><th></th><th scope=col>beta0</th><th scope=col>beta1</th><th scope=col>inneriter</th><th scope=col>lik</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><th scope=row>[12951,]</th><td>0.3396212</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12952,]</th><td>0.3396212</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12953,]</th><td>0.3396212</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12954,]</th><td>0.3396212</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12955,]</th><td>0.3396212</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12956,]</th><td>0.3396212</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12957,]</th><td>0.3396212</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12958,]</th><td>0.3396213</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12959,]</th><td>0.3396213</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "\t<tr><th scope=row>[12960,]</th><td>0.3396213</td><td>0.2565247</td><td>0</td><td>472.0625</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A matrix: 10 × 4 of type dbl\n",
       "\\begin{tabular}{r|llll}\n",
       "  & beta0 & beta1 & inneriter & lik\\\\\n",
       "\\hline\n",
       "\t{[}12951,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12952,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12953,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12954,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12955,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12956,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12957,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12958,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12959,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\t{[}12960,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A matrix: 10 × 4 of type dbl\n",
       "\n",
       "| <!--/--> | beta0 | beta1 | inneriter | lik |\n",
       "|---|---|---|---|---|\n",
       "| [12951,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12952,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12953,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12954,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12955,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12956,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12957,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12958,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12959,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n",
       "| [12960,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n",
       "\n"
      ],
      "text/plain": [
       "         beta0     beta1     inneriter lik     \n",
       "[12951,] 0.3396212 0.2565247 0         472.0625\n",
       "[12952,] 0.3396212 0.2565247 0         472.0625\n",
       "[12953,] 0.3396212 0.2565247 0         472.0625\n",
       "[12954,] 0.3396212 0.2565247 0         472.0625\n",
       "[12955,] 0.3396212 0.2565247 0         472.0625\n",
       "[12956,] 0.3396212 0.2565247 0         472.0625\n",
       "[12957,] 0.3396212 0.2565247 0         472.0625\n",
       "[12958,] 0.3396213 0.2565247 0         472.0625\n",
       "[12959,] 0.3396213 0.2565247 0         472.0625\n",
       "[12960,] 0.3396213 0.2565247 0         472.0625"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Australian AIDS data from the IRLS example above\n",
    "pm <- poissonreg_grad(quarters, deaths, tol=1e-8, maxiter=20000)\n",
    "head(pm, 10)\n",
    "tail(pm, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       ".dl-inline {width: auto; margin:0; padding: 0}\n",
       ".dl-inline>dt, .dl-inline>dd {float: none; width: auto; display: inline-block}\n",
       ".dl-inline>dt::after {content: \":\\0020\"; padding-right: .5ex}\n",
       ".dl-inline>dt:not(:first-of-type) {padding-left: .5ex}\n",
       "</style><dl class=dl-inline><dt>(Intercept)</dt><dd>0.339633920708136</dd><dt>quarters</dt><dd>0.256523593717915</dd></dl>\n"
      ],
      "text/latex": [
       "\\begin{description*}\n",
       "\\item[(Intercept)] 0.339633920708136\n",
       "\\item[quarters] 0.256523593717915\n",
       "\\end{description*}\n"
      ],
      "text/markdown": [
       "(Intercept)\n",
       ":   0.339633920708136quarters\n",
       ":   0.256523593717915\n",
       "\n"
      ],
      "text/plain": [
       "(Intercept)    quarters \n",
       "  0.3396339   0.2565236 "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "m <- glm(deaths ~ quarters, family = poisson())\n",
    "coef(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Coordinate descent\n",
    "\n",
    "* Idea: coordinate-wise minimization is often easier than the full-dimensional minimization\n",
    "$$\n",
    "    x_j^{(t+1)} = \\arg\\min_{x_j} f(x_1^{(t+1)}, \\dotsc, x_{j-1}^{(t+1)}, x_j, x_{j+1}^{(t)}, \\dotsc, x_d^{(t)})\n",
    "$$\n",
    "\n",
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/e/e3/Coordinate_descent.svg/768px-Coordinate_descent.svg.png\" width=\"400\" align=\"center\"/>\n",
    "\n",
    "* Similar to the Gauss-Seidel method for solving linear equations. \n",
    "\n",
    "* *Block descent*: a vector version of coordinate descent\n",
    "$$\n",
    "    \\mathbf{x}_j^{(t+1)} = \\arg\\min_{\\mathbf{x}_j} f(\\mathbf{x}_1^{(t+1)}, \\dotsc, \\mathbf{x}_{j-1}^{(t+1)}, \\mathbf{x}_j, \\mathbf{x}_{j+1}^{(t)}, \\dotsc, \\mathbf{x}_d^{(t)})\n",
    "$$\n",
    "\n",
    "* Q: why objective value converges?\n",
    "    - A: descent property\n",
    "\n",
    "* Caution: even if the objective function is convex, coordinate descent may *not* converge to the global minimum and converge to a suboptimal point.\n",
    "\n",
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Nonsmooth_coordinate_descent.svg/500px-Nonsmooth_coordinate_descent.svg.png\" width=400>"
   ]
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.2.2"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "153px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_position": {
    "height": "441.3333435058594px",
    "left": "0px",
    "right": "903.3333129882813px",
    "top": "140.6666717529297px",
    "width": "166px"
   },
   "toc_section_display": "block",
   "toc_window_display": true,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}