{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"data": {
"text/plain": [
"R version 4.2.2 (2022-10-31)\n",
"Platform: x86_64-apple-darwin17.0 (64-bit)\n",
"Running under: macOS Catalina 10.15.7\n",
"\n",
"Matrix products: default\n",
"BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib\n",
"LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib\n",
"\n",
"locale:\n",
"[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n",
"\n",
"attached base packages:\n",
"[1] stats graphics grDevices utils datasets methods base \n",
"\n",
"loaded via a namespace (and not attached):\n",
" [1] fansi_1.0.4 crayon_1.5.2 digest_0.6.31 utf8_1.2.2 \n",
" [5] IRdisplay_1.1 repr_1.1.5 lifecycle_1.0.3 jsonlite_1.8.4 \n",
" [9] evaluate_0.20 pillar_1.8.1 rlang_1.0.6 cli_3.6.0 \n",
"[13] uuid_1.1-0 vctrs_0.5.2 IRkernel_1.3.2 tools_4.2.2 \n",
"[17] glue_1.6.2 fastmap_1.1.0 compiler_4.2.2 base64enc_0.1-3\n",
"[21] pbdZMQ_0.3-9 htmltools_0.5.4"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sessionInfo()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Optimization\n",
"\n",
"### Unconstrained optimization \n",
"\n",
"* Problem\n",
"$$\n",
"\\min_{\\mathbf{x}\\in \\mathbb{R}^d} f(\\mathbf{x})\n",
"$$\n",
"\n",
"* Maximization: $\\min_{\\mathbf{x}\\in\\mathbb{R}^d} -f(\\mathbf{x})$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Golden section search\n",
"\n",
"* $d=1$.\n",
"\n",
"* Applicable to functions defined on an interval of the real line.\n",
"\n",
"* Assumption: $f$ is continuous and *strictly unimodal* on the interval\n",
"\\\n",
"\n",
"\n",
" A function $f: [a,b] \\to \\mathbb{R}$ is strictly unimodal if \n",
" $$\n",
" f(\\lambda x + (1-\\lambda) y) < \\max\\{ f(x), f(y) \\},\n",
" \\quad\n",
" x, y \\in [a, b],~x\\neq y, ~\\lambda \\in (0, 1).\n",
" $$\n",
"\n",
"* Idea: keep track of three points $a < x_1 < b$ such that $f(x_1) < \\min\\{f(a), f(b)\\}$\n",
"\n",
"* Choose a new point $x_0$ such that $x_0$ belongs to the longer of intervals $(a, x_1)$ and $(x_1, b)$.\n",
"\n",
"* WLOG, suppose $a < x_0 < x_1$.\n",
" - If $f(x_0) > f(x_1)$, then the minimum lies in the interval $(x_0, b)$. Hence set $(a, x_1, b) := (x_0, x_1, b)$.\n",
" - Otherwise, the minimum lies in the interval $(a, x_1)$. Hence set $(a, x_1, b) := (a, x_0, x_1)$. \n",
"\n",
" \n",
"\n",
"\n",
"\n",
"* Above rules do not say how to choose $x_0$. A possibility is to keep the lengths of the two possible candidate intervals the same: $b - x_0 = x_1 - a$, and the relative position of the intermediate points ($x_1$ to $(x_0, b)$; $x_0$ to $(a, x_1)$) the same. \n",
"\n",
"* Let $\\alpha = \\frac{x_0 - a}{b - a}$ and $y = \\frac{x_1 - x_0}{b -a}$. Then we require $\\frac{b - x_1}{b - a} = \\alpha$ and\n",
"$$\n",
" 1:\\alpha = y + \\alpha : y, \n",
" \\quad\n",
" \\alpha + y + \\alpha = 1\n",
" .\n",
"$$\n",
"By solving this equation, we have $\\alpha = \\frac{3 - \\sqrt{5}}{2}$ or $1 - \\alpha = \\frac{\\sqrt{5} - 1}{2}$, the golden section.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
},
"tags": []
},
"outputs": [],
"source": [
"goldensection <- function(f, a, b, tol=1e-6) {\n",
" alpha <- (3 - sqrt(5)) * 0.5\n",
" f\n",
" iter <- 0\n",
" while ( b - a > tol ) {\n",
" x0 <- a + alpha * (b - a)\n",
" x1 <- b - alpha * (b - a)\n",
" print(c(a, x1, b))\n",
" if ( f(x0) > f(x1) ) {\n",
" a <- x0\n",
" } else { # if ( f(x0) < f(x1) ) {\n",
" b <- x1\n",
" x1 <- x0\n",
" } #else {\n",
" # if ( f(b) < f(a) ) {\n",
" # a <- x0\n",
" # } else {\n",
" # b <- x1\n",
" # x1 <- x0 \n",
" # }\n",
" #}\n",
" iter <- iter + 1\n",
" }\n",
" ret <- list(val = (a + b) / 2, iter = iter)\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 0.0100000 0.6156733 0.9900000\n",
"[1] 0.3843267 0.7586534 0.9900000\n",
"[1] 0.6156733 0.8470199 0.9900000\n",
"[1] 0.6156733 0.7586534 0.8470199\n",
"[1] 0.6156733 0.7040399 0.7586534\n",
"[1] 0.6702868 0.7249004 0.7586534\n",
"[1] 0.6702868 0.7040399 0.7249004\n",
"[1] 0.6911473 0.7120079 0.7249004\n",
"[1] 0.6911473 0.7040399 0.7120079\n",
"[1] 0.6911473 0.6991154 0.7040399\n",
"[1] 0.6960718 0.7009963 0.7040399\n",
"[1] 0.6960718 0.6991154 0.7009963\n",
"[1] 0.6979528 0.6998338 0.7009963\n",
"[1] 0.6991154 0.7002779 0.7009963\n",
"[1] 0.6991154 0.6998338 0.7002779\n",
"[1] 0.6995594 0.7000034 0.7002779\n",
"[1] 0.6998338 0.7001083 0.7002779\n",
"[1] 0.6998338 0.7000034 0.7001083\n",
"[1] 0.6999387 0.7000435 0.7001083\n",
"[1] 0.6999387 0.7000034 0.7000435\n",
"[1] 0.6999787 0.7000187 0.7000435\n",
"[1] 0.6999787 0.7000034 0.7000187\n",
"[1] 0.6999940 0.7000093 0.7000187\n",
"[1] 0.6999940 0.7000034 0.7000093\n",
"[1] 0.6999940 0.6999998 0.7000034\n",
"[1] 0.6999976 0.7000012 0.7000034\n",
"[1] 0.6999976 0.6999998 0.7000012\n",
"[1] 0.6999990 0.7000004 0.7000012\n",
"[1] 0.6999990 0.6999998 0.7000004\n",
"[1] 0.6999995 0.7000000 0.7000004\n",
"[1] 0.6999998 0.7000002 0.7000004\n",
"[1] 0.6999998 0.7000000 0.7000002\n",
"[1] 0.7000000 0.7000001 0.7000002\n",
"[1] 0.7000000 0.7000000 0.7000001\n",
"[1] 0.7 0.7 0.7\n",
"[1] 0.7 0.7 0.7\n",
"[1] 0.7 0.7 0.7\n",
"[1] 0.7 0.7 0.7\n",
"[1] 0.7 0.7 0.7\n"
]
}
],
"source": [
"# binomial log-likelihood of 7 successes and 3 failures\n",
"\n",
"F <- function(x) { -7 * log(x) - 3 * log(1 - x) }\n",
"goldensection(F, 0.01, 0.99, tol=1e-8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convergence of golden section search is slow but sure to find the global minimum (if the assumptions are met)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Newton's Method\n",
"\n",
"* First-order optimality condition\n",
"$$\n",
"\\nabla f(\\mathbf{x}^\\star) = 0.\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Apply Newton-Raphson to $g(\\mathbf{x})=\\nabla f(\\mathbf{x})$.\n",
"\n",
"\n",
"* Idea: iterative quadratic approximation. \n",
"\n",
"* Second-order Taylor expansion of the objective function around the current iterate $\\mathbf{x}^{(t)}$\n",
"$$\n",
"\tf(\\mathbf{x}) \\approx f(\\mathbf{x}^{(t)}) + \\nabla f(\\mathbf{x}^{(t)})^T (\\mathbf{x} - \\mathbf{x}^{(t)}) + \\frac {1}{2} (\\mathbf{x} - \\mathbf{x}^{(t)})^T [\\nabla^2 f(\\mathbf{x}^{(t)})] (\\mathbf{x} - \\mathbf{x}^{(t)})\n",
"$$\n",
"and then minimize the quadratic approximation.\n",
"\n",
"* To maximize the quadratic appriximation function, we equate its gradient to zero\n",
"$$\n",
"\t\\nabla f(\\mathbf{x}^{(t)}) + [\\nabla^2 f(\\mathbf{x}^{(t)})] (\\mathbf{x} - \\mathbf{x}^{(t)}) = \\mathbf{0},\n",
"$$\n",
"which suggests the next iterate\n",
"$$\n",
"\\begin{eqnarray*}\n",
"\t\\mathbf{x}^{(t+1)} &=& \\mathbf{x}^{(t)} - [\\nabla^2 f(\\mathbf{x}^{(t)})]^{-1} \\nabla f(\\mathbf{x}^{(t)}),\n",
"\\end{eqnarray*}\n",
"$$\n",
"a complete analogue of the univariate Newton-Raphson for solving $g(x)=0$.\n",
"\n",
"* Considered the gold standard for its fast (quadratic) convergence: $\\mathbf{x}^{(t)} \\to \\mathbf{x}^{\\star}$ and\n",
"$$\n",
"\t\\frac{\\|\\mathbf{x}^{(t+1)} - \\mathbf{x}^{\\star}\\|}{\\|\\mathbf{x}^{(t)} - \\mathbf{x}^{\\star}\\|^2} \\to \\text{constant}.\n",
"$$\n",
"In words, the estimate gets accurate by two decimal digits per iteration.\n",
"\n",
"\n",
"We call this **naive Newton's method**.\n",
"\n",
"* **Stability issue**: naive Newton's iterate is **not** guaranteed to be a descent algorithm, i.e., that ensures $f(\\mathbf{x}^{(t+1)}) \\le f(\\mathbf{x}^{(t)})$. It's equally happy to head uphill or downhill. Following example shows that the Newton iterate converges to a local maximum, converges to a local minimum, or diverges depending on starting points."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"purenewton <- function(f, df, d2f, x0, maxiter=10, tol=1e-6) {\n",
" xold <- x0\n",
" stop <- FALSE\n",
" iter <- 1\n",
" x <- x0\n",
" while ((!stop) && (iter < maxiter)) {\n",
" x <- x - df(x) / d2f(x)\n",
" print(x)\n",
" xdiff <- x - xold\n",
" if (abs(xdiff) < tol) stop <- TRUE \n",
" xold <- x\n",
" iter <- iter + 1\n",
" }\n",
" return(list(val=x, iter=iter))\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"f <- function(x) sin(x) # objective function\n",
"df <- function(x) cos(x) # gradient\n",
"d2f <- function(x) -sin(x) # hessian"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 1.542342\n",
"[1] 1.570804\n",
"[1] 1.570796\n",
"[1] 1.570796\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\t- $val
\n",
"\t\t- 1.5707963267949
\n",
"\t- $iter
\n",
"\t\t- 5
\n",
"
\n"
],
"text/latex": [
"\\begin{description}\n",
"\\item[\\$val] 1.5707963267949\n",
"\\item[\\$iter] 5\n",
"\\end{description}\n"
],
"text/markdown": [
"$val\n",
": 1.5707963267949\n",
"$iter\n",
": 5\n",
"\n",
"\n"
],
"text/plain": [
"$val\n",
"[1] 1.570796\n",
"\n",
"$iter\n",
"[1] 5\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"purenewton(f, df, d2f, 2.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 0.328211\n",
"[1] 3.264834\n",
"[1] 11.33789\n",
"[1] 10.98155\n",
"[1] 10.99558\n",
"[1] 10.99557\n"
]
},
{
"data": {
"text/html": [
"\n",
"\t- $val
\n",
"\t\t- 10.9955742875643
\n",
"\t- $iter
\n",
"\t\t- 7
\n",
"
\n"
],
"text/latex": [
"\\begin{description}\n",
"\\item[\\$val] 10.9955742875643\n",
"\\item[\\$iter] 7\n",
"\\end{description}\n"
],
"text/markdown": [
"$val\n",
": 10.9955742875643\n",
"$iter\n",
": 7\n",
"\n",
"\n"
],
"text/plain": [
"$val\n",
"[1] 10.99557\n",
"\n",
"$iter\n",
"[1] 7\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"purenewton(f, df, d2f, 2.75)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 4.863691\n",
"[1] 4.711224\n",
"[1] 4.712389\n",
"[1] 4.712389\n"
]
},
{
"data": {
"text/html": [
"\n",
"\t- $val
\n",
"\t\t- 4.71238898038469
\n",
"\t- $iter
\n",
"\t\t- 5
\n",
"
\n"
],
"text/latex": [
"\\begin{description}\n",
"\\item[\\$val] 4.71238898038469\n",
"\\item[\\$iter] 5\n",
"\\end{description}\n"
],
"text/markdown": [
"$val\n",
": 4.71238898038469\n",
"$iter\n",
": 5\n",
"\n",
"\n"
],
"text/plain": [
"$val\n",
"[1] 4.712389\n",
"\n",
"$iter\n",
"[1] 5\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"purenewton(f, df, d2f, 4.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practical Newton\n",
"\n",
"* A remedy for the instability issue:\n",
" 1. approximate $\\nabla^2 f(\\mathbf{x}^{(t)})$ by a positive definite $\\mathbf{H}^{(t)}$ (if it's not), **and** \n",
" 2. line search (backtracking) to ensure the descent property. \n",
"\n",
"* Why insist on a _positive definite_ approximation of Hessian? First define the **Newton direction**:\n",
"$$\n",
"\\Delta \\mathbf{x}^{(t)} = [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}).\n",
"$$\n",
"By first-order Taylor expansion,\n",
"$$\n",
"\\begin{align*}\n",
" f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)}) - f(\\mathbf{x}^{(t)}) \n",
" &= \\nabla f(\\mathbf{x}^{(t)})^T s \\Delta \\mathbf{x}^{(t)} + o(s) \\\\\n",
" &= - s \\cdot \\nabla f(\\mathbf{x}^{(t)})^T [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}) + o(s).\n",
"\\end{align*}\n",
"$$\n",
"For $s$ sufficiently small, $f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)}) - f(\\mathbf{x}^{(t)})$ is strictly negative if $\\mathbf{H}^{(t)}$ is positive definite. \n",
"\\\n",
"The quantity $\\{\\nabla f(\\mathbf{x}^{(t)})^T [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)})\\}^{1/2}$ is termed the **Newton decrement**.\n",
"\n",
"* In summary, a **practical Newton-type algorithm** iterates according to\n",
"$$\n",
"\t\\boxed{ \\mathbf{x}^{(t+1)} = \\mathbf{x}^{(t)} - s [\\mathbf{H}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}) \n",
"\t= \\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)} }\n",
"$$\n",
"where $\\mathbf{H}^{(t)}$ is a positive definite approximation to $\\nabla^2 f(\\mathbf{x}^{(t)})$ and $s$ is a step size.\n",
"\n",
"* For strictly convex $f$, $\\nabla^2 f(\\mathbf{x}^{(t)})$ is always positive definite. In this case, the above algorithm is called **damped Newton**. However, *line search* is still needed (at least for a finite number of times) to guarantee convergence.\n",
"\n",
"### Line search\n",
"\n",
"* Line search: compute the Newton direction and search $s$ such that $f(\\mathbf{x}^{(t)} + s \\Delta \\mathbf{x}^{(t)})$ is minimized.\n",
"\n",
"* Note the Newton direction only needs to be calculated once. Cost of line search mainly lies in objective function evaluation.\n",
"\n",
"* Full line search: $s = \\arg\\min_{\\alpha} f(\\mathbf{x}^{(t)} + \\alpha \\Delta \\mathbf{x}^{(t)})$. May use golden section search.\n",
"\n",
"* Approximate line search: step-halving ($s=1,1/2,\\ldots$), Amijo rule, ...\n",
"\n",
"\n",
"* Backtracking line search (Armijo rule)\n",
"```r\n",
" # Backtracking line search\n",
" # given: descent direction ∆x, x ∈ domf, α ∈ (0,0.5), β ∈ (0,1).\n",
" t <- 1.0\n",
" while (f(x + t * delx) > f(x) + alpha * t * sum(gradf(x) * delx) {\n",
" t <- beta * t\n",
" }\n",
"```\n",
"\n",
"\n",
"\n",
"> The lower dashed line shows the linear extrapolation of $f$, and the upper dashed line has a slope a factor of α smaller. The backtracking condition is that $f$ lies below the upper dashed line, i.e., $0 \\le t \\le t_0$.\n",
"\n",
"* How to approximate $\\nabla^2 f(\\mathbf{x})$? More of an art than science. Often requires problem specific analysis. \n",
"\n",
"* Taking $\\mathbf{H}^{(t)} = \\mathbf{I}$ leads to the **gradient descent method** (see below)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fisher scoring\n",
"\n",
"* Consider MLE in which $f(\\mathbf{x}) = -\\ell(\\boldsymbol{\\theta})$, where $\\ell(\\boldsymbol{\\theta})$ is the log-likelihood of parameter $\\boldsymbol{\\theta}$.\n",
"\n",
"* **Fisher scoring method**: replace $- \\nabla^2 \\ell(\\boldsymbol{\\theta})$ by the expected Fisher information matrix\n",
"$$\n",
"\t\\mathbf{I}(\\theta) = \\mathbf{E}[-\\nabla^2\\ell(\\boldsymbol{\\theta})] = \\mathbf{E}[\\nabla \\ell(\\boldsymbol{\\theta}) \\nabla \\ell(\\boldsymbol{\\theta})^T] \\succeq \\mathbf{0},\n",
"$$\n",
"which is true under exchangeability of tne expectation and the differentiation (true for most common distributions).\n",
"\n",
" Therefore we set $\\mathbf{H}^{(t)}=\\mathbf{I}(\\boldsymbol{\\theta}^{(t)})$ and obtain the Fisher scoring algorithm: \n",
"$$\n",
"\t\\boxed{ \\boldsymbol{\\theta}^{(t+1)} = \\boldsymbol{\\theta}^{(t)} + s [\\mathbf{I}(\\boldsymbol{\\theta}^{(t)})]^{-1} \\nabla \\ell(\\boldsymbol{\\theta}^{(t)})}.\n",
"$$\n",
"\n",
"* Combined with line search, a descent algorithm can be devised."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: logistic regression\n",
"\n",
"* Binary data: response $y_i \\in \\{0,1\\}$, predictor $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. \n",
"\n",
"* Model: $y_i \\sim $Bernoulli$(p_i)$, where\n",
"\\begin{eqnarray*}\n",
"\t\\mathbf{E} (y_i) = p_i &=& g^{-1}(\\eta_i) = \\frac{e^{\\eta_i}}{1+ e^{\\eta_i}} \\quad \\text{(mean function, inverse link function)} \\\\\n",
"\t\\eta_i = \\mathbf{x}_i^T \\beta &=& g(p_i) = \\log \\left( \\frac{p_i}{1-p_i} \\right) \\quad \\text{(logit link function)}.\n",
"\\end{eqnarray*}\n",
"\n",
"\n",
"* MLE: density\n",
"\\begin{eqnarray*}\n",
"\tf(y_i|p_i) &=& p_i^{y_i} (1-p_i)^{1-y_i} \\\\\n",
"\t&=& e^{y_i \\log p_i + (1-y_i) \\log (1-p_i)} \\\\\n",
"\t&=& \\exp\\left( y_i \\log \\frac{p_i}{1-p_i} + \\log (1-p_i)\\right).\n",
"\\end{eqnarray*}\n",
"\n",
"\n",
"* Log likelihood of the data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$, and its derivatives are\n",
"\\begin{eqnarray*}\n",
"\t\\ell(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\log p_i + (1-y_i) \\log (1-p_i) \\right] \\\\\n",
"\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\log (1 + e^{\\mathbf{x}_i^T \\beta}) \\right] \\\\\n",
"\t\\nabla \\ell(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\frac{e^{\\mathbf{x}_i^T \\beta}}{1+e^{\\mathbf{x}_i^T \\beta}} \\mathbf{x}_i \\right) \\\\\n",
"\t&=& \\sum_{i=1}^n (y_i - p_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})\t\\\\\n",
"\t- \\nabla^2\\ell(\\beta) &=& \\sum_{i=1}^n p_i(1-p_i) \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n",
"\t\\text{where } \\mathbf{W} &=& \\text{diag}(w_1,\\ldots,w_n), w_i = p_i (1-p_i) \\\\\n",
"\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2\\ell(\\beta)] = \\mathbf{X}^T \\mathbf{W} \\mathbf{X} = - \\nabla^2\\ell(\\beta) \\quad \\text{!!!}\n",
"\\end{eqnarray*}\n",
" (why the last line?)\n",
"\n",
"* Therefore for this problem **Newton's method == Fisher scoring**: \n",
"$$\n",
"\\begin{eqnarray*}\n",
"\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-\\nabla^2 \\ell(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\t\\\\\n",
"\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p}^{(t)}) \\\\\n",
"\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)}) \\right] \\\\\n",
"\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n",
"\\end{eqnarray*}\n",
"$$\n",
"where \n",
"$$\n",
"\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s[\\mathbf{W}^{(t)}]^{-1} (\\mathbf{y} - \\mathbf{p}^{(t)})\n",
"$$ \n",
"are the working responses. A Newton iteration is equivalent to solving a *weighed* least squares problem \n",
"$$\n",
"\\min_{\\beta} \\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2\n",
"$$\n",
"for which we know how to solve well.\n",
"Thus the name **IRLS (iteratively re-weighted least squares)**.\n",
"\n",
"* Implication: if a weighted least squares solver is at hand, then logistic regression models can be fitted.\n",
" - IRLS == Fisher scoring == Newton's method"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: Poisson regression\n",
"\n",
"* Count data: response $y_i \\in \\{0, 1, 2, \\dotsc \\}$, predictor $\\mathbf{x}_i \\in \\mathbb{R}^{p}$. \n",
"\n",
"* Model: $y_i \\sim $Poisson$(\\lambda_i)$, where\n",
"\\begin{eqnarray*}\n",
"\t\\mathbf{E} (y_i) = \\lambda_i &=& g^{-1}(\\eta_i) = e^{\\eta_i} \\quad \\text{(mean function, inverse link function)} \\\\\n",
"\t\\eta_i = \\mathbf{x}_i^T \\beta &=& g(p_i) = \\log \\lambda_i \\quad \\text{(logit link function)}.\n",
"\\end{eqnarray*}\n",
"\n",
"\n",
"* MLE: density\n",
"\\begin{eqnarray*}\n",
"\tf(y_i|\\lambda_i) &=& e^{-\\lambda_i}\\frac{\\lambda_i^{y_i}}{y_i!} \\\\\n",
"\t&=& \\exp\\left( y_i \\log \\lambda_i - \\lambda_i - \\log(y_i!) \\right)\n",
"\\end{eqnarray*}\n",
"\n",
"\n",
"* Log likelihood of the data $(y_i,\\mathbf{x}_i)$, $i=1,\\ldots,n$, and its derivatives are\n",
"\\begin{eqnarray*}\n",
"\t\\ell(\\beta) &=& \\sum_{i=1}^n \\left[ y_i \\log \\lambda_i - \\lambda_i - \\log(y_i!) \\right] \\\\\n",
"\t&=& \\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\exp(\\mathbf{x}_i^T \\beta) - \\log(y_i!) \\right] \\\\\n",
"\t\\nabla \\ell(\\beta) &=& \\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\exp(\\mathbf{x}_i^T \\beta)\\mathbf{x}_i \\right) \\\\\n",
"\t&=& \\sum_{i=1}^n (y_i - \\lambda_i) \\mathbf{x}_i = \\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda})\t\\\\\n",
"\t- \\nabla^2\\ell(\\beta) &=& \\sum_{i=1}^n \\lambda_i \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}, \\quad\n",
"\t\\text{where } \\mathbf{W} = \\text{diag}(\\lambda_1,\\ldots,\\lambda_n) \\\\\n",
"\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2\\ell(\\beta)] = \\mathbf{X}^T \\mathbf{W} \\mathbf{X} = - \\nabla^2\\ell(\\beta) \\quad \\text{!!!}\n",
"\\end{eqnarray*}\n",
" (why the last line?)\n",
"\n",
"* Therefore for this problem **Newton's method == Fisher scoring**: \n",
"$$\n",
"\\begin{eqnarray*}\n",
"\t\\beta^{(t+1)} &=& \\beta^{(t)} + s[-\\nabla^2 \\ell(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\t\\\\\n",
"\t&=& \\beta^{(t)} + s(\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)}) \\\\\n",
"\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\left[ \\mathbf{X} \\beta^{(t)} + s(\\mathbf{W}^{(t)})^{-1} (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)}) \\right] \\\\\n",
"\t&=& (\\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{X})^{-1} \\mathbf{X}^T \\mathbf{W}^{(t)} \\mathbf{z}^{(t)},\n",
"\\end{eqnarray*}\n",
"$$\n",
"where \n",
"$$\n",
"\t\\mathbf{z}^{(t)} = \\mathbf{X} \\beta^{(t)} + s[\\mathbf{W}^{(t)}]^{-1} (\\mathbf{y} - \\boldsymbol{\\lambda}^{(t)})\n",
"$$ \n",
"are the working responses. A Newton iteration is again equivalent to solving a *weighed* least squares problem \n",
"$$\n",
"\\min_{\\beta} \\sum_{i=1}^n w_i (z_i - \\mathbf{x}_i^T \\beta)^2\n",
"$$\n",
"which leads to the IRLS.\n",
"\n",
"* Implication: if a weighted least squares solver is at hand, then Poisson regression models can be fitted.\n",
" - IRLS == Fisher scoring == Newton's method"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1\n",
"\\item 2\n",
"\\item 3\n",
"\\item 4\n",
"\\item 5\n",
"\\item 6\n",
"\\item 7\n",
"\\item 8\n",
"\\item 9\n",
"\\item 10\n",
"\\item 11\n",
"\\item 12\n",
"\\item 13\n",
"\\item 14\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1\n",
"2. 2\n",
"3. 3\n",
"4. 4\n",
"5. 5\n",
"6. 6\n",
"7. 7\n",
"8. 8\n",
"9. 9\n",
"10. 10\n",
"11. 11\n",
"12. 12\n",
"13. 13\n",
"14. 14\n",
"\n",
"\n"
],
"text/plain": [
" [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Quarterly count of AIDS deaths in Australia (from Dobson, 1990)\n",
"deaths <- c(0, 1, 2, 3, 1, 4, 9, 18, 23, 31, 20, 25, 37, 45)\n",
"(quarters <- seq_along(deaths))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"# Poisson regression using Fisher scoring (IRLS) and step halving\n",
"# Model: lambda = exp(beta0 + beta1 * quarter), or deaths ~ quarter\n",
"poissonreg <- function(x, y, maxiter=10, tol=1e-6) {\n",
" beta0 <- matrix(0, nrow=2, ncol=1) # initial point\n",
" betaold <- beta0\n",
" stop <- FALSE\n",
" iter <- 1\n",
" inneriter <- rep(0, maxiter) # to count no. step halving\n",
" beta <- beta0\n",
" lik <- function(bet) {eta <- bet[1] + bet[2] * x; sum( y * eta - exp(eta) ) } # log-likelihood\n",
" likold <- lik(betaold)\n",
" while ((!stop) && (iter < maxiter)) {\n",
" eta <- beta[1] + x * beta[2]\n",
" w <- exp(eta) # lambda\n",
" # line search by step halving\n",
" s <- 1.0\n",
" for (i in 1:5) {\n",
" z <- eta + s * (y / w - 1) # working response\n",
" m <- lm(z ~ x, weights=w) # weighted least squares\n",
" beta <- as.matrix(coef(m))\n",
" curlik <- lik(beta)\n",
" if (curlik > likold) break\n",
" s <- s * 0.5\n",
" inneriter[iter] <- inneriter[iter] + 1\n",
" }\n",
" print(c(as.numeric(beta), inneriter[iter], curlik))\n",
" betadiff <- beta - betaold\n",
" if (norm(betadiff, \"F\") < tol) stop <- TRUE \n",
" likold <- curlik\n",
" betaold <- beta\n",
" iter <- iter + 1\n",
" }\n",
" return(list(val=as.numeric(beta), iter=iter, inneriter=inneriter[1:iter]))\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] -1.3076923 0.4184066 3.0000000 443.4763188\n",
"[1] 0.6456032 0.2401380 0.0000000 469.9257845\n",
"[1] 0.3743785 0.2541525 0.0000000 472.0483495\n",
"[1] 0.3400344 0.2564929 0.0000000 472.0625465\n",
"[1] 0.3396340 0.2565236 0.0000000 472.0625479\n",
"[1] 0.3396339 0.2565236 0.0000000 472.0625479\n"
]
},
{
"data": {
"text/html": [
"\n",
"\t- $val
\n",
"\t\t- \n",
"
- 0.339633920708135
- 0.256523593717915
\n",
" \n",
"\t- $iter
\n",
"\t\t- 7
\n",
"\t- $inneriter
\n",
"\t\t- \n",
"
- 3
- 0
- 0
- 0
- 0
- 0
- 0
\n",
" \n",
"
\n"
],
"text/latex": [
"\\begin{description}\n",
"\\item[\\$val] \\begin{enumerate*}\n",
"\\item 0.339633920708135\n",
"\\item 0.256523593717915\n",
"\\end{enumerate*}\n",
"\n",
"\\item[\\$iter] 7\n",
"\\item[\\$inneriter] \\begin{enumerate*}\n",
"\\item 3\n",
"\\item 0\n",
"\\item 0\n",
"\\item 0\n",
"\\item 0\n",
"\\item 0\n",
"\\item 0\n",
"\\end{enumerate*}\n",
"\n",
"\\end{description}\n"
],
"text/markdown": [
"$val\n",
": 1. 0.339633920708135\n",
"2. 0.256523593717915\n",
"\n",
"\n",
"\n",
"$iter\n",
": 7\n",
"$inneriter\n",
": 1. 3\n",
"2. 0\n",
"3. 0\n",
"4. 0\n",
"5. 0\n",
"6. 0\n",
"7. 0\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"$val\n",
"[1] 0.3396339 0.2565236\n",
"\n",
"$iter\n",
"[1] 7\n",
"\n",
"$inneriter\n",
"[1] 3 0 0 0 0 0 0\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"poissonreg(quarters, deaths)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"data": {
"text/html": [
"- (Intercept)
- 0.339633920708136
- quarters
- 0.256523593717915
\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[(Intercept)] 0.339633920708136\n",
"\\item[quarters] 0.256523593717915\n",
"\\end{description*}\n"
],
"text/markdown": [
"(Intercept)\n",
": 0.339633920708136quarters\n",
": 0.256523593717915\n",
"\n"
],
"text/plain": [
"(Intercept) quarters \n",
" 0.3396339 0.2565236 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"m <- glm(deaths ~ quarters, family = poisson())\n",
"coef(m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Homework: repeat this using the Armijo rule.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generalized Linear Models (GLM)\n",
"\n",
"That IRLS == Fisher scoring == Newton's method for both logistic and Poisson regression is not a coincidence. Let's consider a more general class of generalized linear models (GLM).\n",
"\n",
"#### Exponential families\n",
"\n",
"* Random variable $Y$ belongs to an exponential family if the density\n",
"$$\n",
"\tp(y|\\eta,\\phi) = \\exp \\left\\{ \\frac{y\\eta - b(\\eta)}{a(\\phi)} + c(y,\\phi) \\right\\}.\n",
"$$\n",
" * $\\eta$: natural parameter. \n",
" * $\\phi>0$: dispersion parameter. \n",
" * Mean: $\\mu= b'(\\eta)$. When $b'(\\cdot)$ is invertible, function $g(\\cdot)=[b']^{-1}(\\cdot)$ is called the canonical link function.\n",
" * Variance $\\mathbf{Var}{Y}=b''(\\eta)a(\\phi)$.\n",
"\n",
"\n",
"* For example, if $Y \\sim \\text{Ber}(\\mu)$, then\n",
"$$\n",
"p(y|\\eta,\\phi) = \\exp\\left( y \\log \\frac{\\mu}{1-\\mu} + \\log (1-\\mu)\\right).\n",
"$$\n",
"Hence\n",
"$$\n",
"\\eta = \\log \\frac{\\mu}{1-\\mu}, \\quad\n",
"\\mu = \\frac{e^{\\eta}}{1+e^{\\eta}},\n",
"\\quad\n",
"b(\\eta) = -\\log (1-\\mu) = \\log(1+e^{\\eta})\n",
"$$\n",
"Hence\n",
"$$\n",
"b'(\\eta) = \\frac{e^{\\eta}}{1+e^{\\eta}} = g^{-1}(\\eta).\n",
"$$\n",
"as above.\n",
"\n",
"\n",
"| Family | Canonical Link | Variance Function |\n",
"|------------------|------------------------------------------------|-------------------|\n",
"| Normal (unit variance) | $\\eta=\\mu$ | 1 |\n",
"| Poisson | $\\eta=\\log \\mu$ | $\\mu$ |\n",
"| Binomial | $\\eta=\\log \\left(\\frac{\\mu}{1 - \\mu} \\right)$ | $\\mu (1 - \\mu)$ |\n",
"| Gamma | $\\eta = \\mu^{-1}$ | $\\mu^2$ |\n",
"| Inverse Gaussian | $\\eta = \\mu^{-2}$ | $\\mu^3$ |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generalized linear models\n",
"\n",
"GLM models the conditional distribution of $Y$ given predictors $\\mathbf{x}$ through\n",
"the conditional mean $\\mu = \\mathbf{E}(Y|\\mathbf{x})$ via a strictly increasing link function\n",
"$$\n",
"\tg(\\mu) = \\mathbf{x}^T \\beta, \\quad \\mu = g^{-1}(\\mathbf{x}^T\\beta) = b'(\\eta)\n",
"$$\n",
"\n",
"From these relations we have (assuming no overdispersion, i.e., $a(\\phi)\\equiv 1$)\n",
"$$\n",
"\\mathbf{x}^T d\\beta = g'(\\mu)d\\mu,\n",
"\\quad\n",
"d\\mu = b''(\\eta)d\\eta, \n",
"\\quad\n",
"b''(\\eta) = \\mathbf{Var}[Y] = \\sigma^2,\n",
"\\quad\n",
"d\\eta = \\frac{1}{b''(\\eta)}d\\mu = \\frac{1}{b''(\\eta)g'(\\mu)}\\mathbf{x}^T d\\beta.\n",
"$$\n",
"Therefore\n",
"$$\n",
" d\\mu = \\frac{1}{g'(\\mu)}\\mathbf{x}^T d\\beta\n",
" .\n",
"$$\n",
"\n",
"Then, after some workout with matrix calculus, we have for $n$ samples:\n",
"\n",
"* Score, Hessian, information\n",
"\n",
"\\begin{eqnarray*}\n",
" \t\\nabla\\ell(\\beta) &=& \\sum_{i=1}^n \\frac{(y_i-\\mu_i) [1/g'(\\mu_i)]}{\\sigma_i^2} \\mathbf{x}_i, \\quad \\mu_i = g^{-1}(\\mathbf{x}_i^T\\beta), ~\\sigma_i^2 = b''(\\eta_i), \\\\\n",
"\t- \\nabla^2 \\ell(\\beta) &=& \\sum_{i=1}^n \\frac{[1/g'(\\mu_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T - \\sum_{i=1}^n \\frac{(y_i - \\mu_i)[b'''(\\eta_i)/[b''(\\eta_i)(g'(\\mu_i)^2] - g''(\\mu_i)/[g'(\\mu_i)]^3]}{\\sigma^2} \\mathbf{x}_i \\mathbf{x}_i^T, \\\\\n",
"\t\\mathbf{I}(\\beta) &=& \\mathbf{E} [- \\nabla^2 \\ell(\\beta)] = \\sum_{i=1}^n \\frac{[1/g'(\\mu_i)]^2}{\\sigma_i^2} \\mathbf{x}_i \\mathbf{x}_i^T = \\mathbf{X}^T \\mathbf{W} \\mathbf{X}.\n",
"\\end{eqnarray*} \n",
"\n",
"\n",
"* Fisher scoring method:\n",
"$$\n",
" \t\\beta^{(t+1)} = \\beta^{(t)} + s [\\mathbf{I}(\\beta^{(t)})]^{-1} \\nabla \\ell(\\beta^{(t)})\n",
"$$\n",
"IRLS with weights $w_i = [1/g'(\\mu_i)]^2/\\sigma_i^2$ and some working responses $z_i$.\n",
"\n",
"* For *canonical link*, $\\mathbf{x}^T\\beta = g(\\mu) =[b']^{-1}(\\mu) = \\eta$. The second term of Hessian vanishes because $d\\eta=\\mathbf{x}^Td\\beta$ and $b''(\\eta)=1/g'(\\mu)$. The Hessian coincides with Fisher information matrix. **IRLS == Fisher scoring == Newton's method**. Hence MLE is a *convex* optimization problem.\n",
"\n",
" \n",
"* Non-canonical link, **Fisher scoring != Newton's method**, and MLE is in general a *non-convex* optimization problem.\n",
"\n",
" Example: Probit regression (binary response with probit link).\n",
"\\begin{eqnarray*}\n",
" y_i &\\sim& \\text{Ber}(p_i) \\\\\n",
" p_i &=& \\Phi(\\mathbf{x}_i^T \\beta) \\\\\n",
" \\eta_i &=& \\log\\left(\\frac{p_i}{1-p_i}\\right) \\neq \\mathbf{x}_i^T \\beta = \\Phi^{-1}(p_i).\n",
"\\end{eqnarray*}\n",
" where $\\Phi(\\cdot)$ is the cdf of a standard normal. Exceptionally, this problem is a convex optimization problem.\n",
" \n",
"* R implements the Fisher scoring method, aka IRLS, for GLMs in function `glm()`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Nonlinear regression - Gauss-Newton method\n",
"\n",
"* Now we finally get to the problem Gauss faced in 1801! \n",
"Relocate the dwarf planet Ceres by fitting 24 observations to a 6-parameter (nonlinear) orbit.\n",
" - In 1801, Jan 1 -- Feb 11 (41 days), astronomer Piazzi discovered Ceres, which was lost behind the Sun after observing its orbit 24 times.\n",
" - Aug -- Sep, futile search by top astronomers; Laplace claimed it unsolvable.\n",
" - Oct -- Nov, Gauss did calculations by method of least squares, sent his results to astronomer von Zach.\n",
" - Dec 31, von Zach relocated Ceres according to Gauss’ calculation.\n",
"\n",
"* Nonlinear least squares (curve fitting): \n",
"$$\n",
"\t\\text{minimize} \\,\\, f(\\beta) = \\frac{1}{2} \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i, \\beta)]^2\n",
"$$\n",
"For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve\n",
"$$\n",
"\t\\mu(x, \\beta_1,\\beta_2,\\beta_3) = \\frac{\\beta_3}{1 + \\exp(-\\beta_1 - \\beta_2 x)}.\n",
"$$\n",
"\n",
"\n",
"\n",
"* If $\\mu_i$ is a linear function of $\\beta$, i.e., $\\mathbf{x}_i^T\\beta$, then NLLS reduces to the usual least squares.\n",
"\n",
"* \"Score\" and \"information matrices\"\n",
"$$\n",
"\\begin{eqnarray*}\n",
"\t\\nabla f(\\beta) &=& - \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i,\\beta)] \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n",
"\t\\nabla^2 f(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\mathbf{x}_i,\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n",
"\t\\mathbf{I}(\\beta) &=& \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T = \\mathbf{J}(\\beta)^T \\mathbf{J}(\\beta),\n",
"\\end{eqnarray*}\n",
"$$\n",
"where $\\mathbf{J}(\\beta)^T = [\\nabla \\mu_1(\\mathbf{x}_1,\\beta), \\ldots, \\nabla \\mu_n(\\mathbf{x}_n,\\beta)] \\in \\mathbb{R}^{p \\times n}$.\n",
"\n",
"* **Gauss-Newton** (= \"Fisher scoring method\") uses $\\mathbf{I}(\\beta)$, which is always positive semidefinite.\n",
"$$\n",
"\t\\boxed{ \\beta^{(t+1)} = \\beta^{(t)} - s [\\mathbf{I} (\\beta^{(t)})]^{-1} \\nabla f(\\beta^{(t)}) }\n",
"$$\n",
"\n",
"* Justification\n",
" 1. Residuals $y_i - \\mu_i(\\mathbf{x}_i, \\beta)$ are small, or $\\mu_i$ is nearly linear;\n",
" 2. Model the data as $y_i \\sim N(\\mu_i(\\mathbf{x}_i, \\beta), \\sigma^2)$, where $\\sigma^2$ is assumed to be known. Then, \n",
"\\begin{align*}\n",
" \\ell(\\beta) &= -\\frac{1}{2\\sigma^2}\\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla \\mu_i(\\mathbf{x}_i, \\beta) \\\\\n",
" \\nabla\\ell(\\beta) &= -\\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i, \\beta) \\\\\n",
" -\\nabla^2 \\ell(\\beta) &= \\sum_{i=1}^n \\nabla \\mu_i(\\mathbf{x}_i,\\beta) \\nabla \\mu_i(\\mathbf{x}_i,\\beta)^T - \\sum_{i=1}^n [y_i - \\mu_i(\\beta)] \\nabla^2 \\mu_i(\\mathbf{x}_i,\\beta) \\\\\n",
"\\end{align*}\n",
" Thus\n",
" $$\n",
" \\mathbf{E}[-\\nabla^2 \\ell(\\beta)] = \\sum_{i=1}^n \\nabla \\mu_i(\\beta) \\nabla \\mu_i(\\beta)^T.\n",
" $$\n",
" \n",
"* In fact Gauss invented the Gaussian (normal) distribution to justify this method!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gradient descent method\n",
"\n",
"* Iteration:\n",
"$$\n",
"\t\\boxed{ \\mathbf{x}^{(t+1)} = \\mathbf{x}^{(t)} - \\gamma_t\\nabla f(\\mathbf{x}^{(t)}) }\n",
"$$\n",
"for some step size $\\gamma_t$. This is a special case of the Netwon-type algorithms with $\\mathbf{H}_t = \\mathbf{I}$ and $\\Delta\\mathbf{x} = \\nabla f(\\mathbf{x}^{(t)})$.\n",
"\n",
"* Idea: iterative linear approximation (first-order Taylor series expansion). \n",
"$$\n",
"\tf(\\mathbf{x}) \\approx f(\\mathbf{x}^{(t)}) + \\nabla f(\\mathbf{x}^{(t)})^T (\\mathbf{x} - \\mathbf{x}^{(t)})\n",
"$$\n",
"and then minimize the linear approximation within a compact set: let $\\Delta\\mathbf{x} = \\mathbf{x}^{(t+1)}-\\mathbf{x}^{(t)}$ to choose. Solve\n",
"$$\n",
" \\min_{\\|\\Delta\\mathbf{x}\\|_2 \\le 1} \\nabla f(\\mathbf{x}^{(t)})^T\\Delta\\mathbf{x}\n",
"$$\n",
"to obtain $\\Delta\\mathbf{x} = -\\nabla f(\\mathbf{x}^{(t)})/\\|\\nabla f(\\mathbf{x}^{(t)})\\|_2 \\propto -\\nabla f(\\mathbf{x}^{(t)})$.\n",
"\n",
"\n",
"\n",
"* Step sizes are chosen so that the descent property is maintained (e.g., line search).\n",
" - Step size *must* be chosen at any time, since unlike Netwon's method, first-order approximation is always unbounded below.\n",
"\n",
"* Pros\n",
" - Each iteration is inexpensive.\n",
" - No need to derive, compute, store and invert Hessians; attractive in large scale problems. \n",
" \n",
"* Cons\n",
" - Slow convergence (zigzagging).\n",
"\n",
" \n",
" \n",
" - Do not work for non-smooth problems."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convergence\n",
"\n",
"* If \n",
" 1. $f$ is convex and differentiable over $\\mathbb{R}^d$;\n",
" 2. $f$ has $L$-Lipschitz gradients, i.e., $\\|\\nabla f(\\mathbf{x}) - \\nabla f(\\mathbf{y})\\|_2 \\le L\\|\\mathbf{x} - \\mathbf{y} \\|_2$;\n",
" 3. $p^{\\star} = \\inf_{\\mathbf{x}\\in\\mathbb{R}^d} f(x) > -\\infty$ and attained at $\\mathbf{x}^{\\star}$,\n",
" \n",
" then \n",
"$$\n",
" f(\\mathbf{x}^{(t)}) - p^{\\star} \\le \\frac{1}{2\\gamma t}\\|\\mathbf{x}^{(0)} - \\mathbf{x}^{\\star}\\|_2^2 = O(1/t).\n",
"$$\n",
" for constant step size ($\\gamma_t = \\gamma \\in (0, 1/L]$), a *sublinear* convergence. Similar upper bound with line search.\n",
"\n",
"* In general, the best we can obtain from gradient descent is linear convergence (cf. quadratic convergence of Newton)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Examples\n",
"\n",
"* Least squares: $f(\\beta) = \\frac{1}{2}\\|\\mathbf{X}\\beta - \\mathbf{y}\\|_2^2 = \\frac{1}{2}\\beta^T\\mathbf{X}^T\\mathbf{X}\\beta - \\beta^T\\mathbf{X}^T\\mathbf{y} + \\frac{1}{2}\\mathbf{y}^T\\mathbf{y}$\n",
"\n",
" - Gradient $\\nabla f(\\beta) = \\mathbf{X}^T\\mathbf{X}\\beta - \\mathbf{X}^T\\mathbf{y}$ is $\\|\\mathbf{X}^T\\mathbf{X}\\|_2$-Lipschitz. Hence $\\gamma \\in (0, 1/\\sigma_{\\max}(\\mathbf{X})^2]$ guarantees descent property and convergence.\n",
"\n",
"* Logistic regression: $f(\\beta) = -\\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\log (1 + e^{\\mathbf{x}_i^T \\beta}) \\right]$\n",
"\n",
" - Gradient $\\nabla f(\\beta) = - \\mathbf{X}^T (\\mathbf{y} - \\mathbf{p})$ is $\\frac{1}{4}\\|\\mathbf{X}^T\\mathbf{X}\\|_2$-Lipschitz.\n",
"\n",
" - May be unbounded below, e.g., all the data are of the same class.\n",
"\n",
" - Adding a ridge penalty $\\frac{\\rho}{2}\\|\\beta\\|_2^2$ makes the problem identifiable.\n",
" \n",
"* Poisson regression: $f(\\beta) = -\\sum_{i=1}^n \\left[ y_i \\mathbf{x}_i^T \\beta - \\exp(\\mathbf{x}_i^T \\beta) - \\log(y_i!) \\right]$\n",
" - Gradient\n",
"\\begin{eqnarray*}\n",
"\t\\nabla f(\\beta) &=& -\\sum_{i=1}^n \\left( y_i \\mathbf{x}_i - \\exp(\\mathbf{x}_i^T \\beta)\\mathbf{x}_i \\right) \\\\\n",
"\t&=& -\\sum_{i=1}^n (y_i - \\lambda_i) \\mathbf{x}_i = -\\mathbf{X}^T (\\mathbf{y} - \\boldsymbol{\\lambda})\n",
"\\end{eqnarray*}\n",
" is *not* Lipschitz (HW). What is the consequence?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"# Poisson regression using gradient descent\n",
"# Model: lambda = exp(beta0 + beta1 * quarter), or deaths ~ quarter\n",
"poissonreg_grad <- function(x, y, maxiter=10, tol=1e-6) {\n",
" beta0 <- matrix(0, nrow=2, ncol=1) # initial point\n",
" betaold <- beta0\n",
" iter <- 1\n",
" inneriter <- rep(0, maxiter) # to count no. step halving\n",
" betamat <- matrix(NA, nrow=maxiter, ncol=length(beta0) + 2)\n",
" colnames(betamat) <- c(\"beta0\", \"beta1\", \"inneriter\", \"lik\")\n",
" beta <- beta0\n",
" lik <- function(bet) {eta <- bet[1] + bet[2] * x; sum( y * eta - exp(eta) ) } # log-likelihood\n",
" likold <- lik(betaold)\n",
" while (iter < maxiter) {\n",
" eta <- beta[1] + x * beta[2]\n",
" lam <- exp(eta) # lambda\n",
" grad <- as.matrix(c(sum(lam - y), sum((lam - y) * x)))\n",
" # line search by step halving\n",
" #s <- 0.00001\n",
" s <- 0.00005\n",
" for (i in 1:5) {\n",
" beta <- beta - s * grad\n",
" curlik <- lik(beta)\n",
" if (curlik > likold) break\n",
" s <- s * 0.5\n",
" inneriter[iter] <- inneriter[iter] + 1\n",
" }\n",
" #print(c(as.numeric(beta), inneriter[iter], curlik))\n",
" betamat[iter,] <- c(as.numeric(beta), inneriter[iter], curlik)\n",
" betadiff <- beta - betaold\n",
" if (norm(betadiff, \"F\") < tol) break\n",
" likold <- curlik\n",
" betaold <- beta\n",
" iter <- iter + 1\n",
" }\n",
" #return(list(val=as.numeric(beta), iter=iter, inneriter=inneriter[1:iter]))\n",
" return(betamat[1:iter,])\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"A matrix: 10 × 4 of type dbl\n",
"\n",
"\tbeta0 | beta1 | inneriter | lik |
\n",
"\n",
"\n",
"\t0.01025000 | 0.1149500 | 0 | 241.3754 |
\n",
"\t0.01933954 | 0.2178634 | 0 | 423.2101 |
\n",
"\t0.02505110 | 0.2826001 | 0 | 471.3151 |
\n",
"\t0.02532401 | 0.2830995 | 0 | 471.3176 |
\n",
"\t0.02553404 | 0.2828481 | 0 | 471.3190 |
\n",
"\t0.02577206 | 0.2829335 | 0 | 471.3201 |
\n",
"\t0.02599725 | 0.2828674 | 0 | 471.3212 |
\n",
"\t0.02622798 | 0.2828694 | 0 | 471.3222 |
\n",
"\t0.02645600 | 0.2828408 | 0 | 471.3233 |
\n",
"\t0.02668501 | 0.2828259 | 0 | 471.3243 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A matrix: 10 × 4 of type dbl\n",
"\\begin{tabular}{llll}\n",
" beta0 & beta1 & inneriter & lik\\\\\n",
"\\hline\n",
"\t 0.01025000 & 0.1149500 & 0 & 241.3754\\\\\n",
"\t 0.01933954 & 0.2178634 & 0 & 423.2101\\\\\n",
"\t 0.02505110 & 0.2826001 & 0 & 471.3151\\\\\n",
"\t 0.02532401 & 0.2830995 & 0 & 471.3176\\\\\n",
"\t 0.02553404 & 0.2828481 & 0 & 471.3190\\\\\n",
"\t 0.02577206 & 0.2829335 & 0 & 471.3201\\\\\n",
"\t 0.02599725 & 0.2828674 & 0 & 471.3212\\\\\n",
"\t 0.02622798 & 0.2828694 & 0 & 471.3222\\\\\n",
"\t 0.02645600 & 0.2828408 & 0 & 471.3233\\\\\n",
"\t 0.02668501 & 0.2828259 & 0 & 471.3243\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A matrix: 10 × 4 of type dbl\n",
"\n",
"| beta0 | beta1 | inneriter | lik |\n",
"|---|---|---|---|\n",
"| 0.01025000 | 0.1149500 | 0 | 241.3754 |\n",
"| 0.01933954 | 0.2178634 | 0 | 423.2101 |\n",
"| 0.02505110 | 0.2826001 | 0 | 471.3151 |\n",
"| 0.02532401 | 0.2830995 | 0 | 471.3176 |\n",
"| 0.02553404 | 0.2828481 | 0 | 471.3190 |\n",
"| 0.02577206 | 0.2829335 | 0 | 471.3201 |\n",
"| 0.02599725 | 0.2828674 | 0 | 471.3212 |\n",
"| 0.02622798 | 0.2828694 | 0 | 471.3222 |\n",
"| 0.02645600 | 0.2828408 | 0 | 471.3233 |\n",
"| 0.02668501 | 0.2828259 | 0 | 471.3243 |\n",
"\n"
],
"text/plain": [
" beta0 beta1 inneriter lik \n",
" [1,] 0.01025000 0.1149500 0 241.3754\n",
" [2,] 0.01933954 0.2178634 0 423.2101\n",
" [3,] 0.02505110 0.2826001 0 471.3151\n",
" [4,] 0.02532401 0.2830995 0 471.3176\n",
" [5,] 0.02553404 0.2828481 0 471.3190\n",
" [6,] 0.02577206 0.2829335 0 471.3201\n",
" [7,] 0.02599725 0.2828674 0 471.3212\n",
" [8,] 0.02622798 0.2828694 0 471.3222\n",
" [9,] 0.02645600 0.2828408 0 471.3233\n",
"[10,] 0.02668501 0.2828259 0 471.3243"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"A matrix: 10 × 4 of type dbl\n",
"\n",
"\t | beta0 | beta1 | inneriter | lik |
\n",
"\n",
"\n",
"\t[12951,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12952,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12953,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12954,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12955,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12956,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12957,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12958,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12959,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |
\n",
"\t[12960,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |
\n",
"\n",
"
\n"
],
"text/latex": [
"A matrix: 10 × 4 of type dbl\n",
"\\begin{tabular}{r|llll}\n",
" & beta0 & beta1 & inneriter & lik\\\\\n",
"\\hline\n",
"\t{[}12951,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12952,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12953,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12954,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12955,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12956,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12957,{]} & 0.3396212 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12958,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12959,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n",
"\t{[}12960,{]} & 0.3396213 & 0.2565247 & 0 & 472.0625\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"A matrix: 10 × 4 of type dbl\n",
"\n",
"| | beta0 | beta1 | inneriter | lik |\n",
"|---|---|---|---|---|\n",
"| [12951,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
"| [12952,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
"| [12953,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
"| [12954,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
"| [12955,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
"| [12956,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
"| [12957,] | 0.3396212 | 0.2565247 | 0 | 472.0625 |\n",
"| [12958,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n",
"| [12959,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n",
"| [12960,] | 0.3396213 | 0.2565247 | 0 | 472.0625 |\n",
"\n"
],
"text/plain": [
" beta0 beta1 inneriter lik \n",
"[12951,] 0.3396212 0.2565247 0 472.0625\n",
"[12952,] 0.3396212 0.2565247 0 472.0625\n",
"[12953,] 0.3396212 0.2565247 0 472.0625\n",
"[12954,] 0.3396212 0.2565247 0 472.0625\n",
"[12955,] 0.3396212 0.2565247 0 472.0625\n",
"[12956,] 0.3396212 0.2565247 0 472.0625\n",
"[12957,] 0.3396212 0.2565247 0 472.0625\n",
"[12958,] 0.3396213 0.2565247 0 472.0625\n",
"[12959,] 0.3396213 0.2565247 0 472.0625\n",
"[12960,] 0.3396213 0.2565247 0 472.0625"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Australian AIDS data from the IRLS example above\n",
"pm <- poissonreg_grad(quarters, deaths, tol=1e-8, maxiter=20000)\n",
"head(pm, 10)\n",
"tail(pm, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [
{
"data": {
"text/html": [
"- (Intercept)
- 0.339633920708136
- quarters
- 0.256523593717915
\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[(Intercept)] 0.339633920708136\n",
"\\item[quarters] 0.256523593717915\n",
"\\end{description*}\n"
],
"text/markdown": [
"(Intercept)\n",
": 0.339633920708136quarters\n",
": 0.256523593717915\n",
"\n"
],
"text/plain": [
"(Intercept) quarters \n",
" 0.3396339 0.2565236 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"m <- glm(deaths ~ quarters, family = poisson())\n",
"coef(m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Coordinate descent\n",
"\n",
"* Idea: coordinate-wise minimization is often easier than the full-dimensional minimization\n",
"$$\n",
" x_j^{(t+1)} = \\arg\\min_{x_j} f(x_1^{(t+1)}, \\dotsc, x_{j-1}^{(t+1)}, x_j, x_{j+1}^{(t)}, \\dotsc, x_d^{(t)})\n",
"$$\n",
"\n",
"\n",
"\n",
"* Similar to the Gauss-Seidel method for solving linear equations. \n",
"\n",
"* *Block descent*: a vector version of coordinate descent\n",
"$$\n",
" \\mathbf{x}_j^{(t+1)} = \\arg\\min_{\\mathbf{x}_j} f(\\mathbf{x}_1^{(t+1)}, \\dotsc, \\mathbf{x}_{j-1}^{(t+1)}, \\mathbf{x}_j, \\mathbf{x}_{j+1}^{(t)}, \\dotsc, \\mathbf{x}_d^{(t)})\n",
"$$\n",
"\n",
"* Q: why objective value converges?\n",
" - A: descent property\n",
"\n",
"* Caution: even if the objective function is convex, coordinate descent may *not* converge to the global minimum and converge to a suboptimal point.\n",
"\n",
""
]
}
],
"metadata": {
"@webio": {
"lastCommId": null,
"lastKernelId": null
},
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.2.2"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "153px",
"width": "252px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"threshold": 4,
"toc_cell": true,
"toc_position": {
"height": "441.3333435058594px",
"left": "0px",
"right": "903.3333129882813px",
"top": "140.6666717529297px",
"width": "166px"
},
"toc_section_display": "block",
"toc_window_display": true,
"widenNotebook": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}