{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1 toc-item\"><a href=\"#Quasi-Newton-Method-(KL-14.9)\" data-toc-modified-id=\"Quasi-Newton-Method-(KL-14.9)-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Quasi-Newton Method (KL 14.9)</a></div><div class=\"lev2 toc-item\"><a href=\"#Introduction\" data-toc-modified-id=\"Introduction-11\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Introduction</a></div><div class=\"lev2 toc-item\"><a href=\"#Quasi-Newton\" data-toc-modified-id=\"Quasi-Newton-12\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Quasi-Newton</a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quasi-Newton Method (KL 14.9)\n",
    "\n",
    "## Introduction\n",
    "\n",
    "* Quasi-Newton is one of the most successful \"black-box\" NLP optimizers in use. Examples:\n",
    "    * `Optim.jl`, `NLopt.jl`, `Ipopt.jl` packages in Julia.\n",
    "    * `optim` in R.\n",
    "\n",
    "* Consider the practical Newton scheme for minimizing $f(\\mathbf{x})$, $\\mathbf{x} \\in {\\cal \\mathbf{X}} \\subset \\mathbb{R}^p$:  \n",
    "$$\n",
    "    \\mathbf{x}^{(t+1)} = \\mathbf{x}^{(t)} - s [\\mathbf{A}^{(t)}]^{-1} \\nabla f(\\mathbf{x}^{(t)}),\n",
    "$$\n",
    "where $\\mathbf{A}^{(t)}$ a pd approximation of the Hessian $\\nabla^2 f(\\mathbf{x}^{(t)})$. \n",
    "    * Pros: fast convergence\n",
    "    * Cons: compute and store Hessian at each iteration (usually $O(np^2)$ cost in statistical problems), solving a linear system ($O(p^3)$ cost in general), human efforts (derive and implement gradient and Hessian, pd approximation, ...)\n",
    "\n",
    "* Any pd $\\mathbf{A}$ gives a descent algorithm - tradeoff between convergence rate and cost per iteration.\n",
    "\n",
    "* Setting $\\mathbf{A} = \\mathbf{I}$ leads to the **steepest descent** algorithm. Picture: slow convergence (zig-zagging) of steepest descent (with exact line search) for minimizing a convex quadratic function (ellipse).\n",
    "\n",
    "<img src=\"http://trond.hjorteland.com/thesis/img208.gif\" width=\"400\" align=\"center\"/>  \n",
    "How many steps does the Newton's method using the Hessian take for a convex quadratic $f$?\n",
    "\n",
    "## Quasi-Newton\n",
    "\n",
    "* Idea of Quasi-Newton: No analytical Hessian is required (still need gradient). Bootstrap Hessian from gradient!  \n",
    "\n",
    "* Update approximate Hessian $\\mathbf{A}$ according to the **secant condition**\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t\\nabla f(\\mathbf{x}^{(t)}) - \\nabla f(\\mathbf{x}^{(t-1)}) \\approx [\\nabla^2 f(\\mathbf{x}^{(t)})] (\\mathbf{x}^{(t)} - \\mathbf{x}^{(t-1)}).\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "Instead of computing $\\mathbf{A}$ from scratch at each iteration, we update an approximation $\\mathbf{A}$ to $\\nabla^2 f(\\mathbf{x})$ that satisfies\n",
    "    1. p.d., \n",
    "    2. the secant condition, and \n",
    "    3. closest to the previous approximation.\n",
    "\n",
    "* Super-linear convergence, compared to the quadratic convergence of Newton's method. But each iteration only takes $O(p^2)$!\n",
    "\n",
    "* **Davidon-Fletcher-Powell (DFP)** rank-2 update. Solve \n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t&\\text{minimize}& \\, \\|\\mathbf{A} - \\mathbf{A}^{(t)}\\|_{\\text{F}} \\\\\n",
    "\t&\\text{subject to}& \\, \\mathbf{A} = \\mathbf{A}^T \\\\\n",
    "\t& & \\mathbf{A} (\\mathbf{x}^{(t)}-\\mathbf{x}^{(t-1)}) = \\nabla f(\\mathbf{x}^{(t)}) - \\nabla f(\\mathbf{x}^{(t-1)})\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "for the next approximation $\\mathbf{A}^{(t+1)}$. The solution is a low rank (rank 1 or rank 2) update of $\\mathbf{A}^{(t)}$. The inverse is too thanks to the Sherman-Morrison-Woodbury formula! $O(p^2)$ operations. Need to store a $p$-by-$p$ dense matrix.\n",
    "\n",
    "* **Broyden-Fletcher-Goldfarb-Shanno (BFGS)** rank 2 update is considered by many the most effective among all quasi-Newton updates. BFGS imposes secant condition on the inverse of Hessian $\\mathbf{H} = \\mathbf{A}^{-1}$. \n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "\t&\\text{minimize}& \\, \\|\\mathbf{H} - \\mathbf{H}^{(t)}\\|_{\\text{F}} \\\\\n",
    "\t&\\text{subject to}& \\, \\mathbf{H} = \\mathbf{H}^T \\\\\n",
    "\t& & \\mathbf{H} [ \\nabla f(\\mathbf{x}^{(t)}) - \\nabla f(\\mathbf{x}^{(t-1)})] = \\mathbf{x}^{(t)}-\\mathbf{x}^{(t-1)}.\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "Again $\\mathbf{H}^{(t+1)}$ is a rank two update of $\\mathbf{H}^{(t)}$. $O(p^2)$ operations. Still need to store a dense $p$-by-$p$ matrix.\n",
    "\n",
    "* **Limited-memory BFGS (L-BFGS)**. Only store the secant pairs. No need to store the $p$-by-$p$ approximate inverse Hessian. Particularly useful for large scale optimization.\n",
    "\n",
    "* How to set the initial approximation $\\mathbf{A}^{(0)}$? Identity or Hessian (if pd) or Fisher information matrix at starting point.\n",
    "\n",
    "* History: **Davidon** was a physicist at Argonne National Lab in 1950s and proposed the very first idea of quasi-Newton method in 1959. Later studied, implemented, and popularized by Fletcher and Powell. Davidon's original paper was not accepted for publication 😞; 30 years later it appeared as the first article in the inaugural issue of [SIAM Journal of Optimization](http://epubs.siam.org/doi/abs/10.1137/0801001?journalCode=sjope8).\n",
    "\n",
    "<img src=\"./davidon-wiki.png\" width=\"400\" align=\"center\"/>\n",
    "\n",
    "<img src=\"https://images-na.ssl-images-amazon.com/images/I/51J0nOUwqiL._SX334_BO1,204,203,200_.jpg\" width=\"200\" align=\"center\"/>"
   ]
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.1.0",
   "language": "julia",
   "name": "julia-1.1"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.1.0"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "65px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}