{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"font-size:30px; text-align:center\"><b>Gradient Descent</b></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: right\">(C) <a href=\"https://github.com/niknow\">Nikolai Nowaczyk</a>, <a href=\"https://github.com/Lapsilago\">Jörg Kienitz</a> 2019-2021</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib widget\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import ipywidgets as wd\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "from matplotlib import cm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let $f:\\mathbb{R}^m \\to \\mathbb{R}$ be a differentiable function. Gradient descent is a class of methods to find the minimum\n",
    "$$ \\min_{x \\in \\mathbb{R}^m}{f(x)}$$\n",
    "\n",
    "One method to find a minimum is to find a zero of the gradient $\\operatorname{grad} f$, for example via Newton's method. However, this requires the computation of the Hessian of $f$, which can be quite costly. Gradient descent is an alternative method, which only requires the computation of the gradient of $f$, but not the Hessian."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Algorithm\n",
    "The basic idea of gradient descent is simple: Start with an arbitrary guess $x_0 \\in \\mathbb{R}^m$ and then recursively descent into the direction negative gradient:\n",
    "$$ x_{n+1} = x_n - \\alpha_n \\operatorname{grad} f(x_n)$$\n",
    "Here, the $\\alpha_n \\in \\mathbb{R}$ are a choice of *step sizes*. This method is motivated by the below mathematical background."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mathematical Background\n",
    "The key principles behind gradient descent rely on the following simple insights:\n",
    "\n",
    "**Lemma**: Let $f:\\mathbb{R}^m \\to \\mathbb{R}$ be a differentiable function and $x_0 \\in \\mathbb{R}^m$. \n",
    "1. For any $v \\in \\mathbb{R}^m$, the directional derivative $\\partial _v f$ satisfies\n",
    "\\begin{align*}\n",
    "    \\lim_{t \\searrow 0}\\frac{f(x_0 + tv)}{t} = \\partial_v f(x_0) =\\nabla f(x_0) v = \\langle \\nabla f(x_0)^{\\top}, v \\rangle \n",
    "    = \\langle \\operatorname{grad} f(x_0), v \\rangle \n",
    "\\end{align*}\n",
    "2. The negative gradient is the direction of steepest descent, i.e.\n",
    "\\begin{align*}\n",
    "    \\underset{\\|v\\|=1}{\\operatorname{argmin}}\\partial_v f(x_0) = - \\frac{\\nabla f(x_0)}{\\|\\nabla f(x_0)\\|},\n",
    "\\end{align*}\n",
    "provided $\\nabla f(x_0) \\neq 0$.\n",
    "\n",
    "**Proof**: The first claim is the definition of directional derivative and an application of the chain rule. To see the second claim, denote by $\\vartheta$ the angle between $\\nabla f(x_0)$ and a $v \\in \\mathbb{R}^m$, $\\|v\\|=1$. By definition of the angle\n",
    "$$ \\cos(\\vartheta) = \\langle \\nabla f(x_0)^{\\top}, v \\rangle \\|\\nabla f(x_0)\\| ,$$\n",
    "thus\n",
    "$$\\partial_v f(x_0) = \\langle \\nabla f(x_0)^{\\top}, v \\rangle = \\frac{\\cos(\\vartheta)}{\\|\\nabla f(x_0)\\|}, $$\n",
    "which becomes smallest when $\\cos(\\vartheta)=-1$, i.e. when $v = -\\frac{\\nabla f(x_0)}{\\|\\nabla f(x_0)\\|}$.\n",
    "\n",
    "**Interpretation:** This means that at any point $x_0 \\in \\mathbb{R}^m$ descending an infinitesimal step into the direction of $-\\operatorname{grad} f(x_0)$ decreases $f$ most. The problem with that view is that 1) in reality, one needs to chose a finite step size and 2) this only yields a local minimum of $f$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def f(x):\n",
    "    return x[0]**2 + x[1]**2\n",
    "\n",
    "def df(x):\n",
    "    return 2*(x[0]+x[1])\n",
    "\n",
    "def gradient_descent_sequence(x0, f, df, N):\n",
    "    X = np.zeros((N+1, x0.shape[0]))\n",
    "    X[0] = x0\n",
    "    alpha=0.1\n",
    "    for n in range(N):\n",
    "        X[n+1] = X[n] - alpha * df(X[n])\n",
    "    return X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Apps\\Anaconda3\\envs\\heroku\\lib\\site-packages\\matplotlib\\__init__.py:942: MatplotlibDeprecationWarning: nbagg.transparent is deprecated and ignored. Use figure.facecolor instead.\n",
      "  mplDeprecation)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4bae68a51a404ae08486c7669289a11c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to  previous…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ec07f46d7b5c4f9d8a9da13719ecc23b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=1, description='N', max=20, min=1), Output()), _dom_classes=('widget-int…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "x = np.linspace(-1, 1, 20)\n",
    "y = np.linspace(-1, 1, 20)\n",
    "Z = np.array([[f(np.array([xx, yy])) for xx in x] for yy in y])\n",
    "\n",
    "X, Y = np.meshgrid(x, y)\n",
    "\n",
    "fig = plt.figure()\n",
    "ax = fig.add_subplot(111, projection='3d')\n",
    "\n",
    "@wd.interact(N=wd.IntSlider(min=1, max=20, value=1))\n",
    "def plot_gradient_descent(N):\n",
    "    ax.clear()\n",
    "    ax.plot_wireframe(X, Y, Z, alpha=0.8, cmap=cm.coolwarm)\n",
    "    Xgd = gradient_descent_sequence(np.array([1, 1]), f, df, N)\n",
    "    ax.plot(Xgd[:, 0], Xgd[:, 1], np.array([f(xx) for xx in Xgd]), marker='x', color='k')\n",
    "    plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "heroku",
   "language": "python",
   "name": "heroku"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}