{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Gradient Descent

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
(C) Nikolai Nowaczyk, Jörg Kienitz 2019-2021
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib widget\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import ipywidgets as wd\n", "from mpl_toolkits.mplot3d import Axes3D\n", "from matplotlib import cm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let $f:\\mathbb{R}^m \\to \\mathbb{R}$ be a differentiable function. Gradient descent is a class of methods to find the minimum\n", "$$ \\min_{x \\in \\mathbb{R}^m}{f(x)}$$\n", "\n", "One method to find a minimum is to find a zero of the gradient $\\operatorname{grad} f$, for example via Newton's method. However, this requires the computation of the Hessian of $f$, which can be quite costly. Gradient descent is an alternative method, which only requires the computation of the gradient of $f$, but not the Hessian." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Algorithm\n", "The basic idea of gradient descent is simple: Start with an arbitrary guess $x_0 \\in \\mathbb{R}^m$ and then recursively descent into the direction negative gradient:\n", "$$ x_{n+1} = x_n - \\alpha_n \\operatorname{grad} f(x_n)$$\n", "Here, the $\\alpha_n \\in \\mathbb{R}$ are a choice of *step sizes*. This method is motivated by the below mathematical background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mathematical Background\n", "The key principles behind gradient descent rely on the following simple insights:\n", "\n", "**Lemma**: Let $f:\\mathbb{R}^m \\to \\mathbb{R}$ be a differentiable function and $x_0 \\in \\mathbb{R}^m$. \n", "1. For any $v \\in \\mathbb{R}^m$, the directional derivative $\\partial _v f$ satisfies\n", "\\begin{align*}\n", " \\lim_{t \\searrow 0}\\frac{f(x_0 + tv)}{t} = \\partial_v f(x_0) =\\nabla f(x_0) v = \\langle \\nabla f(x_0)^{\\top}, v \\rangle \n", " = \\langle \\operatorname{grad} f(x_0), v \\rangle \n", "\\end{align*}\n", "2. The negative gradient is the direction of steepest descent, i.e.\n", "\\begin{align*}\n", " \\underset{\\|v\\|=1}{\\operatorname{argmin}}\\partial_v f(x_0) = - \\frac{\\nabla f(x_0)}{\\|\\nabla f(x_0)\\|},\n", "\\end{align*}\n", "provided $\\nabla f(x_0) \\neq 0$.\n", "\n", "**Proof**: The first claim is the definition of directional derivative and an application of the chain rule. To see the second claim, denote by $\\vartheta$ the angle between $\\nabla f(x_0)$ and a $v \\in \\mathbb{R}^m$, $\\|v\\|=1$. By definition of the angle\n", "$$ \\cos(\\vartheta) = \\langle \\nabla f(x_0)^{\\top}, v \\rangle \\|\\nabla f(x_0)\\| ,$$\n", "thus\n", "$$\\partial_v f(x_0) = \\langle \\nabla f(x_0)^{\\top}, v \\rangle = \\frac{\\cos(\\vartheta)}{\\|\\nabla f(x_0)\\|}, $$\n", "which becomes smallest when $\\cos(\\vartheta)=-1$, i.e. when $v = -\\frac{\\nabla f(x_0)}{\\|\\nabla f(x_0)\\|}$.\n", "\n", "**Interpretation:** This means that at any point $x_0 \\in \\mathbb{R}^m$ descending an infinitesimal step into the direction of $-\\operatorname{grad} f(x_0)$ decreases $f$ most. The problem with that view is that 1) in reality, one needs to chose a finite step size and 2) this only yields a local minimum of $f$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def f(x):\n", " return x[0]**2 + x[1]**2\n", "\n", "def df(x):\n", " return 2*(x[0]+x[1])\n", "\n", "def gradient_descent_sequence(x0, f, df, N):\n", " X = np.zeros((N+1, x0.shape[0]))\n", " X[0] = x0\n", " alpha=0.1\n", " for n in range(N):\n", " X[n+1] = X[n] - alpha * df(X[n])\n", " return X" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "D:\\Apps\\Anaconda3\\envs\\heroku\\lib\\site-packages\\matplotlib\\__init__.py:942: MatplotlibDeprecationWarning: nbagg.transparent is deprecated and ignored. Use figure.facecolor instead.\n", " mplDeprecation)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4bae68a51a404ae08486c7669289a11c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ec07f46d7b5c4f9d8a9da13719ecc23b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=1, description='N', max=20, min=1), Output()), _dom_classes=('widget-int…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "x = np.linspace(-1, 1, 20)\n", "y = np.linspace(-1, 1, 20)\n", "Z = np.array([[f(np.array([xx, yy])) for xx in x] for yy in y])\n", "\n", "X, Y = np.meshgrid(x, y)\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111, projection='3d')\n", "\n", "@wd.interact(N=wd.IntSlider(min=1, max=20, value=1))\n", "def plot_gradient_descent(N):\n", " ax.clear()\n", " ax.plot_wireframe(X, Y, Z, alpha=0.8, cmap=cm.coolwarm)\n", " Xgd = gradient_descent_sequence(np.array([1, 1]), f, df, N)\n", " ax.plot(Xgd[:, 0], Xgd[:, 1], np.array([f(xx) for xx in Xgd]), marker='x', color='k')\n", " plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "heroku", "language": "python", "name": "heroku" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }