{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Description**: Introduction to automatic differentiation and the ``DualNumbers`` package. Prepared for MIT course 15.S60 \"Software Tools for Operations Research\", January 2015.\n", "\n", "**Author**: Miles Lubin\n", "\n", "**License**: \"Creative
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Computing derivatives for nonlinear optimization: Forward mode automatic differentiation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider a general constrained nonlinear optimization problem:\n", "$$\n", "\\begin{align}\n", "\\min \\quad&f(x)\\\\\n", "\\text{s.t.} \\quad& g(x) = 0, \\\\\n", "& h(x) \\leq 0.\n", "\\end{align}\n", "$$\n", "where $f : \\mathbb{R}^n \\to \\mathbb{R}, g : \\mathbb{R}^n \\to \\mathbb{R}^r$, and $h: \\mathbb{R}^n \\to \\mathbb{R}^s$.\n", "\n", "When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.\n", "\n", "What approaches can we use to solve this?\n", " - When $r=0$ and $s = 0$ (unconstrained), and $f$ differentiable, most classical approach is [gradient descent](http://en.wikipedia.org/wiki/Gradient_descent), also fancier methods like [Newton's method](http://en.wikipedia.org/wiki/Newton%27s_method) and quasi-newton methods like [BFGS](http://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm).\n", " - When $f$ differentiable and $g$ and $h$ linear, [gradient projection](http://neos-guide.org/content/gradient-projection-methods)\n", " - When $f$, $g$, and $h$ twice differentiable, [interior-point methods](http://en.wikipedia.org/wiki/Interior_point_method), [sequential quadratic programming](http://www.neos-guide.org/content/sequential-quadratic-programming)\n", " - When derivatives \"not available\", [derivative-free optimization](http://rd.springer.com/article/10.1007/s10898-012-9951-y)\n", " \n", "This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How are derivatives computed?\n", "\n", "- Hand-written by applying chain rule\n", "- Finite difference approximation $\\frac{\\partial f}{\\partial x_i} = \\lim_{h\\to 0} \\frac{f(x+h e_i)-f(x)}{h}$\n", "- **Automatic differentiation**\n", " - Idea: Automatically transform code to compute a function into code to compute its derivatives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dual Numbers\n", "\n", "Consider numbers of the form $x + y\\epsilon$ with $x,y \\in \\mathbb{R}$. We *define* $\\epsilon^2 = 0$, so\n", "$$\n", "(x_1 + y_1\\epsilon)(x_2+y_2\\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\\epsilon.\n", "$$\n", "These are called the *dual numbers*. Think of $\\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).\n", "\n", "If we are given an infinitely differentiable function in Taylor expanded form\n", "$$\n", "f(x) = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a)^k\n", "$$\n", "it follows that \n", "$$\n", "f(x+y\\epsilon) = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a+y\\epsilon)^k = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a)^k + y\\epsilon\\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!}\\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\\epsilon\n", "$$\n", "\n", "Let's unpack what's going on here. We started with a function $f : \\mathbb{R} \\to \\mathbb{R}$. Dual numbers are *not* real numbers, so it doesn't even make sense to ask for the value $f(x+y\\epsilon)$ given $x+y\\epsilon \\in \\mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\\epsilon^2 = 0$ we found that $f(x+y\\epsilon)$ must be equal to $f(x) + yf'(x)\\epsilon$ if we use the Taylor expansion as the definition of $f : \\mathbb{D} \\to \\mathbb{R}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, for any once differentiable function $f : \\mathbb{R} \\to \\mathbb{R}$, we can *define* its extension to the dual numbers as\n", "$$\n", "f(x+y\\epsilon) = f(x) + yf'(x)\\epsilon.\n", "$$\n", "This is essentially equivalent to the previous definition.\n", "\n", "Let's verify a very basic property, the chain rule, using this definition.\n", "\n", "Suppose $h(x) = f(g(x))$. Then,\n", "$$\n", "h(x+y\\epsilon) = f(g(x+y\\epsilon)) = f(g(x) + yg'(x)\\epsilon) = f(g(x)) + yg'(x)f'(g(x))\\epsilon = h(x) + yh'(x)\\epsilon.\n", "$$\n", "\n", "Maybe that's not too surprising, but it's actually a quite useful observation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementation\n", "\n", "Dual numbers are implemented in the [DualNumbers](https://github.com/JuliaDiff/DualNumbers.jl) package in Julia." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "using DualNumbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You construct $x + y\\epsilon$ with ``Dual(x,y)``. The real and epsilon components are accessed as ``real(d)`` and ``epsilon(d)``:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2.0 + 1.0ɛ" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = Dual(2.0,1.0)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DualNumbers.Dual{Float64}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(d)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "realpart(d)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "epsilon(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How is addition of dual numbers defined?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "@which d+Dual(3.0,4.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clicking on the link, we'll see:\n", "```julia\n", "+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiplication?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "6.0 + 14.0ɛ" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Dual(2.0,2.0)*Dual(3.0,4.0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "@which Dual(2.0,2.0)*Dual(3.0,4.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code is:\n", "```julia\n", "*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basic univariate functions?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.6931471805599453 + 0.5ɛ" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "log(Dual(2.0,1.0))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1/2.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How is this implemented?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LambdaInfo template for log(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273\n", ":(begin \n", " nothing\n", " x = (DualNumbers.value)(z) # line 274:\n", " xp = (DualNumbers.epsilon)(z) # line 275:\n", " return (DualNumbers.Dual)((DualNumbers.log)(x),xp * (1 / x))\n", " end)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@code_lowered log(Dual(2.0,1.0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trig functions?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LambdaInfo template for sin(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273\n", ":(begin \n", " nothing\n", " x = (DualNumbers.value)(z) # line 274:\n", " xp = (DualNumbers.epsilon)(z) # line 275:\n", " return (DualNumbers.Dual)((DualNumbers.sin)(x),xp * (DualNumbers.cos)(x))\n", " end)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@code_lowered sin(Dual(2.0,1.0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing derivatives of functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can define a function in Julia as:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: Method definition f(Any) in module Main at In[11]:1 overwritten at In[11]:4.\n" ] }, { "data": { "text/plain": [ "f (generic function with 1 method)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f(x) = x^2 - log(x)\n", "# Or equivalently\n", "function f(x)\n", " return x^2 - log(x)\n", "end" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.8465735902799727 - 2.220446049250313e-16ɛ" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f(Dual(1/sqrt(2),1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">**\\[Exercise\\]**: Differentiate it!\n", "\n", "> 1. Evaluate $f$ at $1 + \\epsilon$. What are $f(1)$ and $f'(1)$?\n", "> 2. Evaluate $f$ at $\\frac{1}{\\sqrt{2}} + \\epsilon$. What are $f(\\frac{1}{\\sqrt{2}})$ and $f'(\\frac{1}{\\sqrt{2}})$?\n", "> 3. Define a new function ``fprime`` which returns the derivative of ``f`` by using ``DualNumbers``.\n", "> 3. Use the finite difference formula $$\n", "f'(x) \\approx \\frac{f(x+h)-f(x)}{h}\n", "$$\n", "to evaluate $f'(\\frac{1}{\\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using ``@manipulate``, plots, or both!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How general is it?\n", "\n", "Recall [Newton's iterative method](http://en.wikipedia.org/wiki/Newton%27s_method) for finding zeros:\n", "$$\n", "x \\leftarrow x - \\frac{f(x)}{f'(x)}\n", "$$\n", "until $f(x) \\approx 0$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use this method to compute $\\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$.\n", "So $f'(z) = 2z$, and we can implement the algorithm as:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "squareroot (generic function with 1 method)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function squareroot(x)\n", " z = x # Initial starting point\n", " while abs(z*z - x) > 1e-13\n", " z = z - (z*z-x)/(2z)\n", " end\n", " return z\n", "end" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "squareroot(100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we differentiate this code? **Yes!**" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "10.0 + 0.049999999999999996ɛ" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = squareroot(Dual(100.0,1.0))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.049999999999999996" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "epsilon(d) # Computed derivative" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.05" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1/(2*sqrt(100)) # The exact derivative" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "6.938893903907228e-18" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "abs(epsilon(d)-1/(2*sqrt(100)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multivariate functions?\n", "\n", "Dual numbers can be used to compute the gradient of a function $f: \\mathbb{R}^n \\to \\mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is [implemented](https://github.com/JuliaOpt/Optim.jl/blob/583907676b5b99cdb2d4cba37f6026a3fe620a49/src/autodiff.jl) in [Optim](https://github.com/JuliaOpt/Optim.jl) with the ``autodiff=true`` keyword." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "using Optim\n", "rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2\n", "optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When $n$ is large, there's an alternative procedure called [reverse-mode automatic differentiation](http://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation) which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in [ReverseDiffSparse](https://github.com/mlubin/ReverseDiffSparse.jl))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions\n", "\n", "- We can compute numerically exact derivatives of any differentiable function which is implemented by using a sequence of basic operations.\n", "- In Julia it's very easy to use dual numbers for this!\n", "- Reconsider when derivatives are \"not available.\"\n", "\n", "This was just an introduction to one technique from the area of automatic differentiation. For more references, see [autodiff.org](http://www.autodiff.org/?module=Introduction&submenu=Selected%20Books)." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Julia 0.5.0", "language": "julia", "name": "julia-0.5" }, "keywords": "DualNumbers,AD,derivatives", "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }