{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Description**: Introduction to automatic differentiation and the ``DualNumbers`` package. Prepared for MIT course 15.S60 \"Software Tools for Operations Research\", January 2015.\n",
"\n",
"**Author**: Miles Lubin\n",
"\n",
"**License**:
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Computing derivatives for nonlinear optimization: Forward mode automatic differentiation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Consider a general constrained nonlinear optimization problem:\n",
"$$\n",
"\\begin{align}\n",
"\\min \\quad&f(x)\\\\\n",
"\\text{s.t.} \\quad& g(x) = 0, \\\\\n",
"& h(x) \\leq 0.\n",
"\\end{align}\n",
"$$\n",
"where $f : \\mathbb{R}^n \\to \\mathbb{R}, g : \\mathbb{R}^n \\to \\mathbb{R}^r$, and $h: \\mathbb{R}^n \\to \\mathbb{R}^s$.\n",
"\n",
"When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.\n",
"\n",
"What approaches can we use to solve this?\n",
" - When $r=0$ and $s = 0$ (unconstrained), and $f$ differentiable, most classical approach is [gradient descent](http://en.wikipedia.org/wiki/Gradient_descent), also fancier methods like [Newton's method](http://en.wikipedia.org/wiki/Newton%27s_method) and quasi-newton methods like [BFGS](http://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm).\n",
" - When $f$ differentiable and $g$ and $h$ linear, [gradient projection](http://neos-guide.org/content/gradient-projection-methods)\n",
" - When $f$, $g$, and $h$ twice differentiable, [interior-point methods](http://en.wikipedia.org/wiki/Interior_point_method), [sequential quadratic programming](http://www.neos-guide.org/content/sequential-quadratic-programming)\n",
" - When derivatives \"not available\", [derivative-free optimization](http://rd.springer.com/article/10.1007/s10898-012-9951-y)\n",
" \n",
"This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How are derivatives computed?\n",
"\n",
"- Hand-written by applying chain rule\n",
"- Finite difference approximation $\\frac{\\partial f}{\\partial x_i} = \\lim_{h\\to 0} \\frac{f(x+h e_i)-f(x)}{h}$\n",
"- **Automatic differentiation**\n",
" - Idea: Automatically transform code to compute a function into code to compute its derivatives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dual Numbers\n",
"\n",
"Consider numbers of the form $x + y\\epsilon$ with $x,y \\in \\mathbb{R}$. We *define* $\\epsilon^2 = 0$, so\n",
"$$\n",
"(x_1 + y_1\\epsilon)(x_2+y_2\\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\\epsilon.\n",
"$$\n",
"These are called the *dual numbers*. Think of $\\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).\n",
"\n",
"If we are given an infinitely differentiable function in Taylor expanded form\n",
"$$\n",
"f(x) = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a)^k\n",
"$$\n",
"it follows that \n",
"$$\n",
"f(x+y\\epsilon) = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a+y\\epsilon)^k = \\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!} (x-a)^k + y\\epsilon\\sum_{k=0}^{\\infty} \\frac{f^{(k)}(a)}{k!}\\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\\epsilon\n",
"$$\n",
"\n",
"Let's unpack what's going on here. We started with a function $f : \\mathbb{R} \\to \\mathbb{R}$. Dual numbers are *not* real numbers, so it doesn't even make sense to ask for the value $f(x+y\\epsilon)$ given $x+y\\epsilon \\in \\mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\\epsilon^2 = 0$ we found that $f(x+y\\epsilon)$ must be equal to $f(x) + yf'(x)\\epsilon$ if we use the Taylor expansion as the definition of $f : \\mathbb{D} \\to \\mathbb{R}$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, for any once differentiable function $f : \\mathbb{R} \\to \\mathbb{R}$, we can *define* its extension to the dual numbers as\n",
"$$\n",
"f(x+y\\epsilon) = f(x) + yf'(x)\\epsilon.\n",
"$$\n",
"This is essentially equivalent to the previous definition.\n",
"\n",
"Let's verify a very basic property, the chain rule, using this definition.\n",
"\n",
"Suppose $h(x) = f(g(x))$. Then,\n",
"$$\n",
"h(x+y\\epsilon) = f(g(x+y\\epsilon)) = f(g(x) + yg'(x)\\epsilon) = f(g(x)) + yg'(x)f'(g(x))\\epsilon = h(x) + yh'(x)\\epsilon.\n",
"$$\n",
"\n",
"Maybe that's not too surprising, but it's actually a quite useful observation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implementation\n",
"\n",
"Dual numbers are implemented in the [DualNumbers](https://github.com/JuliaDiff/DualNumbers.jl) package in Julia."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"using DualNumbers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You construct $x + y\\epsilon$ with ``Dual(x,y)``. The real and epsilon components are accessed as ``real(d)`` and ``epsilon(d)``:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2.0 + 1.0ɛ"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = Dual(2.0,1.0)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"DualNumbers.Dual{Float64}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"typeof(d)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2.0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"realpart(d)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"epsilon(d)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How is addition of dual numbers defined?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"@which d+Dual(3.0,4.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clicking on the link, we'll see:\n",
"```julia\n",
"+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Multiplication?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"6.0 + 14.0ɛ"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Dual(2.0,2.0)*Dual(3.0,4.0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"@which Dual(2.0,2.0)*Dual(3.0,4.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code is:\n",
"```julia\n",
"*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Basic univariate functions?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.6931471805599453 + 0.5ɛ"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log(Dual(2.0,1.0))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.5"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1/2.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How is this implemented?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"LambdaInfo template for log(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273\n",
":(begin \n",
" nothing\n",
" x = (DualNumbers.value)(z) # line 274:\n",
" xp = (DualNumbers.epsilon)(z) # line 275:\n",
" return (DualNumbers.Dual)((DualNumbers.log)(x),xp * (1 / x))\n",
" end)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@code_lowered log(Dual(2.0,1.0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trig functions?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"LambdaInfo template for sin(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273\n",
":(begin \n",
" nothing\n",
" x = (DualNumbers.value)(z) # line 274:\n",
" xp = (DualNumbers.epsilon)(z) # line 275:\n",
" return (DualNumbers.Dual)((DualNumbers.sin)(x),xp * (DualNumbers.cos)(x))\n",
" end)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@code_lowered sin(Dual(2.0,1.0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing derivatives of functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can define a function in Julia as:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Method definition f(Any) in module Main at In[11]:1 overwritten at In[11]:4.\n"
]
},
{
"data": {
"text/plain": [
"f (generic function with 1 method)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f(x) = x^2 - log(x)\n",
"# Or equivalently\n",
"function f(x)\n",
" return x^2 - log(x)\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.8465735902799727 - 2.220446049250313e-16ɛ"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f(Dual(1/sqrt(2),1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
">**\\[Exercise\\]**: Differentiate it!\n",
"\n",
"> 1. Evaluate $f$ at $1 + \\epsilon$. What are $f(1)$ and $f'(1)$?\n",
"> 2. Evaluate $f$ at $\\frac{1}{\\sqrt{2}} + \\epsilon$. What are $f(\\frac{1}{\\sqrt{2}})$ and $f'(\\frac{1}{\\sqrt{2}})$?\n",
"> 3. Define a new function ``fprime`` which returns the derivative of ``f`` by using ``DualNumbers``.\n",
"> 3. Use the finite difference formula $$\n",
"f'(x) \\approx \\frac{f(x+h)-f(x)}{h}\n",
"$$\n",
"to evaluate $f'(\\frac{1}{\\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using ``@manipulate``, plots, or both!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How general is it?\n",
"\n",
"Recall [Newton's iterative method](http://en.wikipedia.org/wiki/Newton%27s_method) for finding zeros:\n",
"$$\n",
"x \\leftarrow x - \\frac{f(x)}{f'(x)}\n",
"$$\n",
"until $f(x) \\approx 0$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use this method to compute $\\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$.\n",
"So $f'(z) = 2z$, and we can implement the algorithm as:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"squareroot (generic function with 1 method)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"function squareroot(x)\n",
" z = x # Initial starting point\n",
" while abs(z*z - x) > 1e-13\n",
" z = z - (z*z-x)/(2z)\n",
" end\n",
" return z\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"squareroot(100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can we differentiate this code? **Yes!**"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10.0 + 0.049999999999999996ɛ"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = squareroot(Dual(100.0,1.0))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.049999999999999996"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"epsilon(d) # Computed derivative"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.05"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1/(2*sqrt(100)) # The exact derivative"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"6.938893903907228e-18"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"abs(epsilon(d)-1/(2*sqrt(100)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multivariate functions?\n",
"\n",
"Dual numbers can be used to compute the gradient of a function $f: \\mathbb{R}^n \\to \\mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is [implemented](https://github.com/JuliaOpt/Optim.jl/blob/583907676b5b99cdb2d4cba37f6026a3fe620a49/src/autodiff.jl) in [Optim](https://github.com/JuliaOpt/Optim.jl) with the ``autodiff=true`` keyword."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"using Optim\n",
"rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2\n",
"optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When $n$ is large, there's an alternative procedure called [reverse-mode automatic differentiation](http://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation) which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in [ReverseDiffSparse](https://github.com/mlubin/ReverseDiffSparse.jl))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions\n",
"\n",
"- We can compute numerically exact derivatives of any differentiable function which is implemented by using a sequence of basic operations.\n",
"- In Julia it's very easy to use dual numbers for this!\n",
"- Reconsider when derivatives are \"not available.\"\n",
"\n",
"This was just an introduction to one technique from the area of automatic differentiation. For more references, see [autodiff.org](http://www.autodiff.org/?module=Introduction&submenu=Selected%20Books)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Julia 0.5.0",
"language": "julia",
"name": "julia-0.5"
},
"keywords": "DualNumbers,AD,derivatives",
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "0.5.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}