{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "97d3c80864403311b3342877098cef5b", "grade": false, "grade_id": "cell-f38ec230aeb63f9d", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "# Part 3: Reverse Mode Automatic Differentiation with PyTorch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "529f94d77025a0b9b2a4aa44ed0b1f1b", "grade": false, "grade_id": "cell-2afa7eb0f5479ff9", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "# Execute this code block to install dependencies when running on colab\n", "try:\n", " import torch\n", "except:\n", " from os.path import exists\n", " from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag\n", " platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())\n", " cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\\.\\([0-9]*\\)\\.\\([0-9]*\\)$/cu\\1\\2/'\n", " accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'\n", "\n", " !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "8c13a801a6e49369b91bda4e906cb1cf", "grade": false, "grade_id": "cell-3d5d3b3ff9e2ac9f", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "PyTorch implements Dynamic Reverse Mode Automatic Differentiation, much like we did in the previous exercise. There is one really major difference in what PyTorch provides over our simple example: it works directly with matrices (`Tensor`s) rather than with scalars (although obviously a matrix can represent a scalar).\n", "\n", "In this tutorial, we'll explore PyTorch's AD implementation. Note that we're using the API of PyTorch 0.4 or later which simplifies use of AD (previous versions required wrapping all `Tensor`s that you wanted to compute gradients of in `Variable` objects; PyTorch 0.4 removes the need to do this and allows `Tensor`s themselves to track gradients).\n", "\n", "We'll start with the simple example we tried earlier in the code block below:\n", "\n", "__Task:__ Run the following code and verify the solution is correct" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "0d5aa67ad79542449beb79e8b65a50bc", "grade": false, "grade_id": "cell-a7740c83bfcaa983", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "import torch\n", "\n", "# set up the problem\n", "x = torch.tensor(0.5, requires_grad=True)\n", "y = torch.tensor(4.2, requires_grad=True)\n", "z = x * y + torch.sin(x)\n", "\n", "print(\"z = \" + str(z.item()))\n", "\n", "z.backward() # this goes through the computation graph and accumulates the gradients in the cached .grad attributes\n", "print(\"dz/dx = \" + str(x.grad.item()))\n", "print(\"dz/dy = \" + str(y.grad.item()))" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "aac08494fe446009596079eba3016ddf", "grade": false, "grade_id": "cell-cb3a2586467dd0ad", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "As with our own AD implementation, PyTorch lets us differentiate through an algorithm.\n", "\n", "__Task__: Use the block below to compute the gradient $\\partial z/\\partial x$ of the following pseudocode algorithm and store the result in the `dzdx` variable:\n", "\n", " x = 0.5\n", " z = 1\n", " i = 0\n", " while i<2:\n", " z = (z + i) * x * x\n", " i = i + 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "b689f4aaf1507a7f532c6059896ad02e", "grade": false, "grade_id": "cell-62e1f0838236eddf", "locked": false, "schema_version": 3, "solution": true } }, "outputs": [], "source": [ "dzdx = None\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "bc79555ad292cd1026a9db951f9f8e47", "grade": true, "grade_id": "cell-cbe71552690f710b", "locked": true, "points": 4, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "assert dzdx\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "d999d1c4c84b066fb15be1972dcaaad8", "grade": false, "grade_id": "cell-824ecf46b4875b5f", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## PyTorch limitations: in-place operations and aliasing\n", "\n", "PyTorch will throw an error at runtime if you try to differentiate through an in-place operation on a tensor. \n", "\n", "__Task__: Run the following code to see this in action." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "eb07b791962399cd736ede26b99309e8", "grade": false, "grade_id": "cell-813341e9825354bf", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "x = torch.tensor(1.0, requires_grad=True)\n", "y = x.tanh()\n", "y.add_(3) # inplace addition\n", "y.backward()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "bfd84ec33f62c537a0ea6ea4f87291de", "grade": false, "grade_id": "cell-b9986c88256e8468", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "Aliasing is also something that can't be differentiated through and will result in a slightly more cryptic error.\n", "\n", "__Task__: Run the following code to see this in action. If you don't understand what this code does add some `print` statements to show the values of `x` and `y` at various points." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = torch.tensor([1, 2, 3, 4], requires_grad=True, dtype=torch.float)\n", "y = x[:1]\n", "y.add_(3)\n", "y.backward()" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4e749620afa15537eddb1f61565c927e", "grade": false, "grade_id": "cell-76e1b1d7cbc02388", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "## Dealing with multiple outputs\n", "\n", "PyTorch can deal with the case where there are multiple output variables if we can formulate the expression in terms of tensor operations. Consider the example from the presentation for example:\n", "\n", "$$\\begin{cases}\n", " z = 2x + \\sin x\\\\\n", " v = 4x + \\cos x\n", "\\end{cases}$$\n", "\n", "We could formulate this as:\n", "\n", "$$\n", "\\begin{bmatrix}z \\\\ v\\end{bmatrix} = \\begin{bmatrix}2 \\\\ 4\\end{bmatrix} \\odot \\bar{x} + \\begin{bmatrix}1 \\\\ 0\\end{bmatrix} \\odot \\sin\\bar x + \\begin{bmatrix}0 \\\\ 1\\end{bmatrix} \\odot \\cos\\bar x\n", "$$\n", "\n", "where \n", "\n", "$$\n", "\\bar x = \\begin{bmatrix}x \\\\ x\\end{bmatrix}\n", "$$\n", "\n", "and $\\odot$ represents the Hadamard or element-wise product. This is demonstrated using PyTorch in the following code block.\n", "\n", "__Task:__ run the code below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "5ab1d7ea5f6b1a017f7949fa0dd1c8f1", "grade": false, "grade_id": "cell-58a1e0676df73645", "locked": true, "schema_version": 3, "solution": false } }, "outputs": [], "source": [ "x = torch.tensor([[1.0],[1.0]], requires_grad=True)\n", "\n", "zv = ( torch.tensor([[2.0],[4.0]]) * x +\n", " torch.tensor([[1.0], [0.0]]) * torch.sin(x) +\n", " torch.tensor([[0.0], [1.0]]) * torch.cos(x) )\n", " \n", "zv.backward(torch.tensor([[1.0],[1.0]])) # Note as we have \"multiple outputs\" we must pass in a tensor of weights of the correct size\n", "\n", "print(x.grad)" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "68158fbb8633636d8fdeaab2b8066186", "grade": false, "grade_id": "cell-5253f44d76ef4079", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "__Task:__ Use the following box to write down the derivative of the expression for $\\begin{bmatrix}z \\\\ v\\end{bmatrix}$ and verify the gradients $\\partial z/\\partial x$ and $\\partial v/\\partial x$ are correct for $x=1$." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1e10ee0f8f8395414cb70aa241376003", "grade": true, "grade_id": "cell-b85be7c90ef1f408", "locked": false, "points": 3, "schema_version": 3, "solution": true } }, "source": [ "YOUR ANSWER HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient descent & gradients with respect to a vector\n", "Let's put everything together and using automatically computed gradients to find the minima of a function by taking steps down the gradient from an initial position. Rather than explicitly creating each input variable as a scalar as in the previous examples, we'll use a vector instead (so our gradients will be with respect to each element of the vector).\n", "\n", "__Task:__ work through the following example to see how taking gradients with respect to a vector works & how simple gradient descent optimisation can be implemented." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is our starting vector\n", "initial=[[2.0], [1.0], [10.0]]\n", "\n", "# This is the function we will optimise (feel free to work out the analytic minima!)\n", "def function(x):\n", " return x[0]**2 + x[1]**2 + x[2]**2\n", "\n", "x = torch.tensor(initial, requires_grad=True, dtype=torch.float)\n", "for i in range(0,100):\n", " # manually dispose of the gradient (in reality it would be better to detach and zero it to reuse memory)\n", " x.grad=None\n", " # evaluate the function\n", " J = function(x) \n", " # auto-compute the gradients at the previously evaluated point x\n", " J.backward()\n", " # compute the update\n", " x.data = x - x.grad*0.1 \n", " \n", " if i%10 == 0:\n", " print((x, function(x).item()))\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "1a6fb5c11feed5015f3836b04ed25d65", "grade": false, "grade_id": "cell-fe38c3ff8580d8e1", "locked": true, "schema_version": 3, "solution": false } }, "source": [ "__Task__: Answer the following question in the box below: Why must the update in the code above be assigned to `x.data` rather than `x`?" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "2b3a76629b34fb1c8a80c90d9df28dd9", "grade": true, "grade_id": "cell-855b4580505f3b5c", "locked": false, "points": 3, "schema_version": 3, "solution": true } }, "source": [ "YOUR ANSWER HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Differentiating through random operations\n", "\n", "We'll end with an example that will be important later in the course: differentiating with respect to the parameters of a random number generator.\n", "\n", "Assume that as some part of a differentiable program that we write we wish to incorporate a random element where we sample values, $z$ from a Normal distribution: $z \\sim \\mathcal{N}(\\mu,\\sigma^2)$. We want to learn the parameters of the generator $\\mu$ and $\\sigma^2$, but how can we do this? In a differentiable program setting we want to differentiate with respect to these parameters, but at first glance it isn't at all obvious what this means as the generator _just_ produces numbers which themselves don't have gradients.\n", "\n", "The answer is often called the _reparameterisation trick_: If we note that sampling a Normal distribution with a specfic mean and variance is equivalent to drawing numbers from a standard Normal distribution and scaling and shifting them: $z \\sim \\mathcal{N}(\\mu,\\sigma^2) \\equiv z \\sim \\mu + \\sigma\\mathcal{N}(0,1)\\equiv z = \\mu + \\sigma \\zeta\\, \\rm{where}\\, \\zeta\\sim\\mathcal{N}(0,1)$. With this reparameterisation the gradients with respect to the parameters are obvious.\n", "\n", "The following code block demonstrates this in practice; each of the gradients can be interpreted as how much an infintesimal change in $\\mu$ or $\\sigma$ would change $z$ if we could repeat the sampling operation again with the same value of `torch.randn(1)` being produced:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mu = torch.tensor(1.0, requires_grad=True)\n", "sigma = torch.tensor(1.0, requires_grad=True)\n", "\n", "for i in range(10):\n", " mu.grad = None\n", " sigma.grad = None\n", " \n", " z = mu + sigma * torch.randn(1)\n", " z.backward()\n", " print(\"z:\", z.item(), \"\\tdzdmu:\", mu.grad.item(), \"\\tdzdsigma:\", sigma.grad.item())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }