{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fully connected model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Installing packages:\n", "\t.package(path: \"/home/jupyter/notebooks/swift/FastaiNotebook_01a_fastai_layers\")\n", "\t\tFastaiNotebook_01a_fastai_layers\n", "With SwiftPM flags: []\n", "Working in: /tmp/tmpvtt2ci6k/swift-install\n", "[1/4] Compiling FastaiNotebook_01a_fastai_layers 01_matmul.swift\n", "[2/4] Compiling FastaiNotebook_01a_fastai_layers 00_load_data.swift\n", "[3/4] Compiling FastaiNotebook_01a_fastai_layers 01a_fastai_layers.swift\n", "[4/5] Merging module FastaiNotebook_01a_fastai_layers\n", "[5/6] Compiling jupyterInstalledPackages jupyterInstalledPackages.swift\n", "[6/7] Merging module jupyterInstalledPackages\n", "[7/7] Linking libjupyterInstalledPackages.so\n", "Initializing Swift...\n", "Installation complete!\n" ] } ], "source": [ "%install-location $cwd/swift-install\n", "%install '.package(path: \"$cwd/FastaiNotebook_01a_fastai_layers\")' FastaiNotebook_01a_fastai_layers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//export\n", "import Path\n", "import TensorFlow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import FastaiNotebook_01a_fastai_layers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The forward and backward passes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typing `Tensor` all the time is tedious. The S4TF team expects to make `Float` be the default so we can just say `Tensor`. Until that happens though, we can define our own alias." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// export\n", "public typealias TF=Tensor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will need to normalize our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// export\n", "public func normalize(_ x:TF, mean:TF, std:TF) -> TF {\n", " return (x-mean)/std\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var (xTrain, yTrain, xValid, yValid) = loadMNIST(path: mnistPath, flat: true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normalize the training and validation sets with the training set statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.13066047 0.30810782\r\n" ] } ], "source": [ "let trainMean = xTrain.mean()\n", "let trainStd = xTrain.std()\n", "print(trainMean, trainStd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xTrain = normalize(xTrain, mean: trainMean, std: trainStd)\n", "xValid = normalize(xValid, mean: trainMean, std: trainStd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test everything is going well:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//export\n", "public func testNearZero(_ a: TF, tolerance: Float = 1e-3) {\n", " assert(abs(a) < tolerance, \"Near zero: \\(a)\")\n", "}\n", "\n", "public func testSame(_ a: TF, _ b: TF) {\n", " // Check shapes match so broadcasting doesn't hide shape errors.\n", " assert(a.shape == b.shape)\n", " testNearZero(a-b)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "testNearZero(xTrain.mean())\n", "testNearZero(xTrain.std() - 1.0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "60000 784 10\r\n" ] } ], "source": [ "let (n,m) = (xTrain.shape[0],xTrain.shape[1])\n", "let c = yTrain.max()+1\n", "print(n, m, c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Foundations version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic architecture" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//num hidden\n", "let nh = 50" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// simplified kaiming init / he init\n", "let w1 = TF(randomNormal: [m, nh]) / sqrt(Float(m))\n", "let b1 = TF(zeros: [nh])\n", "let w2 = TF(randomNormal: [nh,1]) / sqrt(Float(nh))\n", "let b2 = TF(zeros: [1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "testNearZero(w1.mean())\n", "testNearZero(w1.std()-1/sqrt(Float(m)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0060178037 1.0077\r\n" ] } ], "source": [ "// This should be ~ (0,1) (mean,std)...\n", "print(xValid.mean(), xValid.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Swift `@` is spelled `•`, which is option-8 on Mac or compose-.-= elsewhere. Or just use the `matmul()` function we've seen already." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func lin(_ x: TF, _ w: TF, _ b: TF) -> TF { return x•w+b }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "let t = lin(xValid, w1, b1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.040110435 1.0424675\r\n" ] } ], "source": [ "//...so should this, because we used kaiming init, which is designed to do this\n", "print(t.mean(), t.std())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func myRelu(_ x:TF) -> TF { return max(x, 0) }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "let t = myRelu(lin(xValid, w1, b1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.43306303 0.6227423\r\n" ] } ], "source": [ "//...actually it really should be this!\n", "print(t.mean(),t.std())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// kaiming init / he init for relu\n", "let w1 = TF(randomNormal: [m,nh]) * sqrt(2.0/Float(m))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-0.00023605586 0.05074154\r\n" ] } ], "source": [ "print(w1.mean(), w1.std())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.6191642 0.91126376\r\n" ] } ], "source": [ "let t = myRelu(lin(xValid, w1, b1))\n", "print(t.mean(), t.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a simple basic model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func model(_ xb: TF) -> TF {\n", " let l1 = lin(xb, w1, b1)\n", " let l2 = myRelu(l1)\n", " let l3 = lin(l2, w2, b2)\n", " return l3\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average: 0.32438059999999996 ms, min: 0.270078 ms, max: 0.430408 ms\r\n" ] } ], "source": [ "time(repeating: 10) { _ = model(xValid) }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loss function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We begin with the mean squared error to have easier gradient computations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "let preds = model(xTrain)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// export\n", "public func mse(_ out: TF, _ targ: TF) -> TF {\n", " return (out.squeezingShape(at: -1) - targ).squared().mean()\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One more step compared to Python, we have to make sure our labels are properly converted to floats." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// Convert these to Float dtype.\n", "var yTrainF = TF(yTrain)\n", "var yValidF = TF(yValid)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21.76267\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mse(preds, yTrainF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradients and backward pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we should how to calculate gradients for a simple model the hard way, manually.\n", "\n", "To store the gradients a bit like in PyTorch we introduce a `TFGrad` class that has two attributes: the original tensor and the gradient. We choose a class to easily replicate the Python notebook: classes are reference types (which means they are mutable) while structures are value types.\n", "\n", "In fact, since this is the first time we're discovering Swift classes, let's jump into a [sidebar discussion about Value Semantics vs Reference Semantics](https://docs.google.com/presentation/d/1dc6o2o-uYGnJeCeyvgsgyk05dBMneArxdICW5vF75oU/edit#slide=id.g5669969ead_0_145) since it is a pretty fundamental part of the programming model and a huge step forward that Swift takes.\n", "\n", "When we get back, we'll keep charging on, even though this is very non-idiomatic Swift code!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "/// WARNING: This is designed to be similar to the PyTorch 02_fully_connected lesson,\n", "/// this isn't idiomatic Swift code.\n", "class TFGrad {\n", " var inner, grad: TF\n", " \n", " init(_ x: TF) {\n", " inner = x\n", " grad = TF(zeros: x.shape)\n", " } \n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// Redefine our functions on TFGrad.\n", "func lin(_ x: TFGrad, _ w: TFGrad, _ b: TFGrad) -> TFGrad {\n", " return TFGrad(x.inner • w.inner + b.inner)\n", "}\n", "func myRelu(_ x: TFGrad) -> TFGrad {\n", " return TFGrad(max(x.inner, 0))\n", "}\n", "func mse(_ inp: TFGrad, _ targ: TF) -> TF {\n", " //grad of loss with respect to output of previous layer\n", " return (inp.inner.squeezingShape(at: -1) - targ).squared().mean()\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// Define our gradient functions.\n", "func mseGrad(_ inp: TFGrad, _ targ: TF) {\n", " //grad of loss with respect to output of previous layer\n", " inp.grad = 2.0 * (inp.inner.squeezingShape(at: -1) - targ).expandingShape(at: -1) / Float(inp.inner.shape[0])\n", "}\n", "\n", "func reluGrad(_ inp: TFGrad, _ out: TFGrad) {\n", " //grad of relu with respect to input activations\n", " inp.grad = out.grad.replacing(with: TF(zeros: inp.inner.shape), where: (inp.inner .< 0))\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is our python version (we've renamed the python `g` to `grad` for consistency):\n", "\n", "```python\n", "def lin_grad(inp, out, w, b):\n", " inp.grad = out.grad @ w.t()\n", " w.grad = (inp.unsqueeze(-1) * out.grad.unsqueeze(1)).sum(0)\n", " b.grad = out.grad.sum(0)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func linGrad(_ inp:TFGrad, _ out:TFGrad, _ w:TFGrad, _ b:TFGrad){\n", " // grad of linear layer with respect to input activations, weights and bias\n", " inp.grad = out.grad • w.inner.transposed()\n", " w.grad = inp.inner.transposed() • out.grad\n", " b.grad = out.grad.sum(squeezingAxes: 0)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "let w1a = TFGrad(w1)\n", "let b1a = TFGrad(b1)\n", "let w2a = TFGrad(w2)\n", "let b2a = TFGrad(b2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func forwardAndBackward(_ inp:TFGrad, _ targ:TF){\n", " // forward pass:\n", " let l1 = lin(inp, w1a, b1a)\n", " let l2 = myRelu(l1)\n", " let out = lin(l2, w2a, b2a)\n", " //we don't actually need the loss in backward!\n", " let loss = mse(out, targ)\n", " \n", " // backward pass:\n", " mseGrad(out, targ)\n", " linGrad(l2, out, w2a, b2a)\n", " reluGrad(l1, l2)\n", " linGrad(inp, l1, w1a, b1a)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "let inp = TFGrad(xTrain)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forwardAndBackward(inp, yTrainF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Automatic Differentiation in Swift\n", "\n", "There are a few challenges with the code above:\n", "\n", " * It doesn't follow the principle of value semantics, because TensorGrad is a class. Mutating a tensor would produce the incorrect results.\n", " * It doesn't compose very well - we need to keep track of values passed in the forward pass and also pass them in the backward pass.\n", " * It is fully dynamic, keeping track of gradients at runtime. This interferes with the compiler's ability to perform fusion and other advanced optimizations.\n", " \n", "We want something that is simple, consistent and easy to use, like this:\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\r\n", "0.2\r\n", "0.4\r\n", "0.6000000000000001\r\n", "0.8\r\n", "1.0\r\n", "1.2000000000000002\r\n", "1.4000000000000001\r\n", "1.6\r\n", "1.8\r\n" ] } ], "source": [ "let gradF = gradient { (x : Double) in x*x }\n", "\n", "for x in stride(from: 0.0, to: 1, by: 0.1) {\n", " print(gradF(x)) \n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how we're working with simple doubles here, not having to use tensors. Other than that, you can use it basically the way PyTorch autodiff works.\n", "\n", "You can get the gradients of functions, and do everything else you'd expect:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-3.0501623\r\n" ] } ], "source": [ "func doThing(_ x: Float) -> Float {\n", " return sin(x*x) + cos(x*x)\n", "}\n", "\n", "print(gradient(at: 3.14, in: doThing))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Autodiff the Functional Way\n", "\n", "Swift for TensorFlow's autodiff is built on value semantics and functional programming ideas.\n", "\n", "Each differentiable function gets an associated \"chainer\" (described below) that defines its gradient. When you write a function that, like `model`, calls a bunch of these in sequence, the compiler calls the function and it collects its pullback, then stitches together the pullbacks using the chain rule from Calculus.\n", "\n", "Let's remember the chain rule - it is written:\n", "\n", "$$\\frac{d}{dx}\\left[f\\left(g(x)\\right)\\right] = f'\\left(g(x)\\right)g'(x)$$\n", "\n", "Notice how the chain rule requires mixing together expressions from both the forward pass (`g()`) and the backward pass (`f'()` and `g'()`) of a computation to get the derivative. While it is possible to calculate all the forward versions of a computation, then recompute everything needed again on the backward pass, this would be incredibly inefficient - it makes more sense to save intermediate values from the forward pass and reuse them on the backward pass.\n", "\n", "The Swift language provides the atoms we need to express this: we can represent math with function calls, and the pullback can be represented with a closure. This works out well because closures provide a natural way to capture interesting values from the forward pass." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic expressions in MSE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To explore this, let's look at a really simple example of this, the inner computation of MSE. The full body of MSE looks like this:\n", "\n", "```swift\n", "func mse(_ inp: TF, _ targ: TF) -> TF {\n", " //grad of loss with respect to output of previous layer\n", " return (inp.squeezingShape(at: -1) - targ).squared().mean()\n", "}\n", "```\n", "\n", "For the purposes of our example, we're going to keep it super super simple and just focus on the `x.squared().mean()` part of the computation, which we'll write as `mseInner(x) = mean(square(x))` to align better with function composition notation. We want a way to visualize what functions get called, so let's define a little helper that prints the name of its caller whenever it is called. To do this we use a [litteral expression](https://docs.swift.org/swift-book/ReferenceManual/Expressions.html#ID390) `#function` that contains the name of the function we are in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "foo(a:b:)\r\n", "bar(x:)\r\n" ] }, { "data": { "text/plain": [ "731\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "// This function prints out the calling function's name. This \n", "// is useful to see what is going on in your program..\n", "func trace(function: String = #function) {\n", " print(function)\n", "}\n", "\n", "// Try out the trace helper function.\n", "func foo(a: Int, b: Int) -> Int {\n", " trace()\n", " return a+b\n", "}\n", "func bar(x: Int) -> Int {\n", " trace()\n", " return x*42+17\n", "}\n", "\n", "foo(a: 1, b: 2)\n", "bar(x: 17)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Ok, given that, we start by writing the implementation and gradients of these functions, and we put print statements in them so we can tell when they are called. This looks like:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func square(_ x: TF) -> TF {\n", " trace() \n", " return x * x\n", "}\n", "func 𝛁square(_ x: TF) -> TF {\n", " trace()\n", " return 2 * x\n", "}\n", "\n", "func mean(_ x: TF) -> TF {\n", " trace()\n", " return x.mean() // this is a reduction. (can someone write this out longhand?)\n", "}\n", "func 𝛁mean(_ x: TF) -> TF {\n", " trace()\n", " return TF(ones: x.shape) / Float(x.shape[0])\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given these definitions we can now compute the forward and derivative of the `mseInner` function that composes `square` and `mean`, using the chain rule:\n", "\n", "$$\\frac{d}{dx}\\left[f\\left(g(x)\\right)\\right] = f'\\left(g(x)\\right)g'(x)$$\n", "\n", "where `f` is `mean` and `g` is `square`. This gives us:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func mseInner(_ x: TF) -> TF {\n", " return mean(square(x))\n", "}\n", "\n", "func 𝛁mseInner(_ x: TF) -> TF {\n", " return 𝛁mean(square(x)) * 𝛁square(x)\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is all simple, but we have a small problem if (in the common case for deep nets) we want to calculate both the forward and the gradient computation at the same time: we end up redundantly computing `square(x)` in both the forward and backward paths!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "square(_:)\n", "mean(_:)\n", "square(_:)\n", "𝛁mean(_:)\n", "𝛁square(_:)\n", "\n", "result: 7.5\n", "gradient: [0.5, 1.0, 1.5, 2.0]\n" ] } ], "source": [ "func mseInnerAndGrad(_ x: TF) -> (TF, TF) {\n", " return (mseInner(x), 𝛁mseInner(x)) \n", "}\n", "\n", "let exampleData = TF([1, 2, 3, 4])\n", "\n", "let (mseInnerResult1, mseInnerGrad1) = mseInnerAndGrad(exampleData)\n", "print()\n", "\n", "print(\"result:\", mseInnerResult1)\n", "print(\"gradient:\", mseInnerGrad1)\n", "\n", "// Check that our gradient matches builtin S4TF's autodiff.\n", "let builtinGrad = gradient(at: exampleData) { x in (x*x).mean() }\n", "testSame(mseInnerGrad1, builtinGrad)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note above how `square` got called two times: once in the forward function and once in the gradient. In more complicated cases, this can be an incredible amount of redundant computation, which would make performance unacceptably slow.\n", "\n", "**Exercise:** take a look what happens when you use the same techniques to implement more complex functions.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reducing recomputation with Chainers and the ValueAndChainer pattern\n", "\n", "We can fix this by refactoring our code. We want to preserve the linear structure of `mseInner` that calls `square` and then `mean`, but we want to make it so the ultimate *user* of the computation can choose whether they want the gradient computation (or not) and if so, we want to minimize computation. To do this, we have to slightly generalize our derivative functions. While it is true that the derivative of $square(x)$ is `2*x`, this is only true for a given point `x`.\n", "\n", "If we generalize the derivative of `square` to work with an arbitrary **function**, instead of point, then we need to remember that $\\frac{d}{dx}x^2 = 2x\\frac{d}{dx}$, and therefore the derivative for `square` needs to get $\\frac{d}{dx}$ passed in from its nested function. \n", "\n", "This form of gradient is known by the academic term \"Vector Jacobian Product\" (vjp) or the technical term \"pullback\", but we will refer to it as a 𝛁Chain because it implements the gradient chain rule for the operation. We can write it like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// The chainer for the gradient of square(x).\n", "func square𝛁Chain(x: TF, ddx: TF) -> TF {\n", " trace()\n", " return ddx * 2*x\n", "}\n", "\n", "// The chainer for the gradient of mean(x).\n", "func mean𝛁Chain(x: TF, ddx: TF) -> TF {\n", " trace()\n", " return ddx * TF(ones: x.shape) / Float(x.shape[0])\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given this very general way of describing gradients, we now want to pull them together in a single bundle that we can keep track of: we do this by changing each atom of computation to return both a normal value with the 𝛁Chain closure that produces a piece of the gradient given the chained input.\n", "\n", "We refer to this as a \"Value With 𝛁Chain\" function (since that is what it is) and abreviate this mouthful to \"VWC\". This is also an excuse to use labels in tuples, which are a Swift feature that is very useful for return values like this.\n", "\n", "They look like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// Returns x*x and the chain for the gradient of x*x.\n", "func squareVWC(_ x: TF) -> (value: TF, \n", " chain: (TF) -> TF) {\n", " trace()\n", " return (value: x*x,\n", " chain: { ddx in square𝛁Chain(x: x, ddx: ddx) }) \n", "}\n", "\n", "// Returns the mean of x and the chain for the mean.\n", "func meanVWC(_ x: TF) -> (value: TF,\n", " chain: (TF) -> TF) {\n", " trace()\n", " return (value: x.mean(),\n", " chain: { ddx in mean𝛁Chain(x: x, ddx: ddx) })\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given this, we can now implement `mseInner` in the same way. Notice how our use of named tuple results make the code nice and tidy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// We implement mean(square(x)) by calling each of the VWCs in turn.\n", "func mseInnerVWC(_ x: TF) -> (value: TF, \n", " chain: (TF) -> TF) {\n", "\n", " // square and mean are tuples that carry the value/chain for each step.\n", " let square = squareVWC(x)\n", " let mean = meanVWC(square.value)\n", "\n", " // The result is the combination of the results and the pullbacks.\n", " return (mean.value,\n", " // The mseInner pullback calls the functions in reverse order.\n", " { v in square.chain(mean.chain(v)) })\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can choose to evaluate just the forward computation, or we can choose to run both:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Calling the forward function:\n", "squareVWC(_:)\n", "meanVWC(_:)\n", "\n", "Calling the backward function:\n", "mean𝛁Chain(x:ddx:)\n", "square𝛁Chain(x:ddx:)\n", "\n", "[0.5, 1.0, 1.5, 2.0]\n" ] } ], "source": [ "print(\"Calling the forward function:\")\n", "let mseInner2 = mseInnerVWC(exampleData)\n", "print()\n", "\n", "testSame(mseInner2.value, mseInnerResult1)\n", "\n", "\n", "print(\"Calling the backward function:\")\n", "let mseInnerGrad2 = mseInner2.chain(TF(1))\n", "print()\n", "\n", "print(mseInnerGrad2)\n", "// Check that we get the same result.\n", "testSame(mseInnerGrad2, builtinGrad)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, great - we only ran each piece of the computation once, and we gained a single conceptual abstraction that bundles everything we need together.\n", "\n", "Now we have all of the infrastructure and scaffolding necessary to define and compose computations and figure out their backwards versions from the chain rule. Let's jump up a level to define Jeremy's example using the VWC form of the computation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Implementing Relu, MSE, and Lin with the Value With 𝛁Chain pattern\n", "\n", "Lets come back to our earlier examples and define pullbacks for our primary functions in the simple model function example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func reluVWC(_ x: TF) -> (value: TF, chain: (TF) -> TF) {\n", " return (value: max(x, 0),\n", " // Pullback for max(x, 0)\n", " chain: { 𝛁out -> TF in\n", " 𝛁out.replacing(with: TF(zeros: x.shape), where: x .< 0)\n", " })\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```swift\n", "func lin(_ x: TFGrad, _ w: TFGrad, _ b: TFGrad) -> TFGrad {\n", " return TFGrad(x.inner • w.inner + b.inner)\n", "}\n", "func linGrad(_ inp:TFGrad, _ out:TFGrad, _ w:TFGrad, _ b:TFGrad){\n", " inp.grad = out.grad • w.inner.transposed()\n", " w.grad = inp.inner.transposed() • out.grad\n", " b.grad = out.grad.sum(squeezingAxes: 0)\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func linVWC(_ inp: TF, _ w: TF, _ b: TF) -> (value: TF,\n", " chain: (TF) -> (TF, TF, TF)) {\n", " return (value: inp • w + b,\n", " // Pullback for inp • w + b. Three results because 'lin' has three args.\n", " chain: { 𝛁out in\n", " (𝛁out • w.transposed(), \n", " inp.transposed() • 𝛁out,\n", " 𝛁out.unbroadcasted(to: b.shape))\n", " })\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func mseVWC(_ inp: TF, _ targ: TF) -> (value: TF,\n", " chain: (TF) -> (TF)) {\n", " let tmp = inp.squeezingShape(at: -1) - targ\n", " \n", " // We already wrote a VWC for x.square().mean(), so we can reuse it.\n", " let mseInner = mseInnerVWC(tmp)\n", " \n", " // Return the result, and a pullback that expands back out to\n", " // the input shape.\n", " return (mseInner.value, \n", " { v in mseInner.chain(v).expandingShape(at: -1) })\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then our forward and backward can be refactored in:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func forwardAndBackward(_ inp: TF, _ targ: TF) -> (TF, TF, TF, TF, TF) {\n", " // Forward pass:\n", " let l1 = linVWC(inp, w1, b1)\n", " let l2 = reluVWC(l1.value)\n", " let out = linVWC(l2.value, w2, b2)\n", " //we don't actually need the loss in backward, but we need the pullback.\n", " let loss = mseVWC(out.value, targ)\n", " \n", " // Backward pass:\n", " let 𝛁loss = TF(1) // We don't really need it but the gradient of the loss with respect to itself is 1\n", " let 𝛁out = loss.chain(𝛁loss)\n", " let (𝛁l2, 𝛁w2, 𝛁b2) = out.chain(𝛁out)\n", " let 𝛁l1 = l2.chain(𝛁l2)\n", " let (𝛁inp, 𝛁w1, 𝛁b1) = l1.chain(𝛁l1)\n", " return (𝛁inp, 𝛁w1, 𝛁b1, 𝛁w2, 𝛁b2)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n" ] } ], "source": [ "let (𝛁xTrain, 𝛁w1, 𝛁b1, 𝛁w2, 𝛁b2) = forwardAndBackward(xTrain, yTrainF)\n", "// Check this is still all correct\n", "testSame(inp.grad, 𝛁xTrain)\n", "testSame(w1a.grad, 𝛁w1)\n", "testSame(b1a.grad, 𝛁b1)\n", "testSame(w2a.grad, 𝛁w2)\n", "testSame(b2a.grad, 𝛁b2)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, this is pretty nice - we get composition, we get value semantics, and everything just stacks up nicely. We have a problem them, which is that this is a real pain to write and it is very easy to make simple mistakes. This is also very mechanical - and thus boring.\n", "\n", "This is where Swift's autodiff system comes to the rescue!\n", "\n", "# Automatically generating 𝛁Chains and VWCs\n", "\n", "When you define a function with `@differentiable` you're saying that it must be differentiable by the compiler by composing the VWCs of other functions just like we did above manually. as it turns out, all of the methods on `Tensor` are marked up with `@differentiable` attributes until you get down to the atoms of the raw ops. For example, this is how the `Tensor.squared` method is [defined in Ops.swift in the TensorFlow module](https://github.com/apple/swift/blob/tensorflow/stdlib/public/TensorFlow/Ops.swift#L960):\n", "\n", "```swift\n", "// slightly simplified for clarity\n", "public extension Tensor {\n", " @differentiable(vjp: _vjpSquared()) // VWCs are called \"VJPs\" by S4TF \n", " func squared() -> Tensor {\n", " return Raw.square(self)\n", " }\n", "}\n", "```\n", " \n", "The Value with 𝛁Chain function is [defined in Gradients.swift](https://github.com/apple/swift/blob/tensorflow/stdlib/public/TensorFlow/Gradients.swift#L470):\n", "\n", "```swift\n", "public extension Tensor {\n", " func _vjpSquared() -> (Tensor, (Tensor) -> Tensor) {\n", " return (squared(), { 2 * self * $0 })\n", " }\n", "}\n", "```\n", "\n", "This tells the compiler that `squared()` has a manually written VJP that is implemented as we already saw. Now, anything that calls `squared()` can have its own VJP synthesized out of it. For example we can write our `mseInner` function the trivial way, and we can get low level access to the 𝛁Chain (which S4TF calls a \"pullback\") if we want:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(Tensor) -> Tensor\r\n" ] } ], "source": [ "@differentiable\n", "func mseInnerForAD(_ x: TF) -> TF {\n", " return x.squared().mean()\n", "}\n", "\n", "let mseInner𝛁Chain = pullback(at: exampleData, in: mseInnerForAD)\n", "print(type(of: mseInner𝛁Chain))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "because the compiler knows the VWCs for the `squared` and `mean` function, it can synthesize them as we need them. Most often though, you don't use the 𝛁Chain function directly. You can instead ask for both the value and the gradient of a function at a specific point, which is the most typical thing you'd use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "value: 7.5, grad: [0.5, 1.0, 1.5, 2.0]\r\n" ] } ], "source": [ "let (value, grad) = valueWithGradient(at: exampleData, in: mseInnerForAD)\n", "\n", "print(\"value: \\(value), grad: \\(grad)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also ask for just the gradient. Of course, we can also use trailing closures, which work very nicely with these functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0.5, 1.0, 1.5, 2.0]\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gradient(at: exampleData) { ($0*$0).mean() }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The @differentiable attribute is normally optional in a S4TF standalone environment, but is currently required in Jupyter notebooks. The S4TF team is planning to relax this limitation when time permits." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bundling up a model into an aggregate value\n", "\n", "When we work with models and individual layers, we often want to bundle up a bunch of differentiable variables into one value, so we don't have to pass a ton of arguments around. When we get to building our whole model, it is mathematically just a struct that contains a bunch of differentiable values embedded into it. It is more convenient to think of a model as a function that takes one value and returns one value rather than something that can take an unbounded number of inputs: our simple model has 4 parameters, and two normal inputs!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@differentiable\n", "func forward(_ inp: TF, _ targ: TF, w1: TF, b1: TF, w2: TF, b2: TF) -> TF {\n", " // FIXME: use lin\n", " let l1 = matmul(inp, w1) + b1\n", " let l2 = relu(l1)\n", " let l3 = matmul(l2, w2) + b2\n", " return (l3.squeezingShape(at: -1) - targ).squared().mean()\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try refactoring our single linear model to use a `struct` to simplify this. We start by defining a structure to contain all the fields we need. We mark the structure as `: Differentiable` so the compiler knows we want it to be differentiable (not discrete):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "struct MyModel: Differentiable {\n", " public var w1, b1, w2, b2: TF\n", "}\n", "\n", "// Create an instance of our model with all the individual parameters we initialized.\n", "let model = MyModel(w1: w1, b1: b1, w2: w2, b2: b2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now define our forward function as a method on this model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "extension MyModel {\n", " @differentiable\n", " func forward(_ input: TF, _ target: TF) -> TF {\n", " // FIXME: use lin\n", " let l1 = matmul(input, w1) + b1\n", " let l2 = relu(l1)\n", " let l3 = matmul(l2, w2) + b2\n", " // use mse\n", " return (l3.squeezingShape(at: -1) - target).squared().mean()\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given this, we can now get the gradient of our entire loss w.r.t to the input and the expected labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// Grads is a struct with one gradient per parameter.\n", "let grads = gradient(at: model) { model in model.forward(xTrain, yTrainF) }\n", "\n", "// Check that this still calculates the same thing.\n", "testSame(𝛁w1, grads.w1)\n", "testSame(𝛁b1, grads.b1)\n", "testSame(𝛁w2, grads.w2)\n", "testSame(𝛁b2, grads.b2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In terms of timing our implementation gives:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "squareVWC(_:)\r\n", "meanVWC(_:)\r\n", "mean𝛁Chain(x:ddx:)\r\n", "square𝛁Chain(x:ddx:)\r\n", "average: 5.6533444 ms, min: 5.298 ms, max: 5.80675 ms\r\n" ] } ], "source": [ "time(repeating: 10) { _ = forwardAndBackward(xTrain, yTrainF) }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "average: 3.9664487 ms, min: 3.588866 ms, max: 4.454641 ms\r\n" ] } ], "source": [ "time(repeating: 10) {\n", " _ = valueWithGradient(at: model) { \n", " model in model.forward(xTrain, yTrainF)\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# More about AutoDiff\n", "\n", "There are lots of cool things you can do with Swift autodiff. One of the great things about understanding how the system fits together is that you do a lot of interesting things by customizing gradients with S4TF. This can be useful for lots of reasons, for example:\n", "\n", " * you want a faster approximation of an expensive gradient\n", " * you want to improve the numerical instability of a gradient\n", " * you want to pursue exotic techniques like learned gradients\n", " * you want to work around a limitation of the current implementation\n", " \n", "In fact, we've had to do that in `11_imagenette` where we've built a `SwitchableLayer` with a custom gradient. Let's go take a look.\n", "\n", "To find out more, check out this nice tutorial in Colab on custom autodiff](https://colab.research.google.com/github/tensorflow/swift/blob/master/docs/site/tutorials/custom_differentiation.ipynb)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Export" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "success\r\n" ] } ], "source": [ "import NotebookExport\n", "let exporter = NotebookExport(Path.cwd/\"02_fully_connected.ipynb\")\n", "print(exporter.export(usingPrefix: \"FastaiNotebook_\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Swift", "language": "swift", "name": "swift" } }, "nbformat": 4, "nbformat_minor": 2 }