{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing packages:\n",
      "\t.package(path: \"/home/jupyter/notebooks/swift/FastaiNotebook_00_load_data\")\n",
      "\t\tFastaiNotebook_00_load_data\n",
      "With SwiftPM flags: []\n",
      "Working in: /tmp/tmp602xg52k/swift-install\n",
      "[1/2] Merging module Path\n",
      "[2/3] Merging module NotebookExport\n",
      "[3/4] Merging module Just\n",
      "[4/5] Merging module FastaiNotebook_00_load_data\n",
      "[5/6] Compiling jupyterInstalledPackages jupyterInstalledPackages.swift\n",
      "[6/7] Merging module jupyterInstalledPackages\n",
      "Initializing Swift...\n",
      "Installation complete!\n"
     ]
    }
   ],
   "source": [
    "%install-location $cwd/swift-install\n",
    "%install '.package(path: \"$cwd/FastaiNotebook_00_load_data\")' FastaiNotebook_00_load_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "//export\n",
    "import Path\n",
    "import TensorFlow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import FastaiNotebook_00_load_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get some Tensors to play with"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can initialize a tensor in lots of different ways because in swift, two functions with the same name can coexist as long as they don't have the same signatures. Different named arguments give different signatures, so all of those are different `init` functions of `Tensor`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "let zeros = Tensor<Float>(zeros: [1,4,5])\n",
    "let ones  = Tensor<Float>(ones: [12,4,5])\n",
    "let twos  = Tensor<Float>(repeating: 2.0, shape: [2,3,4,5])\n",
    "let range = Tensor<Int32>(rangeFrom: 0, to: 32, stride: 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those are just some examples and there are many more! Here we grab random numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[  0.015628124,   0.026147088, -0.0083014425,    0.06369277,   -0.06701048,  -0.026953917,\r\n",
      "   0.014180449,  0.0067227157,   0.005665138,   0.010936922]\r\n"
     ]
    }
   ],
   "source": [
    "let xTrain = Tensor<Float>(randomNormal: [5, 784])\n",
    "var weights = Tensor<Float>(randomNormal: [784, 10]) / sqrt(784)\n",
    "print(weights[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building Matmul"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, now that we know how floating point types and arrays work, we can finally build our own matmul from scratch, using a few loops.  We will take the two input matrices as single dimensional arrays so we can show manual indexing into them, the hard way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "// a and b are the flattened array elements, aDims/bDims are the #rows/columns of the arrays.\n",
    "func swiftMatmul(a: [Float], b: [Float], aDims: (Int,Int), bDims: (Int,Int)) -> [Float] {\n",
    "    assert(aDims.1 == bDims.0, \"matmul shape mismatch\")\n",
    "    \n",
    "    var res = Array(repeating: Float(0.0), count: aDims.0 * bDims.1)\n",
    "    for i in 0 ..< aDims.0 {\n",
    "        for j in 0 ..< bDims.1 {\n",
    "            for k in 0 ..< aDims.1 {\n",
    "                res[i*bDims.1+j] += a[i*aDims.1+k] * b[k*bDims.1+j]\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "    return res\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To try this out, we extract the scalars out of our MNIST data as an array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "let flatA = xTrain[0..<5].scalars\n",
    "let flatB = weights.scalars\n",
    "let (aDims,bDims) = ((5, 784), (784, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've got everything together, we can try it out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "var resultArray = swiftMatmul(a: flatA, b: flatB, aDims: aDims, bDims: bDims)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average: 0.12000220999999996 ms,   min: 0.110746 ms,   max: 0.149986 ms\r\n"
     ]
    }
   ],
   "source": [
    "time(repeating: 100) {\n",
    "    _ = swiftMatmul(a: flatA, b: flatB, aDims: aDims, bDims: bDims)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome, that is pretty fast - compare that to **835 ms** with Python!\n",
    "\n",
    "You might be wondering what that `time(repeating:)` builtin is.  As you might guess, this is actually a Swift function - one that is using \"trailing closure\" syntax to specify the body of the timing block.  Trailing closures are passed as arguments to the function, and in this case, the function was defined in our ✅**00_load_data** workbook.  Let's take a look!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting the performance of C 💯"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This performance is pretty great, but we can do better.  Swift is a memory safe language (like Python), which means it has to do array bounds checks and some other stuff.  Fortunately, Swift is a pragmatic language that allows you to drop through this to get peak performance - check out Jeremy's article [High Performance Numeric Programming with Swift: Explorations and Reflections](https://www.fast.ai/2019/01/10/swift-numerics/) for a deep dive.\n",
    "\n",
    "One thing you can do is use `UnsafePointer` (which is basically a raw C pointer) instead of using a bounds checked array.  This isn't memory safe, but gives us about a 2x speedup in this case!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "// a and b are the flattened array elements, aDims/bDims are the #rows/columns of the arrays.\n",
    "func swiftMatmulUnsafe(a: UnsafePointer<Float>, b: UnsafePointer<Float>, aDims: (Int,Int), bDims: (Int,Int)) -> [Float] {\n",
    "    assert(aDims.1 == bDims.0, \"matmul shape mismatch\")\n",
    "    \n",
    "    var res = Array(repeating: Float(0.0), count: aDims.0 * bDims.1)\n",
    "    res.withUnsafeMutableBufferPointer { res in \n",
    "        for i in 0 ..< aDims.0 {\n",
    "            for j in 0 ..< bDims.1 {\n",
    "                for k in 0 ..< aDims.1 {\n",
    "                    res[i*bDims.1+j] += a[i*aDims.1+k] * b[k*bDims.1+j]\n",
    "                }\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "    return res\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average: 0.059201270000000035 ms,   min: 0.052995 ms,   max: 0.084454 ms\r\n"
     ]
    }
   ],
   "source": [
    "time(repeating: 100) {\n",
    "    _ = swiftMatmulUnsafe(a: flatA, b: flatB, aDims: aDims, bDims: bDims)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the other cool things about this is that we can provide a nice idiomatic API to the caller of this, and keep all the unsafe shenanigans inside the implementation of this function.\n",
    "\n",
    "If you really want to fall down the rabbit hole, you can look at the [implementation of `UnsafePointer`](https://github.com/apple/swift/blob/tensorflow/stdlib/public/core/UnsafePointer.swift), which is of written in Swift wrapping LLVM pointer operations.  This means you can literally get the performance of C code directly in Swift, while providing easy to use high level APIs!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Swift 💖 C APIs too: you get the full utility of the C ecosystem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Swift even lets you transparently work with C APIs, just like it does with Python.  This can be used for both good and evil.  For example, here we directly call the `malloc` function, dereference the uninitialized pointer, and print it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "☠️☠️ Uninitialized garbage = 160\r\n"
     ]
    }
   ],
   "source": [
    "import Glibc\n",
    "\n",
    "let ptr : UnsafeMutableRawPointer = malloc(42)\n",
    "\n",
    "print(\"☠️☠️ Uninitialized garbage =\", ptr.load(as: UInt8.self))\n",
    "\n",
    "free(ptr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An `UnsafeMutableRawPointer` ([implementation](https://github.com/apple/swift/blob/tensorflow/stdlib/public/core/UnsafeRawPointer.swift)) isn't something you should use lightly, but when you work with C APIs, you'll see various types like that in the function signatures.\n",
    "\n",
    "Calling `malloc` and `free` directly aren't recommended in Swift, but is useful and important when you're working with C APIs that expect to get malloc'd memory, which comes up when you're written a safe Swift wrapper for some existing code.\n",
    "\n",
    "Speaking of existing code, let's take a look at that **Python interop** we touched on before:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fatal error: 'try!' expression unexpectedly raised an error: Python exception: No module named 'numpy': file /swift-base/swift/stdlib/public/Python/Python.swift, line 683\r\n",
      "Current stack trace:\r\n",
      "0    libswiftCore.so                    0x00007ffb425a58b0 swift_reportError + 50\r\n",
      "1    libswiftCore.so                    0x00007ffb42614aa0 _swift_stdlib_reportFatalErrorInFile + 115\r\n",
      "2    libswiftCore.so                    0x00007ffb4253cace <unavailable> + 3738318\r\n",
      "3    libswiftCore.so                    0x00007ffb4253cc47 <unavailable> + 3738695\r\n",
      "4    libswiftCore.so                    0x00007ffb4230ac4d <unavailable> + 1436749\r\n",
      "5    libswiftCore.so                    0x00007ffb42511a78 <unavailable> + 3562104\r\n",
      "6    libswiftCore.so                    0x00007ffb42334795 <unavailable> + 1607573\r\n",
      "7    libswiftPython.so                  0x00007ffb42f9862b <unavailable> + 67115\r\n"
     ]
    },
    {
     "ename": "",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "Current stack trace:",
      "\tframe #3: 0x00007ffae8008060 $__lldb_expr96`main at <Cell 13>:2:17"
     ]
    }
   ],
   "source": [
    "import Python\n",
    "let np = Python.import(\"numpy\")\n",
    "let pickle = Python.import(\"pickle\")\n",
    "let sys = Python.import(\"sys\")\n",
    "\n",
    "print(\"🐍list =    \", pickle.dumps([1, 2, 3]))\n",
    "print(\"🐍ndarray = \", pickle.dumps(np.array([[1, 2], [3, 4]])))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course this is [all written in Swift](https://github.com/apple/swift/tree/tensorflow/stdlib/public/Python) as well.  You can probably guess how this works now: `PythonObject` is a Swift struct that wraps a pointer to the Python interpreter's notion of a Python object.\n",
    "\n",
    "```swift\n",
    "@dynamicCallable\n",
    "@dynamicMemberLookup\n",
    "public struct PythonObject {\n",
    "  var reference: PyReference\n",
    "  ...\n",
    "}\n",
    "```\n",
    "\n",
    "The `@dynamicMemberLookup` attribute allows it to dynamically handle all member lookups (like `x.y`) by calling into [the `PyObject_GetAttrString` runtime call](https://github.com/apple/swift/blob/tensorflow/stdlib/public/Python/Python.swift#L427).  Similarly, the `@dynamicCallable` attribute allows the type to intercept all calls to a PythonObject (like `x()`), which it implements using the [`PyObject_Call` runtime call](https://github.com/apple/swift/blob/tensorflow/stdlib/public/Python/Python.swift#L324).  \n",
    "\n",
    "Because Swift has such simple and transparent access to C, it allows building very nice first-class Swift APIs that talk directly to the lower level implementation, and these implementations can have very little overhead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with Tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets get back into matmul and explore more of the `Tensor` type as provided by the TensorFlow module.  You can see all things `Tensor` can do in the [official documentation](https://www.tensorflow.org/swift/api_docs/Structs/Tensor).\n",
    "\n",
    "Here are some highlights. We saw how you can get zeros or random data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "var bias = Tensor<Float>(zeros: [10])\n",
    "\n",
    "let m1 = Tensor<Float>(randomNormal: [5, 784])\n",
    "let m2 = Tensor<Float>(randomNormal: [784, 10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tensors carry data and a shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "m1:  [5, 784]\r\n",
      "m2:  [784, 10]\r\n"
     ]
    }
   ],
   "source": [
    "print(\"m1: \", m1.shape)\n",
    "print(\"m2: \", m2.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Tensor` type provides all the normal stuff you'd expect as methods.  Including arithmetic, convolutions, etc and this includes full support for broadcasting:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔢2x2:\r\n",
      " [[1.0, 2.0],\r\n",
      " [3.0, 4.0]]\r\n"
     ]
    }
   ],
   "source": [
    "let small = Tensor<Float>([[1, 2],\n",
    "                           [3, 4]])\n",
    "\n",
    "print(\"🔢2x2:\\n\", small)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**MatMul Operator:** In addition to using the global `matmul(a, b)` function, you can also use the `a • b` operator to matmul together two things.  This is just like the `@` operator in Python.  You can get it with the <kbd>option</kbd>-<kbd>8</kbd> on Mac or <kbd>compose</kbd>-<kbd>.</kbd>-<kbd>=</kbd> elsewhere. Or if you prefer, just use the `matmul()` function we've seen already."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "⊞ matmul:\r\n",
      " [[ 7.0, 10.0],\r\n",
      " [15.0, 22.0]]\r\n",
      "\r\n",
      "⊞ again:\r\n",
      " [[ 7.0, 10.0],\r\n",
      " [15.0, 22.0]]\r\n"
     ]
    }
   ],
   "source": [
    "print(\"⊞ matmul:\\n\",  matmul(small, small))\n",
    "print(\"\\n⊞ again:\\n\", small • small)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reshaping** works the way you'd expect:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1.0, 2.0, 3.0],\r\n",
      " [4.0, 5.0, 6.0],\r\n",
      " [7.0, 8.0, 9.0]]\r\n"
     ]
    }
   ],
   "source": [
    "var m = Tensor([1.0, 2, 3, 4, 5, 6, 7, 8, 9]).reshaped(to: [3, 3])\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have the basic mathematical functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16.881943016134134\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sqrt((m * m).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Elementwise ops and comparisons"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Standard math operators (`+`,`-`,`*`,`/`) are all element-wise, and there are a bunch of standard math functions like `sqrt` and `pow`.  Here are some examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "▿ 2 elements\n",
       "  - .0 : [10.0,  6.0, -4.0]\n",
       "  - .1 : [2.0, 8.0, 7.0]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "var a = Tensor([10.0, 6, -4])\n",
    "var b = Tensor([2.0, 8, 7])\n",
    "(a,b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "add:   [12.0, 14.0,  3.0]\r\n",
      "mul:   [ 20.0,  48.0, -28.0]\r\n",
      "sqrt:  [3.1622776601683795,  2.449489742783178,               -nan]\r\n",
      "pow:   [             100.0, 1679616.0000000002,           -16384.0]\r\n"
     ]
    }
   ],
   "source": [
    "print(\"add:  \", a + b)\n",
    "print(\"mul:  \", a * b)\n",
    "print(\"sqrt: \", sqrt(a))\n",
    "print(\"pow:  \", pow(a, b))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**Comparison operators** (`>`,`<`,`==`,`!=`,...) in Swift are supposed to return a single `Bool` value, so they are `true` if all the elements of the tensors satisfy the comparison.\n",
    "\n",
    "Elementwise versions have the `.` prefix, which is read as \"pointwise\": `.>`, `.<`, `.==`, etc.  You can merge a tensor of bools into a single Bool with the `any()` and `all()` methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "false\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a < b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[false,  true,  true]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a .< b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "false\r\n"
     ]
    }
   ],
   "source": [
    "print((a .> 0).all())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "true\r\n"
     ]
    }
   ],
   "source": [
    "print((a .> 0).any())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Broadcasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Broadcasting with a scalar works just like in Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "var a = Tensor([10.0, 6.0, -4.0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[11.0,  7.0, -3.0]\r\n"
     ]
    }
   ],
   "source": [
    "print(a+1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[ 2.0,  4.0,  6.0],\n",
       " [ 8.0, 10.0, 12.0],\n",
       " [14.0, 16.0, 18.0]]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "2 * m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Broadcasting a vector with a matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "let c = Tensor([10.0,20.0,30.0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, broadcasting is done by adding 1 dimensions to the beginning until dimensions of both objects match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[11.0, 22.0, 33.0],\n",
       " [14.0, 25.0, 36.0],\n",
       " [17.0, 28.0, 39.0]]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m + c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[11.0, 22.0, 33.0],\n",
       " [14.0, 25.0, 36.0],\n",
       " [17.0, 28.0, 39.0]]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c + m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To broadcast on the other dimensions, one has to use `expandingShape` to add the dimension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[11.0, 12.0, 13.0],\n",
       " [24.0, 25.0, 26.0],\n",
       " [37.0, 38.0, 39.0]]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m + c.expandingShape(at: 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[10.0],\n",
       " [20.0],\n",
       " [30.0]]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.expandingShape(at: 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Broadcasting rules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 3]\r\n"
     ]
    }
   ],
   "source": [
    "print(c.expandingShape(at: 0).shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[3, 1]\r\n"
     ]
    }
   ],
   "source": [
    "print(c.expandingShape(at: 1).shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[100.0, 200.0, 300.0],\n",
       " [200.0, 400.0, 600.0],\n",
       " [300.0, 600.0, 900.0]]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.expandingShape(at: 0) * c.expandingShape(at: 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[false,  true,  true],\n",
       " [false, false,  true],\n",
       " [false, false, false]]\n"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.expandingShape(at: 0) .> c.expandingShape(at: 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Matmul using `Tensor`\n",
    "\n",
    "Coming back to our matmul algorithm, we can implement exactly what we had before by using subscripting into a tensor, instead of subscripting into an array.  Let's see how that works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "func tensorMatmul(_ a: Tensor<Float>, _ b: Tensor<Float>) -> Tensor<Float> {\n",
    "    var res = Tensor<Float>(zeros: [a.shape[0], b.shape[1]])\n",
    "\n",
    "    for i in 0 ..< a.shape[0] {\n",
    "        for j in 0 ..< b.shape[1] {\n",
    "            for k in 0 ..< a.shape[1] {\n",
    "                res[i, j] += a[i, k] * b[k, j]\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "    return res\n",
    "}\n",
    "\n",
    "_ = tensorMatmul(m1, m2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average: 5100.505422 ms,   min: 5100.505422 ms,   max: 5100.505422 ms\r\n"
     ]
    }
   ],
   "source": [
    "time { \n",
    "    let tmp = tensorMatmul(m1, m2)\n",
    "    \n",
    "    // Copy a scalar back to the host to force a GPU sync.\n",
    "    _ = tmp[0, 0].scalar\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What, what just happened?? We used to be less than a **tenth of a millisecond**, now we're taking **multiple seconds**.  It turns out that Tensor's are very good at bulk data processing, but they are not good at doing one float at a time.  Make sure to use the coarse-grained operations.  We can make this faster by vectorizing each loop in turn.\n",
    "\n",
    "**Slides:** [Granularity of Tensor Operations](https://docs.google.com/presentation/d/1dc6o2o-uYGnJeCeyvgsgyk05dBMneArxdICW5vF75oU/edit#slide=id.g58253914c1_0_380). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorize the inner loop into a multiply + sum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "func elementWiseMatmul(_ a:Tensor<Float>, _ b:Tensor<Float>) -> Tensor<Float>{\n",
    "    let (ar, ac) = (a.shape[0], a.shape[1])\n",
    "    let (br, bc) = (b.shape[0], b.shape[1])\n",
    "    var res = Tensor<Float>(zeros: [ac, br])\n",
    "    \n",
    "    for i in 0 ..< ar {\n",
    "        let row = a[i]\n",
    "        for j in 0 ..< bc {\n",
    "            res[i, j] = (row * b.slice(lowerBounds: [0,j], upperBounds: [ac,j+1]).squeezingShape(at: 1)).sum()\n",
    "        }\n",
    "    }\n",
    "    return res\n",
    "}\n",
    "\n",
    "_ = elementWiseMatmul(m1, m2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average: 361.954593 ms,   min: 361.954593 ms,   max: 361.954593 ms\r\n"
     ]
    }
   ],
   "source": [
    "time { \n",
    "    let tmp = elementWiseMatmul(m1, m2)\n",
    "\n",
    "    // Copy a scalar back to the host to force a GPU sync.\n",
    "    _ = tmp[0, 0].scalar\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorize the inner two loops with broadcasting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "func broadcastMatmult(_ a:Tensor<Float>, _ b:Tensor<Float>) -> Tensor<Float>{\n",
    "    var res = Tensor<Float>(zeros: [a.shape[0], b.shape[1]])\n",
    "    for i in 0..<a.shape[0] {\n",
    "        res[i] = (a[i].expandingShape(at: 1) * b).sum(squeezingAxes: 0)\n",
    "    }\n",
    "    return res\n",
    "}\n",
    "\n",
    "_ = broadcastMatmult(m1, m2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average: 0.70941021 ms,   min: 0.61099 ms,   max: 1.239529 ms\r\n"
     ]
    }
   ],
   "source": [
    "time(repeating: 100) {\n",
    "    let tmp = broadcastMatmult(m1, m2)\n",
    "\n",
    "    // Copy a scalar back to the host to force a GPU sync.\n",
    "    _ = tmp[0, 0].scalar\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorize the whole thing with one Tensorflow op"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average: 0.0204796 ms,   min: 0.018915 ms,   max: 0.040889 ms\r\n"
     ]
    }
   ],
   "source": [
    "time(repeating: 100) { _ = m1 • m2 }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, now that we have matmul, we can continue to build out our framework.\n",
    "\n",
    "To complete today's lesson, let's jump way way up the stack to see **Workbook 11**.\n",
    "\n",
    "\n",
    "\n",
    "## Tensorflow vectorizes, parallelizes, and scales\n",
    "\n",
    "The reason that TensorFlow works in practice is that it can scale way up to large matrices, for example, lets try some thing a bit larger:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "1x1:\n",
      "  ⏰average: 0.18814999999999998 ms,   min: 0.143182 ms,   max: 0.31991 ms\n",
      "\n",
      "10x10:\n",
      "  ⏰average: 0.1964562 ms,   min: 0.168568 ms,   max: 0.306457 ms\n",
      "\n",
      "100x100:\n",
      "  ⏰average: 0.18293589999999998 ms,   min: 0.155465 ms,   max: 0.289976 ms\n",
      "\n",
      "1000x1000:\n",
      "  ⏰average: 0.36083699999999996 ms,   min: 0.336118 ms,   max: 0.413469 ms\n",
      "\n",
      "5000x5000:\n",
      "  ⏰average: 20.0013405 ms,   min: 19.084045 ms,   max: 20.446258 ms\n"
     ]
    }
   ],
   "source": [
    "func timeMatmulTensor(size: Int) {\n",
    "    var matrix = Tensor<Float>(randomNormal: [size, size])\n",
    "    print(\"\\n\\(size)x\\(size):\\n  ⏰\", terminator: \"\")\n",
    "    time(repeating: 10) { \n",
    "        let matrix = matrix • matrix \n",
    "        _ = matrix[0, 0].scalar\n",
    "    }\n",
    "}\n",
    "\n",
    "timeMatmulTensor(size: 1)     // Tiny\n",
    "timeMatmulTensor(size: 10)    // Bigger\n",
    "timeMatmulTensor(size: 100)   // Even Bigger\n",
    "timeMatmulTensor(size: 1000)  // Biggerest\n",
    "timeMatmulTensor(size: 5000)  // Even Biggerest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In constrast, our simple CPU implementation takes a lot longer to do the same work.  For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "1x1:\n",
      "  ⏰average: 0.0002532 ms,   min: 0.00022 ms,   max: 0.000383 ms\n",
      "\n",
      "10x10:\n",
      "  ⏰average: 0.0032415 ms,   min: 0.003208 ms,   max: 0.003293 ms\n",
      "\n",
      "100x100:\n",
      "  ⏰average: 1.4835639 ms,   min: 1.38322 ms,   max: 1.565449 ms\n",
      "\n",
      "1000x1000:\n",
      "  ⏰average: 1698.665057 ms,   min: 1698.665057 ms,   max: 1698.665057 ms\n",
      "\n",
      "5000x5000: skipped, it takes tooo long!\n"
     ]
    }
   ],
   "source": [
    "func timeMatmulSwift(size: Int, repetitions: Int = 10) {\n",
    "    var matrix = Tensor<Float>(randomNormal: [size, size])\n",
    "    let matrixFlatArray = matrix.scalars\n",
    "\n",
    "    print(\"\\n\\(size)x\\(size):\\n  ⏰\", terminator: \"\")\n",
    "    time(repeating: repetitions) { \n",
    "       _ = swiftMatmulUnsafe(a: matrixFlatArray, b: matrixFlatArray, aDims: (size,size), bDims: (size,size))\n",
    "    }\n",
    "}\n",
    "\n",
    "timeMatmulSwift(size: 1)     // Tiny\n",
    "timeMatmulSwift(size: 10)    // Bigger\n",
    "timeMatmulSwift(size: 100)   // Even Bigger\n",
    "timeMatmulSwift(size: 1000, repetitions: 1)  // Biggerest\n",
    "\n",
    "print(\"\\n5000x5000: skipped, it takes tooo long!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why is TensorFlow *so so so* much faster than our CPU implementation?  Well there are two reasons: the first of which is that it uses GPU hardware, which is much faster for math like this.  That said, there are a ton of tricks (involving memory hierarchies, cache blocking, and other tricks) that make matrix multiplications go fast on CPUs and other hardware.\n",
    "\n",
    "For example, try using TensorFlow on the CPU to do the same computation as above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "1x1:\n",
      "  ⏰average: 0.030079900000000003 ms,   min: 0.02664 ms,   max: 0.040804 ms\n",
      "\n",
      "10x10:\n",
      "  ⏰average: 0.0340013 ms,   min: 0.031712 ms,   max: 0.044604 ms\n",
      "\n",
      "100x100:\n",
      "  ⏰average: 0.115587 ms,   min: 0.092782 ms,   max: 0.133166 ms\n",
      "\n",
      "1000x1000:\n",
      "  ⏰average: 6.0490596 ms,   min: 5.553612 ms,   max: 6.476949 ms\n",
      "\n",
      "5000x5000:\n",
      "  ⏰average: 852.6646403999999 ms,   min: 651.327758 ms,   max: 927.424344 ms\n"
     ]
    }
   ],
   "source": [
    "withDevice(.cpu) {\n",
    "    timeMatmulTensor(size: 1)     // Tiny\n",
    "    timeMatmulTensor(size: 10)    // Bigger\n",
    "    timeMatmulTensor(size: 100)   // Even Bigger\n",
    "    timeMatmulTensor(size: 1000)  // Biggerest\n",
    "    timeMatmulTensor(size: 5000)  // Even Biggerest\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a pretty big difference.  On my hardware, it takes 2287ms for Swift to do a 1000x1000 multiply on the CPU, it takes TensorFlow 6.7ms to do the same work on the CPU, and takes TensorFlow 0.49ms to do it on a GPU.\n",
    "\n",
    "# Hardware Accelerators vs Flexibility\n",
    "\n",
    "One of the big challenges with machine learning frameworks today is that they provide a fixed set of \"ops\" that you can use with high performance.  There is a lot of work underway to fix this.  The [XLA compiler in TensorFlow](https://www.tensorflow.org/xla) is an important piece of this, which allows more flexibility in the programming model while still providing high performance by using compilers to target the hardware accelerator.  If you're interested in the details, there is a [great video by the creator of Halide](https://www.youtube.com/watch?v=3uiEyEKji0M) explaining why this is challenging.\n",
    "\n",
    "TensorFlow internals are undergoing [significant changes (slide)](https://docs.google.com/presentation/d/1dc6o2o-uYGnJeCeyvgsgyk05dBMneArxdICW5vF75oU/edit#slide=id.g58253914c1_3_0) including the introduction of the XLA compiler, and the introduction of [MLIR compiler technology](https://github.com/tensorflow/mlir).\n",
    "\n",
    "\n",
    "# Tensor internals and Raw TensorFlow operations\n",
    "\n",
    "TensorFlow provides hundreds of different operators, and they sort of grew organically over time.  This means that there are some deprecated operators, they aren't particularly consistent, and there are other oddities.  As such, the `Tensor` type provides a curated set of these operators as methods.\n",
    "\n",
    "Whereas `Int` and `Float` are syntactic sugar for LLVM, and `PythonObject` is syntactic sugar for the Python interpreter, `Tensor` ends up being syntactic sugar for the TensorFlow operator set.  You can dive in and see its implementation in Swift in [the S4TF `TensorFlow` module](https://github.com/apple/swift/blob/tensorflow/stdlib/public/TensorFlow/Tensor.swift), e.g.:\n",
    "\n",
    "```swift\n",
    "public struct Tensor<Scalar : TensorFlowScalar> : TensorProtocol {\n",
    "  /// The underlying `TensorHandle`.\n",
    "  /// - Note: `handle` is public to allow user defined ops, but should not\n",
    "  /// normally be used otherwise.\n",
    "  public let handle: TensorHandle<Scalar>\n",
    "  ... \n",
    "}\n",
    "```\n",
    "\n",
    "Here we see the internal implementation details of `Tensor`, which stores a `TensorHandle` - the internal implementation detail of the TensorFlow Eager runtime.\n",
    "\n",
    "Methods are defined on Tensor just like you'd expect, here [is the basic addition operator](https://github.com/apple/swift/blob/tensorflow/stdlib/public/TensorFlow/Ops.swift#L88), defined over all numeric tensors (i.e., not tensors of `Bool`):\n",
    "\n",
    "```swift\n",
    "extension Tensor : AdditiveArithmetic where Scalar : Numeric {\n",
    "  /// Adds two tensors and produces their sum.\n",
    "  /// - Note: `+` supports broadcasting.\n",
    "  public static func + (lhs: Tensor, rhs: Tensor) -> Tensor {\n",
    "    return Raw.add(lhs, rhs)\n",
    "  }\n",
    "}\n",
    "```\n",
    "\n",
    "But wait, what is this Raw thing?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Raw TensorFlow ops\n",
    "\n",
    "TensorFlow has a database of the operators it defines, which gets encoded into a [protocol buffer](https://developers.google.com/protocol-buffers/).  From this protobuf, *all* of the operators automatically get a Raw operator (implemented in terms of a lower level `#tfop` primitive).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0, 0.0, 0.0]\r\n"
     ]
    }
   ],
   "source": [
    "// Explore the contents of the Raw namespace by typing Raw.<tab>\n",
    "print(Raw.zerosLike(c))\n",
    "\n",
    "// Raw."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is an [entire tutorial on Raw operators](https://colab.research.google.com/github/tensorflow/swift/blob/master/docs/site/tutorials/raw_tensorflow_operators.ipynb) on github/TensorFlow/swift.  The key thing to know is that TensorFlow can do almost anything, so if there is no obvious method on `Tensor` to do what you need it is worth checking out the tutorial to see how to do this.\n",
    "\n",
    "As one example, later parts of the tutorial need the ability to load files and decode JPEGs.  Swift for TensorFlow doesn't have these as methods on `StringTensor` yet, but we can add them like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "//export\n",
    "public extension StringTensor {\n",
    "    // Read a file into a Tensor.\n",
    "    init(readFile filename: String) {\n",
    "        self.init(readFile: StringTensor(filename))\n",
    "    }\n",
    "    init(readFile filename: StringTensor) {\n",
    "        self = Raw.readFile(filename: filename)\n",
    "    }\n",
    "\n",
    "    // Decode a StringTensor holding a JPEG file into a Tensor<UInt8>.\n",
    "    func decodeJpeg(channels: Int = 0) -> Tensor<UInt8> {\n",
    "        return Raw.decodeJpeg(contents: self, channels: Int64(channels), dctMethod: \"\") \n",
    "    }\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Export"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "success\r\n"
     ]
    }
   ],
   "source": [
    "import NotebookExport\n",
    "let exporter = NotebookExport(Path.cwd/\"01_matmul.ipynb\")\n",
    "print(exporter.export(usingPrefix: \"FastaiNotebook_\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Swift",
   "language": "swift",
   "name": "swift"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}