{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "4972f311d33e889babafe6f6e44edc5f",
     "grade": false,
     "grade_id": "cell-8115527bd0e08e63",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "# Part 4: Using GPU acceleration with PyTorch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "8ef6029eb23fe884594de09e1cd97769",
     "grade": false,
     "grade_id": "cell-2e8abb75fa5d4222",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "# Execute this code block to install dependencies when running on colab\n",
    "try:\n",
    "    import torch\n",
    "except:\n",
    "    from os.path import exists\n",
    "    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag\n",
    "    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())\n",
    "    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\\.\\([0-9]*\\)\\.\\([0-9]*\\)$/cu\\1\\2/'\n",
    "    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'\n",
    "\n",
    "    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision\n",
    "\n",
    "try: \n",
    "    import torchbearer\n",
    "except:\n",
    "    !pip install torchbearer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "5ad3d98e7a666f0550c48db3d40c9fb6",
     "grade": false,
     "grade_id": "cell-56a116e085aef83c",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Manual use of `.cuda()`\n",
    "\n",
    "Now the magic of PyTorch comes in. So far, we've only been using the CPU to do computation. When we want to scale to a bigger problem, that won't be feasible for very long.\n",
    "|\n",
    "PyTorch makes it really easy to use the GPU for accelerating computation. Consider the following code that computes the element-wise product of two large matrices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "t1 = torch.randn(1000, 1000)\n",
    "t2 = torch.randn(1000, 1000)\n",
    "t3 = t1*t2\n",
    "print(t3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "6af792ab02ecca981f5c12685789f471",
     "grade": false,
     "grade_id": "cell-6849616c01cf9f25",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "By sending all the tensors that we are using to the GPU, all the operations on them will also run on the GPU without having to change anything else. If you're running a non-cuda enabled version of PyTorch the following will throw an error; if you have cuda available the following will create the input matrices, copy them to the GPU and perform the multiplication on the GPU itself:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t1 = torch.randn(1000, 1000).cuda()\n",
    "t2 = torch.randn(1000, 1000).cuda()\n",
    "t3 = t1*t2\n",
    "print(t3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "c771877e9beb32f8a49585373534dad1",
     "grade": false,
     "grade_id": "cell-2bca345a3c01999c",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "If you're running this workbook in colab, now enable GPU acceleration (`Runtime->Runtime Type` and add a `GPU` in the hardware accelerator pull-down). You'll then need to re-run all cells to this point.\n",
    "\n",
    "If you were able to run the above with hardware acceleration, the print-out of the result tensor would show that it was an instance of `cuda.FloatTensor` type on the the `(GPU 0)` GPU device. If your wanted to copy the tensor back to the CPU, you would use the `.cpu()` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "c064698f174abe4509b23c34a7912f44",
     "grade": false,
     "grade_id": "cell-e308141abb8d0e0c",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Writing platform agnostic code\n",
    "\n",
    "Most of the time you'd like to write code that is device agnostic; that is it will run on a GPU if one is available, and otherwise it would fall back to the CPU. The recommended way to do this is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
    "t1 = torch.randn(1000, 1000).to(device)\n",
    "t2 = torch.randn(1000, 1000).to(device)\n",
    "t3 = t1*t2\n",
    "print(t3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "b31bddc27c2bdeb593ee2498dfbd7e10",
     "grade": false,
     "grade_id": "cell-24f03f8a35648313",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Accelerating neural net training\n",
    "\n",
    "If you wanted to accelerate the training of a neural net using raw PyTorch, you would have to copy both the model and the training data to the GPU. Unless you were using a really small dataset like MNIST, you would typically _stream_ the batches of training data to the GPU as you used them in the training loop:\n",
    "\n",
    "```python\n",
    "device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
    "model = BaselineModel(784, 784, 10).to(device)\n",
    "\n",
    "loss_function = ...\n",
    "optimiser = ...\n",
    "\n",
    "for epoch in range(10):\n",
    "    for data in trainloader:\n",
    "        inputs, labels = data\n",
    "        inputs, labels = inputs.to(device), labels.to(device)\n",
    "\n",
    "        optimiser.zero_grad()\n",
    "        outputs = model(inputs)\n",
    "        loss = loss_function(outputs, labels)\n",
    "        loss.backward()\n",
    "        optimiser.step()\n",
    "```\n",
    "\n",
    "Using Torchbearer, this becomes much simpler - you just tell the `Trial` to run on the GPU and that's it!:\n",
    "\n",
    "```python\n",
    "model = BetterCNN()\n",
    "\n",
    "loss_function = ...\n",
    "optimiser = ...\n",
    "\n",
    "device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
    "trial = Trial(model, optimiser, loss_function, metrics=['loss', 'accuracy']).to(device)\n",
    "trial.with_generators(trainloader)\n",
    "trial.run(epochs=10)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "064768c85834aa40d82396f9f3cccfac",
     "grade": false,
     "grade_id": "cell-cf0444554770e817",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Multiple GPUs\n",
    "\n",
    "Using multiple GPUs is beyond the scope of the lab, but if you have multiple cuda devices, they can be referred to by index: `cuda:0`, `cuda:1`, `cuda:2`, etc. You have to be careful not to mix operations on different devices, and would need how to carefully orchestrate moving of data between the devices (which can really slow down your code to the point at which using the CPU would actually be faster)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "d698b1536be57d852780abaf08fcad68",
     "grade": false,
     "grade_id": "cell-f0f058c0af885275",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Questions\n",
    "\n",
    "__Answer the following questions (enter the answer in the box below each one):__\n",
    "\n",
    "__1.__ What features of GPUs allow them to perform computations faster than a typically CPU?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "b1dc585b11c0e499f09e409017b6cb06",
     "grade": true,
     "grade_id": "cell-76fcc457388a8223",
     "locked": false,
     "points": 1,
     "schema_version": 1,
     "solution": true
    }
   },
   "source": [
    "YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "374801c6d757e48f93fe93618435c685",
     "grade": false,
     "grade_id": "cell-68f765cc2155e517",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "__2.__ What is the biggest limiting factor for training large models with current generation GPUs?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "964d60951aa496908969f2ca35be3104",
     "grade": true,
     "grade_id": "cell-a147457afc2c4c40",
     "locked": false,
     "points": 1,
     "schema_version": 1,
     "solution": true
    }
   },
   "source": [
    "YOUR ANSWER HERE"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}