{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hardware Ecosystem\n",
    "\n",
    "### [Nic Lane](http://niclane.org/)\n",
    "\n",
    "### 2022-02-03"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Abstract**: This lecture will look at the changes in hardware that\n",
    "enabled neural networks to be efficient and how neural network models\n",
    "are deployed on hardware."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!---->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- Do not edit this file locally. -->\n",
    "<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->\n",
    "<!--\n",
    "\n",
    "-->\n",
    "\n",
    "## DeepNN\n",
    "\n",
    "## Plan for the Day\n",
    "\n",
    "-   **Introduction**\n",
    "    -   **How did we get here?**\n",
    "-   Hardware Foundation\n",
    "-   Parallelism Leveraging\n",
    "-   Data Movement and Bandwidth Pressures\n",
    "-   Closing messages\n",
    "\n",
    "## Hardware at Deep Learning's birth\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td>\n",
    "<center>\n",
    "<h2>\n",
    "\n",
    "New York Times (1958)\n",
    "\n",
    "</h2>\n",
    "</center>\n",
    "</td>\n",
    "<td>\n",
    "<center>\n",
    "<h2>\n",
    "\n",
    "Eniac, 1950s SoTA Hardware\n",
    "\n",
    "</h2>\n",
    "</center>\n",
    "</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/NYT_core.png\" style=\"width:40%\">\n",
    "\n",
    "</td>\n",
    "<td>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Eniac.jpg\" style=\"width:40%\">\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "## How did we get here? Deep Learning requires *peta* FLOPS\n",
    "\n",
    "<center>\n",
    "\n",
    "0.01 PFLOP (left) = $10^{13}$ FLOPS (right)\n",
    "\n",
    "</center>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"45%\">\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_MACs.png\" style=\"width:100%\">\n",
    "\n",
    "</td>\n",
    "<td width=\"45%\">\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/FLOPS.png\" style=\"width:100%\">\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "*Credit*: [Our World in\n",
    "Data](https://ourworldindata.org/technological-progress)\n",
    "\n",
    "## Plan for the Day\n",
    "\n",
    "-   Introduction\n",
    "-   **Hardware Foundation**\n",
    "    -   **Internal organisation of processors**\n",
    "    -   **A typical organisation of a DL system**\n",
    "    -   **Two pillars: Data Movement & Parallelism**\n",
    "-   Parallelism Leveraging\n",
    "-   Data Movement and Bandwidth Pressures\n",
    "-   Closing messages\n",
    "\n",
    "## Internal Organisation of Processors\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"30%\">\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/AMDprocessor.jpg\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "<td width=\"40%\">\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/GPU.jpg\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "<td width=\"30%\">\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/TPU_out.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "## Central Processing Unit (CPU)\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"60%\">\n",
    "\n",
    "-   General-purpose processor (in use since mid-1950s)\n",
    "-   CPU is composed of cores, each of which consists of several threads.\n",
    "-   Example high-end performance:\n",
    "    -   AMD Ryzen 9 5950X\n",
    "    -   No. Cores:    **16**\n",
    "    -   No. Threads:   **32**\n",
    "    -   Clock speed:   **3.4GHz**, boost up to **4.9GHz**\n",
    "    -   L2 cache:     **8 MB**\n",
    "    -   L3 cache:     **64 MB**\n",
    "\n",
    "</td>\n",
    "<td width=\"40%\">\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_CPU.png\" style=\"width:100%\">\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "``` python\n",
    "from custom_imports import *\n",
    "\n",
    "our_custom_net = BasicFCModel()\n",
    "our_custom_net.cpu()\n",
    "# OR\n",
    "device = torch.device('cpu')\n",
    "our_custom_net.to(device)\n",
    "```\n",
    "\n",
    "## Graphics Processing Unit\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"60%\">\n",
    "\n",
    "-   Parallelism-exploiting Accelerator\n",
    "-   Originally used for graphics processing (in use since 1970s)\n",
    "-   GPU is composed of a large number of threads organised into blocks\n",
    "    (cores)\n",
    "-   Example high-end performance:\n",
    "    -   NVIDIA GEFORCE RTX 3090\n",
    "    -   No. Threads:   **10496**\n",
    "    -   Clock speed:   **1.4GHz**, boost up to **1.7GHz**\n",
    "    -   L2 cache:     **24 GB** `{=html}     </td>`\n",
    "        `{=html}     <td width=\"40%\">`\n",
    "        <img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_GPU.png\" style=\"width:100%\">\n",
    "        `{=html}     </td>` `{=html}     </tr>` `{=html}     </table>`\n",
    "\n",
    "``` python\n",
    "if torch.cuda.is_available():\n",
    "    our_custom_net.cuda()\n",
    "    # OR\n",
    "    device = torch.device('cuda:0')\n",
    "    our_custom_net.to(device)\n",
    "# Remember to do the same for all inputs to the network\n",
    "```\n",
    "\n",
    "## Graphics Processing Unit\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"60%\">\n",
    "\n",
    "-   Register (per thread)\n",
    "    -   An automatic variable in kernel function\n",
    "    -   Low latency, high bandwidth\n",
    "-   Local Memory (per thread)\n",
    "    -   Variable in a kernel but can not be fitted in register\n",
    "-   Shared Memory (between thread blocks)\n",
    "    -   All threads faster than local and global memory\n",
    "    -   Use for inter-thread communication\n",
    "    -   physically shared with L1 cache\n",
    "-   Constant memory\n",
    "    -   Per Device Read-only memory\n",
    "-   Texture Memory\n",
    "    -   Per SM, read-only cache, optimized for 2D spatial locality\n",
    "-   Global Memory `{=html}     </td>` `{=html}     <td width=\"40%\">`\n",
    "    <img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_GPU.png\" style=\"width:100%\">\n",
    "    `{=html}     </td>` `{=html}     </tr>` `{=html}     </table>`\n",
    "\n",
    "## A typical organisation of a DL system\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"60%\">\n",
    "\n",
    "-   Processors\n",
    "    -   CPU sits at the centre of the system\n",
    "-   <u>**Accelerators**</u>\n",
    "    -   GPUs, TPUs, Eyeriss, other specialised\n",
    "    -   Specialised hardware can be designed with exploiting\n",
    "        <u>**parallelism**</u> in mind\n",
    "-   <u>**Memory hierarchy**</u>\n",
    "    -   Caches - smallest and fastest\n",
    "    -   Random Access Memory (RAM) - largest and slowest\n",
    "    -   Disk / SSD - storage\n",
    "        -   Stores the dataset; in crisis it supplements RAM up to Swap\n",
    "        -   <u>**Bandwidth**</u> can be serious a bottleneck\n",
    "    -   System, memory, and I/O buses\n",
    "        -   Closer to processor - faster\n",
    "        -   Designed to transport fixed-size data chunks\n",
    "        -   Word size is a key system parameter 4 bytes (32 bit) or 8\n",
    "            bytes (64 bit) </pp>\n",
    "    -   Auxiliary hardware\n",
    "        -   Mouse, keyboard, display\n",
    "\n",
    "</td>\n",
    "<td width=\"40%\">\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_hardwareOrg.png\" style=\"width:100%\">\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "## Data Movement & Parallelism\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_hardwareOrg.png\" style=\"width:60%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## Memory and bandwidth: memory hierarchy\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_hierarchy2.png\" style=\"width:70%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## Memory and bandwidth: data movement\n",
    "\n",
    "-   Energy and latency are commensurate\n",
    "-   Accessing RAM is 3 to 4 <u>orders of magnitude</u> slower than\n",
    "    executing MAC\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/latencynumbers.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## Processor comparison based on memory and bandwidth\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_memoryInversion.png\" style=\"width:45%\">\n",
    "\n",
    "-   CPU has faster I/O bus than GPU, but it has lower bandwidth than\n",
    "    GPU. <pp>CPU can fetch small pieces of data very fast, GPU fetches\n",
    "    them slower but in bulk.</pp>\n",
    "-   GPU has more lower-level memory than CPU. <pp>Even though each\n",
    "    individual thread and thread block have less memory than the</pp>\n",
    "    <pp>CPU threads and cores do, there are just so much more threads in\n",
    "    the GPU that</pp> <pp></pp> <pp>**taken as a whole** they have much\n",
    "    more lower-level memory.</pp> <pp>This is memory inversion.</pp>\n",
    "\n",
    "## The case for parallelism - Moore's law is slowing down\n",
    "\n",
    "-   *Moore's law fuelled the prosperity of the past 50 years.*\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Moores_Law.png\" style=\"width:60%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "*Credit*: [Our World in\n",
    "Data](https://ourworldindata.org/technological-progress)\n",
    "\n",
    "## Plan for the Day\n",
    "\n",
    "-   Introduction\n",
    "-   Hardware Foundation\n",
    "-   **Parallelism Leveraging**\n",
    "    -   **Parallelism in Deep Learning**\n",
    "    -   **Leveraging Deep Learning parallelism**\n",
    "-   Data Movement and Bandwidth Pressures\n",
    "-   Closing messages\n",
    "\n",
    "## The case for parallelism - Moore’s law is slowing down\n",
    "\n",
    "-   *As it slows, programmers and hardware designers are searching for\n",
    "    alternative drivers of performance growth.*\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Moores_Law2.png\" style=\"width:65%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "*Credit*: [Karl\n",
    "Rupp](https://github.com/karlrupp/microprocessor-trend-data)\n",
    "\n",
    "## Processor comparison based on parallelism\n",
    "\n",
    "``` python\n",
    "print(\"CPU matrix multiplication\")\n",
    "a, b = [torch.rand(2**10, 2**10) for _ in range(2)]\n",
    "start = time()\n",
    "a * b\n",
    "print(f'CPU took {time() - start} seconds')\n",
    "print(\"GPU matrix multiplication\")\n",
    "start = time()\n",
    "a * b\n",
    "print(f'GPU took {time() - start} seconds')\n",
    "```\n",
    "\n",
    "    CPU matrix multiplication\n",
    "    CPU took 0.0005156993865966797 seconds\n",
    "    GPU matrix multiplication\n",
    "    GPU took 0.0002989768981933594 seconds\n",
    "\n",
    "## Plan for the Day\n",
    "\n",
    "-   Introduction\n",
    "-   Hardware Foundation\n",
    "-   **Parallelism Leveraging**\n",
    "    -   **Parallelism in Deep Learning**\n",
    "    -   **Leveraging Deep Learning parallelism**\n",
    "-   Data Movement and Bandwidth Pressures\n",
    "-   Closing messages\n",
    "\n",
    "## Parallelism in Deep Learning training\n",
    "\n",
    "-   Minibatch model update:\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Picture1.png\" style=\"width:50%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "-   where $\\theta^{s}_{l,i}$ is an $i$'s parameter at layer $l$ value at\n",
    "    step $s$ of the training process; $r$ is the learning rate; $B$ is\n",
    "    the batch size; and $g^{s}_{l,i,b}$ is the $s$-th training step\n",
    "    gradient coming from $b$-th training example for parameter update of\n",
    "    $i$-th parameter at layer $l$.\n",
    "\n",
    "## DL parallelism: parallelize backprop through an example\n",
    "\n",
    "-   The matrix multiplications in the forward and backward passes can be\n",
    "    parallelized:\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Picture2.png\" style=\"width:50%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "-   Fast inference is unthinkable without parallel matrix\n",
    "    multiplication.\n",
    "-   Frequent synchronization is needed - at each layer the parallel\n",
    "    threads need to sync up.\n",
    "\n",
    "## DL parallelism: parallelize gradient sample computation\n",
    "\n",
    "-   Gradients for individual training examples can be computed in\n",
    "    parallel:\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Picture3.png\" style=\"width:50%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "-   Synchronization is needed only at the point where we sum the\n",
    "    individual gradients across the batch.\n",
    "\n",
    "## DL parallelism: parallelize update iterations\n",
    "\n",
    "-   Gradient updates from separate batches can be computed in parallel:\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Picture4.png\" style=\"width:50%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "-   Imagine computing $N$ batches at the same time in parallel:\n",
    "    -   We can see this as using $N-1$ outdated gradient when making\n",
    "        update based on the second batch.\n",
    "    -   We can see this as using $N$ gradient estimates in place of the\n",
    "        usual $1$ that SGD is based on.\n",
    "\n",
    "## DL parallelism: parallelize the training of multiple models\n",
    "\n",
    "-   In the course of solving a given DL problem one would often train\n",
    "    competing models because of:\n",
    "    -   The choice of hyperparameters such as architecture,\n",
    "        initialization, dropout and learning rates, regularization, ...\n",
    "    -   The desire to build a model ensemble.\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/hyperparameter.jpg\" style=\"width:90%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "-   The models are independent of each other and thus can be computed in\n",
    "    parallel.\n",
    "\n",
    "## Leveraging Deep Learning parallelism\n",
    "\n",
    "-   CPUs, GPUs, multi-GPU, and multi-machine each offer unique\n",
    "    opportunities to leverage the four sources of parallelism.\n",
    "\n",
    "XXXX\n",
    "\n",
    "XXXX\n",
    "\n",
    "XXXX\n",
    "\n",
    "XXXX\n",
    "\n",
    "## CPU training\n",
    "\n",
    "The most a CPU can do for this setup is to:\n",
    "\n",
    "-   Run through the batches *sequentially*\n",
    "-   Run through the model *sequentially*\n",
    "-   Run through the batch *sequentially*\n",
    "-   Potentially, *parallelize* each layer computation between its cores\n",
    "    -   Best case scenario: individual cores can tackle separate nodes /\n",
    "        channels as these are independent of each other\n",
    "-   *Parallelize* matrix multiplication\n",
    "    -   Best case scenario: Matrix multiplication is split between\n",
    "        separate cores and threads. The degree of parallelism is,\n",
    "        however, negligent.\n",
    "\n",
    "## CPU training\n",
    "\n",
    "Consequently:\n",
    "\n",
    "-   Overpowered CPU threads are scrambling to juggle the many nodes /\n",
    "    channels they need to compute.\n",
    "-   The CPU is slowed down considerably by the fact that it needs to\n",
    "    access its own L3 cache many more times than the GPU would, due to\n",
    "    its lower memory access bandwidth.\n",
    "\n",
    "``` python\n",
    "print(\"CPU training code\")\n",
    "print(\"CPU training of the above-defined model short example of how long it takes\")\n",
    "our_custom_net.cpu()\n",
    "start = time()\n",
    "train(lenet, MNIST_trainloader)\n",
    "print(f'CPU took {time()-start:.2f} seconds')\n",
    "```\n",
    "\n",
    "    CPU training code\n",
    "    CPU training of the above-defined model short example of how long it takes\n",
    "\n",
    "    Epoch 1, iter 469, loss 1.980: : 469it [00:02, 181.77it/s]\n",
    "    Epoch 2, iter 469, loss 0.932: : 469it [00:02, 182.58it/s]\n",
    "\n",
    "    CPU took 5.22 seconds\n",
    "\n",
    "## GPU training\n",
    "\n",
    "The GPU, on the other hand, can:\n",
    "\n",
    "-   Run through the batches *sequentially*\n",
    "-   Run through the model *sequentially*\n",
    "    -   The model and the batch size just fit once in the memory of the\n",
    "        GPU we chose.\n",
    "-   In the best case scenario run through the training examples in a\n",
    "    batch in *parallel*\n",
    "    -   For most GPUs the computation is, however, *sequential* if their\n",
    "        memory is not big enough to hold the entire batch of training\n",
    "        examples.\n",
    "-   *Parallelize* each layer computation between its cores\n",
    "    -   Groups of several cores are assigned to separate network layers\n",
    "        / channels. Cores in the group need not be physically close to\n",
    "        each other.\n",
    "-   *Parallelize* matrix multiplication\n",
    "    -   The matrix multiplication needed to compute a given node /\n",
    "        channel is split between the threads in the group that was\n",
    "        assigned to it. Each thread computes separate sector of the\n",
    "        input.\n",
    "\n",
    "## GPU training\n",
    "\n",
    "Consequently:\n",
    "\n",
    "-   GPU cores are engaged at all times as they sequentially push through\n",
    "    the training examples all at the same time.\n",
    "-   All threads need to sync-up at the end of each layer computation so\n",
    "    that their outputs can become the inputs to the next layer.\n",
    "\n",
    "``` python\n",
    "print(\"GPU training\")\n",
    "print(\"GPU training of the same example as in CPU\")\n",
    "lenet.cuda()\n",
    "\n",
    "batch_size = 512\n",
    "gpu_trainloader = make_MNIST_loader(batch_size=batch_size)\n",
    "start = time()\n",
    "gpu_train(lenet, gpu_trainloader)\n",
    "print(f'GPU took {time()-start:.2f} seconds')\n",
    "```\n",
    "\n",
    "    GPU training\n",
    "    GPU training of the same example as in CPU\n",
    "\n",
    "    Epoch 1, iter 118, iter loss 0.786: : 118it [00:02, 52.62it/s]\n",
    "    Epoch 2, iter 118, iter loss 0.760: : 118it [00:02, 57.48it/s]\n",
    "\n",
    "    GPU took 4.37 seconds\n",
    "\n",
    "## GPU parallelism: matrix multiplication example\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width>\n",
    "<center>\n",
    "\n",
    "GPU\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_GPU.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "*12 thread blocks, each with 16 threads.*\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "<td width>\n",
    "<center>\n",
    "\n",
    "Naive implementation\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_matmulVOsharedMemory.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "*Each **thread** reads one row of A, one <br>column of B and returns one\n",
    "element of C.*\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "<td width>\n",
    "<center>\n",
    "\n",
    "Shared memory implementation\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_matmul.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "*Each **thread <u>block</u>** is computing <br>one square sub-matrix.*\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "## GPU parallelism: matrix multiplication example\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/fig7.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## Multi-GPU training\n",
    "\n",
    "With multiple GPUs we can choose one of the following:\n",
    "\n",
    "-   *Distribute* the training examples of a batch between GPUs\n",
    "    -   When individual GPUs can not hold the whole batch in the memory,\n",
    "        this can be distributed between multiple cards.\n",
    "    -   The computation has to sync-up for each Batch-Norm.\n",
    "-   *Parallelize* the model computation\n",
    "    -   Separate layers or groups of layers are handled by separate\n",
    "        GPUs.\n",
    "    -   Computation syncs between pairs of GPU cards - as the one's\n",
    "        outputs are the other's inputs.\n",
    "    -   This creates a flow-through system that will keep all GPUs busy\n",
    "        at all times during the training.\n",
    "    -   Batch is processed sequentially, all GPUs sync up after each\n",
    "        batch - either dead time or outdated gradients.\n",
    "-   *Parallelize* the gradient computation\n",
    "    -   Each GPU can be given its own batch if we accept outdated model\n",
    "        in gradient computations.\n",
    "\n",
    "## Multi-GPU training\n",
    "\n",
    "``` python\n",
    "print(\"multi-GPU training\")\n",
    "print(\"GPU training of the same example as in single GPU but with two GPUs\")\n",
    "our_custom_net_dp = lenet\n",
    "our_custom_net_dp.cuda()\n",
    "our_custom_net_dp = nn.DataParallel(our_custom_net_dp, device_ids=[0, 1])\n",
    "batch_size = 1024\n",
    "multigpu_trainloader = make_MNIST_loader(batch_size=batch_size)\n",
    "start = time()\n",
    "gpu_train(our_custom_net_dp, multigpu_trainloader)\n",
    "print(f'2 GPUs took {time()-start:.2f} seconds')\n",
    "```\n",
    "\n",
    "    multi-GPU training\n",
    "    GPU training of the same example as in single GPU but with two GPUs\n",
    "\n",
    "    Epoch 1, iter 59, iter loss 0.745: : 59it [00:02, 21.24it/s]\n",
    "    Epoch 2, iter 59, iter loss 0.736: : 59it [00:01, 31.70it/s]\n",
    "\n",
    "    2 GPUs took 4.72 seconds\n",
    "\n",
    "## Multi-Machine training\n",
    "\n",
    "In principle the same options as in multi-GPU:\n",
    "\n",
    "-   *Distribute* the training examples of a batch between GPUs\n",
    "    -   This is rarely if ever needed on the scale of multi-Machine\n",
    "-   *Parallelize* the model computation\n",
    "    -   Same principles as in multi-GPU.\n",
    "-   *Parallelize* the gradient computation\n",
    "    -   Same principles as in multi-GPU.\n",
    "\n",
    "In practice we would either take advantage of the latter two. In extreme\n",
    "examples one might do a combination of multiple options.\n",
    "\n",
    "## Parallelism summary: model and data parallelism\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"49%\">\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/dataParallelism.png\" style=\"width:100%\">\n",
    "\n",
    "</td>\n",
    "<td width=\"49%\">\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/modelParallelism.png\" style=\"width:100%\">\n",
    "\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "## Parallelism bottlenecks: Synchronization & Communication\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/distributed_bottlenecks4.png\" style=\"width:80%\">\n",
    "\n",
    "-   DL-training hardware needs to synchronize and communicate very\n",
    "    frequently\n",
    "    -   Model nodes are heavily interconnected at each model layer.\n",
    "    -   Data nodes are interconnected by batch-norm-style layers\n",
    "    -   Data nodes are interconnected at gradient computation\n",
    "-   This communication occurs between\n",
    "    -   Threads in a core (CPU and GPU)\n",
    "    -   Cores within a chip\n",
    "    -   Pieces of hardware *example: SLI bridge is a connector and a\n",
    "        protocol for such a communication*\n",
    "\n",
    "## Bottlenecks beyond parallelism\n",
    "\n",
    "-   DL training and inference do not take place solely on the\n",
    "    accelerator.\n",
    "    -   The accelerator accelerates the gradient computations and\n",
    "        updates.\n",
    "    -   The CPU will still need to be loading the data (model, train\n",
    "        set) and saving the model (checkpointing).\n",
    "-   The accelerator starves if it waits idly for its inputs due for\n",
    "    example to slow CPU, I/O buses, or storage interface (SATA, SSD,\n",
    "    NVMe).\n",
    "\n",
    "``` python\n",
    "print(\"starving GPUs\")\n",
    "print(\"show in-code what starving GPU looks like\")\n",
    "# Deliberately slow down data flow into the gpu \n",
    "# Do you have any suggestions how to do this in a more realistic way than just to force waiting?\n",
    "print('Using only 1 worker for the dataloader, the time the GPU takes increases.')\n",
    "lenet.cuda()\n",
    "batch_size = 64\n",
    "gpu_trainloader = make_MNIST_loader(batch_size=batch_size, num_workers=1)\n",
    "start = time()\n",
    "gpu_train(lenet, gpu_trainloader)\n",
    "print(f'GPU took {time()-start:.2f} seconds')\n",
    "```\n",
    "\n",
    "    starving GPUs\n",
    "    show in-code what starving GPU looks like\n",
    "    Using only 1 worker for the dataloader, the time the GPU takes increases.\n",
    "\n",
    "    Epoch 1, iter 938, iter loss 0.699: : 938it [00:04, 214.02it/s]\n",
    "    Epoch 2, iter 938, iter loss 0.619: : 938it [00:04, 208.96it/s]\n",
    "\n",
    "    GPU took 8.92 seconds\n",
    "\n",
    "## Plan for the Day\n",
    "\n",
    "-   Introduction\n",
    "-   Hardware Foundation\n",
    "-   Parallelism Leveraging\n",
    "-   **Data Movement and Bandwidth Pressures**\n",
    "    -   **Deep Learning working set**\n",
    "    -   **Mapping Deep Learning onto hardware**\n",
    "    -   **Addressing memory pressure**\n",
    "-   Closing messages\n",
    "\n",
    "## Deep Learning resource characterisation\n",
    "\n",
    "``` python\n",
    "print(\"profiling demo\")\n",
    "print(\"in-house DL training resource profiling code & output - based on the above model and training loop\")\n",
    "#for both of the below produce one figure for inference and one for training\n",
    "#MACs profiling - first slide; show as piechard\n",
    "\n",
    "lenet.cpu()\n",
    "profile_ops(lenet, shape=(1,1,28,28))\n",
    "```\n",
    "\n",
    "    profiling demo\n",
    "    in-house DL training resource profiling code & output - based on the above model and training loop\n",
    "    Operation                              OPS      \n",
    "    -------------------------------------  -------  \n",
    "    LeNet/Conv2d[conv1]/onnx::Conv         89856    \n",
    "    LeNet/ReLU[relu1]/onnx::Relu           6912     \n",
    "    LeNet/MaxPool2d[pool1]/onnx::MaxPool   2592     \n",
    "    LeNet/Conv2d[conv2]/onnx::Conv         154624   \n",
    "    LeNet/ReLU[relu2]/onnx::Relu           2048     \n",
    "    LeNet/MaxPool2d[pool2]/onnx::MaxPool   768      \n",
    "    LeNet/Linear[fc1]/onnx::Gemm           30720    \n",
    "    LeNet/ReLU[relu3]/onnx::Relu           240      \n",
    "    LeNet/Linear[fc2]/onnx::Gemm           7200     \n",
    "    LeNet/ReLU[relu4]/onnx::Relu           120      \n",
    "    LeNet/Linear[fc3]/onnx::Gemm           600      \n",
    "    LeNet/ReLU[relu5]/onnx::Relu           20       \n",
    "    ------------------------------------   ------   \n",
    "    Input size: (1, 1, 28, 28)\n",
    "    295,700 FLOPs or approx. 0.00 GFLOPs\n",
    "\n",
    "## Deep Learning working set\n",
    "\n",
    "-   Working set - a collection of all elements needed for executing a\n",
    "    given DL layer\n",
    "    -   Input and output activations\n",
    "    -   Parameters (weights & biases)\n",
    "\n",
    "``` python\n",
    "print(\"working set profiling\")\n",
    "# compute the per-layer required memory:\n",
    "# memory to load weights, to load inputs, to save oputputs\n",
    "# visualize as a per-layer bar chart, each bar consists of three sections - the inputs, outputs, weights\n",
    "\n",
    "profile_layer_mem(lenet)\n",
    "```\n",
    "\n",
    "    working set profiling\n",
    "\n",
    "## Working Set requirement exceeding RAM\n",
    "\n",
    "``` python\n",
    "print(\"exceeding RAM+Swap demo\")\n",
    "print(\"exceeding working set experiment - see the latency spike over a couple of bytes of working set\")\n",
    "# sample* a training speed of a model whose layer working sets just first in the memory\n",
    "# bump up layer dimensions which are far from reaching the RAM limit - see that the effect on latency is limited\n",
    "# bump up the layer(s) that are at the RAM limit - observe the latency spike rapidly\n",
    "# add profiling graphs for each of the cases, print out latency numbers.\n",
    "\n",
    "# *train for an epoch or two, give the latency & give a reasonable estimate of how long would the full training take (assuming X epochs)\n",
    "estimate_training_for(LeNet, 1000)\n",
    "```\n",
    "\n",
    "    exceeding RAM+Swap demo\n",
    "    exceeding working set experiment - see the latency spike over a couple of bytes of working set\n",
    "    Using 128 hidden nodes took 2.42 seconds,        training for 1000 epochs would take ~2423.7449169158936s\n",
    "    Using 256 hidden nodes took 2.31 seconds,        training for 1000 epochs would take ~2311.570882797241s\n",
    "    Using 512 hidden nodes took 2.38 seconds,        training for 1000 epochs would take ~2383.8846683502197s\n",
    "    Using 1024 hidden nodes took 2.56 seconds,        training for 1000 epochs would take ~2559.4213008880615s\n",
    "    Using 2048 hidden nodes took 3.10 seconds,        training for 1000 epochs would take ~3098.113536834717s\n",
    "    Using 4096 hidden nodes took 7.20 seconds,        training for 1000 epochs would take ~7196.521997451782s\n",
    "    Using 6144 hidden nodes took 13.21 seconds,        training for 1000 epochs would take ~13207.558155059814s\n",
    "\n",
    "## Working Set requirement exceeding RAM + Swap\n",
    "\n",
    "``` python\n",
    "print(\"OOM - massive images\")\n",
    "print(\"show in-code how this can hapen - say massive images; maybe show error message\")\n",
    "# How could we do this without affecting the recording process?\n",
    "print('Loading too many images at once causes errors.')\n",
    "lenet.cuda()\n",
    "batch_size = 6000\n",
    "gpu_trainloader = make_MNIST_loader(batch_size=batch_size, num_workers=1)\n",
    "start = time()\n",
    "gpu_train(lenet, gpu_trainloader)\n",
    "print(f'GPU took {time()-start:.2f} seconds')\n",
    "```\n",
    "\n",
    "    OOM - massive images\n",
    "    show in-code how this can hapen - say massive images; maybe show error message\n",
    "    Loading too many images at once causes errors.\n",
    "\n",
    "    Epoch 1, iter 10, iter loss 0.596: : 10it [00:03,  2.78it/s]\n",
    "    Epoch 2, iter 2, iter loss 0.592: : 2it [00:01,  1.69it/s]\n",
    "\n",
    "## Mapping Deep Models to hardware: Systolic Arrays\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"49%\">\n",
    "<center>\n",
    "\n",
    "**Core principle**\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/basic_systolic_system.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "<td width=\"49%\">\n",
    "<center>\n",
    "\n",
    "**Systolic system matrix multiplication**\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "<video width=\"100%\" height=\"\" controls preload=\"none\">\n",
    "<source src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/systolic_array.mp4\" type=\"video/mp4\"/>\n",
    "</video>\n",
    "</center>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "## Mapping Deep Models to hardware: weight, input, and output stationarity\n",
    "\n",
    "**Weight stationary design**\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/weight_stationary.png\" style=\"width:70%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "**Input stationary design**\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/input_stationary.png\" style=\"width:70%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "**Output stationary design**\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/output_stationary.png\" style=\"width:70%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## Systolic array example: weight stationary Google Tensor Processing Unit (TPU)\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/provisional_TPU3.png\" style=\"width:60%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## Plan for the Day\n",
    "\n",
    "-   Introduction\n",
    "-   Hardware Foundation\n",
    "-   Parallelism Leveraging\n",
    "-   Data Movement and Bandwidth Pressures\n",
    "-   **Closing messages**\n",
    "    -   **Deep Learning stack**\n",
    "    -   **Deep Learning and accelerator co-design**\n",
    "    -   **The Hardware and the Software Lottery**\n",
    "\n",
    "## Deep Learning stack\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/TVM_stack2.png\" style=\"width:80%\">\n",
    "\n",
    "## Beyond hardware methods\n",
    "\n",
    "-   Sparsity leveraging\n",
    "    -   Sparsity-inducing compression\n",
    "    -   Sparsity leveraging hardware\n",
    "-   Numerical representation\n",
    "    -   Low precision\n",
    "    -   bfloat16\n",
    "    -   Quantization\n",
    "-   Low-level implementations\n",
    "    -   GEMM\n",
    "    -   cuDNN\n",
    "\n",
    "## Deep Learning and accelerator co-design\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/codesign.png\" style=\"width:40%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## AlexNet: how GPU memory defined its architecture\n",
    "\n",
    "-   Alex Krizhevsky used two GTX 580 GPUs, each with 3GB of memory.\n",
    "-   Theoretical AlexNet (without mid-way split) working set profiling:\n",
    "\n",
    "``` python\n",
    "print(\"profile AlexNet layers - show memory requirements\")\n",
    "print(\"per-layer profiling of AlexNet - connects to the preceding slide\")\n",
    "from torchvision.models import alexnet as net\n",
    "anet = net()\n",
    "profile_layer_alexnet(anet)\n",
    "```\n",
    "\n",
    "    profile AlexNet layers - show memory requirements\n",
    "    per-layer profiling of AlexNet - connects to the preceding slide\n",
    "\n",
    "## The actual AlexNet architecture\n",
    "\n",
    "AlexNet's architecture had to be split down the middle to accommodate\n",
    "the 3GB limit per unit in its two GPUs.\n",
    "\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/AlexNet.png\" style=\"width:70%\">\n",
    "\n",
    "</center>\n",
    "\n",
    "## Beyond hardware methods\n",
    "\n",
    "-   Sparsity leveraging\n",
    "    -   Sparsity-inducing compression\n",
    "    -   Sparsity leveraging hardware\n",
    "-   Numerical representation\n",
    "    -   Low precision\n",
    "    -   bfloat16\n",
    "    -   Quantization\n",
    "-   Low-level implementations\n",
    "    -   GEMM\n",
    "    -   cuDNN\n",
    "\n",
    "## The Hardware and the Software Lotteries\n",
    "\n",
    "<center>\n",
    "\n",
    "**The software and hardware lottery describes the success of a software\n",
    "or a piece of hardware resulting not from its universal superiority,\n",
    "but, rather, from its fit to the broader hardware and software\n",
    "ecosystem.**\n",
    "\n",
    "</center>\n",
    "<table>\n",
    "<tr>\n",
    "<td width=\"49%\">\n",
    "<center>\n",
    "\n",
    "Eniac (1950s)\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/Eniac.jpg\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "<td width=\"49%\">\n",
    "<center>\n",
    "\n",
    "All-optical NN (2019)\n",
    "\n",
    "</center>\n",
    "<center>\n",
    "\n",
    "<img class=\"\" src=\"https://mlatcl.github.io/deepnn/./slides/diagrams//hardware/futureDL.png\" style=\"width:100%\">\n",
    "\n",
    "</center>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "## Summary of the Day\n",
    "\n",
    "-   Introduction\n",
    "-   Hardware Foundation\n",
    "-   Parallelism Leveraging\n",
    "-   Data Movement and Bandwidth Pressures\n",
    "-   Closing messages\n",
    "\n",
    "# Thank you for your attention!\n",
    "\n",
    "## Deep Learning resource characterisation\n",
    "\n",
    "``` python\n",
    "# memory requirements profiling - second slide; show as piechard\n",
    "# Show proportion of data required for input, parameters and outputs\n",
    "\n",
    "profile_mem(lenet)\n",
    "```"
   ]
  }
 ],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {}
}