{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hardware Ecosystem\n", "\n", "### [Nic Lane](http://niclane.org/)\n", "\n", "### 2022-02-03" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Abstract**: This lecture will look at the changes in hardware that\n", "enabled neural networks to be efficient and how neural network models\n", "are deployed on hardware." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "## DeepNN\n", "\n", "## Plan for the Day\n", "\n", "- **Introduction**\n", " - **How did we get here?**\n", "- Hardware Foundation\n", "- Parallelism Leveraging\n", "- Data Movement and Bandwidth Pressures\n", "- Closing messages\n", "\n", "## Hardware at Deep Learning's birth\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "

\n", "\n", "New York Times (1958)\n", "\n", "

\n", "
\n", "
\n", "
\n", "

\n", "\n", "Eniac, 1950s SoTA Hardware\n", "\n", "

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "## How did we get here? Deep Learning requires *peta* FLOPS\n", "\n", "
\n", "\n", "0.01 PFLOP (left) = $10^{13}$ FLOPS (right)\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "*Credit*: [Our World in\n", "Data](https://ourworldindata.org/technological-progress)\n", "\n", "## Plan for the Day\n", "\n", "- Introduction\n", "- **Hardware Foundation**\n", " - **Internal organisation of processors**\n", " - **A typical organisation of a DL system**\n", " - **Two pillars: Data Movement & Parallelism**\n", "- Parallelism Leveraging\n", "- Data Movement and Bandwidth Pressures\n", "- Closing messages\n", "\n", "## Internal Organisation of Processors\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "## Central Processing Unit (CPU)\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "- General-purpose processor (in use since mid-1950s)\n", "- CPU is composed of cores, each of which consists of several threads.\n", "- Example high-end performance:\n", " - AMD Ryzen 9 5950X\n", " - No. Cores:    **16**\n", " - No. Threads:   **32**\n", " - Clock speed:   **3.4GHz**, boost up to **4.9GHz**\n", " - L2 cache:     **8 MB**\n", " - L3 cache:     **64 MB**\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "``` python\n", "from custom_imports import *\n", "\n", "our_custom_net = BasicFCModel()\n", "our_custom_net.cpu()\n", "# OR\n", "device = torch.device('cpu')\n", "our_custom_net.to(device)\n", "```\n", "\n", "## Graphics Processing Unit\n", "\n", "\n", "\n", "`\n", " `{=html} ` `{=html} ` `{=html}
\n", "\n", "- Parallelism-exploiting Accelerator\n", "- Originally used for graphics processing (in use since 1970s)\n", "- GPU is composed of a large number of threads organised into blocks\n", " (cores)\n", "- Example high-end performance:\n", " - NVIDIA GEFORCE RTX 3090\n", " - No. Threads:   **10496**\n", " - Clock speed:   **1.4GHz**, boost up to **1.7GHz**\n", " - L2 cache:     **24 GB** `{=html} `\n", " \n", " `{=html}
`\n", "\n", "``` python\n", "if torch.cuda.is_available():\n", " our_custom_net.cuda()\n", " # OR\n", " device = torch.device('cuda:0')\n", " our_custom_net.to(device)\n", "# Remember to do the same for all inputs to the network\n", "```\n", "\n", "## Graphics Processing Unit\n", "\n", "\n", "\n", "` `{=html} ` `{=html} ` `{=html}
\n", "\n", "- Register (per thread)\n", " - An automatic variable in kernel function\n", " - Low latency, high bandwidth\n", "- Local Memory (per thread)\n", " - Variable in a kernel but can not be fitted in register\n", "- Shared Memory (between thread blocks)\n", " - All threads faster than local and global memory\n", " - Use for inter-thread communication\n", " - physically shared with L1 cache\n", "- Constant memory\n", " - Per Device Read-only memory\n", "- Texture Memory\n", " - Per SM, read-only cache, optimized for 2D spatial locality\n", "- Global Memory `{=html} `\n", " \n", " `{=html}
`\n", "\n", "## A typical organisation of a DL system\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "- Processors\n", " - CPU sits at the centre of the system\n", "- **Accelerators**\n", " - GPUs, TPUs, Eyeriss, other specialised\n", " - Specialised hardware can be designed with exploiting\n", " **parallelism** in mind\n", "- **Memory hierarchy**\n", " - Caches - smallest and fastest\n", " - Random Access Memory (RAM) - largest and slowest\n", " - Disk / SSD - storage\n", " - Stores the dataset; in crisis it supplements RAM up to Swap\n", " - **Bandwidth** can be serious a bottleneck\n", " - System, memory, and I/O buses\n", " - Closer to processor - faster\n", " - Designed to transport fixed-size data chunks\n", " - Word size is a key system parameter 4 bytes (32 bit) or 8\n", " bytes (64 bit) \n", " - Auxiliary hardware\n", " - Mouse, keyboard, display\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "## Data Movement & Parallelism\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## Memory and bandwidth: memory hierarchy\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## Memory and bandwidth: data movement\n", "\n", "- Energy and latency are commensurate\n", "- Accessing RAM is 3 to 4 orders of magnitude slower than\n", " executing MAC\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## Processor comparison based on memory and bandwidth\n", "\n", "\n", "\n", "- CPU has faster I/O bus than GPU, but it has lower bandwidth than\n", " GPU. CPU can fetch small pieces of data very fast, GPU fetches\n", " them slower but in bulk.\n", "- GPU has more lower-level memory than CPU. Even though each\n", " individual thread and thread block have less memory than the\n", " CPU threads and cores do, there are just so much more threads in\n", " the GPU that **taken as a whole** they have much\n", " more lower-level memory. This is memory inversion.\n", "\n", "## The case for parallelism - Moore's law is slowing down\n", "\n", "- *Moore's law fuelled the prosperity of the past 50 years.*\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "*Credit*: [Our World in\n", "Data](https://ourworldindata.org/technological-progress)\n", "\n", "## Plan for the Day\n", "\n", "- Introduction\n", "- Hardware Foundation\n", "- **Parallelism Leveraging**\n", " - **Parallelism in Deep Learning**\n", " - **Leveraging Deep Learning parallelism**\n", "- Data Movement and Bandwidth Pressures\n", "- Closing messages\n", "\n", "## The case for parallelism - Moore’s law is slowing down\n", "\n", "- *As it slows, programmers and hardware designers are searching for\n", " alternative drivers of performance growth.*\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "*Credit*: [Karl\n", "Rupp](https://github.com/karlrupp/microprocessor-trend-data)\n", "\n", "## Processor comparison based on parallelism\n", "\n", "``` python\n", "print(\"CPU matrix multiplication\")\n", "a, b = [torch.rand(2**10, 2**10) for _ in range(2)]\n", "start = time()\n", "a * b\n", "print(f'CPU took {time() - start} seconds')\n", "print(\"GPU matrix multiplication\")\n", "start = time()\n", "a * b\n", "print(f'GPU took {time() - start} seconds')\n", "```\n", "\n", " CPU matrix multiplication\n", " CPU took 0.0005156993865966797 seconds\n", " GPU matrix multiplication\n", " GPU took 0.0002989768981933594 seconds\n", "\n", "## Plan for the Day\n", "\n", "- Introduction\n", "- Hardware Foundation\n", "- **Parallelism Leveraging**\n", " - **Parallelism in Deep Learning**\n", " - **Leveraging Deep Learning parallelism**\n", "- Data Movement and Bandwidth Pressures\n", "- Closing messages\n", "\n", "## Parallelism in Deep Learning training\n", "\n", "- Minibatch model update:\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "- where $\\theta^{s}_{l,i}$ is an $i$'s parameter at layer $l$ value at\n", " step $s$ of the training process; $r$ is the learning rate; $B$ is\n", " the batch size; and $g^{s}_{l,i,b}$ is the $s$-th training step\n", " gradient coming from $b$-th training example for parameter update of\n", " $i$-th parameter at layer $l$.\n", "\n", "## DL parallelism: parallelize backprop through an example\n", "\n", "- The matrix multiplications in the forward and backward passes can be\n", " parallelized:\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "- Fast inference is unthinkable without parallel matrix\n", " multiplication.\n", "- Frequent synchronization is needed - at each layer the parallel\n", " threads need to sync up.\n", "\n", "## DL parallelism: parallelize gradient sample computation\n", "\n", "- Gradients for individual training examples can be computed in\n", " parallel:\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "- Synchronization is needed only at the point where we sum the\n", " individual gradients across the batch.\n", "\n", "## DL parallelism: parallelize update iterations\n", "\n", "- Gradient updates from separate batches can be computed in parallel:\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "- Imagine computing $N$ batches at the same time in parallel:\n", " - We can see this as using $N-1$ outdated gradient when making\n", " update based on the second batch.\n", " - We can see this as using $N$ gradient estimates in place of the\n", " usual $1$ that SGD is based on.\n", "\n", "## DL parallelism: parallelize the training of multiple models\n", "\n", "- In the course of solving a given DL problem one would often train\n", " competing models because of:\n", " - The choice of hyperparameters such as architecture,\n", " initialization, dropout and learning rates, regularization, ...\n", " - The desire to build a model ensemble.\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "- The models are independent of each other and thus can be computed in\n", " parallel.\n", "\n", "## Leveraging Deep Learning parallelism\n", "\n", "- CPUs, GPUs, multi-GPU, and multi-machine each offer unique\n", " opportunities to leverage the four sources of parallelism.\n", "\n", "XXXX\n", "\n", "XXXX\n", "\n", "XXXX\n", "\n", "XXXX\n", "\n", "## CPU training\n", "\n", "The most a CPU can do for this setup is to:\n", "\n", "- Run through the batches *sequentially*\n", "- Run through the model *sequentially*\n", "- Run through the batch *sequentially*\n", "- Potentially, *parallelize* each layer computation between its cores\n", " - Best case scenario: individual cores can tackle separate nodes /\n", " channels as these are independent of each other\n", "- *Parallelize* matrix multiplication\n", " - Best case scenario: Matrix multiplication is split between\n", " separate cores and threads. The degree of parallelism is,\n", " however, negligent.\n", "\n", "## CPU training\n", "\n", "Consequently:\n", "\n", "- Overpowered CPU threads are scrambling to juggle the many nodes /\n", " channels they need to compute.\n", "- The CPU is slowed down considerably by the fact that it needs to\n", " access its own L3 cache many more times than the GPU would, due to\n", " its lower memory access bandwidth.\n", "\n", "``` python\n", "print(\"CPU training code\")\n", "print(\"CPU training of the above-defined model short example of how long it takes\")\n", "our_custom_net.cpu()\n", "start = time()\n", "train(lenet, MNIST_trainloader)\n", "print(f'CPU took {time()-start:.2f} seconds')\n", "```\n", "\n", " CPU training code\n", " CPU training of the above-defined model short example of how long it takes\n", "\n", " Epoch 1, iter 469, loss 1.980: : 469it [00:02, 181.77it/s]\n", " Epoch 2, iter 469, loss 0.932: : 469it [00:02, 182.58it/s]\n", "\n", " CPU took 5.22 seconds\n", "\n", "## GPU training\n", "\n", "The GPU, on the other hand, can:\n", "\n", "- Run through the batches *sequentially*\n", "- Run through the model *sequentially*\n", " - The model and the batch size just fit once in the memory of the\n", " GPU we chose.\n", "- In the best case scenario run through the training examples in a\n", " batch in *parallel*\n", " - For most GPUs the computation is, however, *sequential* if their\n", " memory is not big enough to hold the entire batch of training\n", " examples.\n", "- *Parallelize* each layer computation between its cores\n", " - Groups of several cores are assigned to separate network layers\n", " / channels. Cores in the group need not be physically close to\n", " each other.\n", "- *Parallelize* matrix multiplication\n", " - The matrix multiplication needed to compute a given node /\n", " channel is split between the threads in the group that was\n", " assigned to it. Each thread computes separate sector of the\n", " input.\n", "\n", "## GPU training\n", "\n", "Consequently:\n", "\n", "- GPU cores are engaged at all times as they sequentially push through\n", " the training examples all at the same time.\n", "- All threads need to sync-up at the end of each layer computation so\n", " that their outputs can become the inputs to the next layer.\n", "\n", "``` python\n", "print(\"GPU training\")\n", "print(\"GPU training of the same example as in CPU\")\n", "lenet.cuda()\n", "\n", "batch_size = 512\n", "gpu_trainloader = make_MNIST_loader(batch_size=batch_size)\n", "start = time()\n", "gpu_train(lenet, gpu_trainloader)\n", "print(f'GPU took {time()-start:.2f} seconds')\n", "```\n", "\n", " GPU training\n", " GPU training of the same example as in CPU\n", "\n", " Epoch 1, iter 118, iter loss 0.786: : 118it [00:02, 52.62it/s]\n", " Epoch 2, iter 118, iter loss 0.760: : 118it [00:02, 57.48it/s]\n", "\n", " GPU took 4.37 seconds\n", "\n", "## GPU parallelism: matrix multiplication example\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "GPU\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "*12 thread blocks, each with 16 threads.*\n", "\n", "
\n", "
\n", "
\n", "\n", "Naive implementation\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "*Each **thread** reads one row of A, one
column of B and returns one\n", "element of C.*\n", "\n", "
\n", "
\n", "
\n", "\n", "Shared memory implementation\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "*Each **thread block** is computing
one square sub-matrix.*\n", "\n", "
\n", "
\n", "\n", "## GPU parallelism: matrix multiplication example\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## Multi-GPU training\n", "\n", "With multiple GPUs we can choose one of the following:\n", "\n", "- *Distribute* the training examples of a batch between GPUs\n", " - When individual GPUs can not hold the whole batch in the memory,\n", " this can be distributed between multiple cards.\n", " - The computation has to sync-up for each Batch-Norm.\n", "- *Parallelize* the model computation\n", " - Separate layers or groups of layers are handled by separate\n", " GPUs.\n", " - Computation syncs between pairs of GPU cards - as the one's\n", " outputs are the other's inputs.\n", " - This creates a flow-through system that will keep all GPUs busy\n", " at all times during the training.\n", " - Batch is processed sequentially, all GPUs sync up after each\n", " batch - either dead time or outdated gradients.\n", "- *Parallelize* the gradient computation\n", " - Each GPU can be given its own batch if we accept outdated model\n", " in gradient computations.\n", "\n", "## Multi-GPU training\n", "\n", "``` python\n", "print(\"multi-GPU training\")\n", "print(\"GPU training of the same example as in single GPU but with two GPUs\")\n", "our_custom_net_dp = lenet\n", "our_custom_net_dp.cuda()\n", "our_custom_net_dp = nn.DataParallel(our_custom_net_dp, device_ids=[0, 1])\n", "batch_size = 1024\n", "multigpu_trainloader = make_MNIST_loader(batch_size=batch_size)\n", "start = time()\n", "gpu_train(our_custom_net_dp, multigpu_trainloader)\n", "print(f'2 GPUs took {time()-start:.2f} seconds')\n", "```\n", "\n", " multi-GPU training\n", " GPU training of the same example as in single GPU but with two GPUs\n", "\n", " Epoch 1, iter 59, iter loss 0.745: : 59it [00:02, 21.24it/s]\n", " Epoch 2, iter 59, iter loss 0.736: : 59it [00:01, 31.70it/s]\n", "\n", " 2 GPUs took 4.72 seconds\n", "\n", "## Multi-Machine training\n", "\n", "In principle the same options as in multi-GPU:\n", "\n", "- *Distribute* the training examples of a batch between GPUs\n", " - This is rarely if ever needed on the scale of multi-Machine\n", "- *Parallelize* the model computation\n", " - Same principles as in multi-GPU.\n", "- *Parallelize* the gradient computation\n", " - Same principles as in multi-GPU.\n", "\n", "In practice we would either take advantage of the latter two. In extreme\n", "examples one might do a combination of multiple options.\n", "\n", "## Parallelism summary: model and data parallelism\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "## Parallelism bottlenecks: Synchronization & Communication\n", "\n", "\n", "\n", "- DL-training hardware needs to synchronize and communicate very\n", " frequently\n", " - Model nodes are heavily interconnected at each model layer.\n", " - Data nodes are interconnected by batch-norm-style layers\n", " - Data nodes are interconnected at gradient computation\n", "- This communication occurs between\n", " - Threads in a core (CPU and GPU)\n", " - Cores within a chip\n", " - Pieces of hardware *example: SLI bridge is a connector and a\n", " protocol for such a communication*\n", "\n", "## Bottlenecks beyond parallelism\n", "\n", "- DL training and inference do not take place solely on the\n", " accelerator.\n", " - The accelerator accelerates the gradient computations and\n", " updates.\n", " - The CPU will still need to be loading the data (model, train\n", " set) and saving the model (checkpointing).\n", "- The accelerator starves if it waits idly for its inputs due for\n", " example to slow CPU, I/O buses, or storage interface (SATA, SSD,\n", " NVMe).\n", "\n", "``` python\n", "print(\"starving GPUs\")\n", "print(\"show in-code what starving GPU looks like\")\n", "# Deliberately slow down data flow into the gpu \n", "# Do you have any suggestions how to do this in a more realistic way than just to force waiting?\n", "print('Using only 1 worker for the dataloader, the time the GPU takes increases.')\n", "lenet.cuda()\n", "batch_size = 64\n", "gpu_trainloader = make_MNIST_loader(batch_size=batch_size, num_workers=1)\n", "start = time()\n", "gpu_train(lenet, gpu_trainloader)\n", "print(f'GPU took {time()-start:.2f} seconds')\n", "```\n", "\n", " starving GPUs\n", " show in-code what starving GPU looks like\n", " Using only 1 worker for the dataloader, the time the GPU takes increases.\n", "\n", " Epoch 1, iter 938, iter loss 0.699: : 938it [00:04, 214.02it/s]\n", " Epoch 2, iter 938, iter loss 0.619: : 938it [00:04, 208.96it/s]\n", "\n", " GPU took 8.92 seconds\n", "\n", "## Plan for the Day\n", "\n", "- Introduction\n", "- Hardware Foundation\n", "- Parallelism Leveraging\n", "- **Data Movement and Bandwidth Pressures**\n", " - **Deep Learning working set**\n", " - **Mapping Deep Learning onto hardware**\n", " - **Addressing memory pressure**\n", "- Closing messages\n", "\n", "## Deep Learning resource characterisation\n", "\n", "``` python\n", "print(\"profiling demo\")\n", "print(\"in-house DL training resource profiling code & output - based on the above model and training loop\")\n", "#for both of the below produce one figure for inference and one for training\n", "#MACs profiling - first slide; show as piechard\n", "\n", "lenet.cpu()\n", "profile_ops(lenet, shape=(1,1,28,28))\n", "```\n", "\n", " profiling demo\n", " in-house DL training resource profiling code & output - based on the above model and training loop\n", " Operation OPS \n", " ------------------------------------- ------- \n", " LeNet/Conv2d[conv1]/onnx::Conv 89856 \n", " LeNet/ReLU[relu1]/onnx::Relu 6912 \n", " LeNet/MaxPool2d[pool1]/onnx::MaxPool 2592 \n", " LeNet/Conv2d[conv2]/onnx::Conv 154624 \n", " LeNet/ReLU[relu2]/onnx::Relu 2048 \n", " LeNet/MaxPool2d[pool2]/onnx::MaxPool 768 \n", " LeNet/Linear[fc1]/onnx::Gemm 30720 \n", " LeNet/ReLU[relu3]/onnx::Relu 240 \n", " LeNet/Linear[fc2]/onnx::Gemm 7200 \n", " LeNet/ReLU[relu4]/onnx::Relu 120 \n", " LeNet/Linear[fc3]/onnx::Gemm 600 \n", " LeNet/ReLU[relu5]/onnx::Relu 20 \n", " ------------------------------------ ------ \n", " Input size: (1, 1, 28, 28)\n", " 295,700 FLOPs or approx. 0.00 GFLOPs\n", "\n", "## Deep Learning working set\n", "\n", "- Working set - a collection of all elements needed for executing a\n", " given DL layer\n", " - Input and output activations\n", " - Parameters (weights & biases)\n", "\n", "``` python\n", "print(\"working set profiling\")\n", "# compute the per-layer required memory:\n", "# memory to load weights, to load inputs, to save oputputs\n", "# visualize as a per-layer bar chart, each bar consists of three sections - the inputs, outputs, weights\n", "\n", "profile_layer_mem(lenet)\n", "```\n", "\n", " working set profiling\n", "\n", "## Working Set requirement exceeding RAM\n", "\n", "``` python\n", "print(\"exceeding RAM+Swap demo\")\n", "print(\"exceeding working set experiment - see the latency spike over a couple of bytes of working set\")\n", "# sample* a training speed of a model whose layer working sets just first in the memory\n", "# bump up layer dimensions which are far from reaching the RAM limit - see that the effect on latency is limited\n", "# bump up the layer(s) that are at the RAM limit - observe the latency spike rapidly\n", "# add profiling graphs for each of the cases, print out latency numbers.\n", "\n", "# *train for an epoch or two, give the latency & give a reasonable estimate of how long would the full training take (assuming X epochs)\n", "estimate_training_for(LeNet, 1000)\n", "```\n", "\n", " exceeding RAM+Swap demo\n", " exceeding working set experiment - see the latency spike over a couple of bytes of working set\n", " Using 128 hidden nodes took 2.42 seconds, training for 1000 epochs would take ~2423.7449169158936s\n", " Using 256 hidden nodes took 2.31 seconds, training for 1000 epochs would take ~2311.570882797241s\n", " Using 512 hidden nodes took 2.38 seconds, training for 1000 epochs would take ~2383.8846683502197s\n", " Using 1024 hidden nodes took 2.56 seconds, training for 1000 epochs would take ~2559.4213008880615s\n", " Using 2048 hidden nodes took 3.10 seconds, training for 1000 epochs would take ~3098.113536834717s\n", " Using 4096 hidden nodes took 7.20 seconds, training for 1000 epochs would take ~7196.521997451782s\n", " Using 6144 hidden nodes took 13.21 seconds, training for 1000 epochs would take ~13207.558155059814s\n", "\n", "## Working Set requirement exceeding RAM + Swap\n", "\n", "``` python\n", "print(\"OOM - massive images\")\n", "print(\"show in-code how this can hapen - say massive images; maybe show error message\")\n", "# How could we do this without affecting the recording process?\n", "print('Loading too many images at once causes errors.')\n", "lenet.cuda()\n", "batch_size = 6000\n", "gpu_trainloader = make_MNIST_loader(batch_size=batch_size, num_workers=1)\n", "start = time()\n", "gpu_train(lenet, gpu_trainloader)\n", "print(f'GPU took {time()-start:.2f} seconds')\n", "```\n", "\n", " OOM - massive images\n", " show in-code how this can hapen - say massive images; maybe show error message\n", " Loading too many images at once causes errors.\n", "\n", " Epoch 1, iter 10, iter loss 0.596: : 10it [00:03, 2.78it/s]\n", " Epoch 2, iter 2, iter loss 0.592: : 2it [00:01, 1.69it/s]\n", "\n", "## Mapping Deep Models to hardware: Systolic Arrays\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "**Core principle**\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "**Systolic system matrix multiplication**\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", "## Mapping Deep Models to hardware: weight, input, and output stationarity\n", "\n", "**Weight stationary design**\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "**Input stationary design**\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "**Output stationary design**\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## Systolic array example: weight stationary Google Tensor Processing Unit (TPU)\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## Plan for the Day\n", "\n", "- Introduction\n", "- Hardware Foundation\n", "- Parallelism Leveraging\n", "- Data Movement and Bandwidth Pressures\n", "- **Closing messages**\n", " - **Deep Learning stack**\n", " - **Deep Learning and accelerator co-design**\n", " - **The Hardware and the Software Lottery**\n", "\n", "## Deep Learning stack\n", "\n", "\n", "\n", "## Beyond hardware methods\n", "\n", "- Sparsity leveraging\n", " - Sparsity-inducing compression\n", " - Sparsity leveraging hardware\n", "- Numerical representation\n", " - Low precision\n", " - bfloat16\n", " - Quantization\n", "- Low-level implementations\n", " - GEMM\n", " - cuDNN\n", "\n", "## Deep Learning and accelerator co-design\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## AlexNet: how GPU memory defined its architecture\n", "\n", "- Alex Krizhevsky used two GTX 580 GPUs, each with 3GB of memory.\n", "- Theoretical AlexNet (without mid-way split) working set profiling:\n", "\n", "``` python\n", "print(\"profile AlexNet layers - show memory requirements\")\n", "print(\"per-layer profiling of AlexNet - connects to the preceding slide\")\n", "from torchvision.models import alexnet as net\n", "anet = net()\n", "profile_layer_alexnet(anet)\n", "```\n", "\n", " profile AlexNet layers - show memory requirements\n", " per-layer profiling of AlexNet - connects to the preceding slide\n", "\n", "## The actual AlexNet architecture\n", "\n", "AlexNet's architecture had to be split down the middle to accommodate\n", "the 3GB limit per unit in its two GPUs.\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "## Beyond hardware methods\n", "\n", "- Sparsity leveraging\n", " - Sparsity-inducing compression\n", " - Sparsity leveraging hardware\n", "- Numerical representation\n", " - Low precision\n", " - bfloat16\n", " - Quantization\n", "- Low-level implementations\n", " - GEMM\n", " - cuDNN\n", "\n", "## The Hardware and the Software Lotteries\n", "\n", "
\n", "\n", "**The software and hardware lottery describes the success of a software\n", "or a piece of hardware resulting not from its universal superiority,\n", "but, rather, from its fit to the broader hardware and software\n", "ecosystem.**\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "Eniac (1950s)\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\n", "All-optical NN (2019)\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "## Summary of the Day\n", "\n", "- Introduction\n", "- Hardware Foundation\n", "- Parallelism Leveraging\n", "- Data Movement and Bandwidth Pressures\n", "- Closing messages\n", "\n", "# Thank you for your attention!\n", "\n", "## Deep Learning resource characterisation\n", "\n", "``` python\n", "# memory requirements profiling - second slide; show as piechard\n", "# Show proportion of data required for input, parameters and outputs\n", "\n", "profile_mem(lenet)\n", "```" ] } ], "nbformat": 4, "nbformat_minor": 5, "metadata": {} }