{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hardware Ecosystem\n", "\n", "### [Nic Lane](http://niclane.org/)\n", "\n", "### 2022-02-03" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Abstract**: This lecture will look at the changes in hardware that\n", "enabled neural networks to be efficient and how neural network models\n", "are deployed on hardware." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "## DeepNN\n", "\n", "## Plan for the Day\n", "\n", "- **Introduction**\n", " - **How did we get here?**\n", "- Hardware Foundation\n", "- Parallelism Leveraging\n", "- Data Movement and Bandwidth Pressures\n", "- Closing messages\n", "\n", "## Hardware at Deep Learning's birth\n", "\n", "
\n",
"\n", "\n", "New York Times (1958)\n", "\n", "\n", " | \n",
"\n",
"\n", "\n", "Eniac, 1950s SoTA Hardware\n", "\n", "\n", " | \n",
"
\n", "\n", "\n", "\n", " | \n", "\n", "\n", "\n", "\n", " | \n", "
\n", "\n", "\n", "\n", " | \n", "\n", "\n", "\n", "\n", " | \n", "
\n",
" | \n",
"\n",
" | \n",
"\n",
" | \n",
"
\n", "\n", "- General-purpose processor (in use since mid-1950s)\n", "- CPU is composed of cores, each of which consists of several threads.\n", "- Example high-end performance:\n", " - AMD Ryzen 9 5950X\n", " - No. Cores: **16**\n", " - No. Threads: **32**\n", " - Clock speed: **3.4GHz**, boost up to **4.9GHz**\n", " - L2 cache: **8 MB**\n", " - L3 cache: **64 MB**\n", "\n", " | \n", "\n", "\n", "\n", "\n", " | \n", "
\n", "\n", "- Parallelism-exploiting Accelerator\n", "- Originally used for graphics processing (in use since 1970s)\n", "- GPU is composed of a large number of threads organised into blocks\n", " (cores)\n", "- Example high-end performance:\n", " - NVIDIA GEFORCE RTX 3090\n", " - No. Threads: **10496**\n", " - Clock speed: **1.4GHz**, boost up to **1.7GHz**\n", " - L2 cache: **24 GB** `{=html} | `\n", " `{=html}`\n", " \n", " `{=html} | ` `{=html}
\n", "\n", "- Register (per thread)\n", " - An automatic variable in kernel function\n", " - Low latency, high bandwidth\n", "- Local Memory (per thread)\n", " - Variable in a kernel but can not be fitted in register\n", "- Shared Memory (between thread blocks)\n", " - All threads faster than local and global memory\n", " - Use for inter-thread communication\n", " - physically shared with L1 cache\n", "- Constant memory\n", " - Per Device Read-only memory\n", "- Texture Memory\n", " - Per SM, read-only cache, optimized for 2D spatial locality\n", "- Global Memory `{=html} | ` `{=html}`\n", " \n", " `{=html} | ` `{=html}
\n", "\n", "- Processors\n", " - CPU sits at the centre of the system\n", "- **Accelerators**\n", " - GPUs, TPUs, Eyeriss, other specialised\n", " - Specialised hardware can be designed with exploiting\n", " **parallelism** in mind\n", "- **Memory hierarchy**\n", " - Caches - smallest and fastest\n", " - Random Access Memory (RAM) - largest and slowest\n", " - Disk / SSD - storage\n", " - Stores the dataset; in crisis it supplements RAM up to Swap\n", " - **Bandwidth** can be serious a bottleneck\n", " - System, memory, and I/O buses\n", " - Closer to processor - faster\n", " - Designed to transport fixed-size data chunks\n", " - Word size is a key system parameter 4 bytes (32 bit) or 8\n", " bytes (64 bit) \n", " - Auxiliary hardware\n", " - Mouse, keyboard, display\n", "\n", " | \n", "\n", "\n", "\n", "\n", " | \n", "
\n",
" | \n",
"\n",
" column of B and returns one\n", "element of C.*\n", "\n", " | \n",
"\n",
" one square sub-matrix.*\n", "\n", " | \n",
"
\n", "\n", "\n", "\n", " | \n", "\n", "\n", "\n", "\n", " | \n", "
\n",
" | \n",
"\n",
" | \n",
"
\n",
" | \n",
"\n",
" | \n",
"