{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GPU Computing in Julia\n",
    "\n",
    "This session introduces GPU computing in Julia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Julia Version 0.6.4\n",
      "Commit 9d11f62bcb (2018-07-09 19:09 UTC)\n",
      "Platform Info:\n",
      "  OS: macOS (x86_64-apple-darwin14.5.0)\n",
      "  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n",
      "  WORD_SIZE: 64\n",
      "  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)\n",
      "  LAPACK: libopenblas64_\n",
      "  LIBM: libopenlibm\n",
      "  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)\n"
     ]
    }
   ],
   "source": [
    "versioninfo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPGPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GPUs are ubiquitous in modern computers. Following are GPUs today's typical computer systems.\n",
    "\n",
    "| NVIDIA GPUs         | Tesla K80                            | GTX 1080                                 | GT 650M                              |\n",
    "|---------------------|----------------------------------------|-----------------------------------------|--------------------------------------|\n",
    "|                     | ![Tesla M2090](nvidia_k80.jpg) | ![GTX 580](nvidia_gtx1080.jpg)    | ![GT 650M](nvidia_gt650m.jpg) |\n",
    "| Computers           | servers, cluster                       | desktop                                 | laptop                               |\n",
    "|                     | ![Server](gpu_server.jpg)       | ![Desktop](alienware-area51.png) | ![Laptop](macpro_inside.png)  |\n",
    "| Main usage          | scientific computing                   | daily work, gaming                      | daily work                           |\n",
    "| Memory              | 24 GB                                    | 8 GB                                   | 1GB                                  |\n",
    "| Memory bandwidth    | 480 GB/sec                              | 320 GB/sec                               | 80GB/sec                             |\n",
    "| Number of cores     | 4992                                    | 2560                                     | 384                                  |\n",
    "| Processor clock     | 562 MHz                                 | 1.6 GHz                                  | 0.9GHz                               |\n",
    "| Peak DP performance | 2.91 TFLOPS                              | 257 GFLOPS                                        |                                      |\n",
    "| Peak SP performance | 8.73 TFLOPS                            | 8228 GFLOPS                              | 691Gflops                            |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GPU architecture vs CPU architecture.  \n",
    "* GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC  \n",
    "* Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm  \n",
    "* Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly\n",
    "\n",
    "| ![i7 die](cpu_i7_die.png) | ![Fermi die](Fermi_Die.png) |\n",
    "|----------------------------------|------------------------------------|\n",
    "| ![Einstein](einstein.png) | ![Rain man](rainman.png)    |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPGPU in Julia\n",
    "\n",
    "GPU support by Julia is under active development. Check [JuliaGPU](https://github.com/JuliaGPU) for currently available packages. \n",
    "\n",
    "There are at least three paradigms to program GPU in Julia.\n",
    "\n",
    "- **CUDA** is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...\n",
    "\n",
    "  The [CuArray.jl](https://github.com/JuliaGPU/CuArrays.jl) package allows defining arrays on Nvidia GPUs and overloads many common operations. CuArrays.jl supports Julia v1.0+.\n",
    "\n",
    "- **OpenCL** is a standard supported multiple manufacturers (Nvidia, AMD, Intel, Apple, ...), but lacks some libraries essential for statistical computing.\n",
    "\n",
    "  The [CLArray.jl](https://github.com/JuliaGPU/CLArrays.jl) package allows defining arrays on OpenCL devices and overloads many common operations. **Currently CLArrays.jl only supports Julia v0.6**.\n",
    "\n",
    "- [**ArrayFire**](https://arrayfire.com) is a high performance library that works on both CUDA or OpenCL framework.\n",
    "\n",
    "  The [ArrayFire.jl](https://github.com/JuliaGPU/ArrayFire.jl) package wraps the library for julia.\n",
    "\n",
    "- **Warning:** Most recent Apple operating system iOS 10.14 (Mojave) does **not** support CUDA yet.\n",
    "\n",
    "Because my laptop has an AMD Radeon GPU, I'll illustrate using OpenCL on Julia **v0.6.4**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query GPU devices in the system"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3-element Array{OpenCL.cl.Device,1}:\n",
       " OpenCL.Device(Intel(R) HD Graphics 530 on Apple @0x0000000001024500)                 \n",
       " OpenCL.Device(AMD Radeon Pro 460 Compute Engine on Apple @0x0000000001021c00)        \n",
       " OpenCL.Device(Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz on Apple @0x00000000ffffffff)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using CLArrays\n",
    "\n",
    "# check available devices on this machine\n",
    "CLArrays.devices()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenCL context with:\n",
       "CL version: OpenCL 1.2 \n",
       "Device: CL AMD Radeon Pro 460 Compute Engine\n",
       "            threads: 256\n",
       "             blocks: (256, 256, 256)\n",
       "      global_memory: 4294.967296 mb\n",
       " free_global_memory: NaN mb\n",
       "       local_memory: 0.032768 mb\n"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# use the AMD Radeon Pro 460 GPU\n",
    "dev = CLArrays.devices()[2]\n",
    "CLArrays.init(dev)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate arrays on GPU devices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GPU: 5×3 Array{Float32,2}:\n",
       " 0.935426   0.656262   0.145963 \n",
       " 0.255108   0.0626027  0.565825 \n",
       " 0.0110094  0.679819   0.767408 \n",
       " 0.640347   0.114715   0.0996476\n",
       " 0.607266   0.332567   0.492061 "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# generate GPU arrays\n",
    "xd = rand(CLArray{Float32}, 5, 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GPU: 5×3 Array{Float32,2}:\n",
       " 1.0  1.0  1.0\n",
       " 1.0  1.0  1.0\n",
       " 1.0  1.0  1.0\n",
       " 1.0  1.0  1.0\n",
       " 1.0  1.0  1.0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "yd = ones(CLArray{Float32}, 5, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transfer data between main memory and GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GPU: 5×3 Array{Float64,2}:\n",
       "  0.0621414  -0.905539    0.795695\n",
       " -0.440687    0.427526   -0.235945\n",
       "  0.304403    0.0362422  -0.260491\n",
       "  1.58347    -0.492863   -0.076589\n",
       "  0.193085    0.351561   -0.857394"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transfer data from main memory to GPU\n",
    "x = randn(5, 3)\n",
    "xd = CLArray(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5×3 Array{Float64,2}:\n",
       "  0.0621414  -0.905539    0.795695\n",
       " -0.440687    0.427526   -0.235945\n",
       "  0.304403    0.0362422  -0.260491\n",
       "  1.58347    -0.492863   -0.076589\n",
       "  0.193085    0.351561   -0.857394"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transfer data from main memory to GPU\n",
    "x = collect(xd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Elementiwise operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GPU: 5×3 Array{Float64,2}:\n",
       "  0.0602494  -1.54533     0.539034 \n",
       " -0.556103    0.346862   -0.266262 \n",
       "  0.262152    0.0355933  -0.297806 \n",
       "  0.693107   -0.640839   -0.0795998\n",
       "  0.175538    0.295921   -1.41116  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zd = log.(yd .+ sin.(xd))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GPU: 5×3 Array{Float64,2}:\n",
       "  0.0621414  -0.905539    0.795695\n",
       " -0.440687    0.427526   -0.235945\n",
       "  0.304403    0.0362422  -0.260491\n",
       "  1.55813    -0.492863   -0.076589\n",
       "  0.193085    0.351561   -0.857394"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# getting back x\n",
    "asin.(exp.(zd) .- yd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear algebra"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GPU: 3×3 Array{Float32,2}:\n",
       "  1.70241    1.70241    1.70241 \n",
       " -0.583072  -0.583072  -0.583072\n",
       " -0.634723  -0.634723  -0.634723"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zd = zeros(CLArray{Float32}, 3, 3)\n",
    "At_mul_B!(zd, xd, yd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "base64 binary data: G1s5MW1FUlJPUiAodW5oYW5kbGVkIHRhc2sgZmFpbHVyZSk6IBtbOTFtT3BlbkNMIEVycm9yOiBPcGVuQ0wuQ29udGV4dCBlcnJvcjoggMMiBYB/G1szOW0KU3RhY2t0cmFjZToKIFsxXSAbWzFtcmFpc2VfY29udGV4dF9lcnJvchtbMjJtG1syMm0bWzFtKBtbMjJtG1syMm06OlN0cmluZywgOjpTdHJpbmcbWzFtKRtbMjJtG1syMm0gYXQgG1sxbS9Vc2Vycy9odWF6aG91Ly5qdWxpYS92MC42L09wZW5DTC9zcmMvY29udGV4dC5qbDoxMDkbWzIybRtbMjJtCiBbMl0gG1sxbW1hY3JvIGV4cGFuc2lvbhtbMjJtG1syMm0gYXQgG1sxbS9Vc2Vycy9odWF6aG91Ly5qdWxpYS92MC42L09wZW5DTC9zcmMvY29udGV4dC5qbDoxNDgbWzIybRtbMjJtIFtpbmxpbmVkXQogWzNdIBtbMW0oOjpPcGVuQ0wuY2wuIyM0MyM0NCkbWzIybRtbMjJtG1sxbSgbWzIybRtbMjJtG1sxbSkbWzIybRtbMjJtIGF0IBtbMW0uL3Rhc2suamw6MzM1G1syMm0bWzIybQobWzM5bQ==\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  2.86 KiB\n",
       "  allocs estimate:  96\n",
       "  --------------\n",
       "  minimum time:     18.158 μs (0.00% GC)\n",
       "  median time:      23.084 μs (0.00% GC)\n",
       "  mean time:        26.206 μs (3.11% GC)\n",
       "  maximum time:     17.951 ms (45.33% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using BenchmarkTools\n",
    "\n",
    "n = 512\n",
    "xd = rand(CLArray{Float32}, n, n)\n",
    "yd = rand(CLArray{Float32}, n, n)\n",
    "zd = zeros(CLArray{Float32}, n, n)\n",
    "\n",
    "# SP matrix multiplication on GPU\n",
    "@benchmark A_mul_B!($zd, $xd, $yd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     890.801 μs (0.00% GC)\n",
       "  median time:      1.207 ms (0.00% GC)\n",
       "  mean time:        1.237 ms (0.00% GC)\n",
       "  maximum time:     2.917 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          4028\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = rand(Float32, n, n)\n",
    "y = rand(Float32, n, n)\n",
    "z = zeros(Float32, n, n)\n",
    "\n",
    "# SP matrix multiplication on CPU\n",
    "@benchmark A_mul_B!($z, $x, $y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We ses ~50 fold speedup in this matrix multiplication example."
   ]
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 0.6.4",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.4"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "30px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": true,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}