{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GPU Computing in Julia\n", "\n", "This session introduces GPU computing in Julia." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 0.6.4\n", "Commit 9d11f62bcb (2018-07-09 19:09 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin14.5.0)\n", " CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n", " WORD_SIZE: 64\n", " BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)\n", " LAPACK: libopenblas64_\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-3.9.1 (ORCJIT, skylake)\n" ] } ], "source": [ "versioninfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GPGPU" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GPUs are ubiquitous in modern computers. Following are GPUs today's typical computer systems.\n", "\n", "| NVIDIA GPUs | Tesla K80 | GTX 1080 | GT 650M |\n", "|---------------------|----------------------------------------|-----------------------------------------|--------------------------------------|\n", "| | ![Tesla M2090](nvidia_k80.jpg) | ![GTX 580](nvidia_gtx1080.jpg) | ![GT 650M](nvidia_gt650m.jpg) |\n", "| Computers | servers, cluster | desktop | laptop |\n", "| | ![Server](gpu_server.jpg) | ![Desktop](alienware-area51.png) | ![Laptop](macpro_inside.png) |\n", "| Main usage | scientific computing | daily work, gaming | daily work |\n", "| Memory | 24 GB | 8 GB | 1GB |\n", "| Memory bandwidth | 480 GB/sec | 320 GB/sec | 80GB/sec |\n", "| Number of cores | 4992 | 2560 | 384 |\n", "| Processor clock | 562 MHz | 1.6 GHz | 0.9GHz |\n", "| Peak DP performance | 2.91 TFLOPS | 257 GFLOPS | |\n", "| Peak SP performance | 8.73 TFLOPS | 8228 GFLOPS | 691Gflops |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GPU architecture vs CPU architecture. \n", "* GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC \n", "* Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm \n", "* Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly\n", "\n", "| ![i7 die](cpu_i7_die.png) | ![Fermi die](Fermi_Die.png) |\n", "|----------------------------------|------------------------------------|\n", "| ![Einstein](einstein.png) | ![Rain man](rainman.png) |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GPGPU in Julia\n", "\n", "GPU support by Julia is under active development. Check [JuliaGPU](https://github.com/JuliaGPU) for currently available packages. \n", "\n", "There are at least three paradigms to program GPU in Julia.\n", "\n", "- **CUDA** is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...\n", "\n", " The [CuArray.jl](https://github.com/JuliaGPU/CuArrays.jl) package allows defining arrays on Nvidia GPUs and overloads many common operations. CuArrays.jl supports Julia v1.0+.\n", "\n", "- **OpenCL** is a standard supported multiple manufacturers (Nvidia, AMD, Intel, Apple, ...), but lacks some libraries essential for statistical computing.\n", "\n", " The [CLArray.jl](https://github.com/JuliaGPU/CLArrays.jl) package allows defining arrays on OpenCL devices and overloads many common operations. **Currently CLArrays.jl only supports Julia v0.6**.\n", "\n", "- [**ArrayFire**](https://arrayfire.com) is a high performance library that works on both CUDA or OpenCL framework.\n", "\n", " The [ArrayFire.jl](https://github.com/JuliaGPU/ArrayFire.jl) package wraps the library for julia.\n", "\n", "- **Warning:** Most recent Apple operating system iOS 10.14 (Mojave) does **not** support CUDA yet.\n", "\n", "Because my laptop has an AMD Radeon GPU, I'll illustrate using OpenCL on Julia **v0.6.4**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query GPU devices in the system" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Array{OpenCL.cl.Device,1}:\n", " OpenCL.Device(Intel(R) HD Graphics 530 on Apple @0x0000000001024500) \n", " OpenCL.Device(AMD Radeon Pro 460 Compute Engine on Apple @0x0000000001021c00) \n", " OpenCL.Device(Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz on Apple @0x00000000ffffffff)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using CLArrays\n", "\n", "# check available devices on this machine\n", "CLArrays.devices()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OpenCL context with:\n", "CL version: OpenCL 1.2 \n", "Device: CL AMD Radeon Pro 460 Compute Engine\n", " threads: 256\n", " blocks: (256, 256, 256)\n", " global_memory: 4294.967296 mb\n", " free_global_memory: NaN mb\n", " local_memory: 0.032768 mb\n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use the AMD Radeon Pro 460 GPU\n", "dev = CLArrays.devices()[2]\n", "CLArrays.init(dev)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate arrays on GPU devices" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GPU: 5×3 Array{Float32,2}:\n", " 0.935426 0.656262 0.145963 \n", " 0.255108 0.0626027 0.565825 \n", " 0.0110094 0.679819 0.767408 \n", " 0.640347 0.114715 0.0996476\n", " 0.607266 0.332567 0.492061 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# generate GPU arrays\n", "xd = rand(CLArray{Float32}, 5, 3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GPU: 5×3 Array{Float32,2}:\n", " 1.0 1.0 1.0\n", " 1.0 1.0 1.0\n", " 1.0 1.0 1.0\n", " 1.0 1.0 1.0\n", " 1.0 1.0 1.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "yd = ones(CLArray{Float32}, 5, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transfer data between main memory and GPU" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GPU: 5×3 Array{Float64,2}:\n", " 0.0621414 -0.905539 0.795695\n", " -0.440687 0.427526 -0.235945\n", " 0.304403 0.0362422 -0.260491\n", " 1.58347 -0.492863 -0.076589\n", " 0.193085 0.351561 -0.857394" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transfer data from main memory to GPU\n", "x = randn(5, 3)\n", "xd = CLArray(x)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5×3 Array{Float64,2}:\n", " 0.0621414 -0.905539 0.795695\n", " -0.440687 0.427526 -0.235945\n", " 0.304403 0.0362422 -0.260491\n", " 1.58347 -0.492863 -0.076589\n", " 0.193085 0.351561 -0.857394" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# transfer data from main memory to GPU\n", "x = collect(xd)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Elementiwise operations" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GPU: 5×3 Array{Float64,2}:\n", " 0.0602494 -1.54533 0.539034 \n", " -0.556103 0.346862 -0.266262 \n", " 0.262152 0.0355933 -0.297806 \n", " 0.693107 -0.640839 -0.0795998\n", " 0.175538 0.295921 -1.41116 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zd = log.(yd .+ sin.(xd))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GPU: 5×3 Array{Float64,2}:\n", " 0.0621414 -0.905539 0.795695\n", " -0.440687 0.427526 -0.235945\n", " 0.304403 0.0362422 -0.260491\n", " 1.55813 -0.492863 -0.076589\n", " 0.193085 0.351561 -0.857394" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# getting back x\n", "asin.(exp.(zd) .- yd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear algebra" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GPU: 3×3 Array{Float32,2}:\n", " 1.70241 1.70241 1.70241 \n", " -0.583072 -0.583072 -0.583072\n", " -0.634723 -0.634723 -0.634723" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zd = zeros(CLArray{Float32}, 3, 3)\n", "At_mul_B!(zd, xd, yd)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "base64 binary data: G1s5MW1FUlJPUiAodW5oYW5kbGVkIHRhc2sgZmFpbHVyZSk6IBtbOTFtT3BlbkNMIEVycm9yOiBPcGVuQ0wuQ29udGV4dCBlcnJvcjoggMMiBYB/G1szOW0KU3RhY2t0cmFjZToKIFsxXSAbWzFtcmFpc2VfY29udGV4dF9lcnJvchtbMjJtG1syMm0bWzFtKBtbMjJtG1syMm06OlN0cmluZywgOjpTdHJpbmcbWzFtKRtbMjJtG1syMm0gYXQgG1sxbS9Vc2Vycy9odWF6aG91Ly5qdWxpYS92MC42L09wZW5DTC9zcmMvY29udGV4dC5qbDoxMDkbWzIybRtbMjJtCiBbMl0gG1sxbW1hY3JvIGV4cGFuc2lvbhtbMjJtG1syMm0gYXQgG1sxbS9Vc2Vycy9odWF6aG91Ly5qdWxpYS92MC42L09wZW5DTC9zcmMvY29udGV4dC5qbDoxNDgbWzIybRtbMjJtIFtpbmxpbmVkXQogWzNdIBtbMW0oOjpPcGVuQ0wuY2wuIyM0MyM0NCkbWzIybRtbMjJtG1sxbSgbWzIybRtbMjJtG1sxbSkbWzIybRtbMjJtIGF0IBtbMW0uL3Rhc2suamw6MzM1G1syMm0bWzIybQobWzM5bQ==\n" ] }, { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 2.86 KiB\n", " allocs estimate: 96\n", " --------------\n", " minimum time: 18.158 μs (0.00% GC)\n", " median time: 23.084 μs (0.00% GC)\n", " mean time: 26.206 μs (3.11% GC)\n", " maximum time: 17.951 ms (45.33% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools\n", "\n", "n = 512\n", "xd = rand(CLArray{Float32}, n, n)\n", "yd = rand(CLArray{Float32}, n, n)\n", "zd = zeros(CLArray{Float32}, n, n)\n", "\n", "# SP matrix multiplication on GPU\n", "@benchmark A_mul_B!($zd, $xd, $yd)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 890.801 μs (0.00% GC)\n", " median time: 1.207 ms (0.00% GC)\n", " mean time: 1.237 ms (0.00% GC)\n", " maximum time: 2.917 ms (0.00% GC)\n", " --------------\n", " samples: 4028\n", " evals/sample: 1" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = rand(Float32, n, n)\n", "y = rand(Float32, n, n)\n", "z = zeros(Float32, n, n)\n", "\n", "# SP matrix multiplication on CPU\n", "@benchmark A_mul_B!($z, $x, $y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We ses ~50 fold speedup in this matrix multiplication example." ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 0.6.4", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.4" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "30px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }