--- title: GPU Computing in Julia subtitle: Biostat/Biomath M257 author: Dr. Hua Zhou @ UCLA date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false jupyter: jupytext: formats: 'ipynb,qmd' text_representation: extension: .qmd format_name: quarto format_version: '1.0' jupytext_version: 1.14.5 kernelspec: display_name: Julia (8 threads) 1.8.5 language: julia name: julia-_8-threads_-1.8 --- This lecture introduces GPU computing in Julia. ## GPGPU GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today's typical computer systems. | NVIDIA GPUs | H100 PCIe | RTX 6000 | RTX 5000 | |---------------------|----------------------------------------|-----------------------------------------|--------------------------------------| | | ![H100](nvidia_h100.png) | ![RTX 6000](nvidia_rtx6000.png) | ![RTX 5000](nvidia_rtx5000.png) | | Computers | servers, cluster | desktop | laptop | | | ![Server](gpu_server.jpg) | ![Desktop](alienware-area51.png) | ![Laptop](macpro_inside.png) | | Main usage | scientific computing | daily work, gaming | daily work | | Memory | 80 GB | 48 GB | 16 GB | | Memory bandwidth | 2 TB/sec | 960 GB/sec | 576 GB/sec | | Number of cores | ??? | ??? | ??? | | Processor clock | ??? GHz | ??? GHz | ??? GHz | | Peak DP performance | 26 TFLOPS | ??? TFLOPS | ??? TFLOPS | | Peak SP performance | 51 TFLOPS | 91.1 TFLOPS | 42.6 TFLOPS | ## GPU architecture vs CPU architecture * GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC * Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm * Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly | ![i7 die](cpu_i7_die.png) | ![Fermi die](Fermi_Die.png) | |----------------------------------|------------------------------------| | ![Einstein](einstein.png) | ![Rain man](rainman.png) | ## GPGPU in Julia GPU support by Julia is under active development. Check [JuliaGPU](https://github.com/JuliaGPU) for currently available packages. There are multiple paradigms to program GPU in Julia, depending on the specific hardware. - **CUDA** is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ... The [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package allows defining arrays on **Nvidia GPUs** and overloads many common operations. - The [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package allows defining arrays on **AMD GPUs** and overloads many common operations. - The [Metal.jl](https://github.com/JuliaGPU/Metal.jl) package allows defining arrays on **Apple Silicon** and overloads many common operations. - The [oneAPI.jl](https://github.com/JuliaGPU/oneAPI.jl) package allows defining arrays on **Intel GPUs** and overloads many common operations. I'll illustrate using Metal.jl on my MacBook Pro running MacOS Ventura 13.3.1. It has Apple M2 chip with 38 GPU cores. ```{julia} versioninfo() ``` Load packages: ```{julia} using Pkg Pkg.activate(pwd()) Pkg.instantiate() Pkg.status() ``` ## Query GPU devices in the system ```{julia} using Metal Metal.versioninfo() ``` ## Transfer data between main memory and GPU ```{julia} # generate SP data on CPU x = rand(Float32, 3, 3) # transfer data form CPU to GPU xd = MtlArray(x) ``` ```{julia} # generate array on GPU directly # yd = Metal.ones(3, 3) yd = MtlArray(ones(Float32, 3, 3)) ``` ```{julia} # collect data from GPU to CPU x = collect(xd) ``` ## Linear algebra ```{julia} using BenchmarkTools, LinearAlgebra, Random Random.seed!(257) n = 2^14 # on CPU x = rand(Float32, n, n) y = rand(Float32, n, n) z = zeros(Float32, n, n) # on GPU xd = MtlArray(x) yd = MtlArray(y) zd = MtlArray(z); ``` ### Dot product ```{julia} # SP matrix dot product on GPU: tr(X'Y) # why are there allocations? bm_gpu = @benchmark Metal.@sync dot($xd, $yd) ``` ```{julia} # SP matrix dot product on CPU: tr(X'Y) bm_cpu = @benchmark dot($x, $y) ``` ```{julia} # speedup median(bm_cpu.times) / median(bm_gpu.times) ``` ### Broadcast ```{julia} # SP broadcast on GPU: z .= x .* y # why is there allocation? bm_gpu = @benchmark Metal.@sync $zd .= $xd .* $yd ``` ```{julia} # SP broadcast on CPU: z .= x .* y bm_cpu = @benchmark $z .= $x .* $y ``` ```{julia} # speedup median(bm_cpu.times) / median(bm_gpu.times) ``` ### Matrix multiplication ```{julia} # SP matrix multiplication on GPU bm_gpu = @benchmark Metal.@sync mul!($zd, $xd, $yd) ``` For this problem size on this machine, we see GPU achieves a staggering **9 TFLOPS** throughput with single precision! ```{julia} # SP throughput on GPU (2n^3) / (minimum(bm_gpu.times) / 1e9) ``` ```{julia} # SP matrix multiplication on CPU bm_cpu = @benchmark mul!($z, $x, $y) ``` ```{julia} # SP throughput on CPU (2n^3) / (minimum(bm_cpu.times) / 1e9) ``` We see >10x speedup by GPUs in this matrix multiplication example. ```{julia} # cholesky on Gram matrix # This one doesn't seem to work on Apple M2 chip yet # xtxd = xd'xd + I # @benchmark Metal.@sync cholesky($(xtxd)) ``` ```{julia} # xtx = collect(xtxd) # @benchmark cholesky($(Symmetric(xtx))) ``` GPU speedup of Cholesky seems unavailable at the moment. ## Evaluation of elementary and special functions on GPU ```{julia} # elementwise function on GPU arrays fill!(yd, 1) bm_gpu = @benchmark Metal.@sync $zd .= log.($yd .+ sin.($xd)) bm_gpu ``` ```{julia} # elementwise function on CPU arrays x, y, z = collect(xd), collect(yd), collect(zd) bm_cpu = @benchmark $z .= log.($y .+ sin.($x)) bm_cpu ``` ```{julia} # Speed up median(bm_cpu.times) / median(bm_gpu.times) ``` GPU brings great speedup (>50x) to the massive evaluation of elementary math functions.