---
title: GPU Computing in Julia
subtitle: Biostat/Biomath M257
author: Dr. Hua Zhou @ UCLA
date: today
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
jupyter:
  jupytext:
    formats: 'ipynb,qmd'
    text_representation:
      extension: .qmd
      format_name: quarto
      format_version: '1.0'
      jupytext_version: 1.14.5
  kernelspec:
    display_name: Julia (8 threads) 1.8.5
    language: julia
    name: julia-_8-threads_-1.8
---

This lecture introduces GPU computing in Julia.

## GPGPU

GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today's typical computer systems.

| NVIDIA GPUs         | H100 PCIe                           | RTX 6000                                 | RTX 5000                              |
|---------------------|----------------------------------------|-----------------------------------------|--------------------------------------|
|                     | ![H100](nvidia_h100.png) | ![RTX 6000](nvidia_rtx6000.png)    | ![RTX 5000](nvidia_rtx5000.png) |
| Computers           | servers, cluster                       | desktop                                 | laptop                               |
|                     | ![Server](gpu_server.jpg)       | ![Desktop](alienware-area51.png) | ![Laptop](macpro_inside.png)  |
| Main usage          | scientific computing                   | daily work, gaming                      | daily work                           |
| Memory              | 80 GB                                    | 48 GB                                   | 16 GB                                  |
| Memory bandwidth    | 2 TB/sec                              | 960 GB/sec                               | 576 GB/sec                             |
| Number of cores     | ???                                    | ???                                     | ???                                  |
| Processor clock     | ??? GHz                                 | ??? GHz                                  | ??? GHz                               |
| Peak DP performance | 26 TFLOPS                              | ??? TFLOPS                                        |                                    ??? TFLOPS  |
| Peak SP performance | 51 TFLOPS                            | 91.1 TFLOPS                              | 42.6 TFLOPS                            |

## GPU architecture vs CPU architecture

* GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC  

* Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm  

* Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly

| ![i7 die](cpu_i7_die.png) | ![Fermi die](Fermi_Die.png) |
|----------------------------------|------------------------------------|
| ![Einstein](einstein.png) | ![Rain man](rainman.png)    |

## GPGPU in Julia

GPU support by Julia is under active development. Check [JuliaGPU](https://github.com/JuliaGPU) for currently available packages. 

There are multiple paradigms to program GPU in Julia, depending on the specific hardware.

- **CUDA** is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...

  The [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package allows defining arrays on **Nvidia GPUs** and overloads many common operations.

- The [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package allows defining arrays on **AMD GPUs** and overloads many common operations.

- The [Metal.jl](https://github.com/JuliaGPU/Metal.jl) package allows defining arrays on **Apple Silicon** and overloads many common operations.

- The [oneAPI.jl](https://github.com/JuliaGPU/oneAPI.jl) package allows defining arrays on **Intel GPUs** and overloads many common operations.

I'll illustrate using Metal.jl on my MacBook Pro running MacOS Ventura 13.3.1. It has Apple M2 chip with 38 GPU cores.

```{julia}
versioninfo()
```

Load packages:

```{julia}
using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
```

## Query GPU devices in the system

```{julia}
using Metal

Metal.versioninfo()
```

## Transfer data between main memory and GPU

```{julia}
# generate SP data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = MtlArray(x)
```

```{julia}
# generate array on GPU directly
# yd = Metal.ones(3, 3)
yd = MtlArray(ones(Float32, 3, 3))
```

```{julia}
# collect data from GPU to CPU
x = collect(xd)
```

## Linear algebra

```{julia}
using BenchmarkTools, LinearAlgebra, Random

Random.seed!(257)
n = 2^14
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = MtlArray(x)
yd = MtlArray(y)
zd = MtlArray(z);
```

### Dot product

```{julia}
# SP matrix dot product on GPU: tr(X'Y)
# why are there allocations?
bm_gpu = @benchmark Metal.@sync dot($xd, $yd)
```

```{julia}
# SP matrix dot product on CPU: tr(X'Y)
bm_cpu = @benchmark dot($x, $y)
```

```{julia}
# speedup
median(bm_cpu.times) / median(bm_gpu.times)
```

### Broadcast

```{julia}
# SP broadcast on GPU: z .= x .* y
# why is there allocation?
bm_gpu = @benchmark Metal.@sync $zd .= $xd .* $yd
```

```{julia}
# SP broadcast on CPU: z .= x .* y
bm_cpu = @benchmark $z .= $x .* $y
```

```{julia}
# speedup
median(bm_cpu.times) / median(bm_gpu.times)
```

### Matrix multiplication

```{julia}
# SP matrix multiplication on GPU
bm_gpu = @benchmark Metal.@sync mul!($zd, $xd, $yd)
```

For this problem size on this machine, we see GPU achieves a staggering **9 TFLOPS** throughput with single precision!

```{julia}
# SP throughput on GPU
(2n^3) / (minimum(bm_gpu.times) / 1e9)
```

```{julia}
# SP matrix multiplication on CPU
bm_cpu = @benchmark mul!($z, $x, $y)
```

```{julia}
# SP throughput on CPU
(2n^3) / (minimum(bm_cpu.times) / 1e9)
```

We see >10x speedup by GPUs in this matrix multiplication example.

```{julia}
# cholesky on Gram matrix
# This one doesn't seem to work on Apple M2 chip yet
# xtxd = xd'xd + I
# @benchmark Metal.@sync cholesky($(xtxd))
```

```{julia}
# xtx = collect(xtxd)
# @benchmark cholesky($(Symmetric(xtx)))
```

GPU speedup of Cholesky seems unavailable at the moment.

## Evaluation of elementary and special functions on GPU

```{julia}
# elementwise function on GPU arrays
fill!(yd, 1)
bm_gpu = @benchmark Metal.@sync $zd .= log.($yd .+ sin.($xd))
bm_gpu
```

```{julia}
# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
bm_cpu = @benchmark $z .= log.($y .+ sin.($x))
bm_cpu
```

```{julia}
# Speed up
median(bm_cpu.times) / median(bm_gpu.times)
```

GPU brings great speedup (>50x) to the massive evaluation of elementary math functions.