{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Estimating $\\pi$ on the GPU\n",
"\n",
"\n",
"\n",
"In this Jupyter notebook we show how to use the GPU seamlessly in a scripting mode with Alea GPU V3. We demonstrate the new highly efficient GPU primitives that will come with our next release 3.1 of Alea GPU, such as highly optimized parallel sort, scan, reduce, copy if, stream compactification, merge, join, etc. We use the parallel-for and parallel aggretation method to implement a Monte Carlo simulation algorithm, which approximates the value of $\\pi$.\n",
"\n",
"## Background\n",
"\n",
"Monte Carlo simulation methods are computational algorithms which rely on repeated random sampling to estimate a results. Monte Carlo simulation can be used to calculate the value of $\\pi$ as follows\n",
"\n",
"$$\n",
"\\dfrac{\\text{area of unit circle}}{\\text{area of } [-1,1]^2} = \\dfrac{\\pi}{4} = \n",
" \\dfrac{\\text{hits in unit circle}}{\\text{randomly generated points in } [-1,1]^2}.\n",
"$$\n",
"\n",
"More precisely if $A \\subset \\mathbb{R}^n$, $f : A \\rightarrow \\mathbb{R}$ is an integrable function and $x^{(i)} \\in A$, $i=1, \\ldots, n$ are uniformly distributed points then\n",
"\n",
"$$\n",
"\\int_A f(x) dx \\approx \\frac{\\mathrm{vol}(A)}{n} \\sum_{i=1}^n f(x^{(i)}).\n",
"\n",
"To find the formula stated in the first equation apply the above equation with $A = [-1,1]^2$ and $f(x) = \\mathbb{1}_{\\{x_1^2 + x_2^2 \\leq 1\\}}$ the indicator function of the unit circle. \n",
"\n",
"\n",
"## Let's Get Started!\n",
"\n",
"Before you continue you should install the [F# kernel for Jupyter](https://github.com/fsprojects/IfSharp).\n",
"\n",
"We use Paket to get the newest version 3.1 beta NuGet packages of Alea GPU. You can run Alea GPU free on any CUDA capable GeForce or Quadro GPU. In case you want to run it on an enterprise GPU such as a Tesla K80 or the brand new NVIDIA P100 you need a license. \n",
"\n",
"Unfortunately as of now (January 2017), it is not possible to run this notebook on [Azure Notebooks](https://notebooks.azure.com) with server side execution because the Azure Notebooks service does not yet provide GPU instances."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#load \"Paket.fsx\"\n",
"Paket.Version [ (\"Alea\", \"3.0.3-beta2\") ]\n",
"Paket.Package [ \"NUnit\" ]"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#load \"packages/Alea/Alea.fsx\"\n",
"#r \"packages/Alea/lib/net45/Alea.Parallel.dll\"\n",
"#r \"packages/NUnit/lib/net45/nunit.framework.dll\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\r\n",
"\r\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#load \"XPlot.Plotly.Paket.fsx\"\n",
"#load \"XPlot.Plotly.fsx\"\n",
"open XPlot.Plotly"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspect the GPU\n",
"\n",
"We first check what GPU is available. As we only need one GPU for our computations, we select the default GPU."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"GPU is [0|6.0|Tesla P100-PCIE-16GB], Number of Cores 3584, GPU DRAM is 15.879 GB, Process is 64 bit"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open System\n",
"open System.Threading.Tasks;\n",
"open Alea\n",
"open Alea.CSharp\n",
"open Alea.Parallel\n",
"\n",
"let gpu = Gpu.Default\n",
"let s = sprintf \"GPU is %A, Number of Cores %A, GPU DRAM is %.3f GB, Process is %d bit\"\n",
" gpu\n",
" gpu.Device.Cores\n",
" ((float) gpu.Device.TotalMemory / 1024. / 1024. / 1024.)\n",
" (IntPtr.Size * 8)\n",
" \n",
"{ Html = s }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## GPU Implementation with Parallel-For and Parallel Aggregate\n",
"\n",
"Monte Carlo simulation experiments often rely on **independent samples** and are therefore embarassingly parallel. This means that we can implement them in two stages:\n",
"\n",
" - **Parallel-for**: Generate random numbers and transform them to the samples \n",
" - **Parallel aggregate**: Calculate the mean value of the samples\n",
" \n",
"The function `estimatePi` follows this paradigm and applies a technique called batch sampling. Generating random numbers in a parallel computing context requires some care: in order to get correct results it is important to assure that the generated random numbers do not overlap. This is achieved by jumping to the right position in the random stream with an offset \n",
"```\n",
"let offset = (uint64 batchSize) * (uint64 iBatch)\n",
"session.RandomUniform(points, offset)\n",
"```\n",
"Not every random number generator is designed for parallel execution. The interested reader can find more details in [this paper](https://people.maths.ox.ac.uk/gilesm/files/gems_rng.pdf). \n",
"\n",
"After all the batches are completed and stored their mean values in the array `pis` we copy this array to the CPU and perform a final average on the CPU."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"let estimatePiTwoStage verbose rng seed batchSize batchs =\n",
" let t = System.Diagnostics.Stopwatch.StartNew()\n",
" use session = new Session(gpu)\n",
" session.UsingPseudoRandom(rngType=rng, seed=seed)\n",
" let points = session.Allocate batchSize\n",
" let values = session.Allocate batchSize\n",
" let pis = session.Allocate batchs\n",
"\n",
" for iBatch = 1 to batchs do\n",
" session.RandomUniform(points)\n",
" \n",
" session.For(0, batchSize, (fun i ->\n",
" let point = points.[i]\n",
" let d = point.x * point.x + point.y * point.y\n",
" values.[i] <- if d < 1.0 then 4.0 else 0.0))\n",
"\n",
" session.Aggregate(batchSize,\n",
" (fun i -> values.[i]),\n",
" (fun value -> pis.[iBatch - 1] <- value / (float batchSize)),\n",
" (fun a b -> a + b))\n",
"\n",
" let pis = Gpu.CopyToHost pis\n",
" let pi = pis |> Array.average\n",
" t.Stop()\n",
"\n",
" if verbose then\n",
" printfn \"%A\" pis\n",
" printfn \"PI = %f, %A\" pi t.Elapsed\n",
"\n",
" pi, t.Elapsed.TotalMilliseconds, ((pis |> Array.distinct).Length)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## GPU Implementation with Parallel Transform-Aggregate\n",
"\n",
"Alea GPU comes with a special overload of the parallel aggregate which first applies a transform before the parallel aggregation. This allows us to safe one kernel call. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"let estimatePiCompact verbose rng seed batchSize batchs =\n",
" let t = System.Diagnostics.Stopwatch.StartNew()\n",
" use session = new Session(gpu)\n",
" session.UsingPseudoRandom(rngType=rng, seed=seed)\n",
" let points = session.Allocate batchSize\n",
" let pis = session.Allocate batchs\n",
"\n",
" for iBatch = 1 to batchs do\n",
" session.RandomUniform(points)\n",
"\n",
" session.Aggregate(batchSize,\n",
" (fun i ->\n",
" let point = points.[i]\n",
" let d = point.x * point.x + point.y * point.y\n",
" if d < 1.0 then 4.0 else 0.0),\n",
" (fun value -> pis.[iBatch - 1] <- value / (float batchSize)),\n",
" (fun a b -> a + b))\n",
"\n",
" let pis = Gpu.CopyToHost pis\n",
" let pi = pis |> Array.average\n",
" t.Stop()\n",
"\n",
" if verbose then\n",
" printfn \"%A\" pis\n",
" printfn \"PI = %f, %A\" pi t.Elapsed\n",
"\n",
" pi, t.Elapsed.TotalMilliseconds, ((pis |> Array.distinct).Length)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CPU Implementation\n",
"\n",
"To measure the actual benefit of moving to the GPU we code a CPU version, which also used the cuRand library but in CPU mode."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"let estimatePiCpu verbose rngType seed batchSize batchs =\n",
" let t = System.Diagnostics.Stopwatch.StartNew()\n",
" let points = Array.zeroCreate (2*batchSize) \n",
" let values = Array.zeroCreate batchSize\n",
" let pis = Array.zeroCreate batchs \n",
" \n",
" use rng = cuRAND.Generator.CreateCpu rngType\n",
" \n",
" for iBatch = 1 to batchs do\n",
" // skip ahead to the right offset to generate non-overlapping blocks of random numbers\n",
" let offset = 2UL * (uint64 batchSize) * (uint64 iBatch)\n",
" rng.SetGeneratorOffset offset\n",
" rng.GenerateUniform(points)\n",
" \n",
" Parallel.For(0, batchSize, (fun i ->\n",
" let pointx = points.[i]\n",
" let pointy = points.[batchSize + 1]\n",
" let d = pointx * pointx + pointy * pointy\n",
" values.[i] <- if d < 1.0 then 4.0 else 0.0)) |> ignore\n",
"\n",
" pis.[iBatch - 1] <- values |> Array.average\n",
" \n",
" let pi = pis |> Array.average\n",
" t.Stop()\n",
"\n",
" if verbose then\n",
" printfn \"%A\" pis\n",
" printfn \"PI = %f, %A\" pi t.Elapsed\n",
"\n",
" pi, t.Elapsed.TotalMilliseconds, ((pis |> Array.distinct).Length)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running the Computations\n",
"\n",
"The NVIDIA cuRand library provides multiple random number generators. Let's run the code and measure the execution time."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"let estimatePi batches batchSize impl rng =\n",
" let cuRANDType () =\n",
" match rng with\n",
" | PseudoRandomType.XORWOW -> cuRAND.RngType.PSEUDO_XORWOW\n",
" | PseudoRandomType.MRG32K3A -> cuRAND.RngType.PSEUDO_MRG32K3A\n",
" | PseudoRandomType.PHILOX4_32_10 -> cuRAND.RngType.PSEUDO_PHILOX4_32_10\n",
" | _ -> failwith \"TODO\"\n",
" let seed = 1UL\n",
" let pi, t, l = \n",
" match impl with\n",
" | \"gpu compact\" -> estimatePiCompact false rng seed batchSize batches\n",
" | \"gpu two stage\" -> estimatePiTwoStage false rng seed batchSize batches\n",
" | \"cpu\" -> estimatePiCpu false (cuRANDType()) seed batchSize batches\n",
" | _ -> failwithf \"unknown implementation %s\" impl\n",
" pi, t, l, rng, impl"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"RngType | Implementation | Value | Timing | Speedup |
---|
XORWOW | gpu compact | 3.141569664 | 0.05 s | 1542.0 |
XORWOW | gpu two stage | 3.141569664 | 0.07 s | 1171.6 |
XORWOW | cpu | 3.078599336 | 77.52 s | 1.0 |
MRG32K3A | gpu compact | 3.141664656 | 0.06 s | 1639.2 |
MRG32K3A | gpu two stage | 3.141664656 | 0.07 s | 1281.9 |
MRG32K3A | cpu | 3.184040656 | 93.44 s | 1.0 |
PHILOX4_32_10 | gpu compact | 3.141626248 | 0.05 s | 1946.9 |
PHILOX4_32_10 | gpu two stage | 3.141626248 | 0.06 s | 1454.6 |
PHILOX4_32_10 | cpu | 3.5850756 | 91.52 s | 1.0 |
"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"let batches, batchSize = if Gpu.Default.Device.Cores <= 512 then 50, 1000000 else 50, 10000000\n",
"\n",
"// JIT to get more accurate timings \n",
"estimatePi batches batchSize \"gpu compact\" PseudoRandomType.XORWOW\n",
"estimatePi batches batchSize \"gpu two stage\" PseudoRandomType.XORWOW\n",
"\n",
"let results =\n",
" [\n",
" estimatePi batches batchSize \"gpu compact\" PseudoRandomType.XORWOW\n",
" estimatePi batches batchSize \"gpu two stage\" PseudoRandomType.XORWOW\n",
" estimatePi batches batchSize \"cpu\" PseudoRandomType.XORWOW\n",
" estimatePi batches batchSize \"gpu compact\" PseudoRandomType.MRG32K3A \n",
" estimatePi batches batchSize \"gpu two stage\" PseudoRandomType.MRG32K3A \n",
" estimatePi batches batchSize \"cpu\" PseudoRandomType.MRG32K3A \n",
" estimatePi batches batchSize \"gpu compact\" PseudoRandomType.PHILOX4_32_10 \n",
" estimatePi batches batchSize \"gpu two stage\" PseudoRandomType.PHILOX4_32_10 \n",
" estimatePi batches batchSize \"cpu\" PseudoRandomType.PHILOX4_32_10 \n",
" ]\n",
" \n",
"let cpuTimes = \n",
" results \n",
" |> List.filter (fun (pi, t, l, rng, impl) -> impl = \"cpu\")\n",
" |> List.map (fun (pi, t, l, rng, impl) -> rng, t)\n",
" |> Map.ofList \n",
"\n",
"type data = { RngType: string; Implementation: string; Value: float; Timing: string; Speedup: string }\n",
"results \n",
"|> List.map (fun (pi, t, l, rng, impl) -> \n",
" { \n",
" RngType = sprintf \"%A\" rng\n",
" Implementation = impl \n",
" Value = pi\n",
" Timing = sprintf \"%.2f s\" (t / 1000.0)\n",
" Speedup = sprintf \"%.1f\" ((cpuTimes |> Map.find rng) / t)\n",
" })\n",
"|> Util.Table "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\r\n",
" "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"let layout = \n",
" Layout(title = \"CPU / GPU Performance\", \n",
" xaxis=Xaxis(title = \"Random number generators\", showgrid = false, zeroline = false), \n",
" yaxis=Yaxis(title = \"Speedup\", showline = false), showlegend = true) \n",
" \n",
"let grouped = results |> List.groupBy (fun (pi, t, l, rng, impl) -> impl)\n",
"let bars = \n",
" grouped |> List.map(fun (impl, s) -> \n",
" let x = s |> List.map (fun (pi, t, l, rng, impl) -> sprintf \"%A\" rng)\n",
" let y = s |> List.map (fun (pi, t, l, rng, impl) -> (cpuTimes |> Map.find rng) / t)\n",
" Bar(x = x, y = y, name = impl))\n",
"\n",
"bars\n",
"|> Seq.toList \n",
"|> Chart.Plot\n",
"|> Chart.WithLayout layout"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "F#",
"language": "fsharp",
"name": "ifsharp"
},
"language": "fsharp",
"language_info": {
"codemirror_mode": "",
"file_extension": ".fs",
"mimetype": "text/x-fsharp",
"name": "fsharp",
"nbconvert_exporter": "",
"pygments_lexer": "",
"version": "4.3.1.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}