{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Emprical Cumulative Distribution Function on the GPU\n", "\n", "\n", "\n", "In this Jupyter notebook we show how to use the GPU seamlessly in a scripting mode with Alea GPU V3 to calculate the empirical cumulative distribution function for a very large sample size. The purpose of this notebook is to demonstrate the new highly efficient GPU primitives that will come with our next release 3.1 of Alea GPU, such as highly optimized parallel sort, scan, reduce, copy if, stream compactification, merge, join, etc. \n", "\n", "## Background\n", "\n", "The empirical cumulative distribution function $\\hat{F}_n(t)$ for the samples $x_1, \\ldots, x_n$ is defined as\n", "\n", "$${\\hat F}_{n}(t)={\\frac {{\\mbox{number of elements in the sample}}\\leq t}n}={\\frac {1}{n}}\\sum _{{i=1}}^{n}{\\mathbf {1}}_{{x_{i}\\leq t}}$$\n", "\n", "It is an estimator of the true distribution function from which the samples $x_1, \\ldots, x_n$ are generated. More details can be found on [Wikipedia](https://en.wikipedia.org/wiki/Empirical_distribution_function).\n", "\n", "\n", "\n", "## Let's Get Started!\n", "\n", "Before you continue you should install the [F# kernel for Jupyter](https://github.com/fsprojects/IfSharp).\n", "\n", "We use Paket to get the newest version 3.1 beta NuGet packages of Alea GPU. You can run Alea GPU free on any CUDA capable GeForce or Quadro GPU. In case you want to run it on an enterprise GPU such as a Tesla K80 or the brand new NVIDIA P100 you need a license. \n", "\n", "Unfortunately as of now (January 2017), it is not possible to run this notebook on [Azure Notebooks](https://notebooks.azure.com) with server side execution because the Azure Notebooks service does not yet provide GPU instances. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#load \"Paket.fsx\"\n", "Paket.Version [ (\"Alea\", \"3.0.3-beta2\") ]\n", "Paket.Package [ \"NUnit\" ]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#load \"packages/Alea/Alea.fsx\"\n", "#r \"packages/Alea/lib/net45/Alea.Parallel.dll\" \n", "#r \"packages/NUnit/lib/net45/nunit.framework.dll\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\r\n", "\r\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#load \"XPlot.Plotly.Paket.fsx\"\n", "#load \"XPlot.Plotly.fsx\"\n", "open XPlot.Plotly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect the GPU \n", "\n", "We first check what GPU is available. As we only need one GPU for our computations, we select the default GPU." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "GPU is [0|3.0|GeForce GT 750M], Number of Cores 384, GPU DRAM is 2.000 GB, Process is 64 bit" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "open System\n", "open Alea\n", "open Alea.CSharp\n", "open Alea.Parallel\n", "\n", "let gpu = Gpu.Default\n", "let s = sprintf \"GPU is %A, Number of Cores %A, GPU DRAM is %.3f GB, Process is %d bit\"\n", " gpu\n", " gpu.Device.Cores\n", " ((float) gpu.Device.TotalMemory / 1024. / 1024. / 1024.)\n", " (IntPtr.Size * 8)\n", " \n", "{ Html = s }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary Search\n", "\n", "To sample the empirical cumulative distribution function we use binary search. We code a binary search function that will be used on the CPU and the GPU, giving us full code reuse. In order to be able to compile it to run on the GPU, we add the `[ReflectedDefinition]` attribute, so that the F# compiler generates the quotation for this function." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "[]\n", "let inline binarySearch (n:int) (v:int -> 'T) (x:'T) =\n", " let mutable l = 0\n", " let mutable u = n - 1\n", " while u - 1 > l do\n", " let m = int((uint32(l) + uint32(u)) >>> 1)\n", " if x < (v m) then u <- m\n", " else l <- m\n", " l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CPU Version\n", "\n", "We create a function to sample data and sample the empirical cumulative distribution function on a grid of points for plotting. Here is the CPU version:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "let ecdfCpu numSamples numPlotPoints (gen:float[] -> unit) =\n", " let numbers = Array.zeroCreate numSamples\n", " let plotX = Array.zeroCreate numPlotPoints\n", " let plotY = Array.zeroCreate numPlotPoints\n", " \n", " gen numbers\n", " \n", " // start measuring time\n", " GC.Collect()\n", " GC.WaitForPendingFinalizers()\n", " let t = System.Diagnostics.Stopwatch.StartNew()\n", " \n", " // first sort it\n", " Array.sortInPlace numbers\n", " \n", " // grab min max value\n", " let min = numbers.[0]\n", " let max = numbers.[numSamples - 1]\n", " let dist = max - min\n", "\n", " // binary search to create ecdf\n", " for i = 0 to numPlotPoints - 1 do\n", " let x = min + (dist / (float numPlotPoints)) * (float i)\n", " let v i = numbers.[i]\n", " let y = binarySearch numSamples v x\n", " let y = (float y) / (float numSamples)\n", " plotX.[i] <- x\n", " plotY.[i] <- y\n", " \n", " t.Stop()\n", " \n", " // to get more accurate timings, we keep these array alive so we are not interrupted by GC collection\n", " GC.KeepAlive(numbers)\n", " GC.KeepAlive(plotX) \n", " GC.KeepAlive(plotY)\n", " \n", " plotX, plotY, t.Elapsed " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GPU Version\n", "\n", "The GPU implementation relies on the new highly efficient GPU sort primitive, which we provide with Alea GPU 3.1 beta and a new session concept, which simplifies the API by hiding all the requires scratch pad memory allocation and management:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "let ecdfGpu numSamples numPlotPoints (gen:Session -> float[] -> unit) =\n", " // create a GPU session, which manages the temporary memory required by the different algrotihms.\n", " use session = new Session(gpu)\n", " \n", " let numbers = session.Allocate(numSamples)\n", " let minmax = session.Allocate(2)\n", " let plotX = session.Allocate(numPlotPoints)\n", " let plotY = session.Allocate(numPlotPoints)\n", "\n", " gen session numbers\n", " \n", " // start measuring time, we synchonize gpu first\n", " session.Gpu.Synchronize()\n", " GC.Collect()\n", " GC.WaitForPendingFinalizers()\n", " let t = System.Diagnostics.Stopwatch.StartNew()\n", " \n", " // first sort it\n", " session.Sort(numbers, numbers, (fun a b -> a < b))\n", " \n", " // grab min max value\n", " session.Gpu.Launch((fun () ->\n", " minmax.[0] <- numbers.[0]\n", " minmax.[1] <- numbers.[numSamples - 1]),\n", " LaunchParam(1, 1))\n", " \n", " session.For(0, numPlotPoints, (fun i ->\n", " let min = minmax.[0]\n", " let max = minmax.[1]\n", " let dist = max - min\n", " let x = min + (dist / (float numPlotPoints)) * (float i)\n", " let v i = numbers.[i]\n", " let y = binarySearch numSamples v x\n", " let y = (float y) / (float numSamples)\n", " plotX.[i] <- x\n", " plotY.[i] <- y\n", " ))\n", " \n", " // synchornize gpu then stop timer\n", " gpu.Synchronize()\n", " t.Stop()\n", " \n", " // to get more accurate timings, we keep these array alive so we are not interrupted by GC collection\n", " GC.KeepAlive(numbers)\n", " GC.KeepAlive(minmax)\n", " GC.KeepAlive(plotX)\n", " GC.KeepAlive(plotY)\n", "\n", " Gpu.CopyToHost(plotX), Gpu.CopyToHost(plotY), t.Elapsed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run and Visualize \n", "\n", "We create sample generators for the normal and log-normal distribution using parallel random number generators from cuRand. Note that these generators also can be used to run on the CPU." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "let genGpuUniform (session:Session) (numbers:float[]) =\n", " session.UsingPseudoRandom(seed=42UL)\n", " session.RandomUniform(numbers)\n", " \n", "let genGpuNormal (session:Session) (numbers:float[]) =\n", " session.UsingPseudoRandom(seed=42UL)\n", " session.RandomNormal(numbers, 0.0, 1.0)\n", " \n", "let genGpuLogNormal (session:Session) (numbers:float[]) =\n", " session.UsingPseudoRandom(seed=42UL)\n", " session.RandomLogNormal(numbers, 0.0, 1.0)\n", " \n", "let genCpuUniform (data:float[]) =\n", " use rng = cuRAND.Generator.CreateCpu(cuRAND.RngType.PSEUDO_DEFAULT)\n", " rng.SetPseudoRandomGeneratorSeed(42UL)\n", " rng.GenerateUniform(data)\n", " \n", "let genCpuNormal (data:float[]) =\n", " use rng = cuRAND.Generator.CreateCpu(cuRAND.RngType.PSEUDO_DEFAULT)\n", " rng.SetPseudoRandomGeneratorSeed(42UL)\n", " rng.GenerateNormal(data, 0.0, 1.0)\n", " \n", "let genCpuLogNormal (data:float[]) =\n", " use rng = cuRAND.Generator.CreateCpu(cuRAND.RngType.PSEUDO_DEFAULT)\n", " rng.SetPseudoRandomGeneratorSeed(42UL)\n", " rng.GenerateLogNormal(data, 0.0, 1.0) \n", " \n", "let layout title x y= \n", " Layout(title = title, \n", " xaxis=Xaxis(title = x, showgrid = false, zeroline = false), \n", " yaxis=Yaxis(title = y, showline = false), \n", " showlegend = true) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Empirical Cummulative Distribution Function for the Normal Distribution\n", "\n", "Let's estimate the empirical cumulative distribution function on the CPU and GPU for the normal distribution. Because we are only interested in the visualization, we choose a moderate sample size. 1MB gives us already sufficient accuracy." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "let numSamples = 1024*1024 \n", "let numPlotPoints = 1000" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\r\n", " " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let x, y, _ = ecdfCpu numSamples numPlotPoints genCpuNormal\n", "Scatter(name = \"Normal\", x = x, y = y, mode = \"lines\")\n", "|> Chart.Plot \n", "|> Chart.WithLayout (layout \"Empirical Cumulative Distribution Function on CPU\" \"x\" \"probability\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\r\n", " " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let x, y, _ = ecdfGpu numSamples numPlotPoints genGpuNormal\n", "Scatter(name = \"Normal\", x = x, y = y, mode = \"lines\") \n", "|> Chart.Plot \n", "|> Chart.WithLayout (layout \"Empirical Cumulative Distribution Function on GPU\" \"x\" \"probability\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Empirical Cummulative Distribution Function for the Log-Normal Distribution\n", "\n", "We repeate the experiment for the log-normal distribution. Note that it has support on $[0, \\infty)$ only." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\r\n", " " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let x, y, _ = ecdfCpu numSamples numPlotPoints genCpuLogNormal\n", "Scatter(name = \"Log Normal\", x = x, y = y, mode = \"lines\")\n", "|> Chart.Plot \n", "|> Chart.WithLayout (layout \"Empirical Cumulative Distribution Function on CPU\" \"x\" \"probability\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\r\n", " " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let x, y, _ = ecdfGpu numSamples numPlotPoints genGpuLogNormal\n", "Scatter(name = \"Log Normal\", x = x, y = y, mode = \"lines\")\n", "|> Chart.Plot \n", "|> Chart.WithLayout (layout \"Empirical Cumulative Distribution Function on GPU\" \"x\" \"probability\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Timings\n", "\n", "The interesting question is of course how much faster the GPU does the job. Let's collect some timings on a larger sample set. \n", "\n", "### Speedup\n", "\n", "First we measure the speedup. For a small GPU choose a count of say 10 MB or 100 MB." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
DeviceTimingSpeedup
CPU13421.90 ms1
GPU145.44 ms92.2846826393253
" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let megaBytes = if Gpu.Default.Device.Cores <= 512 then 10 else 100\n", "let numSamples = 1024*1024*megaBytes\n", "\n", "let _, _, cpuTime = ecdfCpu numSamples 1000 genCpuNormal\n", "let _, _, gpuTime = ecdfGpu numSamples 1000 genGpuNormal\n", "let speedup = cpuTime.TotalMilliseconds / gpuTime.TotalMilliseconds\n", "\n", "type data = { Device: string; Timing: string; Speedup: float }\n", "let records = \n", " [|\n", " { Device = \"CPU\"; Timing = sprintf \"%.2f ms\" cpuTime.TotalMilliseconds; Speedup = 1.0}\n", " { Device = \"GPU\"; Timing = sprintf \"%.2f ms\" gpuTime.TotalMilliseconds; Speedup = speedup}\n", " |]\n", "\n", "records |> Util.Table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Timings for Varying Sample Size\n", "\n", "Let's analyze the GPU performance across different sample sizes. For a small GPU choose a smaller working set such as [1..10]." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
MegaBytesTimingMegaBytesPerSec
100145.49 ms687.32 MB/sec
200305.55 ms654.55 MB/sec
300451.81 ms664.00 MB/sec
400628.83 ms636.10 MB/sec
500786.48 ms635.74 MB/sec
600944.28 ms635.41 MB/sec
7001103.01 ms634.62 MB/sec
8001315.26 ms608.25 MB/sec
9001479.82 ms608.18 MB/sec
" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let megaBytes = if Gpu.Default.Device.Cores <= 512 then [1..10] else [100..100..900]\n", "\n", "let gpuPerformance =\n", " seq {\n", " for scale in megaBytes do\n", " let numSamples = scale * 1024 * 1024\n", " let _, _, time = ecdfGpu numSamples 1000 genGpuNormal\n", " yield (scale, time.TotalMilliseconds)\n", " } |> Seq.toList\n", "\n", "type data = { MegaBytes: int; Timing: string; MegaBytesPerSec: string }\n", "gpuPerformance \n", "|> List.map (fun (s, t) -> \n", " { \n", " MegaBytes = s\n", " Timing = sprintf \"%.2f ms\" t\n", " MegaBytesPerSec = sprintf \"%.2f MB/sec\" ((float s) / t * 1000.0)\n", " }) \n", "|> Util.Table" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\r\n", " " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let layout = \n", " Layout(title = \"Timing in ms\", \n", " xaxis=Xaxis(title = \"Size in MB\", showgrid = false, zeroline = false), \n", " yaxis=Yaxis(title = \"Timing (ms)\", showline = false)) \n", " \n", "Bar(x = (gpuPerformance |> List.map (fun (mb, t) -> sprintf \"%d MB\" mb)),\n", " y = (gpuPerformance |> List.map (fun (mb, t) -> t)))\n", "|> Chart.Plot\n", "|> Chart.WithLayout layout" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\r\n", " " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let layout = \n", " Layout(title = \"Performance in MB per second\", \n", " xaxis=Xaxis(title = \"Size in MB\", showgrid = false, zeroline = false), \n", " yaxis=Yaxis(title = \"MB per second\", showline = false)) \n", " \n", "Bar(x = (gpuPerformance |> List.map (fun (s, t) -> sprintf \"%d MB\" s)),\n", " y = (gpuPerformance |> List.map (fun (s, t) -> (float s) / t * 1000.0)))\n", "|> Chart.Plot\n", "|> Chart.WithLayout layout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance Comparison \n", "\n", "We now compare the performance of CPU and the GPU version. For a small GPU choose a smaller working set such as [1..10]." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
MegaBytesGPUTimingCPUTimingSpeedup
100144.17 ms13122.91 ms91.02
200306.89 ms27465.22 ms89.50
300453.69 ms42691.84 ms94.10
400634.50 ms57360.08 ms90.40
500789.93 ms71462.31 ms90.47
600951.10 ms87471.71 ms91.97
7001109.42 ms101020.00 ms91.06
8001319.30 ms116294.78 ms88.15
9001480.24 ms133094.66 ms89.91
" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let megaBytes = if Gpu.Default.Device.Cores <= 512 then [1..10] else [100..100..900]\n", "\n", "let performanceCpuGpu numSamples numPlotTimes =\n", " let _, _, cpuTime = ecdfCpu numSamples numPlotTimes genCpuNormal\n", " let _, _, gpuTime = ecdfGpu numSamples numPlotTimes genGpuNormal\n", " let cpuTime = cpuTime.TotalMilliseconds\n", " let gpuTime = gpuTime.TotalMilliseconds\n", " let speedup = cpuTime / gpuTime\n", " cpuTime, gpuTime, speedup\n", "\n", "let gpuCpuPerformance =\n", " seq { \n", " for scale in megaBytes do\n", " let numSamples = scale * 1024 * 1024\n", " let cpuTime, gpuTime, speedup = performanceCpuGpu numSamples 1000\n", " yield (scale, cpuTime, gpuTime, speedup)\n", " } |> Seq.toList\n", " \n", "type data = { MegaBytes: int; GPUTiming: string; CPUTiming: string; Speedup: string } \n", "gpuCpuPerformance |> List.map (fun (s, ct, gt, speedup) -> \n", " { \n", " MegaBytes = s\n", " GPUTiming = sprintf \"%.2f ms\" gt\n", " CPUTiming = sprintf \"%.2f ms\" ct\n", " Speedup = sprintf \"%.2f\" speedup\n", " }) |> Util.Table" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\r\n", " " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let layout = \n", " Layout(title = \"CPU / GPU Performance in MB per second\", \n", " xaxis=Xaxis(title = \"Size in MB\", showgrid = false, zeroline = false), \n", " yaxis=Yaxis(title = \"MB per second\", showline = false), showlegend = true) \n", " \n", "seq {\n", " let x = gpuCpuPerformance |> List.map (fun (s, _, _, _) -> sprintf \"%d MB\" s)\n", " let yCpu = gpuCpuPerformance |> List.map (fun (s, t, _, _) -> (float s) / t * 1000.0)\n", " let yGpu = gpuCpuPerformance |> List.map (fun (s, _, t, _) -> (float s) / t * 1000.0)\n", " yield Bar(x = x, y = yCpu, name = \"CPU\")\n", " yield Bar(x = x, y = yGpu, name = \"GPU\")\n", "} \n", "|> Seq.toList \n", "|> Chart.Plot\n", "|> Chart.WithLayout layout" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "F#", "language": "fsharp", "name": "ifsharp" }, "language": "fsharp", "language_info": { "codemirror_mode": "", "file_extension": ".fs", "mimetype": "text/x-fsharp", "name": "fsharp", "nbconvert_exporter": "", "pygments_lexer": "", "version": "4.3.1.0" } }, "nbformat": 4, "nbformat_minor": 0 }