{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Numba 0.52 CUDA release demo\n", "\n", "Key changes to the CUDA target for Release 0.52 include:\n", "\n", "* Support for Unified Memory on Linux (Experimental support only on Windows) (@maxpkatz)\n", "* Reduced kernel launch overhead for eagerly-compiled kernels (@gmarkall)\n", "* Access to all libdevice functions (@gmarkall)\n", "* Support for atomic subtraction (@testhound)\n", "* Additional `math` library functions (@zhihaoy)\n", "* Support for complex power (@gmarkall)\n", "* New convenience functions for creating mapped and pinned arrays like existing arrays (@c200chromebook)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# The usual imports\n", "\n", "from numba import cuda, float32, njit, void\n", "import math\n", "import numpy as np\n", "import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unified memory\n", "\n", "Unified Memory provides a single address space accessible by any CPU or GPU in a system, backed by a paging implementation that automatically ensures data is moved to a device or host only when it is needed. Some use cases this enables:\n", "\n", "* Allocating arrays accessible on the device that are larger than the device memory, paged out to system RAM.\n", "* Operations on data from both the CPU and GPU with no copying on Tegra systems.\n", "\n", "To allocate a managed array, use the new `cuda.managed_array()` function:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# A small array\n", "arr = cuda.managed_array(100, dtype=np.float64)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example using an array that is larger than GPU memory (note that if you don't have enough system memory, this may cause the kernel to be killed):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# If you have more than, or a lot less than, 16GB of GPU RAM then edit this:\n", "GB = 16\n", "\n", "n_elements = (GB + 1) * (1024 * 1024 * 1024)\n", "\n", "# Create a very large array\n", "big_arr = cuda.managed_array(n_elements, dtype=np.uint8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can set the memory on the device:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "@cuda.jit\n", "def initialize_array(x):\n", " start, stride = cuda.grid(1), cuda.gridsize(1)\n", " for i in range(start, len(x), stride):\n", " x[i] = 0xAB\n", "\n", "initialize_array[1024, 1024](big_arr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can verify that all elements are set as expected:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Difference detected!\n" ] } ], "source": [ "@njit\n", "def check(x):\n", " difference = False\n", " for i in range(len(x)):\n", " if x[i] != 0xAB:\n", " difference = True\n", " \n", " if difference:\n", " print(\"Difference detected!\")\n", " else:\n", " print(\"All values as expected!\")\n", " \n", "check(big_arr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kernel launch overhead\n", "\n", "Launch overhead for eagerly-compiled kernels (those where `@cuda.jit` is given a signature) has been reduced in Numba 0.52. The following code provides a benchmark for the launch overhead depending on the number of arguments to a function:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18.9 µs ± 75.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", "31.5 µs ± 344 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "39.9 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "48.8 µs ± 784 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "56.2 µs ± 497 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "@cuda.jit('void()')\n", "def some_kernel_1():\n", " return\n", "\n", "@cuda.jit('void(float32[:])')\n", "def some_kernel_2(arr1):\n", " return\n", "\n", "@cuda.jit('void(float32[:],float32[:])')\n", "def some_kernel_3(arr1,arr2):\n", " return\n", "\n", "@cuda.jit('void(float32[:],float32[:],float32[:])')\n", "def some_kernel_4(arr1,arr2,arr3):\n", " return\n", "\n", "@cuda.jit('void(float32[:],float32[:],float32[:],float32[:])')\n", "def some_kernel_5(arr1,arr2,arr3,arr4):\n", " return\n", "\n", "arr = cuda.device_array(10000, dtype=np.float32)\n", "\n", "%timeit some_kernel_1[1, 1]()\n", "%timeit some_kernel_2[1, 1](arr)\n", "%timeit some_kernel_3[1, 1](arr,arr)\n", "%timeit some_kernel_4[1, 1](arr,arr,arr)\n", "%timeit some_kernel_5[1, 1](arr,arr,arr,arr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results of this benchmark on an HP Z8 G4 with a Xeon Gold 6128 and a Quadro RTX 8000 are:\n", "\n", "Numba 0.51.2:\n", "\n", "```\n", "32.3 µs ± 461 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "64 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "86.6 µs ± 925 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "106 µs ± 24.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "125 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "```\n", "\n", "Numba 0.52:\n", "\n", "```\n", "20 µs ± 72.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "32.4 µs ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "41 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "48.6 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "56.7 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "```\n", "\n", "Your results may vary depending on your configuration - try out the benchmark with 0.51.2 and 0.52!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Libdevice functions\n", "\n", "All [CUDA libdevice functions](https://docs.nvidia.com/cuda/libdevice-users-guide/index.html) (with the exception of `__nv_nan` and `__nv_nanf`) are now available in the `cuda.libdevice` module. The leading `__nv_` is stripped from the names - for example, [`__nv_fast_cosf`](https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_cosf.html#__nv_fast_cosf) is available as `libdevice.fast_cosf`. Here's an example showing the use of some of the fast trigonomentric functions in use:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Standard version time 5.148192998603918e-05ms\n", "Libdevice version time 4.993253998691216e-05ms\n" ] } ], "source": [ "from numba.cuda import libdevice\n", "\n", "# Implementation using the standard trigonometric functions\n", "@cuda.jit(void(float32[::1], float32[::1], float32[::1]))\n", "def trig_functions(r, x, y):\n", " i = cuda.grid(1)\n", " if i < len(r):\n", " r[i] = math.sin(x[i]) * math.cos(y[i]) + math.tan(x[i] + y[i])\n", "\n", "# Implementation using the fast trigonometric functions\n", "@cuda.jit(void(float32[::1], float32[::1], float32[::1]))\n", "def fast_trig_functions(r, x, y):\n", " i = cuda.grid(1)\n", " if i < len(r):\n", " r[i] = libdevice.fast_sinf(x[i]) * libdevice.fast_cosf(y[i]) + libdevice.fast_tanf(x[i] + y[i])\n", "\n", "\n", "# Create some random input\n", "N = 100000\n", "np.random.seed(1)\n", "x = np.random.random(N).astype(np.float32)\n", "y = np.random.random(N).astype(np.float32)\n", "\n", "# Copy input to the device and allocate space for output\n", "d_x = cuda.to_device(x)\n", "d_y = cuda.to_device(y)\n", "r_math = cuda.device_array_like(x)\n", "r_libdevice = cuda.device_array_like(x)\n", "\n", "n_runs = 100\n", "n_threads = 256\n", "n_blocks = math.ceil(N / n_threads)\n", "\n", "# Run and time the normal version\n", "start_math = time.perf_counter()\n", "for i in range(n_runs):\n", " trig_functions[n_blocks, n_threads](r_math, d_x, d_y)\n", "cuda.synchronize()\n", "end_math = time.perf_counter()\n", "\n", "# Run and time the version using fast trig functions\n", "start_libdevice = time.perf_counter()\n", "for i in range(n_runs):\n", " fast_trig_functions[n_blocks, n_threads](r_libdevice, d_x, d_y)\n", "cuda.synchronize()\n", "end_libdevice = time.perf_counter()\n", "\n", "\n", "# Note that the fast versions of the functions sacrifice accuracy for speed,\n", "# so a lower-than-default relative tolerance is required for this sanity check.\n", "np.testing.assert_allclose(r_math.copy_to_host(), r_libdevice.copy_to_host(), rtol=1.0e-2)\n", "\n", "# Note that timings will be fairly similar for this example, as the execution time will be\n", "# dominated by the kernel launch time.\n", "print(f\"Standard version time {(end_math - start_math) / n_runs}ms\")\n", "print(f\"Libdevice version time {(end_libdevice - start_libdevice) / n_runs}ms\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Atomic subtract\n", "\n", "Atomic subtraction is now supported. The following example subtracts several values from an element of an array with every thread contending on the same location:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Result: 4217.0\n" ] } ], "source": [ "@cuda.jit\n", "def subtract_example(x, values):\n", " i = cuda.grid(1)\n", " cuda.atomic.sub(x, 0, values[i])\n", "\n", "initial = 12345.0\n", "n_blocks = 4\n", "n_threads = 32\n", "n_values = n_blocks * n_threads\n", "values = np.arange(n_values, dtype=np.float32)\n", "\n", "x = np.zeros(1, dtype=np.float32)\n", "x[0] = initial\n", "\n", "subtract_example[n_blocks, n_threads](x, values)\n", " \n", "# Floating point subtraction is not associative - the order in which subtractions\n", "# occur can cause a slight variation, so we use assert_allclose instead of checking\n", "# for exact equality.\n", "np.testing.assert_allclose(x, [initial - np.sum(values)])\n", "print(f\"Result: {x[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Math library functions\n", "\n", "The functions `math.frexp` and `math.ldexp` are now supported in CUDA kernels. Example:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.4359949 0.02592623 0.5496625 0.4353224 0.4203678 0.3303348\n", " 0.20464863 0.619271 0.29965466 0.2668273 ]\n", "[0.4359949 0.02592623 0.5496625 0.4353224 0.4203678 0.3303348\n", " 0.20464863 0.619271 0.29965466 0.2668273 ]\n" ] } ], "source": [ "@cuda.jit\n", "def cuda_frexp_ldexp(x, y): \n", " i = cuda.grid(1)\n", " if i < len(x):\n", " fractional, exponent = math.frexp(x[i])\n", " y[i] = math.ldexp(fractional, exponent)\n", "\n", "np.random.seed(2)\n", "n_values = 16384\n", "n_threads = 256\n", "n_blocks = n_values // n_threads\n", "\n", "values = np.random.random(16384).astype(np.float32)\n", "results = np.zeros_like(values)\n", "\n", "cuda_frexp_ldexp[n_blocks, n_threads](values, results)\n", "\n", "# Sanity check\n", "np.testing.assert_equal(values, results)\n", "\n", "# Print the first few values and results\n", "print(values[:10])\n", "print(results[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Powers of complex numbers\n", "\n", "It is now possible to write a kernel using the power operator on complex numbers. For example:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.55335274+0.49336656j 0.88211009+0.17821241j 0.72609239+0.0945577j\n", " 0.63333904+0.08188157j 0.6611122 +0.14269518j 0.42899033+0.39976284j\n", " 0.26301009-0.36953702j 0.14541033-0.01929919j 0.2840652 +0.32310602j\n", " 0.81477267+0.11021739j]\n", "[0.55335274+0.49336656j 0.88211009+0.17821241j 0.72609239+0.0945577j\n", " 0.63333904+0.08188157j 0.6611122 +0.14269518j 0.42899033+0.39976284j\n", " 0.26301009-0.36953702j 0.14541033-0.01929919j 0.2840652 +0.32310602j\n", " 0.81477267+0.11021739j]\n" ] } ], "source": [ "@cuda.jit\n", "def complex_power(r, x, y):\n", " i = cuda.grid(1)\n", " if i < len(r):\n", " r[i] = x[i] ** y[i]\n", "\n", "np.random.seed(3)\n", "n_values = 16384\n", "n_threads = 256\n", "n_blocks = n_values // n_threads\n", "\n", "def random_complex():\n", " \"Generate an array of random complex values\"\n", " real = np.random.random(n_values)\n", " imag = np.random.random(n_values)\n", " return real + imag * 1j\n", "\n", "x = random_complex()\n", "y = random_complex()\n", "r = np.zeros_like(x)\n", "\n", "complex_power[n_blocks, n_threads](r, x, y)\n", "\n", "# Sanity check\n", "np.testing.assert_allclose(r, x ** y)\n", "\n", "# Print the first few results and the same computed on the CPU for comparison\n", "print(r[:10])\n", "print(x[:10] ** y[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `mapped_array_like` and `pinned_array_like`\n", "\n", "In addition to `device_array_like`, `mapped_array_like` and `pinned_array_like` can be used for creating arrays like existing arrays:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "x = np.zeros(16384, dtype=np.int16)\n", "\n", "d_x = cuda.device_array_like(x)\n", "m_x = cuda.mapped_array_like(x)\n", "p_x = cuda.pinned_array_like(x)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }