{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Numba 0.52 CUDA release demo\n",
    "\n",
    "Key changes to the CUDA target for Release 0.52 include:\n",
    "\n",
    "* Support for Unified Memory on Linux (Experimental support only on Windows) (@maxpkatz)\n",
    "* Reduced kernel launch overhead for eagerly-compiled kernels (@gmarkall)\n",
    "* Access to all libdevice functions (@gmarkall)\n",
    "* Support for atomic subtraction (@testhound)\n",
    "* Additional `math` library functions (@zhihaoy)\n",
    "* Support for complex power (@gmarkall)\n",
    "* New convenience functions for creating mapped and pinned arrays like existing arrays (@c200chromebook)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The usual imports\n",
    "\n",
    "from numba import cuda, float32, njit, void\n",
    "import math\n",
    "import numpy as np\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unified memory\n",
    "\n",
    "Unified Memory provides a single address space accessible by any CPU or GPU in a system, backed by a paging implementation that automatically ensures data is moved to a device or host only when it is needed. Some use cases this enables:\n",
    "\n",
    "* Allocating arrays accessible on the device that are larger than the device memory, paged out to system RAM.\n",
    "* Operations on data from both the CPU and GPU with no copying on Tegra systems.\n",
    "\n",
    "To allocate a managed array, use the new `cuda.managed_array()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A small array\n",
    "arr = cuda.managed_array(100, dtype=np.float64)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An example using an array that is larger than GPU memory (note that if you don't have enough system memory, this may cause the kernel to be killed):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If you have more than, or a lot less than, 16GB of GPU RAM then edit this:\n",
    "GB = 16\n",
    "\n",
    "n_elements = (GB + 1) * (1024 * 1024 * 1024)\n",
    "\n",
    "# Create a very large array\n",
    "big_arr = cuda.managed_array(n_elements, dtype=np.uint8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can set the memory on the device:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def initialize_array(x):\n",
    "    start, stride = cuda.grid(1), cuda.gridsize(1)\n",
    "    for i in range(start, len(x), stride):\n",
    "        x[i] = 0xAB\n",
    "\n",
    "initialize_array[1024, 1024](big_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can verify that all elements are set as expected:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Difference detected!\n"
     ]
    }
   ],
   "source": [
    "@njit\n",
    "def check(x):\n",
    "    difference = False\n",
    "    for i in range(len(x)):\n",
    "        if x[i] != 0xAB:\n",
    "            difference = True\n",
    "    \n",
    "    if difference:\n",
    "        print(\"Difference detected!\")\n",
    "    else:\n",
    "        print(\"All values as expected!\")\n",
    "            \n",
    "check(big_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kernel launch overhead\n",
    "\n",
    "Launch overhead for eagerly-compiled kernels (those where `@cuda.jit` is given a signature) has been reduced in Numba 0.52. The following code provides a benchmark for the launch overhead depending on the number of arguments to a function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "18.9 µs ± 75.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n",
      "31.5 µs ± 344 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "39.9 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "48.8 µs ± 784 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "56.2 µs ± 497 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "@cuda.jit('void()')\n",
    "def some_kernel_1():\n",
    "    return\n",
    "\n",
    "@cuda.jit('void(float32[:])')\n",
    "def some_kernel_2(arr1):\n",
    "    return\n",
    "\n",
    "@cuda.jit('void(float32[:],float32[:])')\n",
    "def some_kernel_3(arr1,arr2):\n",
    "    return\n",
    "\n",
    "@cuda.jit('void(float32[:],float32[:],float32[:])')\n",
    "def some_kernel_4(arr1,arr2,arr3):\n",
    "    return\n",
    "\n",
    "@cuda.jit('void(float32[:],float32[:],float32[:],float32[:])')\n",
    "def some_kernel_5(arr1,arr2,arr3,arr4):\n",
    "    return\n",
    "\n",
    "arr = cuda.device_array(10000, dtype=np.float32)\n",
    "\n",
    "%timeit some_kernel_1[1, 1]()\n",
    "%timeit some_kernel_2[1, 1](arr)\n",
    "%timeit some_kernel_3[1, 1](arr,arr)\n",
    "%timeit some_kernel_4[1, 1](arr,arr,arr)\n",
    "%timeit some_kernel_5[1, 1](arr,arr,arr,arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results of this benchmark on an HP Z8 G4 with a Xeon Gold 6128 and a Quadro RTX 8000 are:\n",
    "\n",
    "Numba 0.51.2:\n",
    "\n",
    "```\n",
    "32.3 µs ± 461 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "64 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "86.6 µs ± 925 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "106 µs ± 24.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "125 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "```\n",
    "\n",
    "Numba 0.52:\n",
    "\n",
    "```\n",
    "20 µs ± 72.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "32.4 µs ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "41 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "48.6 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "56.7 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
    "```\n",
    "\n",
    "Your results may vary depending on your configuration - try out the benchmark with 0.51.2 and 0.52!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Libdevice functions\n",
    "\n",
    "All [CUDA libdevice functions](https://docs.nvidia.com/cuda/libdevice-users-guide/index.html) (with the exception of `__nv_nan` and `__nv_nanf`) are now available in the `cuda.libdevice` module. The leading `__nv_` is stripped from the names - for example, [`__nv_fast_cosf`](https://docs.nvidia.com/cuda/libdevice-users-guide/__nv_fast_cosf.html#__nv_fast_cosf) is available as `libdevice.fast_cosf`. Here's an example showing the use of some of the fast trigonomentric functions in use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Standard version time 5.148192998603918e-05ms\n",
      "Libdevice version time 4.993253998691216e-05ms\n"
     ]
    }
   ],
   "source": [
    "from numba.cuda import libdevice\n",
    "\n",
    "# Implementation using the standard trigonometric functions\n",
    "@cuda.jit(void(float32[::1], float32[::1], float32[::1]))\n",
    "def trig_functions(r, x, y):\n",
    "    i = cuda.grid(1)\n",
    "    if i < len(r):\n",
    "        r[i] = math.sin(x[i]) * math.cos(y[i]) + math.tan(x[i] + y[i])\n",
    "\n",
    "# Implementation using the fast trigonometric functions\n",
    "@cuda.jit(void(float32[::1], float32[::1], float32[::1]))\n",
    "def fast_trig_functions(r, x, y):\n",
    "    i = cuda.grid(1)\n",
    "    if i < len(r):\n",
    "        r[i] = libdevice.fast_sinf(x[i]) * libdevice.fast_cosf(y[i]) + libdevice.fast_tanf(x[i] + y[i])\n",
    "\n",
    "\n",
    "# Create some random input\n",
    "N = 100000\n",
    "np.random.seed(1)\n",
    "x = np.random.random(N).astype(np.float32)\n",
    "y = np.random.random(N).astype(np.float32)\n",
    "\n",
    "# Copy input to the device and allocate space for output\n",
    "d_x = cuda.to_device(x)\n",
    "d_y = cuda.to_device(y)\n",
    "r_math = cuda.device_array_like(x)\n",
    "r_libdevice = cuda.device_array_like(x)\n",
    "\n",
    "n_runs = 100\n",
    "n_threads = 256\n",
    "n_blocks = math.ceil(N / n_threads)\n",
    "\n",
    "# Run and time the normal version\n",
    "start_math = time.perf_counter()\n",
    "for i in range(n_runs):\n",
    "    trig_functions[n_blocks, n_threads](r_math, d_x, d_y)\n",
    "cuda.synchronize()\n",
    "end_math = time.perf_counter()\n",
    "\n",
    "# Run and time the version using fast trig functions\n",
    "start_libdevice = time.perf_counter()\n",
    "for i in range(n_runs):\n",
    "    fast_trig_functions[n_blocks, n_threads](r_libdevice, d_x, d_y)\n",
    "cuda.synchronize()\n",
    "end_libdevice = time.perf_counter()\n",
    "\n",
    "\n",
    "# Note that the fast versions of the functions sacrifice accuracy for speed,\n",
    "# so a lower-than-default relative tolerance is required for this sanity check.\n",
    "np.testing.assert_allclose(r_math.copy_to_host(), r_libdevice.copy_to_host(), rtol=1.0e-2)\n",
    "\n",
    "# Note that timings will be fairly similar for this example, as the execution time will be\n",
    "# dominated by the kernel launch time.\n",
    "print(f\"Standard version time {(end_math - start_math) / n_runs}ms\")\n",
    "print(f\"Libdevice version time {(end_libdevice - start_libdevice) / n_runs}ms\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Atomic subtract\n",
    "\n",
    "Atomic subtraction is now supported. The following example subtracts several values from an element of an array with every thread contending on the same location:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result: 4217.0\n"
     ]
    }
   ],
   "source": [
    "@cuda.jit\n",
    "def subtract_example(x, values):\n",
    "    i = cuda.grid(1)\n",
    "    cuda.atomic.sub(x, 0, values[i])\n",
    "\n",
    "initial = 12345.0\n",
    "n_blocks = 4\n",
    "n_threads = 32\n",
    "n_values = n_blocks * n_threads\n",
    "values = np.arange(n_values, dtype=np.float32)\n",
    "\n",
    "x = np.zeros(1, dtype=np.float32)\n",
    "x[0] = initial\n",
    "\n",
    "subtract_example[n_blocks, n_threads](x, values)\n",
    "    \n",
    "# Floating point subtraction is not associative - the order in which subtractions\n",
    "# occur can cause a slight variation, so we use assert_allclose instead of checking\n",
    "# for exact equality.\n",
    "np.testing.assert_allclose(x, [initial - np.sum(values)])\n",
    "print(f\"Result: {x[0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Math library functions\n",
    "\n",
    "The functions `math.frexp` and `math.ldexp` are now supported in CUDA kernels. Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.4359949  0.02592623 0.5496625  0.4353224  0.4203678  0.3303348\n",
      " 0.20464863 0.619271   0.29965466 0.2668273 ]\n",
      "[0.4359949  0.02592623 0.5496625  0.4353224  0.4203678  0.3303348\n",
      " 0.20464863 0.619271   0.29965466 0.2668273 ]\n"
     ]
    }
   ],
   "source": [
    "@cuda.jit\n",
    "def cuda_frexp_ldexp(x, y):    \n",
    "    i = cuda.grid(1)\n",
    "    if i < len(x):\n",
    "        fractional, exponent = math.frexp(x[i])\n",
    "        y[i] = math.ldexp(fractional, exponent)\n",
    "\n",
    "np.random.seed(2)\n",
    "n_values = 16384\n",
    "n_threads = 256\n",
    "n_blocks = n_values // n_threads\n",
    "\n",
    "values = np.random.random(16384).astype(np.float32)\n",
    "results = np.zeros_like(values)\n",
    "\n",
    "cuda_frexp_ldexp[n_blocks, n_threads](values, results)\n",
    "\n",
    "# Sanity check\n",
    "np.testing.assert_equal(values, results)\n",
    "\n",
    "# Print the first few values and results\n",
    "print(values[:10])\n",
    "print(results[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Powers of complex numbers\n",
    "\n",
    "It is now possible to write a kernel using the power operator on complex numbers. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.55335274+0.49336656j 0.88211009+0.17821241j 0.72609239+0.0945577j\n",
      " 0.63333904+0.08188157j 0.6611122 +0.14269518j 0.42899033+0.39976284j\n",
      " 0.26301009-0.36953702j 0.14541033-0.01929919j 0.2840652 +0.32310602j\n",
      " 0.81477267+0.11021739j]\n",
      "[0.55335274+0.49336656j 0.88211009+0.17821241j 0.72609239+0.0945577j\n",
      " 0.63333904+0.08188157j 0.6611122 +0.14269518j 0.42899033+0.39976284j\n",
      " 0.26301009-0.36953702j 0.14541033-0.01929919j 0.2840652 +0.32310602j\n",
      " 0.81477267+0.11021739j]\n"
     ]
    }
   ],
   "source": [
    "@cuda.jit\n",
    "def complex_power(r, x, y):\n",
    "    i = cuda.grid(1)\n",
    "    if i < len(r):\n",
    "        r[i] = x[i] ** y[i]\n",
    "\n",
    "np.random.seed(3)\n",
    "n_values = 16384\n",
    "n_threads = 256\n",
    "n_blocks = n_values // n_threads\n",
    "\n",
    "def random_complex():\n",
    "    \"Generate an array of random complex values\"\n",
    "    real = np.random.random(n_values)\n",
    "    imag = np.random.random(n_values)\n",
    "    return real + imag * 1j\n",
    "\n",
    "x = random_complex()\n",
    "y = random_complex()\n",
    "r = np.zeros_like(x)\n",
    "\n",
    "complex_power[n_blocks, n_threads](r, x, y)\n",
    "\n",
    "# Sanity check\n",
    "np.testing.assert_allclose(r, x ** y)\n",
    "\n",
    "# Print the first few results and the same computed on the CPU for comparison\n",
    "print(r[:10])\n",
    "print(x[:10] ** y[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `mapped_array_like` and `pinned_array_like`\n",
    "\n",
    "In addition to `device_array_like`, `mapped_array_like` and `pinned_array_like` can be used for creating arrays like existing arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.zeros(16384, dtype=np.int16)\n",
    "\n",
    "d_x = cuda.device_array_like(x)\n",
    "m_x = cuda.mapped_array_like(x)\n",
    "p_x = cuda.pinned_array_like(x)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}