{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f96f77ed",
   "metadata": {},
   "source": [
    "# Numba 0.54 CUDA Release Demo\n",
    "\n",
    "Key changes to the CUDA target for Release 0.54 demonstrated in this notebook:\n",
    "\n",
    "* Warnings for behavior that may have a negative impact on performance (Michael Collison):\n",
    "  - Kernel launches using grids that are too small to make effective use of the device\n",
    "  - Implicit copies that force a costly synchronization when launching a kernel\n",
    "* Support for implementing warp-aggregated atomics (Graham Markall):\n",
    "  - The functions `activemask()` and `lanemask_lt()` are now supported\n",
    "  - `cuda.ffs` now gives correct answers (its behavior mirrors that of [`__ffs()`](https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html#group__CUDA__MATH__INTRINSIC__INT_1gaf1eb22243e29e0b7222adee8ae7d4e4) in CUDA C.\n",
    "* Tuples can now be passed to CUDA ufuncs (Graham Markall).\n",
    "* Relaxed strides checking is now used to compute contiguity of arrays, enabling some new use cases (Graham Markall).\n",
    "\n",
    "Other key changes, not demonstrated in this notebook, include:\n",
    "\n",
    "* Debugging improvements (Graham Markall):\n",
    "  - `cuda-gdb` can now find the source location at the beginning of a kernel (e.g. when breaking on the first instruction of kernels with `set cuda break_on_launch application`)\n",
    "  - Breakpoints can now be set on mangled names of kernels.\n",
    "* Per-Thread Default Stream support (Graham Markall)\n",
    "* Support for adding lineinfo to CUDA kernels so that NSight Compute can display profile information for each source line of CUDA kernels (Max Katz)\n",
    "* IPC is now supported on Windows (Graham Markall).\n",
    "* High-level API support for extending the CUDA target (`@overload`) (Stuart Archibald and Graham Markall).\n",
    "* `nanosleep()` is added - this can be used to implement exponential backoff when atomic operations have high contention (Graham Markall)\n",
    "* Support for generating a fast division instruction in PTX when `fastmath=True` (Michael Collison)\n",
    "\n",
    "## Some useful imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "94bbb0bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numba import cuda, guvectorize, vectorize, void, int32, float64, uint32\n",
    "import math\n",
    "import numpy as np\n",
    "np.random.seed(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc234033",
   "metadata": {},
   "source": [
    "## Performance-hinting warnings\n",
    "\n",
    "Getting the best performance out of the CUDA target needs care to be taken with the grid dimensions and also with data transfers. There are two common errors that can hinder getting the best performance out of the CUDA target, especially amongst beginners:\n",
    "\n",
    "* Using too small a grid, which inhibits parallelism and leaves processing resources unused.\n",
    "* Copying data to and from the device more frequently than necessary.\n",
    "\n",
    "Numba 0.54 helps beginners avoid these mistakes by emitting warnings when it can.\n",
    "\n",
    "### Small grid warnings\n",
    "\n",
    "Let's define a simple linear algebra kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2a5b7366",
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def axpy(r, a, x, y):\n",
    "    i = cuda.grid(1)\n",
    "    if i < len(r):\n",
    "        r[i] = a * x[i] + y[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05a01303",
   "metadata": {},
   "source": [
    "We'll also write a function that creates some input data of a given size, `N`, and launches the kernel on the data with a grid of sufficient size to cover all the elements of the input data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "707f8f89",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_and_add_vectors(N):\n",
    "    # Create input data and transfer to GPU\n",
    "    x = np.random.random(N)\n",
    "    y = np.random.random(N)\n",
    "    d_x = cuda.to_device(x)\n",
    "    d_y = cuda.to_device(y)\n",
    "    d_r = cuda.device_array_like(d_x)\n",
    "    a = 4.5\n",
    "\n",
    "    # Compute grid dimensions\n",
    "    \n",
    "    # An arbitrary reasonable choice of block size\n",
    "    block_dim = 256\n",
    "    # Enough blocks to cover the input\n",
    "    grid_dim = math.ceil(len(d_x) / block_dim)\n",
    "\n",
    "    # Launch the kernel\n",
    "    axpy[grid_dim, block_dim](d_r, a, d_x, d_y)\n",
    "    \n",
    "    # Return the result\n",
    "    return d_r.copy_to_host()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2747af3",
   "metadata": {},
   "source": [
    "Now if we use the kernel for vectors of length 32, we should end up with a very small grid, which results in a warning that the grid is too small for efficiency:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a30fe925",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: \u001b[1mGrid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.\u001b[0m\n",
      "  warn(NumbaPerformanceWarning(msg))\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([2.83448855, 3.77462551, 0.6923918 , 1.67601221, 1.34690244,\n",
       "       1.25014935, 0.85645923, 2.30516759, 2.77431472, 3.17284096,\n",
       "       2.16681931, 3.87276708, 1.02326113, 4.39942199, 1.03183967,\n",
       "       3.31071794, 2.16564695, 2.6441328 , 0.65110818, 1.57029223,\n",
       "       3.81497868, 4.62272375, 1.90198196, 3.16881432, 4.51786879,\n",
       "       4.17245856, 0.97200449, 0.87550488, 0.86657132, 4.36569725,\n",
       "       1.13696091, 2.30916358])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_and_add_vectors(32)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e36004d9",
   "metadata": {},
   "source": [
    "In general, the grid should be at least twice the size of the number of SMs on the device. For example, a Quadro RTX 8000 has 72 SMs, so 144 blocks as a minimum would be required. Note that `2 * SMs` is not necessarily ideal - sometimes further oversubscription of 3-4 times is better for hiding latency.\n",
    "\n",
    "For most GPUs, using input data with `2 ** 16` elements should result in a large enough grid that the device is utilized more effectively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7c48b343",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.79326341, 2.86602196, 3.74815365, ..., 1.23942119, 3.11112051,\n",
       "       4.51046983])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_and_add_vectors(2 ** 16)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f51190d2",
   "metadata": {},
   "source": [
    "#### Vectorize\n",
    "\n",
    "When using `@vectorize`, the grid is implicitly created based on the size of the input - this is useful because it saves the user thinking about the grid, but it can also hide the fact that a small grid gets created for small data. To catch this, the warning also activates for `@vectorize`-d functions.\n",
    "\n",
    "Let's redo the example with `@vectorize`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4ec99653",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: \u001b[1mEager compilation of device functions is deprecated (this occurs when a signature is provided)\u001b[0m\n",
      "  warn(NumbaDeprecationWarning(msg))\n"
     ]
    }
   ],
   "source": [
    "@vectorize([float64(float64, float64, float64)], target='cuda')\n",
    "def axpy_vectorize(a, x, y):\n",
    "    return a * x + y\n",
    "\n",
    "def vectorize_add_vectors(N):\n",
    "    x = np.random.random(N)\n",
    "    y = np.random.random(N)\n",
    "    d_x = cuda.to_device(x)\n",
    "    d_y = cuda.to_device(y)\n",
    "    d_r = cuda.device_array_like(d_x)\n",
    "    a = 4.5\n",
    "\n",
    "    axpy_vectorize(a, d_x, d_y, out=d_r)\n",
    "    \n",
    "    return d_r.copy_to_host()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f70890d",
   "metadata": {},
   "source": [
    "(Note that an unrelated warning about deprecated behavior is produced here from the internal implementation of `@vectorize` - this will be removed in 0.55, and has no effect on the use of `@vectorize`)\n",
    "\n",
    "Now if we again call with small data, a small grid will be produced and warned about:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8495fb3b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: \u001b[1mGrid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.\u001b[0m\n",
      "  warn(NumbaPerformanceWarning(msg))\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([0.9076929 , 0.84091496, 4.51281098, 1.60343059, 2.39170646,\n",
       "       3.57143795, 1.9939654 , 0.66315516, 1.01864565, 1.65200497,\n",
       "       1.4520746 , 3.12323577, 3.13834654, 3.20744399, 0.8351249 ,\n",
       "       0.61896314, 5.05435786, 2.13761441, 4.8652938 , 3.19708948,\n",
       "       4.05406256, 1.17937198, 0.27872906, 3.49203118, 3.56165816,\n",
       "       1.04358889, 3.96496876, 1.80941232, 0.94801354, 4.8046405 ,\n",
       "       1.84938901, 3.86637987])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorize_add_vectors(32)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "824551cf",
   "metadata": {},
   "source": [
    "Now let's try again with larger data. `@vectorize` parameterizes the grid differently to the above example, so we need to use larger input data than before to ensure the warning doesn't trigger on the larger GPUs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "fe80b54e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([4.56478069, 1.52891639, 3.09231419, ..., 3.26977228, 1.42833257,\n",
       "       2.86626516])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorize_add_vectors(2 ** 18)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5434ab8",
   "metadata": {},
   "source": [
    "#### GUVectorize\n",
    "\n",
    "`@guvectorize` can similarly create a small grid, so the warning is also enabled for it. Sometimes the combination of the shape of the input data and the shapes of the input variables can obscure the fact that few parallel invocations of the inner function are created, so this warning can be helpful.\n",
    "\n",
    "As an example, let's define a kernel that reduces a set of vectors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b9bbaadc",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: \u001b[1mEager compilation of device functions is deprecated (this occurs when a signature is provided)\u001b[0m\n",
      "  warn(NumbaDeprecationWarning(msg))\n"
     ]
    }
   ],
   "source": [
    "@guvectorize(['void(float32[:], float32[:], float32[:])'],\n",
    "              '(n),(n)->(n)', target='cuda')\n",
    "def numba_dist_cuda(a, b, dist):\n",
    "    len = a.shape[0]\n",
    "    for i in range(len):\n",
    "         dist[i] = a[i] * b[i]\n",
    "            "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d7ef275",
   "metadata": {},
   "source": [
    "If we construct inputs such that the function is invoked once for a single large input, therefore using only one thread, we see the warning:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "403c458b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: \u001b[1mGrid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.\u001b[0m\n",
      "  warn(NumbaPerformanceWarning(msg))\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([0.25242987, 0.15926176, 0.12577234, ..., 0.27449533, 0.00076392,\n",
       "       0.13365015], dtype=float32)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = np.random.rand(1024 * 32).astype('float32')\n",
    "b = np.random.rand(1024 * 32).astype('float32')\n",
    "dist = np.zeros(a.shape[0]).astype('float32')\n",
    "\n",
    "numba_dist_cuda(a, b, dist)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "556a1b14",
   "metadata": {},
   "source": [
    "Now we construct an input that will result in many invocations of the function (524288) instances operating on 2-vectors. There will be enough blocks to fill most devices, and the warning should be elided:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "50485dc9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.27024668, 0.4367813 ],\n",
       "       [0.02799038, 0.25452447],\n",
       "       [0.03211394, 0.31761077],\n",
       "       ...,\n",
       "       [0.2193208 , 0.08123732],\n",
       "       [0.15066122, 0.23769546],\n",
       "       [0.00870252, 0.06313875]], dtype=float32)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = np.random.rand(524288 * 2).astype('float32').reshape((524288, 2))\n",
    "b = np.random.rand(524288 * 2).astype('float32').reshape((524288, 2))\n",
    "dist = np.zeros_like(a)\n",
    "\n",
    "numba_dist_cuda(a, b, dist)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fce20e55",
   "metadata": {},
   "source": [
    "### Implicit copy warnings\n",
    "\n",
    "Numba makes it easy to pass NumPy arrays on the host to CUDA kernels by automatically transferring them to and from the device before and after a kernel call. Whilst this is convenient in early development, a better strategy for performance is to manually control data movement to and from the device.\n",
    "\n",
    "To help ensure that a data movement strategy is properly implemented, Numba now warns when implicit copies are made when launching kernels.\n",
    "\n",
    "For example, let's call our `axpy` kernel again, on NumPy arrays this time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "31badd36",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/cudadrv/devicearray.py:790: NumbaPerformanceWarning: \u001b[1mHost array used in CUDA kernel will incur copy overhead to/from device.\u001b[0m\n",
      "  warn(NumbaPerformanceWarning(msg))\n"
     ]
    }
   ],
   "source": [
    "N = 2 ** 16\n",
    "x = np.random.random(N)\n",
    "y = np.random.random(N)\n",
    "r = np.empty_like(x)\n",
    "a = 5\n",
    "\n",
    "block_dim = 256\n",
    "grid_dim = math.ceil(len(x) / block_dim)\n",
    "\n",
    "axpy[grid_dim, block_dim](r, a, x, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c47f0058",
   "metadata": {},
   "source": [
    "The emitted warning makes it clear that we're inducing some overhead here."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0cbc8d8",
   "metadata": {},
   "source": [
    "## Warp aggregated atomics\n",
    "\n",
    "The added functions `activemask()` and `lanemask_lt()`, as well as the corrected implementation of `ffs()` can be combined to implement warp-aggregated atomic operatations, as described in [CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics](https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/). The following example code is a translation of the examples from that post, where the code filters input by a predicate: it takes each of the input values, and places those that match a predicate in an output array.\n",
    "\n",
    "First we define a device function where:\n",
    "\n",
    "* Warps collaborate to determine how many threads within the warp match the predicate (`warp_res`),\n",
    "* `warp_res` is used to atomically increment a counter of the number of matching elements atomically (`ctr`),\n",
    "* each thread determines its offset into the output array (the returned `warp_res + rank` value)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "a55ad57b",
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit(device=True)\n",
    "def atomicAggInc(ctr):\n",
    "    active = cuda.activemask()\n",
    "    leader = uint32(cuda.ffs(active) - uint32(1))\n",
    "    change = cuda.popc(active)\n",
    "    rank = cuda.popc(uint32(active & cuda.lanemask_lt()))\n",
    "\n",
    "    warp_res = 0\n",
    "    if rank == 0:\n",
    "        warp_res = cuda.atomic.add(ctr, 0, change)\n",
    "    warp_res = cuda.shfl_sync(active, warp_res, leader)\n",
    "\n",
    "    return warp_res + rank"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ca48001",
   "metadata": {},
   "source": [
    "Next we can use the warp-aggregated function to implement our filtering function. The filtering function takes values that are greater than zero and stores them in the output array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "fd20415a",
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def filter_k(dst, nres, src):\n",
    "    i = cuda.grid(1)\n",
    "\n",
    "    if i >= len(src):\n",
    "        return\n",
    "\n",
    "    if src[i] > 0:\n",
    "        dst[atomicAggInc(nres)] = src[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af579e75",
   "metadata": {},
   "source": [
    "We set up the run by creating some output data and printing a short summary of it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "2f4ea177",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running with 16777216 elements, of which approximately 25.0% are zero\n",
      "\n",
      "There are 12584753 nonzeroes in:\n",
      "[0.417022   0.72032449 0.         ... 0.20570723 0.36716537 0.0979951 ]\n"
     ]
    }
   ],
   "source": [
    "# Parameters for the run\n",
    "N = 2 ** 24\n",
    "zero_factor = 0.25\n",
    "\n",
    "print(f'Running with {N} elements, of which approximately {zero_factor * 100}%'\n",
    "      ' are zero\\n')\n",
    "\n",
    "# Seed the RNG for repeatability\n",
    "np.random.seed(1)\n",
    "\n",
    "# Create input data\n",
    "inputs = np.random.random(N)\n",
    "zeros = np.zeros(N)\n",
    "factors = np.random.random(N)\n",
    "values = np.where(factors > zero_factor, inputs, zeros)\n",
    "\n",
    "# Quick summary of the data\n",
    "value_count = np.sum(values > 0)\n",
    "print(f'There are {value_count} nonzeroes in:')\n",
    "print(values)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "945f5962",
   "metadata": {},
   "source": [
    "Next we create outputs for kernel and copy values to device:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e6a89849",
   "metadata": {},
   "outputs": [],
   "source": [
    "d_nres = cuda.to_device(np.zeros(1, dtype=np.uint32))\n",
    "d_result = cuda.to_device(np.zeros_like(values))\n",
    "d_values = cuda.to_device(values)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14d858c4",
   "metadata": {},
   "source": [
    "Then we can can launch the kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "9157d704",
   "metadata": {},
   "outputs": [],
   "source": [
    "n_threads = 128\n",
    "n_blocks = N // n_threads\n",
    "filter_k[n_blocks, n_threads](d_result, d_nres, d_values)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d6f6948",
   "metadata": {},
   "source": [
    "Then we can summarize the kernel's output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "838d2fd4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The kernel found 12584753 elements, resulting in the array:\n",
      "[0.09805002 0.52309677 0.44381494 ... 0.         0.         0.        ]\n"
     ]
    }
   ],
   "source": [
    "cuda_n = d_nres.copy_to_host()[0]\n",
    "print(f'The kernel found {cuda_n} elements, resulting in the array:')\n",
    "result = d_result.copy_to_host()\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6431b8c1",
   "metadata": {},
   "source": [
    "Finally, some sanity checking:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "91a8efb9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sanity checks passed!\n"
     ]
    }
   ],
   "source": [
    "# Did we filter the expected number of values?\n",
    "np.testing.assert_equal(value_count, cuda_n)\n",
    "\n",
    "# Are the first cuda_n elements all nonzero?\n",
    "np.testing.assert_equal(np.ones(value_count, dtype=np.bool_),\n",
    "                        result[:cuda_n] > 0)\n",
    "\n",
    "# Were elements after the cuda_nth element left as zero?\n",
    "np.testing.assert_equal(np.zeros(N - value_count), result[cuda_n:])\n",
    "\n",
    "print('Sanity checks passed!')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cd36330",
   "metadata": {},
   "source": [
    "## Relaxed strides checking\n",
    "\n",
    "Since NumPy 1.12, a *relaxed strides checking* algorithm for determining the contiguity of arrays has been the default (as opposed to *strict strides checking*). Numba originally implemented the strict variant for checking the contiguity of arrays, which can prevent it being used in certain cases - for example, using Dask to chunk an array could result in an array that was contiguous but was determined to be disjoint, preventing it from being transferred to the device. \n",
    "\n",
    "Numba now uses the relaxed algorithm, enabling this use case and other similar ones. The following example provides a demonstration of a case where relaxed strides checking is required - a Dask array is chunked and a CUDA gufunc is invoked on a block of the chunked array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "0ba53425",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[90.]]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: \u001b[1mEager compilation of device functions is deprecated (this occurs when a signature is provided)\u001b[0m\n",
      "  warn(NumbaDeprecationWarning(msg))\n",
      "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: \u001b[1mGrid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.\u001b[0m\n",
      "  warn(NumbaPerformanceWarning(msg))\n"
     ]
    }
   ],
   "source": [
    "import dask.array as da\n",
    "import numpy as np\n",
    "from numba import guvectorize\n",
    "\n",
    "x = da.arange(10000, dtype=np.float64).reshape(100, 100).rechunk((1, 10))\n",
    "f = np.ascontiguousarray(x.blocks[0, 0])\n",
    "g = np.ascontiguousarray(x.blocks[0, 0])\n",
    "\n",
    "@guvectorize(\n",
    "    [\"void(float64[:], float64[:], float64[:], float64[:])\"],\n",
    "    \"(n),(n),(p)->(p)\",\n",
    "    nopython=True,\n",
    "    target=\"cuda\",\n",
    ")\n",
    "def reduce_func(x, y, _, out) -> None:\n",
    "    sum_ = 0.0\n",
    "    for i in range(x.shape[0]):\n",
    "        sum_ += x[i] + y[i]\n",
    "    out[:] = sum_\n",
    "\n",
    "\n",
    "result = reduce_func(f, g, np.empty(1, dtype=x.dtype))\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6063bd8",
   "metadata": {},
   "source": [
    "## Passing tuples to CUDA ufuncs\n",
    "\n",
    "It is now possible to pass tuples to CUDA ufuncs, in line with NumPy and CPU-targets Numba ufuncs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "793faee7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.54030231 -0.41614684 -0.9899925 ]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/gmarkall/numbadev/numba/numba/cuda/decorators.py:118: NumbaDeprecationWarning: \u001b[1mEager compilation of device functions is deprecated (this occurs when a signature is provided)\u001b[0m\n",
      "  warn(NumbaDeprecationWarning(msg))\n",
      "/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:872: NumbaPerformanceWarning: \u001b[1mGrid size (1) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.\u001b[0m\n",
      "  warn(NumbaPerformanceWarning(msg))\n"
     ]
    }
   ],
   "source": [
    "from numba import vectorize, float64\n",
    "import math\n",
    "\n",
    "@vectorize([float64(float64)], target='cuda')\n",
    "def cuda_cos(x):\n",
    "    return math.cos(x)\n",
    "\n",
    "print(cuda_cos((1.0, 2.0, 3.0)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60e51824",
   "metadata": {},
   "source": [
    "## UUIDs\n",
    "\n",
    "The UUID of a device can now be accessed in its `uuid` property - whilst it is not yet possible to select devices by UUID, this can help ensure the intended GPU is in use - this can be helpful in a system using [Multi-Instance GPU (MIG)](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) or when the `CUDA_VISIBLE_DEVICES` environment variable is being used to control which devices can be seen by the application.\n",
    "\n",
    "An example getting the current device's UUID:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "94845147",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPU-e6489c45-5b68-3b03-bab7-0e7c8e809643\n"
     ]
    }
   ],
   "source": [
    "ctx = cuda.current_context()\n",
    "print(ctx.device.uuid)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}