{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Numba 0.54 NUMBA-DPPY Release Demo\n", "\n", "[Numba-dppy](https://github.com/IntelPython/numba-dppy) is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dppy uses [dpctl](https://github.com/IntelPython/dpctl) to support SYCL features. Currently Intel’s DPC++ is the only SYCL runtime supported by Numba-dppy.\n", "\n", "Numba-dppy provides two ways to express SYCL parallelism:\n", "\n", "* **An automatic offload mode** for NumPy data-parallel expressions and [Numba parallel loops](https://numba.pydata.org/numba-doc/dev/user/parallel.html#explicit-parallel-loops) via `@numba.jit`. This automatic approach extends Numba's existing auto-parallelizer to support generating SYCL kernels from data-parallel code regions. Using the automatic offload approach a programmer needs only minimal changes to the existing code and can try to offload an existing `@numba.jit` decorated function to a SYCL device by invoking the function from a `dpctl.device_context`.\n", " \n", "* **An explicit kernel programming mode** using the `@numba_dppy.kernel` decorator. The explicit kernel approach is similar to Numba's other GPU backends: `numba.cuda`. The `@numba_dppy.kernel` decorator is provided by the numba-dppy package. Several advanced SYCL features such as indexing, synchronization, fences, atomcis are provided by the `@numba_dppy.kernel` decorator. Thus, using the decorator a relatively low-level SYCL kernel can be written directly in Python. The feature is intended for programmers who already have SYCL and GPU programming experience." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some useful imports" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import numba_dppy as dppy # numba-dppy package should be installed for the examples below.\n", "import dpctl\n", "from numba import njit\n", "import math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An automatic offload\n", "\n", "The automatic offload feature in numba-dppy is triggered when a `@numba.jit` function is invoked inside a `dpctl.device_context`scope. The following example demonstrates the usage of numba-dppy's automatic offload functionality. Note that the example is identical to the normal Numba parallel example, the only difference is that the function is called in the `dpctl.device_context`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using device ...\n", "\n", "[2. 2. 2. ... 2. 2. 2.]\n" ] } ], "source": [ "@njit\n", "def f1(a, b):\n", " c = a + b\n", " return c\n", "\n", "N = 64 * 32\n", "a = np.ones(N, dtype=np.float32)\n", "b = np.ones(N, dtype=np.float32)\n", "\n", "# Use the environment variable SYCL_DEVICE_FILTER to change the default device.\n", "# See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.\n", "device = dpctl.select_default_device()\n", "print(\"Using device ...\")\n", "print(device)\n", "\n", "with dpctl.device_context(device):\n", " c = f1(a, b)\n", " print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Controllable fallback behavior during automatic offload** \n", "\n", "By default, if a section of code cannot be offloaded to the GPU, it is automatically executed on the CPU and a warning is printed. This behavior is only applicable to JIT functions, auto-offloading of NumPy calls, array expressions and prange loops. To disable this functionality and force code running on GPU, set the environment variable `NUMBA_DPPY_FALLBACK_OPTION` to false (for example, `export NUMBA_DPPY_FALLBACK_OPTION=0`). In this case the code is not automatically offloaded to the CPU and errors occur if any. \n", "\n", "**Diagnostic reporting for automatic offload**\n", "\n", "`Export NUMBA_DPPY_OFFLOAD_DIAGNOSTICS=1`:\n", "\n", "Setting the debug environment variable `NUMBA_DPPY_OFFLOAD_DIAGNOSTICS` provides emission of the parallel and offload diagnostics information based on produced parallel transforms. The level of detail depends on the integer value between 1 and 4 that is set to the environment variable (higher is more detailed). In the \"Auto-offloading\" section there is the information on which device (device name) this kernel was offloaded." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing Explicit Kernels in numba-dppy\n", "\n", "Writing a SYCL kernel using the `@numba_dppy.kernel` decorator has similar syntax to writing OpenCL kernels. The numba-dppy module provides similar indexing and other functions as OpenCL. Some of the indexing functions supported inside a numba_dppy.kernel are:\n", "\n", "* `numba_dppy.get_global_id`: Gets the global ID of the item\n", "* `numba_dppy.get_local_id`: Gets the local ID of the item\n", "* `numba_dppy.get_local_size`: Gets the local work group size of the device\n", "* `numba_dppy.get_group_id` : Gets the group ID of the item\n", "* `numba_dppy.get_num_groups`: Gets the number of gropus in a worksgroup \n", "\n", "Refer https://intelpython.github.io/numba-dppy/latest/user_guides/kernel_programming_guide/index.html for more details." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.1647326 0.5044042 1.0928384 1.6194623 0.64363265 0.923868\n", " 0.9901773 0.16170211 0.6585165 0.37717268 1.7218891 0.7935294\n", " 1.1921285 1.0631248 0.97428465 1.2411709 0.12518258 1.3276634\n", " 0.50359565 1.2648091 ]\n", "[1.1647326 0.5044042 1.0928384 1.6194623 0.64363265 0.923868\n", " 0.9901773 0.16170211 0.6585165 0.37717268 1.7218891 0.7935294\n", " 1.1921285 1.0631248 0.97428465 1.2411709 0.12518258 1.3276634\n", " 0.50359565 1.2648091 ]\n" ] } ], "source": [ "@dppy.kernel\n", "def sum(a, b, c):\n", " i = dppy.get_global_id(0)\n", " c[i] = a[i] + b[i]\n", "\n", "a = np.array(np.random.random(20), dtype=np.float32)\n", "b = np.array(np.random.random(20), dtype=np.float32)\n", "c = np.ones_like(a)\n", "\n", "device = dpctl.select_default_device()\n", "\n", "with dpctl.device_context(device):\n", " sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)\n", "\n", "print(a+b)\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numba-DPPY Atomics\n", "\n", "Numba-dppy supports several atomic operations supported by DPC++. \n", "\n", "`class numba_dppy.ocl.stubs.atomic` \n", "atomic namespace\n", "\n", "* `add(ary, idx, val)` Perform atomic `ary[idx] += val`. Returns the old value at the index location as if it is loaded atomically.\n", "* `sub(ary, idx, val)` Perform atomic `ary[idx] -= val`. Returns the old value at the index location as if it is loaded atomically." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[100.]\n", "Done...\n" ] } ], "source": [ "\"\"\"\n", "The example demonstrates the use of numba_dppy's ``atomic_add`` intrinsic\n", "function on a SYCL device. The ``dpctl.select_gpu_device`` is\n", "equivalent to ``sycl::gpu_selector`` and returns a sycl::device of type GPU.\n", "\n", "For more information please look at:\n", "https://github.com/IntelPython/numba-dppy/blob/0.16.0/numba_dppy/examples/atomic_op.py\n", "\n", "Without these two environment variables Numba_dppy will use other\n", "implementation for floating point atomics.\n", "\"\"\"\n", "\n", "@dppy.kernel\n", "def atomic_add(a):\n", " dppy.atomic.add(a, 0, 1)\n", "\n", "global_size = 100\n", "a = np.array([0], dtype=np.float32)\n", "\n", "device = dpctl.select_default_device()\n", "\n", "with dppy.offload_to_sycl_device(device):\n", " atomic_add[global_size, dppy.DEFAULT_LOCAL_SIZE](a)\n", "\n", "print(a)\n", "\n", "print(\"Done...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Expected 100, because global_size = 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numba-DPPY device functions\n", "\n", "OpenCL and SYCL do not directly have a notion for device-only functions, i.e. functions that can be only invoked from a kernel and not from a host function. However, numba-dppy provides a special decorator `numba_dppy.func` specifically to implement device functions." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done...\n" ] } ], "source": [ "@dppy.func\n", "def a_device_function(a):\n", " \"\"\"\n", " A ``func`` is a device callable function that can be invoked from\n", " ``kernel`` and other ``func`` functions.\n", " \"\"\"\n", " return a + 1\n", "\n", "\n", "@dppy.func\n", "def another_device_function(a):\n", " return a_device_function(a)\n", "\n", "\n", "@dppy.kernel\n", "def a_kernel_function(a, b):\n", " i = dppy.get_global_id(0)\n", " b[i] = another_device_function(a[i])\n", "\n", "\n", "N = 10\n", "a = np.ones(N)\n", "b = np.ones(N)\n", "\n", "device = dpctl.select_default_device()\n", "\n", "with dppy.offload_to_sycl_device(device):\n", " a_kernel_function[N, dppy.DEFAULT_LOCAL_SIZE](a, b)\n", "\n", "print(\"Done...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reduction on SYCL-supported Devices\n", "\n", "This example demonstrates a summation reduction on a one-dimensional array. \n", "In this example, to reduce the array we invoke the kernel multiple times." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Actual: 1035.3582\n", "Expected: 1035.3582\n" ] } ], "source": [ "@dppy.kernel\n", "def sum_reduction_kernel(A, R, stride):\n", " i = dppy.get_global_id(0)\n", " # sum two element\n", " R[i] = A[i] + A[i + stride]\n", " # store the sum to be used in nex iteration\n", " A[i] = R[i]\n", "\n", "\n", "def sum_reduce(A):\n", " \"\"\"Size of A should be power of two.\"\"\"\n", " total = len(A)\n", " # max size will require half the size of A to store sum\n", " R = np.array(np.random.random(math.ceil(total / 2)), dtype=A.dtype)\n", "\n", " device = dpctl.select_default_device()\n", "\n", " with dppy.offload_to_sycl_device(device):\n", " while total > 1:\n", " global_size = total // 2\n", " sum_reduction_kernel[global_size, dppy.DEFAULT_LOCAL_SIZE](\n", " A, R, global_size\n", " )\n", " total = total // 2\n", "\n", " return R[0]\n", "\n", "# This test will only work for size = power of two\n", "N = 2048\n", "assert N % 2 == 0\n", "\n", "A = np.array(np.random.random(N), dtype=np.float32)\n", "A_copy = A.copy()\n", "\n", "actual = sum_reduce(A)\n", "expected = A_copy.sum()\n", "\n", "print(\"Actual: \", actual)\n", "print(\"Expected:\", expected)\n", "\n", "assert expected - actual < 1e-2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Numba-dppy Github repository: https://github.com/IntelPython/numba-dppy\n", "### More examples: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/examples\n", "### Tests: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/tests" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }