{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Numba 0.54 NUMBA-DPPY Release Demo\n",
    "\n",
    "[Numba-dppy](https://github.com/IntelPython/numba-dppy) is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dppy uses [dpctl](https://github.com/IntelPython/dpctl) to support SYCL features. Currently Intel’s DPC++ is the only SYCL runtime supported by Numba-dppy.\n",
    "\n",
    "Numba-dppy provides two ways to express SYCL parallelism:\n",
    "\n",
    "* **An automatic offload mode** for NumPy data-parallel expressions and [Numba parallel loops](https://numba.pydata.org/numba-doc/dev/user/parallel.html#explicit-parallel-loops) via `@numba.jit`. This automatic approach extends Numba's existing auto-parallelizer to support generating SYCL kernels from data-parallel code regions. Using the automatic offload approach a programmer needs only minimal changes to the existing code and can try to offload an existing `@numba.jit` decorated function to a SYCL device by invoking the function from a `dpctl.device_context`.\n",
    "  \n",
    "* **An explicit kernel programming mode** using the `@numba_dppy.kernel` decorator. The explicit kernel approach is similar to Numba's other GPU backends: `numba.cuda`. The `@numba_dppy.kernel` decorator is provided by the numba-dppy package. Several advanced SYCL features such as indexing, synchronization, fences, atomcis are provided by the `@numba_dppy.kernel` decorator. Thus, using the decorator a relatively low-level SYCL kernel can be written directly in Python. The feature is intended for programmers who already have SYCL and GPU programming experience."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some useful imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import numba_dppy as dppy  # numba-dppy package should be installed for the examples below.\n",
    "import dpctl\n",
    "from numba import njit\n",
    "import math"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An automatic offload\n",
    "\n",
    "The automatic offload feature in numba-dppy is triggered when a `@numba.jit` function is invoked inside a `dpctl.device_context`scope. The following example demonstrates the usage of numba-dppy's automatic offload functionality. Note that the example is identical to the normal Numba parallel example, the only difference is that the function is called in the `dpctl.device_context`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using device ...\n",
      "<dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) Graphics [0x5917]] at 0x18db438cfb0>\n",
      "[2. 2. 2. ... 2. 2. 2.]\n"
     ]
    }
   ],
   "source": [
    "@njit\n",
    "def f1(a, b):\n",
    "    c = a + b\n",
    "    return c\n",
    "\n",
    "N = 64 * 32\n",
    "a = np.ones(N, dtype=np.float32)\n",
    "b = np.ones(N, dtype=np.float32)\n",
    "\n",
    "# Use the environment variable SYCL_DEVICE_FILTER to change the default device.\n",
    "# See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.\n",
    "device = dpctl.select_default_device()\n",
    "print(\"Using device ...\")\n",
    "print(device)\n",
    "\n",
    "with dpctl.device_context(device):\n",
    "    c = f1(a, b)\n",
    "    print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Controllable fallback behavior during automatic offload**  \n",
    "\n",
    "By default, if a section of code cannot be offloaded to the GPU, it is automatically executed on the CPU and a warning is printed. This behavior is only applicable to JIT functions, auto-offloading of NumPy calls, array expressions and prange loops. To disable this functionality and force code running on GPU, set the environment variable `NUMBA_DPPY_FALLBACK_OPTION` to false (for example, `export NUMBA_DPPY_FALLBACK_OPTION=0`). In this case the code is not automatically offloaded to the CPU and errors occur if any.  \n",
    "\n",
    "**Diagnostic reporting for automatic offload**\n",
    "\n",
    "`Export NUMBA_DPPY_OFFLOAD_DIAGNOSTICS=1`:\n",
    "\n",
    "Setting the debug environment variable `NUMBA_DPPY_OFFLOAD_DIAGNOSTICS` provides emission of the parallel and offload diagnostics information based on produced parallel transforms. The level of detail depends on the integer value between 1 and 4 that is set to the environment variable (higher is more detailed). In the \"Auto-offloading\" section there is the information on which device (device name) this kernel was offloaded."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing Explicit Kernels in numba-dppy\n",
    "\n",
    "Writing a SYCL kernel using the `@numba_dppy.kernel` decorator has similar syntax to writing OpenCL kernels. The numba-dppy module provides similar indexing and other functions as OpenCL. Some of the indexing functions supported inside a numba_dppy.kernel are:\n",
    "\n",
    "* `numba_dppy.get_global_id`: Gets the global ID of the item\n",
    "* `numba_dppy.get_local_id`: Gets the local ID of the item\n",
    "* `numba_dppy.get_local_size`: Gets the local work group size of the device\n",
    "* `numba_dppy.get_group_id` : Gets the group ID of the item\n",
    "* `numba_dppy.get_num_groups`: Gets the number of gropus in a worksgroup  \n",
    "\n",
    "Refer https://intelpython.github.io/numba-dppy/latest/user_guides/kernel_programming_guide/index.html for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1.1647326  0.5044042  1.0928384  1.6194623  0.64363265 0.923868\n",
      " 0.9901773  0.16170211 0.6585165  0.37717268 1.7218891  0.7935294\n",
      " 1.1921285  1.0631248  0.97428465 1.2411709  0.12518258 1.3276634\n",
      " 0.50359565 1.2648091 ]\n",
      "[1.1647326  0.5044042  1.0928384  1.6194623  0.64363265 0.923868\n",
      " 0.9901773  0.16170211 0.6585165  0.37717268 1.7218891  0.7935294\n",
      " 1.1921285  1.0631248  0.97428465 1.2411709  0.12518258 1.3276634\n",
      " 0.50359565 1.2648091 ]\n"
     ]
    }
   ],
   "source": [
    "@dppy.kernel\n",
    "def sum(a, b, c):\n",
    "    i = dppy.get_global_id(0)\n",
    "    c[i] = a[i] + b[i]\n",
    "\n",
    "a = np.array(np.random.random(20), dtype=np.float32)\n",
    "b = np.array(np.random.random(20), dtype=np.float32)\n",
    "c = np.ones_like(a)\n",
    "\n",
    "device = dpctl.select_default_device()\n",
    "\n",
    "with dpctl.device_context(device):\n",
    "    sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)\n",
    "\n",
    "print(a+b)\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numba-DPPY Atomics\n",
    "\n",
    "Numba-dppy supports several atomic operations supported by DPC++.  \n",
    "\n",
    "`class numba_dppy.ocl.stubs.atomic`  \n",
    "atomic namespace\n",
    "\n",
    "* `add(ary, idx, val)` Perform atomic `ary[idx] += val`. Returns the old value at the index location as if it is loaded atomically.\n",
    "* `sub(ary, idx, val)` Perform atomic `ary[idx] -= val`. Returns the old value at the index location as if it is loaded atomically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[100.]\n",
      "Done...\n"
     ]
    }
   ],
   "source": [
    "\"\"\"\n",
    "The example demonstrates the use of numba_dppy's ``atomic_add`` intrinsic\n",
    "function on a SYCL device. The ``dpctl.select_gpu_device`` is\n",
    "equivalent to ``sycl::gpu_selector`` and returns a sycl::device of type GPU.\n",
    "\n",
    "For more information please look at:\n",
    "https://github.com/IntelPython/numba-dppy/blob/0.16.0/numba_dppy/examples/atomic_op.py\n",
    "\n",
    "Without these two environment variables Numba_dppy will use other\n",
    "implementation for floating point atomics.\n",
    "\"\"\"\n",
    "\n",
    "@dppy.kernel\n",
    "def atomic_add(a):\n",
    "    dppy.atomic.add(a, 0, 1)\n",
    "\n",
    "global_size = 100\n",
    "a = np.array([0], dtype=np.float32)\n",
    "\n",
    "device = dpctl.select_default_device()\n",
    "\n",
    "with dppy.offload_to_sycl_device(device):\n",
    "    atomic_add[global_size, dppy.DEFAULT_LOCAL_SIZE](a)\n",
    "\n",
    "print(a)\n",
    "\n",
    "print(\"Done...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Expected 100, because global_size = 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numba-DPPY device functions\n",
    "\n",
    "OpenCL and SYCL do not directly have a notion for device-only functions, i.e. functions that can be only invoked from a kernel and not from a host function. However, numba-dppy provides a special decorator `numba_dppy.func` specifically to implement device functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done...\n"
     ]
    }
   ],
   "source": [
    "@dppy.func\n",
    "def a_device_function(a):\n",
    "    \"\"\"\n",
    "    A ``func`` is a device callable function that can be invoked from\n",
    "    ``kernel`` and other ``func`` functions.\n",
    "    \"\"\"\n",
    "    return a + 1\n",
    "\n",
    "\n",
    "@dppy.func\n",
    "def another_device_function(a):\n",
    "    return a_device_function(a)\n",
    "\n",
    "\n",
    "@dppy.kernel\n",
    "def a_kernel_function(a, b):\n",
    "    i = dppy.get_global_id(0)\n",
    "    b[i] = another_device_function(a[i])\n",
    "\n",
    "\n",
    "N = 10\n",
    "a = np.ones(N)\n",
    "b = np.ones(N)\n",
    "\n",
    "device = dpctl.select_default_device()\n",
    "\n",
    "with dppy.offload_to_sycl_device(device):\n",
    "    a_kernel_function[N, dppy.DEFAULT_LOCAL_SIZE](a, b)\n",
    "\n",
    "print(\"Done...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reduction on SYCL-supported Devices\n",
    "\n",
    "This example demonstrates a summation reduction on a one-dimensional array.  \n",
    "In this example, to reduce the array we invoke the kernel multiple times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Actual:   1035.3582\n",
      "Expected: 1035.3582\n"
     ]
    }
   ],
   "source": [
    "@dppy.kernel\n",
    "def sum_reduction_kernel(A, R, stride):\n",
    "    i = dppy.get_global_id(0)\n",
    "    # sum two element\n",
    "    R[i] = A[i] + A[i + stride]\n",
    "    # store the sum to be used in nex iteration\n",
    "    A[i] = R[i]\n",
    "\n",
    "\n",
    "def sum_reduce(A):\n",
    "    \"\"\"Size of A should be power of two.\"\"\"\n",
    "    total = len(A)\n",
    "    # max size will require half the size of A to store sum\n",
    "    R = np.array(np.random.random(math.ceil(total / 2)), dtype=A.dtype)\n",
    "\n",
    "    device = dpctl.select_default_device()\n",
    "\n",
    "    with dppy.offload_to_sycl_device(device):\n",
    "        while total > 1:\n",
    "            global_size = total // 2\n",
    "            sum_reduction_kernel[global_size, dppy.DEFAULT_LOCAL_SIZE](\n",
    "                A, R, global_size\n",
    "            )\n",
    "            total = total // 2\n",
    "\n",
    "    return R[0]\n",
    "\n",
    "# This test will only work for size = power of two\n",
    "N = 2048\n",
    "assert N % 2 == 0\n",
    "\n",
    "A = np.array(np.random.random(N), dtype=np.float32)\n",
    "A_copy = A.copy()\n",
    "\n",
    "actual = sum_reduce(A)\n",
    "expected = A_copy.sum()\n",
    "\n",
    "print(\"Actual:  \", actual)\n",
    "print(\"Expected:\", expected)\n",
    "\n",
    "assert expected - actual < 1e-2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Numba-dppy Github repository: https://github.com/IntelPython/numba-dppy\n",
    "### More examples: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/examples\n",
    "### Tests: https://github.com/IntelPython/numba-dppy/tree/main/numba_dppy/tests"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}