{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Numba 0.51 CUDA Release demo\n", "\n", "Key changes to the CUDA target include:\n", "\n", "* Support for CUDA Toolkit 11, Ampere, and Compute Capability 8.0\n", "* Stream callbacks ([demo](#Stream-callbacks))\n", "* Async awaitable streams ([demo](#Async-awaitable-streams))\n", "* Printing of SASS code for kernels ([demo](#Inspecting-SASS-code))\n", "* Atomic ``nanmin`` and ``nanmax`` ([demo](#Atomic-nanmin-and-nanmax))" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# A couple of useful imports\n", "from numba import cuda\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stream callbacks\n", "\n", "Adding a callback to a stream allows a function on the host to be called when all the items presently enqueued on the stream have completed. This can be useful for notifying the host code that a certain sequence of events has completed.\n", "\n", "The CUDA Toolkit Samples include an example of using stream callbacks in a multithreaded application in the `0_Simple/simpleCallback` directory - for this notebook, we demonstrate the CUDA Python API only.\n", "\n", "First let's define a kernel and a callback function:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# A simple kernel to add two arrays\n", "@cuda.jit\n", "def add_kernel(r, x, y):\n", " i = cuda.grid(1)\n", " \n", " if i < len(r):\n", " r[i] = x[i] + y[i]\n", "\n", "# Python function for callback\n", "# `data` is the data value set when the callback was added (see below)\n", "def print_callback(stream, status, data):\n", " print(f'Callback function reached! Data is {data}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll create a stream then use it to transfer data and launch a kernel before enqueueing a callback:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Callback function reached! Data is HELLO WORLD\n" ] } ], "source": [ "# Create a stream\n", "s1 = cuda.stream()\n", "\n", "# Transfer all data to the device on the stream\n", "n_elements = 256\n", "x = cuda.to_device(np.random.random(n_elements), stream=s1)\n", "y = cuda.to_device(np.random.random(n_elements), stream=s1)\n", "r = cuda.device_array_like(x, stream=s1)\n", "\n", "# Launch the kernel on the stream\n", "add_kernel[1, n_elements, s1](r, x, y)\n", "\n", "# Add a callback that will be called on the host when the kernel launch is complete.\n", "# The first parameter is the function to call.\n", "# The second parameter is passed into the `data` (3rd) argument of the callback.\n", "s1.add_callback(print_callback, 'HELLO WORLD')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Callbacks can also be used to notify `Event` objects. For example:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Waiting for event\n", "Waiting returned: True\n" ] } ], "source": [ "import threading\n", "\n", "# Create a stream and an event that will be notified\n", "s2 = cuda.stream()\n", "callback_event = threading.Event()\n", "\n", "# Define a callback function to notify the event\n", "def notify_callback(stream, status, event):\n", " event.set()\n", "\n", "# Add the callback to the stream:\n", "s2.add_callback(notify_callback, callback_event)\n", "\n", "# Wait for the event\n", "print(\"Waiting for event\")\n", "ret = callback_event.wait(1.0)\n", "print(f\"Waiting returned: {ret}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Async awaitable streams\n", "\n", "`Stream.async_done()` returns an awaitable that resolves once all preceding stream operations are complete:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "\n", "# Define a new stream\n", "s3 = cuda.stream()\n", "\n", "# Directly awaiting on async_done works in the notebook as it already\n", "# has a running event loop\n", "await s3.async_done()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example creating multiple tasks on multiple streams and gathering:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1.0, 2.0, 3.0, 4.0]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "async def async_cuda_fn(value_in):\n", " stream = cuda.stream()\n", " h_src, h_dst = cuda.pinned_array(8), cuda.pinned_array(8)\n", " h_src[:] = value_in\n", " d_ary = cuda.to_device(h_src, stream=stream)\n", " d_ary.copy_to_host(h_dst, stream=stream)\n", " await stream.async_done()\n", " return h_dst.mean()\n", "\n", "values_in = [1, 2, 3, 4]\n", "tasks = [asyncio.create_task(async_cuda_fn(v)) for v in values_in]\n", "values_out = await asyncio.gather(*tasks)\n", "values_out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting SASS code\n", "\n", "Let's define a simple kernel, and compile it eagerly:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "@cuda.jit('void(float32[::1], float32[::1], float32[::1])')\n", "def add(r, x, y):\n", " i = cuda.grid(1)\n", " \n", " if i < len(r):\n", " r[i] = x[i] + y[i]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can print the SASS code:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t.headerflags\t@\"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM75 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM75)\"\n", "\t.elftype\t@\"ET_EXEC\"\n", "\n", "\n", "//--------------------- .debug_frame --------------------------\n", "\t.section\t.debug_frame,\"\",@progbits\n", ".debug_frame:\n", " /*0000*/ \t.byte\t0xff, 0xff, 0xff, 0xff, 0x28, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff\n", " /*0010*/ \t.byte\t0xff, 0xff, 0xff, 0xff, 0x03, 0x00, 0x04, 0x7c, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0c, 0x81, 0x80\n", " /*0020*/ \t.byte\t0x80, 0x28, 0x00, 0x08, 0xff, 0x81, 0x80, 0x28, 0x08, 0x81, 0x80, 0x80, 0x28, 0x00, 0x00, 0x00\n", " /*0030*/ \t.byte\t0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0x30, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00\n", " /*0040*/ \t.byte\t0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00\n", " /*0048*/ \t.dword\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE\n", " /*0050*/ \t.byte\t0x70, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x04, 0x02, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00\n", " /*0060*/ \t.byte\t0x00, 0x00, 0x0c, 0x81, 0x80, 0x80, 0x28, 0x00, 0x04, 0x7a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00\n", "\n", "\n", "//--------------------- .nv.info --------------------------\n", "\t.section\t.nv.info,\"\",@\"SHT_CUDA_INFO\"\n", "\t.align\t4\n", "\n", "\n", "\t//----- nvinfo : EIATTR_FRAME_SIZE\n", "\t.align\t\t4\n", " /*0000*/ \t.byte\t0x04, 0x11\n", " /*0002*/ \t.short\t(.L_11 - .L_10)\n", "\t.align\t\t4\n", ".L_10:\n", " /*0004*/ \t.word\tindex@(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE)\n", " /*0008*/ \t.word\t0x00000000\n", "\n", "\n", "\t//----- nvinfo : EIATTR_REGCOUNT\n", "\t.align\t\t4\n", ".L_11:\n", " /*000c*/ \t.byte\t0x04, 0x2f\n", " /*000e*/ \t.short\t(.L_13 - .L_12)\n", "\t.align\t\t4\n", ".L_12:\n", " /*0010*/ \t.word\tindex@(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE)\n", " /*0014*/ \t.word\t0x0000000d\n", "\n", "\n", "\t//----- nvinfo : EIATTR_MIN_STACK_SIZE\n", "\t.align\t\t4\n", ".L_13:\n", " /*0018*/ \t.byte\t0x04, 0x12\n", " /*001a*/ \t.short\t(.L_15 - .L_14)\n", "\t.align\t\t4\n", ".L_14:\n", " /*001c*/ \t.word\tindex@(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE)\n", " /*0020*/ \t.word\t0x00000000\n", ".L_15:\n", "\n", "\n", "//--------------------- .nv.info._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE --------------------------\n", "\t.section\t.nv.info._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,\"\",@\"SHT_CUDA_INFO\"\n", "\t.align\t4\n", "\n", "\n", "\t//----- nvinfo : EIATTR_EXIT_INSTR_OFFSETS\n", "\t.align\t\t4\n", " /*0000*/ \t.byte\t0x04, 0x1c\n", " /*0002*/ \t.short\t(.L_17 - .L_16)\n", "\n", "\n", "\t// ....[0]....\n", ".L_16:\n", " /*0004*/ \t.word\t0x00000070\n", "\n", "\n", "\t// ....[1]....\n", " /*0008*/ \t.word\t0x000001f0\n", "\n", "\n", "\t//----- nvinfo : EIATTR_MAXREG_COUNT\n", "\t.align\t\t4\n", ".L_17:\n", " /*000c*/ \t.byte\t0x03, 0x1b\n", " /*000e*/ \t.short\t0x00ff\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", " /*0010*/ \t.byte\t0x04, 0x17\n", " /*0012*/ \t.short\t(.L_19 - .L_18)\n", ".L_18:\n", " /*0014*/ \t.word\t0x00000000\n", " /*0018*/ \t.short\t0x0000\n", " /*001a*/ \t.short\t0x0000\n", " /*001c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_19:\n", " /*0020*/ \t.byte\t0x04, 0x17\n", " /*0022*/ \t.short\t(.L_21 - .L_20)\n", ".L_20:\n", " /*0024*/ \t.word\t0x00000000\n", " /*0028*/ \t.short\t0x0001\n", " /*002a*/ \t.short\t0x0008\n", " /*002c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_21:\n", " /*0030*/ \t.byte\t0x04, 0x17\n", " /*0032*/ \t.short\t(.L_23 - .L_22)\n", ".L_22:\n", " /*0034*/ \t.word\t0x00000000\n", " /*0038*/ \t.short\t0x0002\n", " /*003a*/ \t.short\t0x0010\n", " /*003c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_23:\n", " /*0040*/ \t.byte\t0x04, 0x17\n", " /*0042*/ \t.short\t(.L_25 - .L_24)\n", ".L_24:\n", " /*0044*/ \t.word\t0x00000000\n", " /*0048*/ \t.short\t0x0003\n", " /*004a*/ \t.short\t0x0018\n", " /*004c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_25:\n", " /*0050*/ \t.byte\t0x04, 0x17\n", " /*0052*/ \t.short\t(.L_27 - .L_26)\n", ".L_26:\n", " /*0054*/ \t.word\t0x00000000\n", " /*0058*/ \t.short\t0x0004\n", " /*005a*/ \t.short\t0x0020\n", " /*005c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_27:\n", " /*0060*/ \t.byte\t0x04, 0x17\n", " /*0062*/ \t.short\t(.L_29 - .L_28)\n", ".L_28:\n", " /*0064*/ \t.word\t0x00000000\n", " /*0068*/ \t.short\t0x0005\n", " /*006a*/ \t.short\t0x0028\n", " /*006c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_29:\n", " /*0070*/ \t.byte\t0x04, 0x17\n", " /*0072*/ \t.short\t(.L_31 - .L_30)\n", ".L_30:\n", " /*0074*/ \t.word\t0x00000000\n", " /*0078*/ \t.short\t0x0006\n", " /*007a*/ \t.short\t0x0030\n", " /*007c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_31:\n", " /*0080*/ \t.byte\t0x04, 0x17\n", " /*0082*/ \t.short\t(.L_33 - .L_32)\n", ".L_32:\n", " /*0084*/ \t.word\t0x00000000\n", " /*0088*/ \t.short\t0x0007\n", " /*008a*/ \t.short\t0x0038\n", " /*008c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_33:\n", " /*0090*/ \t.byte\t0x04, 0x17\n", " /*0092*/ \t.short\t(.L_35 - .L_34)\n", ".L_34:\n", " /*0094*/ \t.word\t0x00000000\n", " /*0098*/ \t.short\t0x0008\n", " /*009a*/ \t.short\t0x0040\n", " /*009c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_35:\n", " /*00a0*/ \t.byte\t0x04, 0x17\n", " /*00a2*/ \t.short\t(.L_37 - .L_36)\n", ".L_36:\n", " /*00a4*/ \t.word\t0x00000000\n", " /*00a8*/ \t.short\t0x0009\n", " /*00aa*/ \t.short\t0x0048\n", " /*00ac*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_37:\n", " /*00b0*/ \t.byte\t0x04, 0x17\n", " /*00b2*/ \t.short\t(.L_39 - .L_38)\n", ".L_38:\n", " /*00b4*/ \t.word\t0x00000000\n", " /*00b8*/ \t.short\t0x000a\n", " /*00ba*/ \t.short\t0x0050\n", " /*00bc*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_39:\n", " /*00c0*/ \t.byte\t0x04, 0x17\n", " /*00c2*/ \t.short\t(.L_41 - .L_40)\n", ".L_40:\n", " /*00c4*/ \t.word\t0x00000000\n", " /*00c8*/ \t.short\t0x000b\n", " /*00ca*/ \t.short\t0x0058\n", " /*00cc*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_41:\n", " /*00d0*/ \t.byte\t0x04, 0x17\n", " /*00d2*/ \t.short\t(.L_43 - .L_42)\n", ".L_42:\n", " /*00d4*/ \t.word\t0x00000000\n", " /*00d8*/ \t.short\t0x000c\n", " /*00da*/ \t.short\t0x0060\n", " /*00dc*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_43:\n", " /*00e0*/ \t.byte\t0x04, 0x17\n", " /*00e2*/ \t.short\t(.L_45 - .L_44)\n", ".L_44:\n", " /*00e4*/ \t.word\t0x00000000\n", " /*00e8*/ \t.short\t0x000d\n", " /*00ea*/ \t.short\t0x0068\n", " /*00ec*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_45:\n", " /*00f0*/ \t.byte\t0x04, 0x17\n", " /*00f2*/ \t.short\t(.L_47 - .L_46)\n", ".L_46:\n", " /*00f4*/ \t.word\t0x00000000\n", " /*00f8*/ \t.short\t0x000e\n", " /*00fa*/ \t.short\t0x0070\n", " /*00fc*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_47:\n", " /*0100*/ \t.byte\t0x04, 0x17\n", " /*0102*/ \t.short\t(.L_49 - .L_48)\n", ".L_48:\n", " /*0104*/ \t.word\t0x00000000\n", " /*0108*/ \t.short\t0x000f\n", " /*010a*/ \t.short\t0x0078\n", " /*010c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_49:\n", " /*0110*/ \t.byte\t0x04, 0x17\n", " /*0112*/ \t.short\t(.L_51 - .L_50)\n", ".L_50:\n", " /*0114*/ \t.word\t0x00000000\n", " /*0118*/ \t.short\t0x0010\n", " /*011a*/ \t.short\t0x0080\n", " /*011c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_51:\n", " /*0120*/ \t.byte\t0x04, 0x17\n", " /*0122*/ \t.short\t(.L_53 - .L_52)\n", ".L_52:\n", " /*0124*/ \t.word\t0x00000000\n", " /*0128*/ \t.short\t0x0011\n", " /*012a*/ \t.short\t0x0088\n", " /*012c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_53:\n", " /*0130*/ \t.byte\t0x04, 0x17\n", " /*0132*/ \t.short\t(.L_55 - .L_54)\n", ".L_54:\n", " /*0134*/ \t.word\t0x00000000\n", " /*0138*/ \t.short\t0x0012\n", " /*013a*/ \t.short\t0x0090\n", " /*013c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_55:\n", " /*0140*/ \t.byte\t0x04, 0x17\n", " /*0142*/ \t.short\t(.L_57 - .L_56)\n", ".L_56:\n", " /*0144*/ \t.word\t0x00000000\n", " /*0148*/ \t.short\t0x0013\n", " /*014a*/ \t.short\t0x0098\n", " /*014c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_KPARAM_INFO\n", "\t.align\t\t4\n", ".L_57:\n", " /*0150*/ \t.byte\t0x04, 0x17\n", " /*0152*/ \t.short\t(.L_59 - .L_58)\n", ".L_58:\n", " /*0154*/ \t.word\t0x00000000\n", " /*0158*/ \t.short\t0x0014\n", " /*015a*/ \t.short\t0x00a0\n", " /*015c*/ \t.byte\t0x00, 0xf0, 0x21, 0x00\n", "\n", "\n", "\t//----- nvinfo : EIATTR_CBANK_PARAM_SIZE\n", "\t.align\t\t4\n", ".L_59:\n", " /*0160*/ \t.byte\t0x03, 0x19\n", " /*0162*/ \t.short\t0x00a8\n", "\n", "\n", "\t//----- nvinfo : EIATTR_PARAM_CBANK\n", "\t.align\t\t4\n", " /*0164*/ \t.byte\t0x04, 0x0a\n", " /*0166*/ \t.short\t(.L_61 - .L_60)\n", "\t.align\t\t4\n", ".L_60:\n", " /*0168*/ \t.word\tindex@(.nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE)\n", " /*016c*/ \t.short\t0x0160\n", " /*016e*/ \t.short\t0x00a8\n", ".L_61:\n", "\n", "\n", "//--------------------- .nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE --------------------------\n", "\t.section\t.nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,\"a\",@progbits\n", "\t.align\t4\n", ".nv.constant0._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE:\n", "\t.zero\t\t520\n", "\n", "\n", "//--------------------- .text._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE --------------------------\n", "\t.section\t.text._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,\"ax\",@progbits\n", "\t.sectioninfo\t@\"SHI_REGISTERS=13\"\n", "\t.align\t128\n", " .global _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE\n", " .type _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,@function\n", " .size _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,(.L_62 - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE)\n", " .other _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,@\"STO_CUDA_ENTRY STV_DEFAULT\"\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE:\n", ".text._ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE:\n", " /*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;\n", " /*0010*/ S2R R0, SR_TID.X ;\n", " /*0020*/ S2R R3, SR_CTAID.X ;\n", " /*0030*/ IMAD R0, R3, c[0x0][0x0], R0 ;\n", " /*0040*/ ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x188], PT ;\n", " /*0050*/ SHF.R.S32.HI R6, RZ, 0x1f, R0 ;\n", " /*0060*/ ISETP.GE.AND.EX P0, PT, R6, c[0x0][0x18c], PT, P0 ;\n", " /*0070*/ @P0 EXIT ;\n", " /*0080*/ ISETP.GE.AND P0, PT, R0, RZ, PT ;\n", " /*0090*/ SEL R7, RZ, c[0x0][0x1c0], P0 ;\n", " /*00a0*/ SEL R3, RZ, c[0x0][0x1f8], P0 ;\n", " /*00b0*/ SEL R9, RZ, c[0x0][0x1c4], P0 ;\n", " /*00c0*/ SEL R5, RZ, c[0x0][0x1fc], P0 ;\n", " /*00d0*/ IADD3 R7, P1, R0.reuse, R7, RZ ;\n", " /*00e0*/ IADD3 R3, P2, R0, R3, RZ ;\n", " /*00f0*/ IMAD.X R10, R6.reuse, 0x1, R9, P1 ;\n", " /*0100*/ LEA R4, P1, R7, c[0x0][0x1b8], 0x2 ;\n", " /*0110*/ IMAD.X R8, R6, 0x1, R5, P2 ;\n", " /*0120*/ LEA R2, P2, R3, c[0x0][0x1f0], 0x2 ;\n", " /*0130*/ LEA.HI.X R5, R7, c[0x0][0x1bc], R10, 0x2, P1 ;\n", " /*0140*/ LEA.HI.X R3, R3, c[0x0][0x1f4], R8, 0x2, P2 ;\n", " /*0150*/ LDG.E.SYS R5, [R4] ;\n", " /*0160*/ LDG.E.SYS R2, [R2] ;\n", " /*0170*/ SEL R7, RZ, c[0x0][0x188], P0 ;\n", " /*0180*/ IADD3 R0, P1, R0, R7, RZ ;\n", " /*0190*/ SEL R7, RZ, c[0x0][0x18c], P0 ;\n", " /*01a0*/ IMAD.X R7, R6, 0x1, R7, P1 ;\n", " /*01b0*/ LEA R6, P0, R0, c[0x0][0x180], 0x2 ;\n", " /*01c0*/ LEA.HI.X R7, R0, c[0x0][0x184], R7, 0x2, P0 ;\n", " /*01d0*/ FADD R9, R2, R5 ;\n", " /*01e0*/ STG.E.SYS [R6], R9 ;\n", " /*01f0*/ EXIT ;\n", ".L_9:\n", " /*0200*/ BRA `(.L_9);\n", ".L_62:\n", "\n", "\n", "//--------------------- .nv.global --------------------------\n", "\t.section\t.nv.global,\"aw\",@nobits\n", "\t.align\t8\n", "\t.global\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__\n", "\t.type\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__,@object\n", "\t.size\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__)\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidx__:\n", ".nv.global:\n", "\t.zero\t\t4\n", "\t.global\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__\n", "\t.type\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__,@object\n", "\t.size\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__)\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidz__:\n", "\t.zero\t\t4\n", "\t.global\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__\n", "\t.type\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__,@object\n", "\t.size\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__)\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__errcode__:\n", "\t.zero\t\t4\n", "\t.global\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__\n", "\t.type\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__,@object\n", "\t.size\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__)\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__ctaidy__:\n", "\t.zero\t\t4\n", "\t.global\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__\n", "\t.type\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__,@object\n", "\t.size\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__)\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidx__:\n", "\t.zero\t\t4\n", "\t.global\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__\n", "\t.type\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__,@object\n", "\t.size\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__,(_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__ - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__)\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidz__:\n", "\t.zero\t\t4\n", "\t.global\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__\n", "\t.type\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__,@object\n", "\t.size\t\t_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__,(.L_4 - _ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__)\n", "_ZN6cudapy8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE__tidy__:\n", "\t.zero\t\t4\n", ".L_4:\n", "\t.zero\t\t4\n", "\t.global\t\t_ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE\n", "\t.type\t\t_ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,@object\n", "\t.size\t\t_ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE,(.L_8 - _ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE)\n", "_ZN08NumbaEnv8__main__7add$242E5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE5ArrayIfLi1E1C7mutable7alignedE:\n", "\t.zero\t\t8\n", ".L_8:\n", "\n" ] } ], "source": [ "print(add.inspect_sass())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that SASS inspection uses `nvdisasm` from the [CUDA Binary Utilities](https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html), so the full CUDA toolkit must be installed - the conda `cudatoolkit` package does not include `nvdisasm`.\n", "\n", "# Atomic `nanmin` and `nanmax`\n", "\n", "Numba 0.50 changed the semantics of `cuda.atomic.min` and `cuda.atomic.max` to more closely match that of `math.max` and `numpy.max`, which always return the first argument if one of the operands is a NaN. The old behaviour, treating NaN values as missing data is now implemented as `cuda.atomic.nanmin` and `cuda.atomic.nanmax` (c.f. [`nanmin`](https://numpy.org/doc/stable/reference/generated/numpy.nanmin.html) and [`nanmax`](https://numpy.org/doc/stable/reference/generated/numpy.nanmax.html) in NumPy).\n", "\n", "To demonstrate the difference, we'll define two kernels, one using `max`, and the other using `nanmax`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "@cuda.jit\n", "def atomic_max_array(val, arr):\n", " i = cuda.grid(1)\n", " cuda.atomic.max(arr, i, val[0])\n", "\n", "@cuda.jit\n", "def atomic_nanmax_array(val, arr):\n", " i = cuda.grid(1)\n", " cuda.atomic.nanmax(arr, i, val[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we define input arrays - we'll set the value of one element to NaN: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0., 1., 2., 3., 4., nan, 6., 7.], dtype=float32)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n_elements = 8\n", "x_max = np.arange(n_elements, dtype=np.float32)\n", "x_max[5] = np.nan\n", "\n", "# A copy of the input for the other kernel\n", "x_nanmax = x_max.copy()\n", "\n", "# Print an array to show the NaN in position:\n", "x_max" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll launch the kernels on our arrays:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# A value to compare the array values with\n", "val = np.ones(1, dtype=np.float32) + 2\n", "\n", "# Launch kernels\n", "atomic_max_array[1, n_elements](val, x_max)\n", "atomic_nanmax_array[1, n_elements](val, x_nanmax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the output from `cuda.atomic.max`, we see that it didn't treat the NaN value as missing data, but instead returned it:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 3., 3., 3., 3., 4., nan, 6., 7.], dtype=float32)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_max" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whereas `cuda.atomic.nanmax` treated the NaN as missing data and replaced it with the comparison value:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3., 3., 3., 3., 4., 3., 6., 7.], dtype=float32)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_nanmax" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }