{ "cells": [ { "cell_type": "markdown", "id": "6d9dce90", "metadata": {}, "source": [ "# Benchmarking\n", "\n", "In this tutorial we will compare DaCe with other popular Python-accelerating libraries. The NumPy results should be a bit faster if an optimized version is installed (for example, compiled with Intel MKL).\n", "\n", "**NOTE**: Running this notebook on a VM/cloud instance may run out of memory and crash the Jupyter kernel, due to inefficiency of the other frameworks. In that case, rerun the cells in [Dependencies](#Dependencies) and continue.\n", "\n", "Table of Contents:\n", "* [Dependencies](#Dependencies)\n", "* [Simple programs](#Simple-programs-with-multiple-operators)\n", "* [Loops](#Loops)\n", " * [Varying sizes](#Varying-sizes)\n", "* [Auto-parallelization](#Auto-parallelization)\n", "* [Example: 3D Heat Diffusion](#3D-Heat-Diffusion)\n", "* [Benchmarking and Instrumentation API](#Benchmarking-and-Instrumentation-API)\n" ] }, { "cell_type": "markdown", "id": "d0a07e84", "metadata": {}, "source": [ "TL;DR DaCe is fast:\n", "\n", "![performance](performance.png \"performance\")" ] }, { "cell_type": "markdown", "id": "93f7d90c", "metadata": {}, "source": [ "## Dependencies\n", "\n", "First, let's make sure we have all the frameworks ready to go:" ] }, { "cell_type": "code", "execution_count": 1, "id": "dbb480ef", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "...\n" ] } ], "source": [ "%pip install jax jaxlib\n", "%pip install numba\n", "%pip install pythran\n", "# Your library here" ] }, { "cell_type": "code", "execution_count": 2, "id": "b2b10c42", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "...\n" ] } ], "source": [ "# MKL for performance\n", "%conda install mkl mkl-include mkl-devel\n", "\n", "# matplotlib to draw the results\n", "%pip install matplotlib" ] }, { "cell_type": "code", "execution_count": 3, "id": "927781f2", "metadata": {}, "outputs": [], "source": [ "# Setup code for plotting\n", "import matplotlib.pyplot as plt\n", "\n", "def barplot(title, labels=False):\n", " x = ['numpy'] + list(sorted(TIMES.keys() - {'numpy'}))\n", " bars = [np.median(TIMES[key].timings) for key in x]\n", " yerr = [np.std(TIMES[key].timings) for key in x]\n", " color = [('#86add9' if 'dace' in key else 'salmon') for key in x]\n", "\n", " p = plt.bar(x, bars, yerr=yerr, color=color)\n", " plt.ylabel('Runtime [s]'); plt.xlabel('Implementation'); plt.title(title); \n", " if labels:\n", " plt.gca().bar_label(p)\n", " pass" ] }, { "cell_type": "code", "execution_count": 4, "id": "317721fd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Setup code for benchmarked frameworks\n", "import numpy as np\n", "import jax\n", "import numba\n", "import dace" ] }, { "cell_type": "code", "execution_count": 5, "id": "46238b6e", "metadata": {}, "outputs": [], "source": [ "# Pythran loads in a separate cell\n", "%load_ext pythran.magic" ] }, { "cell_type": "markdown", "id": "0e44fb94", "metadata": {}, "source": [ "## Simple programs with multiple operators\n", "\n", "Let's start with a basic program with three different operations. This example program was taken from the [JAX README](https://github.com/google/jax#compilation-with-jit):" ] }, { "cell_type": "code", "execution_count": 6, "id": "d9828ae7", "metadata": {}, "outputs": [], "source": [ "def slow_f(x):\n", " return x * x + x * 2.0" ] }, { "cell_type": "markdown", "id": "74047bc8", "metadata": {}, "source": [ "First, let's measure the performance of NumPy as-is on this function:" ] }, { "cell_type": "code", "execution_count": 7, "id": "afe8910d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "68.6 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "a = np.random.rand(5000, 5000)\n", "\n", "TIMES = {}\n", "\n", "TIMES['numpy'] = %timeit -o slow_f(a)" ] }, { "cell_type": "markdown", "id": "0d5afed2", "metadata": {}, "source": [ "Now we can construct Just-In-Time (JIT) compiled versions of this function, for each framework:" ] }, { "cell_type": "code", "execution_count": 8, "id": "f66e04d1", "metadata": {}, "outputs": [], "source": [ "jax_f = jax.jit(slow_f)\n", "numba_f = numba.jit(slow_f)\n", "dace_f = dace.program(auto_optimize=True)(slow_f)" ] }, { "cell_type": "code", "execution_count": 9, "id": "8b6f4f7b", "metadata": {}, "outputs": [], "source": [ "%%pythran\n", "#pythran export pythran_f(float64[:,:])\n", "def pythran_f(x):\n", " return x * x + x * 2.0" ] }, { "cell_type": "markdown", "id": "de148d29", "metadata": {}, "source": [ "Before we measure the time, we will run the functions first as a warmup, to allow compilers to run JIT compilation:" ] }, { "cell_type": "code", "execution_count": 10, "id": "99491394", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.29 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", "323 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", "1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" ] } ], "source": [ "# On your marks...\n", "%timeit -r 1 -n 1 jax_f(a).block_until_ready()\n", "%timeit -r 1 -n 1 numba_f(a)\n", "%timeit -r 1 -n 1 dace_f(a)\n", "%timeit -r 1 -n 1 pythran_f(a)\n", "pass\n", "# ...get set..." ] }, { "cell_type": "code", "execution_count": 11, "id": "067febc9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43.6 ms ± 4.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "# ...Go!\n", "TIMES['jax'] = %timeit -o jax_f(a).block_until_ready()" ] }, { "cell_type": "code", "execution_count": 12, "id": "e7f811ff", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "27.8 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "TIMES['numba'] = %timeit -o numba_f(a)" ] }, { "cell_type": "code", "execution_count": 13, "id": "e6d98ce6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "31.3 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "TIMES['pythran'] = %timeit -o pythran_f(a)" ] }, { "cell_type": "code", "execution_count": 14, "id": "9db35692", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "25.7 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "TIMES['dace_jit'] = %timeit -o dace_f(a)" ] }, { "cell_type": "markdown", "id": "4bb9be46", "metadata": {}, "source": [ "You could also precompile the program for faster runtimes (be aware that the return value is retained across calls!):" ] }, { "cell_type": "code", "execution_count": 15, "id": "a3c0702e", "metadata": {}, "outputs": [], "source": [ "# Either provide type annotations on the `@dace.program`, or call `compile` with sample arguments\n", "cprog = dace_f.compile(a)" ] }, { "cell_type": "code", "execution_count": 16, "id": "d0754f47", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "21.5 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "TIMES['dace'] = %timeit -o cprog(a)" ] }, { "cell_type": "markdown", "id": "78d0830a", "metadata": {}, "source": [ "We can now plot the results:" ] }, { "cell_type": "code", "execution_count": 17, "id": "01ae5917", "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "barplot('Simple program, multiple operators')" ] }, { "cell_type": "markdown", "id": "16be4882", "metadata": {}, "source": [ "## Loops\n", "\n", "Here we test how interpreter overhead can be mitigated by the Python compiling frameworks. Let's take another application from Numba's [5 minute guide](https://numba.readthedocs.io/en/stable/user/5minguide.html):" ] }, { "cell_type": "code", "execution_count": 18, "id": "c7134a92", "metadata": {}, "outputs": [], "source": [ "def go_fast(a):\n", " trace = 0.0\n", " for i in range(a.shape[0]):\n", " trace += np.tanh(a[i, i])\n", " return a + trace" ] }, { "cell_type": "code", "execution_count": 19, "id": "844c1c84", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "b = np.random.rand(1000, 1000)\n", "\n", "TIMES = {}" ] }, { "cell_type": "code", "execution_count": 20, "id": "69ef66f2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.94 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "TIMES['numpy'] = %timeit -o go_fast(b)" ] }, { "cell_type": "code", "execution_count": 21, "id": "1b6aef84", "metadata": {}, "outputs": [], "source": [ "numba_fast = numba.jit(go_fast)" ] }, { "cell_type": "code", "execution_count": 22, "id": "e74804c4", "metadata": {}, "outputs": [], "source": [ "import jax.numpy as jnp\n", "\n", "@jax.jit\n", "def jax_fast(a):\n", " trace = 0.0\n", " for i in range(a.shape[0]):\n", " trace += jnp.tanh(a[i, i])\n", " return a + trace" ] }, { "cell_type": "code", "execution_count": 23, "id": "f88a24c6", "metadata": {}, "outputs": [], "source": [ "N = dace.symbol('N')\n", "\n", "@dace.program(auto_optimize=True)\n", "def dace_fast(a: dace.float64[N, N]):\n", " trace = 0.0\n", " for i in range(N):\n", " trace += np.tanh(a[i, i])\n", " return a + trace" ] }, { "cell_type": "code", "execution_count": 24, "id": "e6f18b89", "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%pythran\n", "from numpy import tanh\n", "\n", "#pythran export pythran_fast(float64[:,:])\n", "def pythran_fast(a):\n", " trace = 0.0\n", " for i in range(a.shape[0]):\n", " trace += tanh(a[i, i])\n", " return a + trace" ] }, { "cell_type": "code", "execution_count": 25, "id": "e7e5ab60", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DaCe compilation time: 0.5581727027893066 seconds\n" ] } ], "source": [ "import time\n", "start = time.time()\n", "csdfg = dace_fast.compile(b)\n", "print('DaCe compilation time:', time.time() - start, 'seconds')" ] }, { "cell_type": "code", "execution_count": 26, "id": "a67d01ac", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", "147 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" ] } ], "source": [ "%timeit -r 1 -n 1 jax_fast(b).block_until_ready()\n", "%timeit -r 1 -n 1 numba_fast(b)\n", "%timeit -r 1 -n 1 pythran_fast(b)" ] }, { "cell_type": "markdown", "id": "7e0ffab9", "metadata": {}, "source": [ "Note that the slow JAX first run time is due to the inspector/executor model, in which the compilation time depends on the size of the array." ] }, { "cell_type": "code", "execution_count": 27, "id": "97657722", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.28 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "TIMES['jax'] = %timeit -o jax_fast(b).block_until_ready()" ] }, { "cell_type": "code", "execution_count": 28, "id": "6696626d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "970 µs ± 130 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "TIMES['numba'] = %timeit -o numba_fast(b)" ] }, { "cell_type": "code", "execution_count": 29, "id": "98a80c82", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "673 µs ± 54.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "TIMES['pythran'] = %timeit -o pythran_fast(b)" ] }, { "cell_type": "code", "execution_count": 30, "id": "7a741c90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "668 µs ± 56.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "TIMES['dace'] = %timeit -o csdfg(b, N=b.shape[0])" ] }, { "cell_type": "code", "execution_count": 31, "id": "fc4c6fb2", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "barplot('Loops')" ] }, { "cell_type": "markdown", "id": "e888f1f6", "metadata": {}, "source": [ "### Varying sizes\n", "\n", "Since the DaCe program was defined symbolically, the input array size can be changed without recompilation:" ] }, { "cell_type": "code", "execution_count": 32, "id": "fbdc52c3", "metadata": {}, "outputs": [], "source": [ "sizes = [np.random.randint(700, 5000) for _ in range(10)]\n", "arrays = [np.random.rand(n, n) for n in sizes]\n", "\n", "def vary_size(call):\n", " for a in arrays:\n", " call(a)\n", "\n", "def vary_size_dace(call):\n", " for a, n in zip(arrays, sizes):\n", " call(a, N=n)\n", " \n", "def vary_size_jax(call):\n", " for a in arrays:\n", " call(a).block_until_ready()\n", " \n", "TIMES = {}" ] }, { "cell_type": "code", "execution_count": 33, "id": "2aa26e86", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "155 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "125 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "124 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "114 ms ± 8.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "334 ms ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "TIMES['numpy'] = %timeit -o vary_size(go_fast)\n", "TIMES['numba'] = %timeit -o vary_size(numba_fast)\n", "TIMES['pythran'] = %timeit -o vary_size(pythran_fast)\n", "TIMES['dace'] = %timeit -o vary_size_dace(csdfg)\n", "TIMES['jax'] = %timeit -o vary_size_jax(jax_fast)" ] }, { "cell_type": "code", "execution_count": 34, "id": "144b470a", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "barplot('Loop - Varying sizes')" ] }, { "cell_type": "markdown", "id": "16405894", "metadata": {}, "source": [ "## Auto-parallelization\n", "\n", "DaCe can use data-centric dependency analysis to not only track and reduce data movement, but also automatically extract parallel regions in code. Here we look at a simple program and how it is run in parallel. We use the `auto_optimize` flag in the `dace.program` decorator to automatically apply optimization heuristics." ] }, { "cell_type": "code", "execution_count": 35, "id": "eb5b28ca", "metadata": {}, "outputs": [], "source": [ "def element_update(a):\n", " return a * 5\n", "\n", "def someforloop(A):\n", " for i in range(A.shape[0]):\n", " for j in range(A.shape[1]):\n", " A[i, j] = element_update(A[i, j])" ] }, { "cell_type": "code", "execution_count": 36, "id": "d80217b2", "metadata": {}, "outputs": [], "source": [ "a = np.random.rand(1000, 1000)\n", "daceloop = dace.program(auto_optimize=True)(someforloop)" ] }, { "cell_type": "markdown", "id": "f2ba2545", "metadata": {}, "source": [ "Here it is compared with numpy and numba's similar capability:" ] }, { "cell_type": "code", "execution_count": 37, "id": "8420d1f0", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "446 ms ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "406 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n", "549 µs ± 212 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "numbaloop = numba.jit(parallel=True)(someforloop)\n", "csdfg = daceloop.compile(a)\n", "\n", "TIMES = {}\n", "TIMES['numpy'] = %timeit -o someforloop(a)\n", "TIMES['numba'] = %timeit -o numbaloop(a)\n", "TIMES['dace'] = %timeit -o csdfg(a)" ] }, { "cell_type": "code", "execution_count": 38, "id": "36a48195", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "barplot('Automatic parallelization', labels=True)" ] }, { "cell_type": "markdown", "id": "3864237f", "metadata": {}, "source": [ "As we can see, the nested call triggered the numba code to stay sequential, whereas the global data dependency analysis in DaCe allowed it to parallelize the code, yielding a performance of **549 µs** vs. 406 ms." ] }, { "cell_type": "markdown", "id": "f71c9c78", "metadata": {}, "source": [ "## 3D Heat Diffusion\n", "\n", "As a more realistic application, the following program, `heat3d` is taken from the [NPBench numpy benchmark](https://github.com/spcl/npbench). It runs a three-dimensional stencil repeatedly to perform heat diffusion:" ] }, { "cell_type": "code", "execution_count": 39, "id": "8b7f6433", "metadata": {}, "outputs": [], "source": [ "def heat3d(TSTEPS, A, B):\n", " for t in range(1, TSTEPS):\n", " B[1:-1, 1:-1,\n", " 1:-1] = (0.125 * (A[2:, 1:-1, 1:-1] - 2.0 * A[1:-1, 1:-1, 1:-1] +\n", " A[:-2, 1:-1, 1:-1]) + 0.125 *\n", " (A[1:-1, 2:, 1:-1] - 2.0 * A[1:-1, 1:-1, 1:-1] +\n", " A[1:-1, :-2, 1:-1]) + 0.125 *\n", " (A[1:-1, 1:-1, 2:] - 2.0 * A[1:-1, 1:-1, 1:-1] +\n", " A[1:-1, 1:-1, 0:-2]) + A[1:-1, 1:-1, 1:-1])\n", " A[1:-1, 1:-1,\n", " 1:-1] = (0.125 * (B[2:, 1:-1, 1:-1] - 2.0 * B[1:-1, 1:-1, 1:-1] +\n", " B[:-2, 1:-1, 1:-1]) + 0.125 *\n", " (B[1:-1, 2:, 1:-1] - 2.0 * B[1:-1, 1:-1, 1:-1] +\n", " B[1:-1, :-2, 1:-1]) + 0.125 *\n", " (B[1:-1, 1:-1, 2:] - 2.0 * B[1:-1, 1:-1, 1:-1] +\n", " B[1:-1, 1:-1, 0:-2]) + B[1:-1, 1:-1, 1:-1])" ] }, { "cell_type": "code", "execution_count": 40, "id": "d8e54447", "metadata": {}, "outputs": [], "source": [ "# Using the \"L\" size\n", "TSTEPS, N = 100, 70\n", "A = np.fromfunction(lambda i, j, k: (i + j + (N - k)) * 10 / N, (N, N, N),\n", " dtype=np.float64)\n", "B = np.copy(A)" ] }, { "cell_type": "code", "execution_count": 41, "id": "29ef687a", "metadata": {}, "outputs": [], "source": [ "dace_heat3d = dace.program(auto_optimize=True)(heat3d)\n", "numba_heat3d = numba.jit(nopython=True, parallel=True)(heat3d)" ] }, { "cell_type": "code", "execution_count": 42, "id": "a7606742", "metadata": {}, "outputs": [], "source": [ "%%pythran\n", "#pythran export pythran_heat3d(int, float64[:,:,:], float64[:,:,:])\n", "def pythran_heat3d(TSTEPS, A, B):\n", " for t in range(1, TSTEPS):\n", " B[1:-1, 1:-1,\n", " 1:-1] = (0.125 * (A[2:, 1:-1, 1:-1] - 2.0 * A[1:-1, 1:-1, 1:-1] +\n", " A[:-2, 1:-1, 1:-1]) + 0.125 *\n", " (A[1:-1, 2:, 1:-1] - 2.0 * A[1:-1, 1:-1, 1:-1] +\n", " A[1:-1, :-2, 1:-1]) + 0.125 *\n", " (A[1:-1, 1:-1, 2:] - 2.0 * A[1:-1, 1:-1, 1:-1] +\n", " A[1:-1, 1:-1, 0:-2]) + A[1:-1, 1:-1, 1:-1])\n", " A[1:-1, 1:-1,\n", " 1:-1] = (0.125 * (B[2:, 1:-1, 1:-1] - 2.0 * B[1:-1, 1:-1, 1:-1] +\n", " B[:-2, 1:-1, 1:-1]) + 0.125 *\n", " (B[1:-1, 2:, 1:-1] - 2.0 * B[1:-1, 1:-1, 1:-1] +\n", " B[1:-1, :-2, 1:-1]) + 0.125 *\n", " (B[1:-1, 1:-1, 2:] - 2.0 * B[1:-1, 1:-1, 1:-1] +\n", " B[1:-1, 1:-1, 0:-2]) + B[1:-1, 1:-1, 1:-1])" ] }, { "cell_type": "code", "execution_count": 43, "id": "3b2218b6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.28 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", "3.75 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", "216 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" ] } ], "source": [ "# Warmup\n", "%timeit -r 1 -n 1 dace_heat3d(TSTEPS, A, B)\n", "%timeit -r 1 -n 1 numba_heat3d(TSTEPS, A, B)\n", "%timeit -r 1 -n 1 pythran_heat3d(TSTEPS, A, B)" ] }, { "cell_type": "code", "execution_count": 44, "id": "d3975c40", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "799 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "TIMES = {}\n", "TIMES['numpy'] = %timeit -o heat3d(TSTEPS, A, B)" ] }, { "cell_type": "code", "execution_count": 45, "id": "452597f9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11.2 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "77.1 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "184 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "TIMES['dace'] = %timeit -o dace_heat3d(TSTEPS, A, B)\n", "TIMES['numba'] = %timeit -o numba_heat3d(TSTEPS, A, B)\n", "TIMES['pythran'] = %timeit -o pythran_heat3d(TSTEPS, A, B)" ] }, { "cell_type": "code", "execution_count": 46, "id": "38bb42f7", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "barplot('3D Heat Diffusion', labels=True)" ] }, { "cell_type": "markdown", "id": "c08e4fd2", "metadata": {}, "source": [ "## Benchmarking and Instrumentation API\n", "\n", "When optimizing programs in DaCe, it is useful to know the raw time the compiled program takes or any of its components. For this purpose, DaCe includes an instrumentation API, which allows you to time each SDFG, state, map, or tasklet directly from the code.\n", "\n", "The instrumentation providers given in DaCe can measure different metrics: wall-clock time, GPU (CUDA/HIP) events, PAPI performance counters, and more (it's extensible).\n", "\n", "Performance results are saved as report files in CSV format or the `chrome://tracing` JSON format for easy timeline view." ] }, { "cell_type": "markdown", "id": "8cc5a3cd", "metadata": {}, "source": [ "### Profiling API\n", "First, we demonstrate the profiling API, which is a simple low-level timer that will run every called DaCe program a number of times and print out the median runtime." ] }, { "cell_type": "code", "execution_count": 47, "id": "c4e40139", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "...\n" ] } ], "source": [ "# Setup some optional dependencies for viewing results and printing progress\n", "%pip install pandas tqdm" ] }, { "cell_type": "code", "execution_count": 48, "id": "06675f18", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Profiling...\n", "Profiling: 100%|██████████| 100/100 [00:00<00:00, 1106.55it/s]\n", "DaCe 0.2954999217763543 ms\n" ] } ], "source": [ "# Temporarily set the DACE_profiling config to True\n", "with dace.config.set_temporary('profiling', value=True):\n", " # You can control the number of times a program is run with the treps configuration\n", " with dace.config.set_temporary('treps', value=100):\n", " daceloop(a)" ] }, { "cell_type": "markdown", "id": "b624ef29", "metadata": {}, "source": [ "This can also be controlled with environment variables. Setting `DACE_profiling=1` and `DACE_treps=100` achieves the same effect on the entire script.\n", "\n", "The report is saved as a CSV file in the `.dacecache//profiling` folder, where `` is the program or SDFG name." ] }, { "cell_type": "code", "execution_count": 49, "id": "e8f7f6b1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ProgramOptimizationProblem_SizeRuntime_sec
0someforloop_0DaCe-f0.006098
1someforloop_0DaCe-f0.000454
2someforloop_0DaCe-f0.000398
3someforloop_0DaCe-f0.000367
4someforloop_0DaCe-f0.000271
5someforloop_0DaCe-f0.000304
6someforloop_0DaCe-f0.000249
7someforloop_0DaCe-f0.004182
8someforloop_0DaCe-f0.000413
9someforloop_0DaCe-f0.000379
\n", "
" ], "text/plain": [ " Program Optimization Problem_Size Runtime_sec\n", "0 someforloop_0 DaCe -f 0.006098\n", "1 someforloop_0 DaCe -f 0.000454\n", "2 someforloop_0 DaCe -f 0.000398\n", "3 someforloop_0 DaCe -f 0.000367\n", "4 someforloop_0 DaCe -f 0.000271\n", "5 someforloop_0 DaCe -f 0.000304\n", "6 someforloop_0 DaCe -f 0.000249\n", "7 someforloop_0 DaCe -f 0.004182\n", "8 someforloop_0 DaCe -f 0.000413\n", "9 someforloop_0 DaCe -f 0.000379" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('.dacecache/someforloop/profiling/results-1644308750891.csv')\n", "df.head(10)" ] }, { "cell_type": "markdown", "id": "68cef56c", "metadata": {}, "source": [ "### Instrumentation API\n", "\n", "The Instrumentation API allows more fine-grained control over measuring program metrics. It creates a JSON report in `.dacecache//perf`, which can be obtained with the API or viewed with any Chrome Tracing capable viewer. More usage information and how to use the API to tune programs can be found in the [program tuning sample](https://github.com/spcl/dace/blob/master/samples/optimization/tuning.py)." ] }, { "cell_type": "code", "execution_count": 50, "id": "6ecccc92", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@dace.program\n", "def twomaps(A):\n", " B = np.sin(A)\n", " return B * 2.0\n", "\n", "a = np.random.rand(1000, 1000)\n", "sdfg = twomaps.to_sdfg(a)\n", "sdfg" ] }, { "cell_type": "markdown", "id": "91a124d6", "metadata": {}, "source": [ "We will now instrument the each of the maps in the program separately, so see which one is a potential bottleneck:" ] }, { "cell_type": "code", "execution_count": 51, "id": "4dc5531f", "metadata": {}, "outputs": [], "source": [ "# Get all maps\n", "maps = [n for n, _ in sdfg.all_nodes_recursive() if isinstance(n, dace.nodes.MapEntry)]\n", "\n", "# Instrument with wall-clock timer\n", "for m in maps:\n", " m.instrument = dace.InstrumentationType.Timer" ] }, { "cell_type": "code", "execution_count": 52, "id": "429a942d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.57072424, 1.62590182, 0.54045806, ..., 1.42865334, 0.74420338,\n", " 1.34051505],\n", " [0.56169953, 0.33241204, 1.18265858, ..., 1.18433834, 0.45687267,\n", " 0.03173654],\n", " [0.21026808, 1.38539332, 1.13363577, ..., 1.20282264, 1.26179853,\n", " 0.94529241],\n", " ...,\n", " [0.58080043, 1.38410909, 1.12745291, ..., 1.54076988, 0.73878048,\n", " 0.76149314],\n", " [1.34720999, 1.08957421, 0.75846927, ..., 1.01317063, 0.13351551,\n", " 1.13468273],\n", " [1.2947957 , 1.0325859 , 1.50298925, ..., 0.56601298, 1.08368357,\n", " 1.29880744]])" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Run SDFG and create report\n", "sdfg(a)" ] }, { "cell_type": "code", "execution_count": 53, "id": "bc980e7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Instrumentation report\n", "SDFG Hash: 0f02b642249b861dc94b7cbc729190d4b27cab79607b8f28c7de3946e62d5977\n", "---------------------------------------------------------------------------\n", "Element Runtime (ms) \n", " Min Mean Median Max \n", "---------------------------------------------------------------------------\n", "SDFG (0) \n", "|-State (0) \n", "| |-Node (0) \n", "| | |Map _numpy_sin__map: \n", "| | | 11.654 11.654 11.654 11.654 \n", "| |-Node (5) \n", "| | |Map _Mult__map: \n", "| | | 1.524 1.524 1.524 1.524 \n", "---------------------------------------------------------------------------\n", "\n" ] } ], "source": [ "# Get the latest instrumentation report from .dacecache/twomaps/perf\n", "report = sdfg.get_latest_report()\n", "\n", "# Print report in a nicely readable format\n", "print(report)" ] }, { "cell_type": "markdown", "id": "5e2a0e77", "metadata": {}, "source": [ "As we can see, the `np.sin` statement is more expensive than the multiplication statement." ] }, { "cell_type": "markdown", "id": "590a0a2d", "metadata": {}, "source": [ "These reports can also be loaded directly to the Visual Studio code plugin to overlay the information on the graph, as shown below:\n", "\n", "![](https://raw.githubusercontent.com/spcl/dace-vscode/master/images/analysis.gif)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "metadata": { "interpreter": { "hash": "ef60a094ca1873cf2e62a8dbe2e76beaf211a154f1b9ff0db0c7157806bcfce0" } } }, "nbformat": 4, "nbformat_minor": 5 }