{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Performance optimization overview\n", "\n", "The purpose of this tutorial is two-fold\n", "\n", "* Illustrate the performance optimizations applied to the code generated by an `Operator`.\n", "* Describe the options Devito provides to users to steer the optimization process.\n", "\n", "As we shall see, most optimizations are automatically applied as they're known to systematically improve performance.\n", "\n", "An Operator has several preset **optimization levels**; the fundamental ones are `noop` and `advanced`. With `noop`, no performance optimizations are introduced. With `advanced`, several flop-reducing and data locality optimization passes are applied. Examples of flop-reducing optimization passes are common sub-expressions elimination and factorization, while examples of data locality optimization passes are loop fusion and cache blocking. Optimization levels in Devito are conceptually akin to the `-O2, -O3, ...` flags in classic C/C++/Fortran compilers.\n", "\n", "An optimization pass may provide knobs, or **options**, for fine-grained tuning. As explained in the next sections, some of these options are given at compile-time, others at run-time.\n", "\n", "**\\*\\* Remark \\*\\***\n", "\n", "Parallelism -- both shared-memory (e.g., OpenMP) and distributed-memory (MPI) -- is _by default disabled_ and is _not_ controlled via the optimization level. In this tutorial we will also show how to enable OpenMP parallelism (you'll see it's trivial!). Another mini-guide about parallelism in Devito and related aspects is available [here](https://github.com/devitocodes/devito/tree/master/benchmarks/user#a-step-back-configuring-your-machine-for-reliable-benchmarking). \n", "\n", "**\\*\\*\\*\\***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "* [API](#API)\n", "* [Default values](#Default-values)\n", "* [Running example](#Running-example)\n", "* [OpenMP parallelism](#OpenMP-parallelism)\n", "* [The `advanced` mode](#The-advanced-mode)\n", "* [The `advanced-fsg` mode](#The-advanced-fsg-mode)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API\n", "\n", "The optimization level may be changed in various ways:\n", "\n", "* globally, through the `DEVITO_OPT` environment variable. For example, to disable all optimizations on all `Operator`'s, one could run with\n", "\n", "```\n", "DEVITO_OPT=noop python ...\n", "```\n", "\n", "* programmatically, adding the following lines to a program\n", "\n", "```\n", "from devito import configuration\n", "configuration['opt'] = 'noop'\n", "```\n", "\n", "* locally, as an `Operator` argument\n", "\n", "```\n", "Operator(..., opt='noop')\n", "```\n", "\n", "Local takes precedence over programmatic, and programmatic takes precedence over global.\n", "\n", "The optimization options, instead, may only be changed locally. The syntax to specify an option is\n", "\n", "```\n", "Operator(..., opt=('advanced', {})\n", "```\n", "\n", "A concrete example (you can ignore the meaning for now) is\n", "\n", "```\n", "Operator(..., opt=('advanced', {'blocklevels': 2})\n", "```\n", "\n", "That is, options are to be specified _together with_ the optimization level (`advanced`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Default values\n", "\n", "By default, all `Operator`s are run with the optimization level set to `advanced`. So this\n", "\n", "```\n", "Operator(Eq(...))\n", "```\n", "\n", "is equivalent to\n", "\n", "```\n", "Operator(Eq(...), opt='advanced')\n", "```\n", "\n", "and obviously also to\n", "\n", "```\n", "Operator(Eq(...), opt=('advanced', {}))\n", "```\n", "\n", "In virtually all scenarios, regardless of application and underlying architecture, this ensures very good performance -- but not necessarily the very best." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Misc\n", "\n", "The following functions will be used throughout the notebook for printing generated code." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from examples.performance import unidiff_output, print_kernel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell is only needed for Continuous Integration. But actually it's also an example of how \"programmatic takes precedence over global\" (see [API](#API) section)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from devito import configuration\n", "configuration['language'] = 'C'\n", "configuration['platform'] = 'bdw' # Optimize for an Intel Broadwell\n", "configuration['opt-options']['par-collapse-ncores'] = 1 # Maximize use loop collapsing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running example\n", "\n", "Throughout the notebook we will generate `Operator`'s for the following time-marching `Eq`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from devito import Eq, Grid, Operator, Function, TimeFunction, sin\n", "\n", "grid = Grid(shape=(80, 80, 80))\n", "\n", "f = Function(name='f', grid=grid)\n", "u = TimeFunction(name='u', grid=grid, space_order=4)\n", "\n", "eq = Eq(u.forward, f**2*sin(f)*u.dy.dy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Despite its simplicity, this `Eq` is all we need to showcase the key components of the Devito optimization engine." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OpenMP parallelism\n", "\n", "There are several ways to enable OpenMP parallelism. The one we use here consists of supplying an _option_ to an `Operator`. The next cell illustrates the difference between two `Operator`'s generated with the `noop` optimization level, but with OpenMP enabled on the latter one." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y) + (-8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y) + (8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y) + (6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op0 = Operator(eq, opt=('noop'))\n", "op0_omp = Operator(eq, opt=('noop', {'openmp': True}))\n", "\n", "# print(op0)\n", "# print(unidiff_output(str(op0), str(op0_omp))) # Uncomment to print out the diff only\n", "print_kernel(op0_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The OpenMP-ized `op0_omp` `Operator` includes:\n", "\n", " - the header file `\"omp.h\"`\n", " - a `#pragma omp parallel num_threads(nthreads)` directive\n", " - a `#pragma omp for collapse(...) schedule(dynamic,1)` directive\n", " \n", "More complex `Operator`'s will have more directives, more types of directives, different iteration scheduling strategies based on heuristics and empirical tuning (e.g., `static` instead of `dynamic`), etc.\n", "\n", "We note how the OpenMP pass introduces a new symbol, `nthreads`. This allows users to explicitly control the number of threads with which an `Operator` is run.\n", "\n", "```\n", "op0_omp.apply(time_M=0) # Picks up `nthreads` from the standard environment variable OMP_NUM_THREADS\n", "op0_omp.apply(time_M=0, nthreads=2) # Runs with 2 threads per parallel loop\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few optimization options are available for this pass (but not on all platforms, see [here](https://github.com/devitocodes/devito/blob/master/examples/performance/README.md)), though in our experience the default values do a fine job:\n", "\n", "* `par-collapse-ncores`: use a collapse clause only if the number of available physical cores is greater than this value (default=4).\n", "* `par-collapse-work`: use a collapse clause only if the trip count of the collapsable loops is statically known to exceed this value (default=100).\n", "* `par-chunk-nonaffine`: a coefficient to adjust the chunk size in non-affine parallel loops. The larger the coefficient, the smaller the chunk size (default=3).\n", "* `par-dynamic-work`: use dynamic scheduling if the operation count per iteration exceeds this value. Otherwise, use static scheduling (default=10).\n", "* `par-nested`: use nested parallelism if the number of hyperthreads per core is greater than this value (default=2).\n", "\n", "So, for instance, we could switch to static scheduling by constructing the following `Operator`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(static,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y) + (-8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y) + (8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y) + (6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op0_b0_omp = Operator(eq, opt=('noop', {'openmp': True, 'par-dynamic-work': 100}))\n", "print_kernel(op0_b0_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `advanced` mode\n", "\n", "The default optimization level in Devito is `advanced`. This mode performs several compilation passes to optimize the `Operator` for computation (number of flops), working set size, and data locality. In the next paragraphs we'll dissect the `advanced` mode to analyze, one by one, some of its key passes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loop blocking\n", "\n", "The next cell creates a new `Operator` that adds loop blocking to what we had in `op0_omp`. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(2) schedule(dynamic,1)\n", " for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size)\n", " {\n", " for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size)\n", " {\n", " for (int x = x0_blk0; x <= MIN(x_M, x0_blk0 + x0_blk0_size - 1); x += 1)\n", " {\n", " for (int y = y0_blk0; y <= MIN(y_M, y0_blk0 + y0_blk0_size - 1); y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y) + (-8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y) + (8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y) + (6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op1_omp = Operator(eq, opt=('blocking', {'openmp': True}))\n", "# print(op1_omp) # Uncomment to see the *whole* generated code\n", "print_kernel(op1_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**\\*\\* Remark \\*\\***\n", "\n", "`'blocking'` is **not** an optimization level -- it rather identifies a specific compilation pass. In other words, the `advanced` mode defines an ordered sequence of passes, and `blocking` is one such pass.\n", "\n", "**\\*\\*\\*\\***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `blocking` pass creates additional loops over blocks. In this simple `Operator` there's just one loop nest, so only a pair of additional loops are created. In more complex `Operator`'s, several loop nests may individually be blocked, whereas others may be left unblocked -- this is decided by the Devito compiler according to certain heuristics. The size of a block is represented by the symbols `x0_blk0_size` and `y0_blk0_size`, which are runtime parameters akin to `nthreads`. \n", "\n", "By default, Devito applies 2D blocking and sets the default block shape to 8x8. There are two ways to set a different block shape:\n", "\n", "* passing an explicit value. For instance, below we run with a 24x8 block shape\n", "\n", "```\n", "op1_omp.apply(..., x0_blk0_size=24)\n", "```\n", "\n", "* letting the autotuner pick up a better block shape for us. There are several autotuning modes. A short summary is available [here](https://github.com/devitocodes/devito/wiki/FAQ#devito_autotuning)\n", "\n", "```\n", "op1_omp.apply(..., autotune='aggressive')\n", "```\n", "\n", "Loop blocking also provides two optimization options:\n", "\n", "* `blockinner={False, True}` -- to enable 3D (or any nD, n>2) blocking\n", "* `blocklevels={int}` -- to enable hierarchical blocking, to exploit multiple levels of the cache hierarchy \n", "\n", "In the example below, we construct an `Operator` with six-dimensional loop blocking: the first three loops represent outer blocks, whereas the second three loops represent inner blocks within an outer block." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size)\n", " {\n", " for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size)\n", " {\n", " for (int z0_blk0 = z_m; z0_blk0 <= z_M; z0_blk0 += z0_blk0_size)\n", " {\n", " for (int x0_blk1 = x0_blk0; x0_blk1 <= MIN(x_M, x0_blk0 + x0_blk0_size - 1); x0_blk1 += x0_blk1_size)\n", " {\n", " for (int y0_blk1 = y0_blk0; y0_blk1 <= MIN(y_M, y0_blk0 + y0_blk0_size - 1); y0_blk1 += y0_blk1_size)\n", " {\n", " for (int z0_blk1 = z0_blk0; z0_blk1 <= MIN(z_M, z0_blk0 + z0_blk0_size - 1); z0_blk1 += z0_blk1_size)\n", " {\n", " for (int x = x0_blk1; x <= MIN(x_M, x0_blk1 + x0_blk1_size - 1); x += 1)\n", " {\n", " for (int y = y0_blk1; y <= MIN(y_M, y0_blk1 + y0_blk1_size - 1); y += 1)\n", " {\n", " for (int z = z0_blk1; z <= MIN(z_M, z0_blk1 + z0_blk1_size - 1); z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y) + (-8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y) + (8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y) + (6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op1_omp_6D = Operator(eq, opt=('blocking', {'blockinner': True, 'blocklevels': 2, 'openmp': True}))\n", "# print(op1_omp_6D) # Uncomment to see the *whole* generated code\n", "print_kernel(op1_omp_6D)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SIMD vectorization\n", "\n", "Devito enforces SIMD vectorization through OpenMP pragmas." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "op2_omp = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))\n", "# print(op2_omp) # Uncomment to see the generated code\n", "# print(unidiff_output(str(op1_omp), str(op2_omp))) # Uncomment to print out the diff only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Code motion\n", "\n", "The `advanced` mode has a code motion pass. In explicit PDE solvers, this is most commonly used to lift expensive time-invariant sub-expressions out of the inner loops. The pass is however quite general in that it is not restricted to the concept of time-invariance -- any sub-expression invariant with respect to a subset of `Dimension`s is a code motion candidate. In our running example, `sinf(f)` gets hoisted out of the inner loops since it is determined to be an expensive invariant sub-expression. In other words, the compiler trades the redundant computation of `sinf(f)` for additional storage (the `r0[...]` array)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "float r1 = 1.0F/h_y;\n", "\n", "START(section0)\n", "#pragma omp parallel num_threads(nthreads)\n", "{\n", " #pragma omp for collapse(3) schedule(static,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x][y][z] = sinf(f[x + 1][y + 1][z + 1]);\n", " }\n", " }\n", " }\n", "}\n", "STOP(section0,timers)\n", "\n", "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section1)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F*r1)*(8.33333333e-2F*r1*u[t0][x + 4][y + 1][z + 4] - 6.66666667e-1F*r1*u[t0][x + 4][y + 2][z + 4] + 6.66666667e-1F*r1*u[t0][x + 4][y + 4][z + 4] - 8.33333333e-2F*r1*u[t0][x + 4][y + 5][z + 4]) + (-8.33333333e-2F*r1)*(8.33333333e-2F*r1*u[t0][x + 4][y + 4][z + 4] - 6.66666667e-1F*r1*u[t0][x + 4][y + 5][z + 4] + 6.66666667e-1F*r1*u[t0][x + 4][y + 7][z + 4] - 8.33333333e-2F*r1*u[t0][x + 4][y + 8][z + 4]) + (8.33333333e-2F*r1)*(8.33333333e-2F*r1*u[t0][x + 4][y][z + 4] - 6.66666667e-1F*r1*u[t0][x + 4][y + 1][z + 4] + 6.66666667e-1F*r1*u[t0][x + 4][y + 3][z + 4] - 8.33333333e-2F*r1*u[t0][x + 4][y + 4][z + 4]) + (6.66666667e-1F*r1)*(8.33333333e-2F*r1*u[t0][x + 4][y + 3][z + 4] - 6.66666667e-1F*r1*u[t0][x + 4][y + 4][z + 4] + 6.66666667e-1F*r1*u[t0][x + 4][y + 6][z + 4] - 8.33333333e-2F*r1*u[t0][x + 4][y + 7][z + 4]))*pow(f[x + 1][y + 1][z + 1], 2)*r0[x][y][z];\n", " }\n", " }\n", " }\n", " }\n", " STOP(section1,timers)\n", "}\n" ] } ], "source": [ "op3_omp = Operator(eq, opt=('lift', {'openmp': True}))\n", "print_kernel(op3_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic flop-reducing transformations\n", "\n", "Among the simpler flop-reducing transformations applied by the `advanced` mode we find:\n", "\n", "* \"classic\" common sub-expressions elimination (CSE),\n", "* factorization,\n", "* optimization of powers\n", "\n", "The cell below shows how the computation of `u` changes by incrementally applying these three passes. First of all, we observe how the reciprocal of the symbolic spacing `h_y` gets assigned to a temporary, `r0`, as it appears in several sub-expressions. This is the effect of CSE." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- \n", "+++ \n", "@@ -42,7 +42,8 @@\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", "- u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y) + (-8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y) + (8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y) + (6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", "+ float r0 = 1.0F/h_y;\n", "+ u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y + 1][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 2][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 5][z + 4]) + (-8.33333333e-2F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 5][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 7][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 8][z + 4]) + (8.33333333e-2F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 1][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 3][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4]) + (6.66666667e-1F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y + 3][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 6][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 7][z + 4]))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", "\n" ] } ], "source": [ "op4_omp = Operator(eq, opt=('cse', {'openmp': True}))\n", "print(unidiff_output(str(op0_omp), str(op4_omp)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The factorization pass makes sure `r0` is collected to reduce the number of multiplications." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- \n", "+++ \n", "@@ -43,7 +43,7 @@\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " float r0 = 1.0F/h_y;\n", "- u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y + 1][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 2][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 5][z + 4]) + (-8.33333333e-2F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 5][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 7][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 8][z + 4]) + (8.33333333e-2F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 1][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 3][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4]) + (6.66666667e-1F*r0)*(8.33333333e-2F*r0*u[t0][x + 4][y + 3][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 6][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 7][z + 4]))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", "+ u[t1][x + 4][y + 4][z + 4] = pow(r0, 2)*(6.94444444e-3F*(u[t0][x + 4][y][z + 4] + u[t0][x + 4][y + 8][z + 4]) + 4.44444444e-1F*(u[t0][x + 4][y + 2][z + 4] + u[t0][x + 4][y + 6][z + 4]) + 1.11111111e-1F*(-u[t0][x + 4][y + 1][z + 4] + u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4] - u[t0][x + 4][y + 7][z + 4]) - 9.02777778e-1F*u[t0][x + 4][y + 4][z + 4])*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", "\n" ] } ], "source": [ "op5_omp = Operator(eq, opt=('cse', 'factorize', {'openmp': True}))\n", "print(unidiff_output(str(op4_omp), str(op5_omp)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, `opt-pows` turns costly `pow` calls into multiplications." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- \n", "+++ \n", "@@ -43,7 +43,7 @@\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " float r0 = 1.0F/h_y;\n", "- u[t1][x + 4][y + 4][z + 4] = pow(r0, 2)*(6.94444444e-3F*(u[t0][x + 4][y][z + 4] + u[t0][x + 4][y + 8][z + 4]) + 4.44444444e-1F*(u[t0][x + 4][y + 2][z + 4] + u[t0][x + 4][y + 6][z + 4]) + 1.11111111e-1F*(-u[t0][x + 4][y + 1][z + 4] + u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4] - u[t0][x + 4][y + 7][z + 4]) - 9.02777778e-1F*u[t0][x + 4][y + 4][z + 4])*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", "+ u[t1][x + 4][y + 4][z + 4] = (r0*r0)*(f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(6.94444444e-3F*(u[t0][x + 4][y][z + 4] + u[t0][x + 4][y + 8][z + 4]) + 4.44444444e-1F*(u[t0][x + 4][y + 2][z + 4] + u[t0][x + 4][y + 6][z + 4]) + 1.11111111e-1F*(-u[t0][x + 4][y + 1][z + 4] + u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4] - u[t0][x + 4][y + 7][z + 4]) - 9.02777778e-1F*u[t0][x + 4][y + 4][z + 4])*sinf(f[x + 1][y + 1][z + 1]);\n", " }\n", " }\n", " }\n", "\n" ] } ], "source": [ "op6_omp = Operator(eq, opt=('cse', 'factorize', 'opt-pows', {'openmp': True}))\n", "print(unidiff_output(str(op5_omp), str(op6_omp)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-iteration redundancy elimination (CIRE)\n", "\n", "This is perhaps the most advanced among the optimization passes in Devito. CIRE [1] searches for redundancies across consecutive loop iterations. These are often induced by a mixture of nested, high-order, partial derivatives. The underlying idea is very simple. Consider:\n", "\n", "```\n", "r0 = a[i-1] + a[i] + a[i+1]\n", "```\n", "\n", "at `i=1`, we have\n", "\n", "```\n", "r0 = a[0] + a[1] + a[2]\n", "```\n", "\n", "at `i=2`, we have\n", "\n", "```\n", "r0 = a[1] + a[2] + a[3]\n", "```\n", "\n", "So the sub-expression `a[1] + a[2]` is computed twice, by two consecutive iterations.\n", "What makes CIRE complicated is the generalization to arbitrary expressions, the presence of multiple dimensions, the scheduling strategy due to the trade-off between redundant compute and working set, and the co-existance with other optimizations (e.g., blocking, vectorization). All these aspects won't be treated here. What instead we will show is the effect of CIRE in our running example and the optimization options at our disposal to drive the detection and scheduling of the captured redundancies.\n", "\n", "In our running example, some cross-iteration redundancies are induced by the nested first-order derivatives along `y`. As we see below, these redundancies are captured and assigned to the two-dimensional temporary `r0`.\n", "\n", "Note: the name `cire-sops` means \"Apply CIRE to sum-of-product expressions\". A sum-of-product is what a finite difference derivative produces." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " float (*restrict r0)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr0[tid];\n", "\n", " #pragma omp for schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m - 2; y <= y_M + 2; y += 1)\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y;\n", " }\n", " }\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f,u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*r0[y + 1][z] + (-8.33333333e-2F/h_y)*r0[y + 4][z] + (8.33333333e-2F/h_y)*r0[y][z] + (6.66666667e-1F/h_y)*r0[y + 3][z])*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op7_omp = Operator(eq, opt=('cire-sops', 'simd', {'openmp': True}))\n", "print_kernel(op7_omp)\n", "# print(unidiff_output(str(op7_omp), str(op0_omp))) # Uncomment to print out the diff only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We note that since there are no redundancies along `x`, the compiler is smart to figure out that `r0` and `u` can safely be computed within the same `x` loop. This is nice -- not only is the reuse distance decreased, but also a grid-sized temporary avoided.\n", "\n", "#### The `min-storage` option\n", "\n", "Let's now consider a variant of our running example" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m - 2; x <= x_M + 2; x += 1)\n", " {\n", " for (int y = y_m - 2; y <= y_M + 2; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x + 2][y + 2][z] = 8.33333333e-2F*u[t0][x + 2][y + 4][z + 4]/h_x - 6.66666667e-1F*u[t0][x + 3][y + 4][z + 4]/h_x + 6.66666667e-1F*u[t0][x + 5][y + 4][z + 4]/h_x - 8.33333333e-2F*u[t0][x + 6][y + 4][z + 4]/h_x;\n", "\n", " r1[x + 2][y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y;\n", " }\n", " }\n", " }\n", " }\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_x)*r0[x + 1][y + 2][z] + (-8.33333333e-2F/h_x)*r0[x + 4][y + 2][z] + (8.33333333e-2F/h_x)*r0[x][y + 2][z] + (6.66666667e-1F/h_x)*r0[x + 3][y + 2][z] + (-6.66666667e-1F/h_y)*r1[x + 2][y + 1][z] + (-8.33333333e-2F/h_y)*r1[x + 2][y + 4][z] + (8.33333333e-2F/h_y)*r1[x + 2][y][z] + (6.66666667e-1F/h_y)*r1[x + 2][y + 3][z])*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "eq = Eq(u.forward, f**2*sin(f)*(u.dy.dy + u.dx.dx))\n", "op7_b0_omp = Operator(eq, opt=('cire-sops', {'openmp': True}))\n", "print_kernel(op7_b0_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, there are now two temporaries, one stemming from `u.dy.dy` and the other from `u.dx.dx`. A key difference with respect to `op7_omp` here is that both are grid-size temporaries. This might seem odd at first -- why should the `u.dy.dy` temporary, that is `r1`, now be a three-dimensional temporary when we know already, from what seen in the previous cell, it could be a two-dimensional temporary? This is merely a compiler heuristic: by adding an extra dimension to `r1`, both temporaries can be scheduled within the same loop nest, thus augmenting data reuse and potentially enabling further cross-expression optimizations. We can disable this heuristic through the `min-storage` option." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(2) schedule(dynamic,1)\n", " for (int x = x_m - 2; x <= x_M + 2; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x + 2][y][z] = 8.33333333e-2F*u[t0][x + 2][y + 4][z + 4]/h_x - 6.66666667e-1F*u[t0][x + 3][y + 4][z + 4]/h_x + 6.66666667e-1F*u[t0][x + 5][y + 4][z + 4]/h_x - 8.33333333e-2F*u[t0][x + 6][y + 4][z + 4]/h_x;\n", " }\n", " }\n", " }\n", " }\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " float (*restrict r1)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr1[tid];\n", "\n", " #pragma omp for schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m - 2; y <= y_M + 2; y += 1)\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r1[y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y;\n", " }\n", " }\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f,u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_x)*r0[x + 1][y][z] + (-8.33333333e-2F/h_x)*r0[x + 4][y][z] + (8.33333333e-2F/h_x)*r0[x][y][z] + (6.66666667e-1F/h_x)*r0[x + 3][y][z] + (-6.66666667e-1F/h_y)*r1[y + 1][z] + (-8.33333333e-2F/h_y)*r1[y + 4][z] + (8.33333333e-2F/h_y)*r1[y][z] + (6.66666667e-1F/h_y)*r1[y + 3][z])*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op7_b1_omp = Operator(eq, opt=('cire-sops', 'simd', {'openmp': True, 'min-storage': True}))\n", "print_kernel(op7_b1_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The `cire-mingain` option\n", "\n", "So far so good -- we've seen that Devito can capture and schedule cross-iteration redundancies. But what if, for example, we actually do _not_ want certain redundancies to be captured? There are a few reasons we may like that way, for example we're allocating too much extra memory for the tensor temporaries, and we rather prefer to avoid that. For this, the user can tell Devito what the *minimum gain*, or simply `mingain`, which stems from optimizing a CIRE candidate must be in order to trigger the transformation. Such `mingain` is an integer number reflecting the number and type of arithmetic operations in a CIRE candidate. The Devito compiler computes the `gain` of a CIRE candidate, compares it to the `mingain`, and acts as follows:\n", "\n", "* if `gain < mingain`, then the CIRE candidate is discarded;\n", "* if `mingain <= gain <= mingain*3`, then the CIRE candidate is optimized if and only if it doesn't lead to an increase in working set size;\n", "* if `gain > mingain*3`, then the CIRE candidate is optimized.\n", "\n", "The default `mingain` is set to 10, a value that was determined empirically. The coefficient `3` in the relationals above was also determined empirically. The user can change the `mingain`'s default value via the `cire-mingain` optimization option.\n", "\n", "To compute the `gain` of a CIRE candidate, Devito applies the following rules:\n", "\n", "* Any integer arithmetic operation (e.g. for array indexing) has a cost of 0.\n", "* A basic arithmetic operation such as `+` and `*` has a cost of 1.\n", "* A `/` has a cost of 5.\n", "* A power with non-integer exponent has a cost of 50.\n", "* A power with non-negative integer exponent `n` has a cost of `n-1` (i.e., the number of `*` it will be converted into).\n", "* A trascendental function (`sin`, `cos`, etc.) has a cost of 100.\n", "\n", "Further, if a CIRE candidate is invariant with respect to one or more dimensions, then the `gain` is proportionately scaled up.\n", "\n", "Let's now take a look at `cire-mingain` in action." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y) + (-8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y) + (8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y) + (6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "eq = Eq(u.forward, f**2*sin(f)*u.dy.dy) # Back to original running example\n", "op8_omp = Operator(eq, opt=('cire-sops', {'openmp': True, 'cire-mingain': 31}))\n", "print_kernel(op8_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe how setting `cire-mingain=31` makes the tensor temporary produced by `op7_omp` disappear. Any `cire-mingain` greater than 31 will have the same effect. Let's see why. The CIRE candidate in question is `u.dy`, which expands to:\n", "\n", "```\n", "8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y\n", "```\n", "\n", "* Three `+` (or `-`) => 3*1 = 3\n", "* Four `/` by `h_y` => 4*5 = 20\n", "* Four two-way `*` with operands `(1/h_y, u)` => 8\n", "\n", "So we have a cost of 31. Since the CIRE candidate stems from `.dy`, and since `u`'s space order is 4, the CIRE candidates appears 4 times in total, with shifted indices along `y`. It follows that the gain due to optimizing it away through a tensor temporary would be 31\\*(4 - 1) = 93. However, `mingain*3 = 31*3 = 93`, so `mingain < gain <= mingain*3`, and since there would be an increase in working set size (the new tensor temporary to be allocated), then the CIRE candidate is discarded.\n", "\n", "As shown below, any `cire-mingain` smaller than 31 will lead to capturing the tensor temporary." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " float (*restrict r0)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr0[tid];\n", "\n", " #pragma omp for schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m - 2; y <= y_M + 2; y += 1)\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y;\n", " }\n", " }\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f,u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*r0[y + 1][z] + (-8.33333333e-2F/h_y)*r0[y + 4][z] + (8.33333333e-2F/h_y)*r0[y][z] + (6.66666667e-1F/h_y)*r0[y + 3][z])*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op8_b1_omp = Operator(eq, opt=('cire-sops', 'simd', {'openmp': True, 'cire-mingain': 30}))\n", "print_kernel(op8_b1_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dimension-invariant sub-expressions\n", "\n", "Cross-iteration redundancies may also be searched across dimension-invariant sub-expressions, typically time-invariants." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "eq = Eq(u.forward, f**2*sin(f)*u.dy.dy) # Back to original running example\n", "op11_omp = Operator(eq, opt=('lift', {'openmp': True}))\n", "# print_kernel(op11_omp) # Uncomment to see the generated code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `lift` pass triggers CIRE for dimension-invariant sub-expressions. As seen before, this leads to producing one tensor temporary. By setting `cire-mingain` to a larger value, we can avoid a grid-size temporary to be allocated, in exchange for a trascendental function (`sinf(...)`) to be computed at each iteration." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y) + (-8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y) + (8.33333333e-2F/h_y)*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y) + (6.66666667e-1F/h_y)*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y))*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op12_omp = Operator(eq, opt=('lift', {'openmp': True, 'cire-mingain': 34}))\n", "print_kernel(op12_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The `cire-maxpar` option\n", "\n", "Sometimes it's possible to trade storage for parallelism (i.e., for more parallel dimensions). For this, Devito provides the `cire-maxpar` option which is by default set to:\n", "\n", "* False on CPU backends\n", "* True on GPU backends\n", "\n", "Let's see what happens when we switch it on" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m - 2; y <= y_M + 2; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x][y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y;\n", " }\n", " }\n", " }\n", " }\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(3) schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = ((-6.66666667e-1F/h_y)*r0[x][y + 1][z] + (-8.33333333e-2F/h_y)*r0[x][y + 4][z] + (8.33333333e-2F/h_y)*r0[x][y][z] + (6.66666667e-1F/h_y)*r0[x][y + 3][z])*sinf(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "}\n" ] } ], "source": [ "op13_omp = Operator(eq, opt=('cire-sops', {'openmp': True, 'cire-maxpar': True}))\n", "print_kernel(op13_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generated code uses a three-dimensional temporary that gets written and subsequently read in two separate `x-y-z` loop nests. Now, both loops can safely be openmp-collapsed, which is vital when running on GPUs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Impact of CIRE in the `advanced` mode\n", "\n", "The `advanced` mode triggers all of the passes we've seen so far... and in fact, many more! Some of them, however, aren't visible in our running example (e.g., all of the MPI-related optimizations). These will be treated in a future notebook.\n", "\n", "Obviously, all of the optimization options (e.g. `cire-mingain`, `blocklevels`) are applicable and composable in any arbitrary way." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#define _POSIX_C_SOURCE 200809L\n", "#define START(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);\n", "#define STOP(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;\n", "#define MIN(a,b) (((a) < (b)) ? (a) : (b))\n", "\n", "#include \"stdlib.h\"\n", "#include \"math.h\"\n", "#include \"sys/time.h\"\n", "#include \"xmmintrin.h\"\n", "#include \"pmmintrin.h\"\n", "#include \"omp.h\"\n", "\n", "struct dataobj\n", "{\n", " void *restrict data;\n", " unsigned long * size;\n", " unsigned long * npsize;\n", " unsigned long * dsize;\n", " int * hsize;\n", " int * hofs;\n", " int * oofs;\n", " void * dmap;\n", "} ;\n", "\n", "struct profiler\n", "{\n", " double section0;\n", " double section1;\n", "} ;\n", "\n", "\n", "int Kernel(struct dataobj *restrict f_vec, struct dataobj *restrict u_vec, const float h_y, const int time_M, const int time_m, const int x0_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, const int z_size, const int x_size, const int y_size, struct profiler * timers)\n", "{\n", " float **restrict pr2_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&pr2_vec),64,nthreads*sizeof(float*));\n", " float *restrict r0_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&r0_vec),64,x_size*y_size*z_size*sizeof(float));\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " posix_memalign((void**)(&(pr2_vec[tid])),64,z_size*(y0_blk0_size + 4)*sizeof(float));\n", " }\n", "\n", " float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data;\n", " float **pr2 = (float**) pr2_vec;\n", " float (*restrict r0)[y_size][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size][z_size]) r0_vec;\n", " float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data;\n", "\n", " /* Flush denormal numbers to zero in hardware */\n", " _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n", " _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n", "\n", " float r1 = 1.0F/h_y;\n", "\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(2) schedule(static,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x][y][z] = sinf(f[x + 1][y + 1][z + 1]);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "\n", " for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", " {\n", " START(section1)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " float (*restrict r2)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr2[tid];\n", "\n", " #pragma omp for collapse(2) schedule(dynamic,1)\n", " for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size)\n", " {\n", " for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size)\n", " {\n", " for (int x = x0_blk0; x <= MIN(x_M, x0_blk0 + x0_blk0_size - 1); x += 1)\n", " {\n", " for (int y = y0_blk0 - 2, ys = 0; y <= MIN(y_M + 2, y0_blk0 + y0_blk0_size + 1); y += 1, ys += 1)\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r2[ys][z] = r1*(8.33333333e-2F*(u[t0][x + 4][y + 2][z + 4] - u[t0][x + 4][y + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4]));\n", " }\n", " }\n", " for (int y = y0_blk0, ys = 0; y <= MIN(y_M, y0_blk0 + y0_blk0_size - 1); y += 1, ys += 1)\n", " {\n", " #pragma omp simd aligned(f,u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = r1*(f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(8.33333333e-2F*(r2[ys][z] - r2[ys + 4][z]) + 6.66666667e-1F*(-r2[ys + 1][z] + r2[ys + 3][z]))*r0[x][y][z];\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " STOP(section1,timers)\n", " }\n", "\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " free(pr2_vec[tid]);\n", " }\n", " free(pr2_vec);\n", " free(r0_vec);\n", "\n", " return 0;\n", "}\n", "\n" ] } ], "source": [ "op14_omp = Operator(eq, openmp=True)\n", "print(op14_omp)\n", "# op14_b0_omp = Operator(eq, opt=('advanced', {'min-storage': True}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A crucial observation here is that CIRE is applied on top of loop blocking -- the `r2` temporary is computed within the same block as `u`, which in turn requires an additional iteration at the block edge along `y` to be performed (the first `y` loop starts at `y0_blk0 - 2`, while the second one at `y0_blk0`). Further, the size of `r2` is now a function of the block shape. Altogether, this implements what in the literature is often referred to as \"overlapped tiling\" (or \"overlapped blocking\"): data reuse across consecutive loop nests is obtained by cross-loop blocking, which in turn requires a certain degree of redundant computation at the block edge. Clearly, there's a tension between the block shape and the amount of redundant computation. For example, a small block shape guarantees a small(er) working set, and thus better data reuse, but also requires more redundant computation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The `cire-rotate` option\n", "\n", "So far we've seen two ways to compute the tensor temporaries:\n", "\n", "* The temporary dimensions span the whole grid;\n", "* The temporary dimensions span a block created by the loop blocking pass.\n", "\n", "There are a few other ways, and in particular there's a third way supported in Devito, enabled through the `cire-rotate` option:\n", "\n", "* The temporary outermost-dimension is a function of the stencil radius; all other temporary dimensions are a function of the loop blocking shape.\n", "\n", "Let's jump straight into an example" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "float r1 = 1.0F/h_y;\n", "\n", "START(section0)\n", "#pragma omp parallel num_threads(nthreads)\n", "{\n", " #pragma omp for collapse(2) schedule(static,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x][y][z] = sinf(f[x + 1][y + 1][z + 1]);\n", " }\n", " }\n", " }\n", "}\n", "STOP(section0,timers)\n", "\n", "for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", "{\n", " START(section1)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " float (*restrict r2)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr2[tid];\n", "\n", " #pragma omp for collapse(2) schedule(dynamic,1)\n", " for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size)\n", " {\n", " for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size)\n", " {\n", " for (int x = x0_blk0; x <= MIN(x_M, x0_blk0 + x0_blk0_size - 1); x += 1)\n", " {\n", " for (int y = y0_blk0, ys = 0, yr0 = (ys)%(5), yr1 = (ys + 3)%(5), yr2 = (ys + 4)%(5), yr3 = (ys + 1)%(5), yii = -2; y <= MIN(y_M, y0_blk0 + y0_blk0_size - 1); y += 1, ys += 1, yr0 = (ys)%(5), yr1 = (ys + 3)%(5), yr2 = (ys + 4)%(5), yr3 = (ys + 1)%(5), yii = 2)\n", " {\n", " for (int yc = yii, yi = (yc + ys + 2)%(5); yc <= 2; yc += 1, yi = (yc + ys + 2)%(5))\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r2[yi][z] = r1*(8.33333333e-2F*(u[t0][x + 4][y + yc + 2][z + 4] - u[t0][x + 4][y + yc + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + yc + 3][z + 4] + u[t0][x + 4][y + yc + 5][z + 4]));\n", " }\n", " }\n", " #pragma omp simd aligned(f,u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = r1*(f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(8.33333333e-2F*(r2[yr0][z] - r2[yr2][z]) + 6.66666667e-1F*(r2[yr1][z] - r2[yr3][z]))*r0[x][y][z];\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " STOP(section1,timers)\n", "}\n" ] } ], "source": [ "op15_omp = Operator(eq, opt=('advanced', {'openmp': True, 'cire-rotate': True}))\n", "print_kernel(op15_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two key things to notice here:\n", "\n", "* The `r2` temporary is a pointer to a two-dimensional array of size `[2][z_size]`. It's obtained via casting of `pr2[tid]`, which in turn is defined as" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "posix_memalign((void**)(&r0_vec),64,x_size*y_size*z_size*sizeof(float));\n" ] } ], "source": [ "print(op15_omp.body.allocs[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Within the `y` loop there are several iteration variables, some of which (`yr0`, `yr1`, ...) employ modulo increment to cyclically produce the indices 0 and 1.\n", "\n", "In essence, with `cire-rotate`, instead of computing an entire slice of `y` values, at each `y` iteration we only keep track of the values that are strictly necessary to evaluate `u` at `y` -- only two values in this case. This results in a working set reduction, at the price of turning one parallel loop (`y`) into a sequential one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `advanced-fsg` mode\n", "\n", "The alternative `advanced-fsg` optimization level applies the same passes as `advanced`, but in a different order. The key difference is that `-fsg` does not generate overlapped blocking code across CIRE-generated loop nests." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#define _POSIX_C_SOURCE 200809L\n", "#define START(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);\n", "#define STOP(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;\n", "\n", "#include \"stdlib.h\"\n", "#include \"math.h\"\n", "#include \"sys/time.h\"\n", "#include \"xmmintrin.h\"\n", "#include \"pmmintrin.h\"\n", "#include \"omp.h\"\n", "\n", "struct dataobj\n", "{\n", " void *restrict data;\n", " unsigned long * size;\n", " unsigned long * npsize;\n", " unsigned long * dsize;\n", " int * hsize;\n", " int * hofs;\n", " int * oofs;\n", " void * dmap;\n", "} ;\n", "\n", "struct profiler\n", "{\n", " double section0;\n", " double section1;\n", "} ;\n", "\n", "\n", "int Kernel(struct dataobj *restrict f_vec, struct dataobj *restrict u_vec, const float h_y, const int time_M, const int time_m, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, const int z_size, const int x_size, const int y_size, struct profiler * timers)\n", "{\n", " float **restrict pr2_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&pr2_vec),64,nthreads*sizeof(float*));\n", " float *restrict r0_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&r0_vec),64,x_size*y_size*z_size*sizeof(float));\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " posix_memalign((void**)(&(pr2_vec[tid])),64,z_size*(y_size + 4)*sizeof(float));\n", " }\n", "\n", " float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data;\n", " float **pr2 = (float**) pr2_vec;\n", " float (*restrict r0)[y_size][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size][z_size]) r0_vec;\n", " float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data;\n", "\n", " /* Flush denormal numbers to zero in hardware */\n", " _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n", " _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n", "\n", " float r1 = 1.0F/h_y;\n", "\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(2) schedule(static,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x][y][z] = sinf(f[x + 1][y + 1][z + 1]);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "\n", " for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", " {\n", " START(section1)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " float (*restrict r2)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr2[tid];\n", "\n", " #pragma omp for schedule(dynamic,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m - 2; y <= y_M + 2; y += 1)\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r2[y + 2][z] = r1*(8.33333333e-2F*(u[t0][x + 4][y + 2][z + 4] - u[t0][x + 4][y + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4]));\n", " }\n", " }\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f,u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = r1*(f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(8.33333333e-2F*(r2[y][z] - r2[y + 4][z]) + 6.66666667e-1F*(-r2[y + 1][z] + r2[y + 3][z]))*r0[x][y][z];\n", " }\n", " }\n", " }\n", " }\n", " STOP(section1,timers)\n", " }\n", "\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " const int tid = omp_get_thread_num();\n", " free(pr2_vec[tid]);\n", " }\n", " free(pr2_vec);\n", " free(r0_vec);\n", "\n", " return 0;\n", "}\n", "\n" ] } ], "source": [ "eq = Eq(u.forward, f**2*sin(f)*u.dy.dy) # Back to original running example\n", "op17_omp = Operator(eq, opt=('advanced-fsg', {'openmp': True}))\n", "print(op17_omp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `x` loop here is still shared by the two loop nests, but the `y` one isn't. Analogously, if we consider the alternative `eq` already used in `op7_b0_omp`, we get two completely separate, and therefore individually blocked, loop nests." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#define _POSIX_C_SOURCE 200809L\n", "#define START(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);\n", "#define STOP(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;\n", "#define MIN(a,b) (((a) < (b)) ? (a) : (b))\n", "\n", "#include \"stdlib.h\"\n", "#include \"math.h\"\n", "#include \"sys/time.h\"\n", "#include \"xmmintrin.h\"\n", "#include \"pmmintrin.h\"\n", "#include \"omp.h\"\n", "\n", "struct dataobj\n", "{\n", " void *restrict data;\n", " unsigned long * size;\n", " unsigned long * npsize;\n", " unsigned long * dsize;\n", " int * hsize;\n", " int * hofs;\n", " int * oofs;\n", " void * dmap;\n", "} ;\n", "\n", "struct profiler\n", "{\n", " double section0;\n", " double section1;\n", "} ;\n", "\n", "\n", "int Kernel(struct dataobj *restrict f_vec, struct dataobj *restrict u_vec, const float h_x, const float h_y, const int time_M, const int time_m, const int x0_blk0_size, const int x1_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y1_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, const int x_size, const int y_size, const int z_size, struct profiler * timers)\n", "{\n", " float *restrict r0_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&r0_vec),64,x_size*y_size*z_size*sizeof(float));\n", " float *restrict r3_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&r3_vec),64,z_size*(x_size + 4)*(y_size + 4)*sizeof(float));\n", " float *restrict r4_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&r4_vec),64,z_size*(x_size + 4)*(y_size + 4)*sizeof(float));\n", "\n", " float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data;\n", " float (*restrict r0)[y_size][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size][z_size]) r0_vec;\n", " float (*restrict r3)[y_size + 4][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size + 4][z_size]) r3_vec;\n", " float (*restrict r4)[y_size + 4][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size + 4][z_size]) r4_vec;\n", " float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data;\n", "\n", " /* Flush denormal numbers to zero in hardware */\n", " _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n", " _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n", "\n", " float r1 = 1.0F/h_x;\n", " float r2 = 1.0F/h_y;\n", "\n", " START(section0)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(2) schedule(static,1)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " #pragma omp simd aligned(f:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r0[x][y][z] = sinf(f[x + 1][y + 1][z + 1]);\n", " }\n", " }\n", " }\n", " }\n", " STOP(section0,timers)\n", "\n", " for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", " {\n", " START(section1)\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(2) schedule(dynamic,1)\n", " for (int x0_blk0 = x_m - 2; x0_blk0 <= x_M + 2; x0_blk0 += x0_blk0_size)\n", " {\n", " for (int y0_blk0 = y_m - 2; y0_blk0 <= y_M + 2; y0_blk0 += y0_blk0_size)\n", " {\n", " for (int x = x0_blk0; x <= MIN(x_M + 2, x0_blk0 + x0_blk0_size - 1); x += 1)\n", " {\n", " for (int y = y0_blk0; y <= MIN(y_M + 2, y0_blk0 + y0_blk0_size - 1); y += 1)\n", " {\n", " #pragma omp simd aligned(u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " r3[x + 2][y + 2][z] = r1*(8.33333333e-2F*(u[t0][x + 2][y + 4][z + 4] - u[t0][x + 6][y + 4][z + 4]) + 6.66666667e-1F*(-u[t0][x + 3][y + 4][z + 4] + u[t0][x + 5][y + 4][z + 4]));\n", " r4[x + 2][y + 2][z] = r2*(8.33333333e-2F*(u[t0][x + 4][y + 2][z + 4] - u[t0][x + 4][y + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4]));\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " #pragma omp parallel num_threads(nthreads)\n", " {\n", " #pragma omp for collapse(2) schedule(dynamic,1)\n", " for (int x1_blk0 = x_m; x1_blk0 <= x_M; x1_blk0 += x1_blk0_size)\n", " {\n", " for (int y1_blk0 = y_m; y1_blk0 <= y_M; y1_blk0 += y1_blk0_size)\n", " {\n", " for (int x = x1_blk0; x <= MIN(x_M, x1_blk0 + x1_blk0_size - 1); x += 1)\n", " {\n", " for (int y = y1_blk0; y <= MIN(y_M, y1_blk0 + y1_blk0_size - 1); y += 1)\n", " {\n", " #pragma omp simd aligned(f,u:32)\n", " for (int z = z_m; z <= z_M; z += 1)\n", " {\n", " u[t1][x + 4][y + 4][z + 4] = (f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(r1*(8.33333333e-2F*(r3[x][y + 2][z] - r3[x + 4][y + 2][z]) + 6.66666667e-1F*(-r3[x + 1][y + 2][z] + r3[x + 3][y + 2][z])) + r2*(8.33333333e-2F*(r4[x + 2][y][z] - r4[x + 2][y + 4][z]) + 6.66666667e-1F*(-r4[x + 2][y + 1][z] + r4[x + 2][y + 3][z])))*r0[x][y][z];\n", " }\n", " }\n", " }\n", " }\n", " }\n", " }\n", " STOP(section1,timers)\n", " }\n", "\n", " free(r0_vec);\n", " free(r3_vec);\n", " free(r4_vec);\n", "\n", " return 0;\n", "}\n", "\n" ] } ], "source": [ "eq = Eq(u.forward, f**2*sin(f)*(u.dy.dy + u.dx.dx))\n", "op17_b0 = Operator(eq, opt=('advanced-fsg', {'openmp': True}))\n", "print(op17_b0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References\n", "\n", "[1] Fabio Luporini, Mathias Louboutin, Michael Lange, Navjot Kukreja, Philipp Witte, Jan Hückelheim, Charles Yount, Paul H. J. Kelly, Felix J. Herrmann, and Gerard J. Gorman. 2020. Architecture and Performance of Devito, a System for Automated Stencil Computation. ACM Trans. Math. Softw. 46, 1, Article 6 (April 2020), 28 pages. DOI:https://doi.org/10.1145/3374916" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "position": { "height": "459.85px", "left": "1723.22px", "right": "20px", "top": "120px", "width": "352.767px" }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }