{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "This notebook contains examples which are expected *to be run with exactly 4 MPI processes*; not because they wouldn't work otherwise, but simply because it's what their description assumes. For this, you need to:\n", "\n", "* Install an MPI distribution on your system, such as OpenMPI, MPICH, or Intel MPI (if not already available).\n", "* Install some optional dependencies, including `mpi4py` and `ipyparallel`; from the root Devito directory, run\n", "```bash\n", "pip install -r requirements-optional.txt\n", "```\n", "* Create an `ipyparallel` MPI profile, by running our simple setup script. From the root directory, run\n", "```bash\n", "./scripts/create_ipyparallel_mpi_profile.sh\n", "```\n", "\n", "## Launch and connect to an ipyparallel cluster\n", "\n", "We're finally ready to launch an ipyparallel cluster. Open a new terminal and run the following command\n", "```bash\n", "ipcluster start -n 4 --profile=mpi --engines=mpi\n", "```\n", "\n", "Wait until the logs report that engines have started successfully:\n", "```bash\n", "2022-05-20 11:57:31.730 [IPClusterStart] Starting ipcluster with [daemonize=False]\n", "2022-05-20 11:57:32.754 [IPClusterStart] Starting 4 engines with \n", "2022-05-20 11:58:02.785 [IPClusterStart] Engines appear to have started successfully\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the engines have started successfully, we can connect to the cluster" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import ipyparallel as ipp\n", "c = ipp.Client(profile='mpi')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, to run commands in parallel over the engines, we will use the %px line magic." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[stdout:0] \n", "Hi, I'm rank 0.\n", "[stdout:1] \n", "Hi, I'm rank 1.\n", "[stdout:2] \n", "Hi, I'm rank 2.\n", "[stdout:3] \n", "Hi, I'm rank 3.\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "\n", "from mpi4py import MPI\n", "print(f\"Hi, I'm rank %d.\" % MPI.COMM_WORLD.rank)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview of MPI in Devito\n", "\n", "Distributed-memory parallelism via MPI is designed so that users can \"think sequentially\" for as much as possible. The few things requested to the user are:\n", "\n", "* Like any other MPI program, run with `mpirun -np X python ...`\n", "* Some pre- and/or post-processing may be rank-specific (e.g., we may want to plot on a given MPI rank only, even though this might be hidden away in the next Devito releases, when newer support APIs will be provided.\n", "* Parallel I/O (if and when necessary) to populate the MPI-distributed datasets in input to a Devito Operator. If a shared file system is available, there are a few simple alternatives to pick from, such as NumPy’s memory-mapped arrays.\n", "\n", "To enable MPI, users have two options. Either export the environment variable `DEVITO_MPI=1` or, programmatically:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%%px\n", "from devito import configuration\n", "configuration['mpi'] = True" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "%%px\n", "# Keep generated code as simple as possible\n", "configuration['language'] = 'C'\n", "# Fix platform so that this notebook can be tested by py.test --nbval\n", "configuration['platform'] = 'knl7210'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An `Operator` will then generate MPI code, including sends/receives for halo exchanges. Below, we introduce a running example through which we explain how domain decomposition as well as data access (read/write) and distribution work. Performance optimizations are discussed [in a later section](#Performance-optimizations).\n", "\n", "Let's start by creating a `TimeFunction`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "%%px\n", "from devito import Grid, TimeFunction, Eq, Operator \n", "grid = Grid(shape=(4, 4))\n", "u = TimeFunction(name=\"u\", grid=grid, space_order=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Domain decomposition is performed when creating a `Grid`. Users may supply their own domain decomposition, but this is not shown in this notebook. Devito exploits the MPI Cartesian topology abstraction to logically split the `Grid` over the available MPI processes. Since `u` is defined over a decomposed `Grid`, its data get distributed too." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[stdout:0] \n", "[[0. 0.]\n", " [0. 0.]]\n", "[stdout:1] \n", "[[0. 0.]\n", " [0. 0.]]\n", "[stdout:2] \n", "[[0. 0.]\n", " [0. 0.]]\n", "[stdout:3] \n", "[[0. 0.]\n", " [0. 0.]]\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "print(u.data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Globally, `u` consists of 2 time-buffer slices of 4x4 points -- this is what users \"see\". But locally, as shown above, each rank has got a 2x2 subdomain per time slice. The key point is: **for the user, the fact that `u.data` is distributed is completely abstracted away -- the perception is that of indexing into a classic NumPy array, regardless of whether MPI is enabled or not**. All sort of NumPy indexing schemes (basic, slicing, etc.) are supported. For example, we can write into a slice-generated view of our data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "%%px\n", "u.data[0, 1:-1, 1:-1] = 1." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[stdout:0] \n", "[[0. 0.]\n", " [0. 1.]]\n", "[stdout:1] \n", "[[0. 0.]\n", " [1. 0.]]\n", "[stdout:2] \n", "[[0. 1.]\n", " [0. 0.]]\n", "[stdout:3] \n", "[[1. 0.]\n", " [0. 0.]]\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "print(u.data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The only limitation, currently, is that a data access cannot require a direct data exchange among two or more processes (e.g., the assignment `u.data[0, 0, 0] = u.data[0, 3, 3]` will raise an exception unless both entries belong to the same MPI rank).\n", "\n", "We can finally write out a trivial `Operator` to try running something." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[stderr:0] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n", "[stderr:1] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n", "[stderr:2] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n", "[stderr:3] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "#NBVAL_IGNORE_OUTPUT\n", "\n", "op = Operator(Eq(u.forward, u + 1))\n", "summary = op.apply(time_M=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can now check again the (distributed) content of our `u.data`, at u.data[1] since the computation ran for one timestep (writing to the next time-buffer slice)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[stdout:0] \n", "[[1. 1.]\n", " [1. 2.]]\n", "[stdout:1] \n", "[[1. 1.]\n", " [2. 1.]]\n", "[stdout:2] \n", "[[1. 2.]\n", " [1. 1.]]\n", "[stdout:3] \n", "[[2. 1.]\n", " [1. 1.]]\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "print(u.data[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Everything as expected. We could also peek at the generated code, because we may be curious to see what sort of MPI calls Devito has generated..." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[stdout:0] #define _POSIX_C_SOURCE 200809L\n", "#define START(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);\n", "#define STOP(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;\n", "\n", "#include \"stdlib.h\"\n", "#include \"math.h\"\n", "#include \"sys/time.h\"\n", "#include \"xmmintrin.h\"\n", "#include \"pmmintrin.h\"\n", "#include \"mpi.h\"\n", "\n", "struct dataobj\n", "{\n", " void *restrict data;\n", " unsigned long * size;\n", " unsigned long * npsize;\n", " unsigned long * dsize;\n", " int * hsize;\n", " int * hofs;\n", " int * oofs;\n", " void * dmap;\n", "} ;\n", "\n", "struct profiler\n", "{\n", " double section0;\n", "} ;\n", "\n", "\n", "int Kernel(struct dataobj *restrict u_vec, const int time_M, const int time_m, const int x_M, const int x_m, const int y_M, const int y_m, struct profiler * timers)\n", "{\n", " float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n", "\n", " /* Flush denormal numbers to zero in hardware */\n", " _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n", " _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n", "\n", " for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", " {\n", " START(section0)\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " #pragma omp simd aligned(u:64)\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " u[t1][x + 2][y + 2] = u[t0][x + 2][y + 2] + 1;\n", " }\n", " }\n", " STOP(section0,timers)\n", " }\n", "\n", " return 0;\n", "}\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%px --targets 0\n", "print(op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hang on. There's nothing MPI-specific here! At least apart from the header file `#include \"mpi.h\"`. What's going on? Well, it's simple. Devito was smart enough to realize that this trivial `Operator` doesn't even need any sort of halo exchange -- the `Eq` implements a pure \"map computation\" (i.e., fully parallel), so it can just let each MPI process do its job without ever synchronizing with halo exchanges. We might want try again with a proper stencil `Eq`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "%%px\n", "op = Operator(Eq(u.forward, u.dx + 1))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[stdout:0] #define _POSIX_C_SOURCE 200809L\n", "#define START(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);\n", "#define STOP(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;\n", "\n", "#include \"stdlib.h\"\n", "#include \"math.h\"\n", "#include \"sys/time.h\"\n", "#include \"xmmintrin.h\"\n", "#include \"pmmintrin.h\"\n", "#include \"mpi.h\"\n", "\n", "struct dataobj\n", "{\n", " void *restrict data;\n", " unsigned long * size;\n", " unsigned long * npsize;\n", " unsigned long * dsize;\n", " int * hsize;\n", " int * hofs;\n", " int * oofs;\n", " void * dmap;\n", "} ;\n", "\n", "struct neighborhood\n", "{\n", " int ll, lc, lr;\n", " int cl, cc, cr;\n", " int rl, rc, rr;\n", "} ;\n", "\n", "struct profiler\n", "{\n", " double section0;\n", "} ;\n", "\n", "static void gather0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy);\n", "static void scatter0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy);\n", "static void sendrecv0(struct dataobj *restrict u_vec, const int x_size, const int y_size, int ogtime, int ogx, int ogy, int ostime, int osx, int osy, int fromrank, int torank, MPI_Comm comm);\n", "static void haloupdate0(struct dataobj *restrict u_vec, MPI_Comm comm, struct neighborhood * nb, int otime);\n", "\n", "int Kernel(struct dataobj *restrict u_vec, const float h_x, const int time_M, const int time_m, const int x_M, const int x_m, const int y_M, const int y_m, MPI_Comm comm, struct neighborhood * nb, struct profiler * timers)\n", "{\n", " float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n", "\n", " /* Flush denormal numbers to zero in hardware */\n", " _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n", " _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n", "\n", " float r0 = 1.0F/h_x;\n", "\n", " for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n", " {\n", " START(section0)\n", " haloupdate0(u_vec,comm,nb,t0);\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " #pragma omp simd aligned(u:64)\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " u[t1][x + 2][y + 2] = r0*(-u[t0][x + 2][y + 2]) + r0*u[t0][x + 3][y + 2] + 1;\n", " }\n", " }\n", " STOP(section0,timers)\n", " }\n", "\n", " return 0;\n", "}\n", "\n", "static void gather0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy)\n", "{\n", " float (*restrict buf)[bx_size][by_size] __attribute__ ((aligned (64))) = (float (*)[bx_size][by_size]) buf_vec;\n", " float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n", "\n", " const int x_m = 0;\n", " const int y_m = 0;\n", " const int x_M = bx_size - 1;\n", " const int y_M = by_size - 1;\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " #pragma omp simd aligned(u:64)\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " buf[0][x][y] = u[otime][x + ox][y + oy];\n", " }\n", " }\n", "}\n", "\n", "static void scatter0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy)\n", "{\n", " float (*restrict buf)[bx_size][by_size] __attribute__ ((aligned (64))) = (float (*)[bx_size][by_size]) buf_vec;\n", " float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n", "\n", " const int x_m = 0;\n", " const int y_m = 0;\n", " const int x_M = bx_size - 1;\n", " const int y_M = by_size - 1;\n", "\n", " for (int x = x_m; x <= x_M; x += 1)\n", " {\n", " #pragma omp simd aligned(u:64)\n", " for (int y = y_m; y <= y_M; y += 1)\n", " {\n", " u[otime][x + ox][y + oy] = buf[0][x][y];\n", " }\n", " }\n", "}\n", "\n", "static void sendrecv0(struct dataobj *restrict u_vec, const int x_size, const int y_size, int ogtime, int ogx, int ogy, int ostime, int osx, int osy, int fromrank, int torank, MPI_Comm comm)\n", "{\n", " float *restrict bufg_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&bufg_vec),64,x_size*y_size*sizeof(float));\n", " float *restrict bufs_vec __attribute__ ((aligned (64)));\n", " posix_memalign((void**)(&bufs_vec),64,x_size*y_size*sizeof(float));\n", "\n", " MPI_Request rrecv;\n", " MPI_Request rsend;\n", "\n", " MPI_Irecv(bufs_vec,x_size*y_size,MPI_FLOAT,fromrank,13,comm,&(rrecv));\n", " if (torank != MPI_PROC_NULL)\n", " {\n", " gather0(bufg_vec,x_size,y_size,u_vec,ogtime,ogx,ogy);\n", " }\n", " MPI_Isend(bufg_vec,x_size*y_size,MPI_FLOAT,torank,13,comm,&(rsend));\n", " MPI_Wait(&(rsend),MPI_STATUS_IGNORE);\n", " MPI_Wait(&(rrecv),MPI_STATUS_IGNORE);\n", " if (fromrank != MPI_PROC_NULL)\n", " {\n", " scatter0(bufs_vec,x_size,y_size,u_vec,ostime,osx,osy);\n", " }\n", "\n", " free(bufg_vec);\n", " free(bufs_vec);\n", "}\n", "\n", "static void haloupdate0(struct dataobj *restrict u_vec, MPI_Comm comm, struct neighborhood * nb, int otime)\n", "{\n", " sendrecv0(u_vec,u_vec->hsize[3],u_vec->npsize[2],otime,u_vec->oofs[2],u_vec->hofs[4],otime,u_vec->hofs[3],u_vec->hofs[4],nb->rc,nb->lc,comm);\n", "}\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%px --targets 0\n", "print(op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uh-oh -- now the generated code looks more complicated than before, though it still is pretty much human-readable. We can spot the following routines:\n", "\n", "* `haloupdate0` performs a blocking halo exchange, relying on three additional functions, `gather0`, `sendrecv0`, and `scatter0`;\n", "* `gather0` copies the (generally non-contiguous) boundary data into a contiguous buffer;\n", "* `sendrecv0` takes the buffered data and sends it to one or more neighboring processes; then it waits until all data from the neighboring processes is received;\n", "* `scatter0` copies the received data into the proper array locations.\n", "\n", "This is the simplest halo exchange scheme available in Devito. There are a few, and some of them apply aggressive optimizations, [as shown later on](#Performance-optimizations)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before looking at other scenarios and performance optimizations, there is one last thing it is worth discussing -- the `data_with_halo` view." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[stdout:0] \n", "[[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 1.]]\n", "[stdout:1] \n", "[[0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [1. 0. 0. 0.]]\n", "[stdout:2] \n", "[[0. 0. 0. 1.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n", "[stdout:3] \n", "[[1. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]\n", " [0. 0. 0. 0.]]\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "print(u.data_with_halo[0])\n", "# Uncomment to see halo for the next (computed) timestep\n", "# u.data_with_halo[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is again a global data view. The shown *with_halo* is the \"true\" halo surrounding the physical domain (often referred to as the \"outer halo\"), **not** the halo used for the MPI halo exchanges (often referred to as \"ghost region\" or \"inner halo\"). A user can straightforwardly initialize the \"true\" halo region (which is typically read by a stencil `Eq` when an `Operator` iterates in proximity of the domain bounday).\n", "\n", "Note: This \"halo\" is often encountered as \"ghost cell area\" in literature" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "%%px\n", "u.data_with_halo[:] = 1." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[stdout:0] \n", "[[[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]\n", "\n", " [[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]]\n", "[stdout:1] \n", "[[[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]\n", "\n", " [[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]]\n", "[stdout:2] \n", "[[[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]\n", "\n", " [[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]]\n", "[stdout:3] \n", "[[[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]\n", "\n", " [[1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]\n", " [1. 1. 1. 1.]]]\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "# Note: Both time buffer slices are now printed\n", "print(u.data_with_halo[:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MPI and SparseFunction\n", "\n", "A `SparseFunction` represents a sparse set of points which are generically unaligned with the `Grid`. A sparse point could be anywhere within a grid, and is therefore attached some coordinates. Given a sparse point, Devito looks at its coordinates and, based on the domain decomposition, **logically** assigns it to a given MPI process; this is purely logical ownership, as in Python-land, before running an Operator, the sparse point physically lives on the MPI rank which created it. Within `op.apply`, right before jumping to C-land, the sparse points are scattered to their logical owners; upon returning to Python-land, the sparse points are gathered back to their original location.\n", "\n", "In the following example, we attempt injection of four sparse points into the neighboring grid points via linear interpolation." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "%%px\n", "from devito import Function, SparseFunction\n", "grid = Grid(shape=(4, 4), extent=(3.0, 3.0))\n", "x, y = grid.dimensions\n", "f = Function(name='f', grid=grid)\n", "coords = [(0.5, 0.5), (1.5, 2.5), (1.5, 1.5), (2.5, 1.5)]\n", "sf = SparseFunction(name='sf', grid=grid, npoint=len(coords), coordinates=coords)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let:\n", "* O be a grid point\n", "* x be a halo point\n", "* A, B, C, D be the sparse points\n", "\n", "We show the global view, that is what the user \"sees\".\n", "\n", "```\n", "O --- O --- O --- O\n", "| A | | |\n", "O --- O --- O --- O\n", "| | C | B |\n", "O --- O --- O --- O\n", "| | D | |\n", "O --- O --- O --- O\n", "```\n", "\n", "And now the local view, that is what the MPI ranks own when jumping to C-land. \n", "\n", "```\n", "Rank 0 Rank 1\n", "O --- O --- x x --- O --- O\n", "| A | | | | |\n", "O --- O --- x x --- O --- O\n", "| | C | | C | B |\n", "x --- x --- x x --- x --- x\n", "\n", "Rank 2 Rank 3\n", "x --- x --- x x --- x --- x\n", "| | C | | C | B |\n", "O --- O --- x x --- O --- O\n", "| | D | | D | |\n", "O --- O --- x x --- O --- O\n", "```\n", "\n", "We observe that the sparse points along the boundary of two or more MPI ranks are _duplicated_ and thus redundantly computed over multiple processes. However, the contributions from these points to the neighboring halo points are naturally ditched, so the final result of the interpolation is as expected. Let's convince ourselves that this is the case. We assign a value of $5$ to each sparse point. Since we are using linear interpolation and all points are placed at the exact center of a grid quadrant, we expect that the contribution of each sparse point to a neighboring grid point will be $5 * 0.25 = 1.25$. Based on the global view above, we eventually expect `f` to look like as follows:\n", "\n", "```\n", "1.25 --- 1.25 --- 0.00 --- 0.00\n", "| | | |\n", "1.25 --- 2.50 --- 2.50 --- 1.25\n", "| | | |\n", "0.00 --- 2.50 --- 3.75 --- 1.25\n", "| | | |\n", "0.00 --- 1.25 --- 1.25 --- 0.00\n", "```\n", "\n", "Let's check this out." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[stderr:0] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n", "[stderr:1] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n", "[stderr:2] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n", "[stderr:3] \n", "Operator `Kernel` ran in 0.01 s\n", "INFO:Devito:Operator `Kernel` ran in 0.01 s\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "#NBVAL_IGNORE_OUTPUT\n", "\n", "sf.data[:] = 5.\n", "op = Operator(sf.inject(field=f, expr=sf))\n", "summary = op.apply()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[stdout:0] \n", "[[1.25 1.25]\n", " [1.25 2.5 ]]\n", "[stdout:1] \n", "[[0. 0. ]\n", " [2.5 1.25]]\n", "[stdout:2] \n", "[[0. 2.5 ]\n", " [0. 1.25]]\n", "[stdout:3] \n", "[[3.75 1.25]\n", " [1.25 0. ]]\n" ] } ], "source": [ "%%px --no-stream --group-outputs=engine\n", "print(f.data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance optimizations\n", "\n", "The Devito compiler applies several optimizations before generating code.\n", "\n", "* Redundant halo exchanges are identified and removed. A halo exchange is redundant if a prior halo exchange carries out the same `Function` update and the data is not “dirty” yet.\n", "* Computation/communication overlap, with explicit prodding of the asynchronous progress engine to make sure that non-blocking communications execute in background during the compute part.\n", "* Halo exchanges could also be reshuffled to maximize the extension of the computation/communication overlap region.\n", "\n", "To run with all these optimizations enabled, instead of `DEVITO_MPI=1`, users should set `DEVITO_MPI=full`, or, equivalently" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "%%px\n", "configuration['mpi'] = 'full'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could now peek at the generated code to see that things now look differently." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%px\n", "op = Operator(Eq(u.forward, u.dx + 1))\n", "# Uncomment below to show code (it's quite verbose)\n", "# print(op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The body of the time-stepping loop has changed, as it now implements a classic computation/communication overlap scheme:\n", "\n", "* `haloupdate0` triggers non-blocking communications;\n", "* `compute0` executes the core domain region, that is the sub-region which doesn't require reading from halo data to be computed;\n", "* `halowait0` wait and terminates the non-blocking communications;\n", "* `remainder0`, which internally calls `compute0`, computes the boundary region requiring the now up-to-date halo data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }