{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "This notebook contains examples which are expected *to be run with exactly 4 MPI processes*; not because they wouldn't work otherwise, but simply because it's what their description assumes. For this, you need to:\n",
    "\n",
    "* Install an MPI distribution on your system, such as OpenMPI, MPICH, or Intel MPI (if not already available).\n",
    "* Install some optional dependencies, including `mpi4py` and `ipyparallel`; from the root Devito directory, run\n",
    "```bash\n",
    "pip install -r requirements-mpi.txt\n",
    "```\n",
    "* Create an `ipyparallel` MPI profile, by running our simple setup script. From the root directory, run\n",
    "```bash\n",
    "./scripts/create_ipyparallel_mpi_profile.sh\n",
    "```\n",
    "\n",
    "## Launch and connect to an ipyparallel cluster\n",
    "\n",
    "We're finally ready to launch an ipyparallel cluster. Open a new terminal and run the following command\n",
    "```bash\n",
    "ipcluster start -n 4 --profile=mpi --engines=mpi\n",
    "```\n",
    "\n",
    "Wait until the logs report that engines have started successfully:\n",
    "```bash\n",
    "2022-05-20 11:57:31.730 [IPClusterStart] Starting ipcluster with [daemonize=False]\n",
    "2022-05-20 11:57:32.754 [IPClusterStart] Starting 4 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>\n",
    "2022-05-20 11:58:02.785 [IPClusterStart] Engines appear to have started successfully\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the engines have started successfully, we can connect to the cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ipyparallel as ipp\n",
    "c = ipp.Client(profile='mpi')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, to run commands in parallel over the engines, we will use the %px line magic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[stdout:0] \n",
      "Hi, I'm rank 0.\n",
      "[stdout:1] \n",
      "Hi, I'm rank 1.\n",
      "[stdout:2] \n",
      "Hi, I'm rank 2.\n",
      "[stdout:3] \n",
      "Hi, I'm rank 3.\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "\n",
    "from mpi4py import MPI\n",
    "print(f\"Hi, I'm rank %d.\" % MPI.COMM_WORLD.rank)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview of MPI in Devito\n",
    "\n",
    "Distributed-memory parallelism via MPI is designed so that users can \"think sequentially\" for as much as possible. The few things requested to the user are:\n",
    "\n",
    "* Like any other MPI program, run with `mpirun -np X python ...`\n",
    "* Some pre- and/or post-processing may be rank-specific (e.g., we may want to plot on a given MPI rank only, even though this might be hidden away in the next Devito releases, when newer support APIs will be provided.\n",
    "* Parallel I/O (if and when necessary) to populate the MPI-distributed datasets in input to a Devito Operator. If a shared file system is available, there are a few simple alternatives to pick from, such as NumPy’s memory-mapped arrays.\n",
    "\n",
    "To enable MPI, users have two options. Either export the environment variable `DEVITO_MPI=1` or, programmatically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "from devito import configuration\n",
    "configuration['mpi'] = True\n",
    "\n",
    "# Feel free to change the log level, and see more detailed logging\n",
    "configuration['log-level'] = 'INFO'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "# Keep generated code as simple as possible\n",
    "configuration['language'] = 'C'\n",
    "# Fix platform so that this notebook can have asserted output\n",
    "# when tested by ``py.test --nbval\" in any platform\n",
    "configuration['platform'] = 'knl7210'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An `Operator` will then generate MPI code, including sends/receives for halo exchanges. Below, we introduce a running example through which we explain how domain decomposition as well as data access (read/write) and distribution work. Performance optimizations are discussed [in a later section](#Performance-optimizations).\n",
    "\n",
    "Let's start by creating a `TimeFunction`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "from devito import Grid, TimeFunction, Eq, Operator \n",
    "grid = Grid(shape=(4, 4))\n",
    "u = TimeFunction(name=\"u\", grid=grid, space_order=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Domain decomposition is performed when creating a `Grid`. Users may supply their own domain decomposition, but this is not shown in this notebook. Devito exploits the MPI Cartesian topology abstraction to logically split the `Grid` over the available MPI processes. Since `u` is defined over a decomposed `Grid`, its data get distributed too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[stdout:0] \n",
      "[[0. 0.]\n",
      " [0. 0.]]\n",
      "[stdout:1] \n",
      "[[0. 0.]\n",
      " [0. 0.]]\n",
      "[stdout:2] \n",
      "[[0. 0.]\n",
      " [0. 0.]]\n",
      "[stdout:3] \n",
      "[[0. 0.]\n",
      " [0. 0.]]\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "print(u.data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Globally, `u` consists of 2 time-buffer slices of 4x4 points -- this is what users \"see\". But locally, as shown above, each rank has got a 2x2 subdomain per time slice. The key point is: **for the user, the fact that `u.data` is distributed is completely abstracted away -- the perception is that of indexing into a classic NumPy array, regardless of whether MPI is enabled or not**. All sort of NumPy indexing schemes (basic, slicing, etc.) are supported. For example, we can write into a slice-generated view of our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "u.data[0, 1:-1, 1:-1] = 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[stdout:0] \n",
      "[[0. 0.]\n",
      " [0. 1.]]\n",
      "[stdout:1] \n",
      "[[0. 0.]\n",
      " [1. 0.]]\n",
      "[stdout:2] \n",
      "[[0. 1.]\n",
      " [0. 0.]]\n",
      "[stdout:3] \n",
      "[[1. 0.]\n",
      " [0. 0.]]\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "print(u.data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The only limitation, currently, is that a data access cannot require a direct data exchange among two or more processes (e.g., the assignment `u.data[0, 0, 0] = u.data[0, 3, 3]` will raise an exception unless both entries belong to the same MPI rank).\n",
    "\n",
    "We can finally write out a trivial `Operator` to try running something."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[stderr:0] \n",
      "Operator `Kernel` ran in 0.01 s\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "#NBVAL_IGNORE_OUTPUT\n",
    "\n",
    "op = Operator(Eq(u.forward, u + 1))\n",
    "summary = op.apply(time_M=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can now check again the (distributed) content of our `u.data`, at u.data[1] since the computation ran for one timestep (writing to the next time-buffer slice)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[stdout:0] \n",
      "[[1. 1.]\n",
      " [1. 2.]]\n",
      "[stdout:1] \n",
      "[[1. 1.]\n",
      " [2. 1.]]\n",
      "[stdout:2] \n",
      "[[1. 2.]\n",
      " [1. 1.]]\n",
      "[stdout:3] \n",
      "[[2. 1.]\n",
      " [1. 1.]]\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "print(u.data[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Everything as expected. We could also peek at the generated code, because we may be curious to see what sort of MPI calls Devito has generated..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[stdout:0] /* Devito generated code for Operator `Kernel` */\n",
       "\n",
       "#define _POSIX_C_SOURCE 200809L\n",
       "#define START(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);\n",
       "#define STOP(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;\n",
       "\n",
       "#include \"stdlib.h\"\n",
       "#include \"math.h\"\n",
       "#include \"sys/time.h\"\n",
       "#include \"xmmintrin.h\"\n",
       "#include \"pmmintrin.h\"\n",
       "#include \"mpi.h\"\n",
       "\n",
       "struct dataobj\n",
       "{\n",
       "  void *restrict data;\n",
       "  int * size;\n",
       "  unsigned long nbytes;\n",
       "  unsigned long * npsize;\n",
       "  unsigned long * dsize;\n",
       "  int * hsize;\n",
       "  int * hofs;\n",
       "  int * oofs;\n",
       "  void * dmap;\n",
       "} ;\n",
       "\n",
       "struct profiler\n",
       "{\n",
       "  double section0;\n",
       "} ;\n",
       "\n",
       "\n",
       "int Kernel(struct dataobj *restrict u_vec, const int time_M, const int time_m, const int x_M, const int x_m, const int y_M, const int y_m, struct profiler * timers)\n",
       "{\n",
       "  float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n",
       "\n",
       "  /* Flush denormal numbers to zero in hardware */\n",
       "  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n",
       "  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n",
       "\n",
       "  for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n",
       "  {\n",
       "    START(section0)\n",
       "    for (int x = x_m; x <= x_M; x += 1)\n",
       "    {\n",
       "      #pragma omp simd aligned(u:64)\n",
       "      for (int y = y_m; y <= y_M; y += 1)\n",
       "      {\n",
       "        u[t1][x + 2][y + 2] = u[t0][x + 2][y + 2] + 1;\n",
       "      }\n",
       "    }\n",
       "    STOP(section0,timers)\n",
       "  }\n",
       "\n",
       "  return 0;\n",
       "}\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%px --targets 0\n",
    "print(op)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hang on. There's nothing MPI-specific here! At least apart from the header file `#include \"mpi.h\"`. What's going on? Well, it's simple. Devito was smart enough to realize that this trivial `Operator` doesn't even need any sort of halo exchange -- the `Eq` implements a pure \"map computation\" (i.e., fully parallel), so it can just let each MPI process do its job without ever synchronizing with halo exchanges. We might want try again with a proper stencil `Eq`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "op = Operator(Eq(u.forward, u.dx + 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[stdout:0] /* Devito generated code for Operator `Kernel` */\n",
       "\n",
       "#define _POSIX_C_SOURCE 200809L\n",
       "#define START(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL);\n",
       "#define STOP(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000;\n",
       "\n",
       "#include \"stdlib.h\"\n",
       "#include \"math.h\"\n",
       "#include \"sys/time.h\"\n",
       "#include \"xmmintrin.h\"\n",
       "#include \"pmmintrin.h\"\n",
       "#include \"mpi.h\"\n",
       "\n",
       "struct dataobj\n",
       "{\n",
       "  void *restrict data;\n",
       "  int * size;\n",
       "  unsigned long nbytes;\n",
       "  unsigned long * npsize;\n",
       "  unsigned long * dsize;\n",
       "  int * hsize;\n",
       "  int * hofs;\n",
       "  int * oofs;\n",
       "  void * dmap;\n",
       "} ;\n",
       "\n",
       "struct neighborhood\n",
       "{\n",
       "  int ll, lc, lr;\n",
       "  int cl, cc, cr;\n",
       "  int rl, rc, rr;\n",
       "} ;\n",
       "\n",
       "struct profiler\n",
       "{\n",
       "  double section0;\n",
       "} ;\n",
       "\n",
       "static void sendrecv0(struct dataobj *restrict u_vec, const int x_size, const int y_size, int ogtime, int ogx, int ogy, int ostime, int osx, int osy, int fromrank, int torank, MPI_Comm comm);\n",
       "static void haloupdate0(struct dataobj *restrict u_vec, MPI_Comm comm, struct neighborhood * nb, int otime);\n",
       "static void gather0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy);\n",
       "static void scatter0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy);\n",
       "\n",
       "int Kernel(struct dataobj *restrict u_vec, const float h_x, const int time_M, const int time_m, const int x_M, const int x_m, const int y_M, const int y_m, MPI_Comm comm, struct neighborhood * nb, struct profiler * timers)\n",
       "{\n",
       "  float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n",
       "\n",
       "  /* Flush denormal numbers to zero in hardware */\n",
       "  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n",
       "  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n",
       "\n",
       "  float r0 = 1.0F/h_x;\n",
       "\n",
       "  for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))\n",
       "  {\n",
       "    START(section0)\n",
       "    haloupdate0(u_vec,comm,nb,t0);\n",
       "    for (int x = x_m; x <= x_M; x += 1)\n",
       "    {\n",
       "      #pragma omp simd aligned(u:64)\n",
       "      for (int y = y_m; y <= y_M; y += 1)\n",
       "      {\n",
       "        u[t1][x + 2][y + 2] = -r0*u[t0][x + 2][y + 2] + r0*u[t0][x + 3][y + 2] + 1;\n",
       "      }\n",
       "    }\n",
       "    STOP(section0,timers)\n",
       "  }\n",
       "\n",
       "  return 0;\n",
       "}\n",
       "\n",
       "static void sendrecv0(struct dataobj *restrict u_vec, const int x_size, const int y_size, int ogtime, int ogx, int ogy, int ostime, int osx, int osy, int fromrank, int torank, MPI_Comm comm)\n",
       "{\n",
       "  MPI_Request rrecv;\n",
       "  MPI_Request rsend;\n",
       "\n",
       "  float *restrict bufg_vec __attribute__ ((aligned (64)));\n",
       "  posix_memalign((void**)(&bufg_vec),64,sizeof(float)*(long)y_size*(long)x_size);\n",
       "  float *restrict bufs_vec __attribute__ ((aligned (64)));\n",
       "  posix_memalign((void**)(&bufs_vec),64,sizeof(float)*(long)y_size*(long)x_size);\n",
       "\n",
       "  MPI_Irecv(bufs_vec,x_size*y_size,MPI_FLOAT,fromrank,13,comm,&rrecv);\n",
       "  if (torank != MPI_PROC_NULL)\n",
       "  {\n",
       "    gather0(bufg_vec,x_size,y_size,u_vec,ogtime,ogx,ogy);\n",
       "  }\n",
       "  MPI_Isend(bufg_vec,x_size*y_size,MPI_FLOAT,torank,13,comm,&rsend);\n",
       "  MPI_Wait(&rsend,MPI_STATUS_IGNORE);\n",
       "  MPI_Wait(&rrecv,MPI_STATUS_IGNORE);\n",
       "  if (fromrank != MPI_PROC_NULL)\n",
       "  {\n",
       "    scatter0(bufs_vec,x_size,y_size,u_vec,ostime,osx,osy);\n",
       "  }\n",
       "\n",
       "  free(bufg_vec);\n",
       "  free(bufs_vec);\n",
       "}\n",
       "\n",
       "static void haloupdate0(struct dataobj *restrict u_vec, MPI_Comm comm, struct neighborhood * nb, int otime)\n",
       "{\n",
       "  sendrecv0(u_vec,u_vec->hsize[3],u_vec->npsize[2],otime,u_vec->oofs[2],u_vec->hofs[4],otime,u_vec->hofs[3],u_vec->hofs[4],nb->rc,nb->lc,comm);\n",
       "}\n",
       "\n",
       "static void gather0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy)\n",
       "{\n",
       "  float (*restrict buf)[bx_size][by_size] __attribute__ ((aligned (64))) = (float (*)[bx_size][by_size]) buf_vec;\n",
       "  float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n",
       "\n",
       "  const int x_m = 0;\n",
       "  const int y_m = 0;\n",
       "  const int x_M = bx_size - 1;\n",
       "  const int y_M = by_size - 1;\n",
       "\n",
       "  for (int x = x_m; x <= x_M; x += 1)\n",
       "  {\n",
       "    #pragma omp simd aligned(u:64)\n",
       "    for (int y = y_m; y <= y_M; y += 1)\n",
       "    {\n",
       "      buf[0][x][y] = u[otime][x + ox][y + oy];\n",
       "    }\n",
       "  }\n",
       "}\n",
       "\n",
       "static void scatter0(float *restrict buf_vec, int bx_size, int by_size, struct dataobj *restrict u_vec, const int otime, const int ox, const int oy)\n",
       "{\n",
       "  float (*restrict buf)[bx_size][by_size] __attribute__ ((aligned (64))) = (float (*)[bx_size][by_size]) buf_vec;\n",
       "  float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;\n",
       "\n",
       "  const int x_m = 0;\n",
       "  const int y_m = 0;\n",
       "  const int x_M = bx_size - 1;\n",
       "  const int y_M = by_size - 1;\n",
       "\n",
       "  for (int x = x_m; x <= x_M; x += 1)\n",
       "  {\n",
       "    #pragma omp simd aligned(u:64)\n",
       "    for (int y = y_m; y <= y_M; y += 1)\n",
       "    {\n",
       "      u[otime][x + ox][y + oy] = buf[0][x][y];\n",
       "    }\n",
       "  }\n",
       "}\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%px --targets 0\n",
    "print(op)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uh-oh -- now the generated code looks more complicated than before, though it still is pretty much human-readable. We can spot the following routines:\n",
    "\n",
    "* `haloupdate0` performs a blocking halo exchange, relying on three additional functions, `gather0`, `sendrecv0`, and `scatter0`;\n",
    "* `gather0` copies the (generally non-contiguous) boundary data into a contiguous buffer;\n",
    "* `sendrecv0` takes the buffered data and sends it to one or more neighboring processes; then it waits until all data from the neighboring processes is received;\n",
    "* `scatter0` copies the received data into the proper array locations.\n",
    "\n",
    "This is the simplest halo exchange scheme available in Devito. There are a few, and some of them apply aggressive optimizations, [as shown later on](#Performance-optimizations)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before looking at other scenarios and performance optimizations, there is one last thing it is worth discussing -- the `data_with_halo` view."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[stdout:0] \n",
      "[[0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 1.]]\n",
      "[stdout:1] \n",
      "[[0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [1. 0. 0. 0.]]\n",
      "[stdout:2] \n",
      "[[0. 0. 0. 1.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]]\n",
      "[stdout:3] \n",
      "[[1. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]\n",
      " [0. 0. 0. 0.]]\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "print(u.data_with_halo[0])\n",
    "# Uncomment to see halo for the next (computed) timestep\n",
    "# u.data_with_halo[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is again a global data view. The shown *with_halo* is the \"true\" halo surrounding the physical domain (often referred to as the \"outer halo\"), **not** the halo used for the MPI halo exchanges (often referred to as \"ghost region\" or \"inner halo\"). A user can straightforwardly initialize the \"true\" halo region (which is typically read by a stencil `Eq` when an `Operator` iterates in proximity of the domain boundary).\n",
    "\n",
    "Note: This \"halo\" is often encountered as \"ghost cell area\" in literature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "u.data_with_halo[:] = 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[stdout:0] \n",
      "[[[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]\n",
      "\n",
      " [[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]]\n",
      "[stdout:1] \n",
      "[[[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]\n",
      "\n",
      " [[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]]\n",
      "[stdout:2] \n",
      "[[[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]\n",
      "\n",
      " [[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]]\n",
      "[stdout:3] \n",
      "[[[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]\n",
      "\n",
      " [[1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]\n",
      "  [1. 1. 1. 1.]]]\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "# Note: Both time buffer slices are now printed\n",
    "print(u.data_with_halo[:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MPI and SparseFunction\n",
    "\n",
    "A `SparseFunction` represents a sparse set of points which are generically unaligned with the `Grid`. A sparse point could be anywhere within a grid, and is therefore attached some coordinates. Given a sparse point, Devito looks at its coordinates and, based on the domain decomposition, **logically** assigns it to a given MPI process; this is purely logical ownership, as in Python-land, before running an Operator, the sparse point physically lives on the MPI rank which created it. Within `op.apply`, right before jumping to C-land, the sparse points are scattered to their logical owners; upon returning to Python-land, the sparse points are gathered back to their original location.\n",
    "\n",
    "In the following example, we attempt injection of four sparse points into the neighboring grid points via linear interpolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "from devito import Function, SparseFunction\n",
    "grid = Grid(shape=(4, 4), extent=(3.0, 3.0))\n",
    "x, y = grid.dimensions\n",
    "f = Function(name='f', grid=grid)\n",
    "coords = [(0.5, 0.5), (1.5, 2.5), (1.5, 1.5), (2.5, 1.5)]\n",
    "sf = SparseFunction(name='sf', grid=grid, npoint=len(coords), coordinates=coords)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let:\n",
    "* O be a grid point\n",
    "* x be a halo point\n",
    "* A, B, C, D be the sparse points\n",
    "\n",
    "We show the global view, that is what the user \"sees\".\n",
    "\n",
    "```\n",
    "O --- O --- O --- O\n",
    "|  A  |     |     |\n",
    "O --- O --- O --- O\n",
    "|     |  C  |  B  |\n",
    "O --- O --- O --- O\n",
    "|     |  D  |     |\n",
    "O --- O --- O --- O\n",
    "```\n",
    "\n",
    "And now the local view, that is what the MPI ranks own when jumping to C-land.        \n",
    "\n",
    "```\n",
    "Rank 0          Rank 1\n",
    "O --- O --- x   x --- O --- O\n",
    "|  A  |     |   |     |     |\n",
    "O --- O --- x   x --- O --- O\n",
    "|     |  C  |   |  C  |  B  |\n",
    "x --- x --- x   x --- x --- x\n",
    "\n",
    "Rank 2           Rank 3\n",
    "x --- x --- x   x --- x --- x\n",
    "|     |  C  |   |  C  |  B  |\n",
    "O --- O --- x   x --- O --- O\n",
    "|     |  D  |   |  D  |     |\n",
    "O --- O --- x   x --- O --- O\n",
    "```\n",
    "\n",
    "We observe that the sparse points along the boundary of two or more MPI ranks are _duplicated_ and thus redundantly computed over multiple processes. However, the contributions from these points to the neighboring halo points are naturally ditched, so the final result of the interpolation is as expected. Let's convince ourselves that this is the case. We assign a value of $5$ to each sparse point. Since we are using linear interpolation and all points are placed at the exact center of a grid quadrant, we expect that the contribution of each sparse point to a neighboring grid point will be $5 * 0.25 = 1.25$. Based on the global view above, we eventually expect `f` to look like as follows:\n",
    "\n",
    "```\n",
    "1.25 --- 1.25 --- 0.00 --- 0.00\n",
    "|         |        |        |\n",
    "1.25 --- 2.50 --- 2.50 --- 1.25\n",
    "|         |        |        |\n",
    "0.00 --- 2.50 --- 3.75 --- 1.25\n",
    "|         |        |        |\n",
    "0.00 --- 1.25 --- 1.25 --- 0.00\n",
    "```\n",
    "\n",
    "Let's check this out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[stderr:0] \n",
      "Operator `Kernel` ran in 0.01 s\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "#NBVAL_IGNORE_OUTPUT\n",
    "\n",
    "sf.data[:] = 5.\n",
    "op = Operator(sf.inject(field=f, expr=sf))\n",
    "summary = op.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[stdout:0] \n",
      "[[1.25 1.25]\n",
      " [1.25 2.5 ]]\n",
      "[stdout:1] \n",
      "[[0.   0.  ]\n",
      " [2.5  1.25]]\n",
      "[stdout:2] \n",
      "[[0.   2.5 ]\n",
      " [0.   1.25]]\n",
      "[stdout:3] \n",
      "[[3.75 1.25]\n",
      " [1.25 0.  ]]\n"
     ]
    }
   ],
   "source": [
    "%%px --no-stream --group-outputs=engine\n",
    "print(f.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performance optimizations\n",
    "\n",
    "The Devito compiler applies several optimizations before generating code.\n",
    "\n",
    "* Redundant halo exchanges are identified and removed. A halo exchange is redundant if a prior halo exchange carries out the same `Function` update and the data is not “dirty” yet.\n",
    "* Halo exchange communications that could be fired together are preferred over being scattered all over the code.\n",
    "* Halo exchanges could also be reshuffled to maximize the extension of the computation/communication overlap region.\n",
    "\n",
    "## Computation/communication patterns\n",
    "\n",
    "![mpi-modes](https://gist.githubusercontent.com/georgebisbas/aa0e6a2f658728f1bb360f328ee6984a/raw/8c625fb2216dc6f67035856e63985516bbdeb340/mpi-modes.drawio.svg)\n",
    "\n",
    "Additionally, the Devito compiler offers a few modes of different computation and communication strategies, each exhibiting superiority under specific conditions for a kernel, such as operational intensity, memory footprint, the number of utilized ranks, and the characteristics of the cluster’s interconnect. Some of the best patterns are namely `basic`, `diagonal`, and `full`. Those have proven to be effective in improving the efficiency and scalability of computations, under several scnarios.\n",
    "\n",
    "- `basic`: The basic pattern is the simplest among the methods presented in this section and targets CPUs and GPUs. This mode, illustrated in Figure 5a, involves blocking point-to-point (P2P) data exchanges perpendicular to the 2D and 3D planes of the Cartesian topology between MPI ranks. For\n",
    "example, each rank issues 4 in 2D and 6 communications in 3D. While this mode benefits from fewer communications, it may encounter synchronization bottlenecks during grid updates before computing the next timestep. This method allocates the memory needed to exchange halos in C-land before every communication, only adding negligible overhead.\n",
    "\n",
    "- `diag2`: Compared to the `basic`, this pattern also performs diagonal data exchanges, facilitating the communication of the corner points in our domains in a single step. This results in more  communications, with 8 in 2D and 26 in 3D. Although it involves more communications, they are issued\n",
    "in a single step, and the messages are smaller compared to basic. Compared to basic, this mode slightly benefits from preallocated buffers in python-land, eliminating the need to allocate data in C-land before every communication. The latter is why this version is not supported on GPUs since the\n",
    "mechanism of pre-allocating buffers on device memory still needs to be supported.\n",
    "\n",
    "- `full`: This pattern leverages communication/computation overlap. The local-per-rank domain is logically decomposed into an inner (CORE) and an outer (OWNED/remainder) area. In a 3D example, the remainder areas take the form of faces and vector-like areas along the decomposed dimensions. The number of communications is the same as in the diagonal mode. This mode benefits from overlapping\n",
    "two steps: halo updating and the stencil computations in the CORE area. After this step, stencil updates are computed in the ``remainder” areas. In the ideal case, assuming that communication is perfectly hidden, the execution time should converge to the time needed to compute the CORE plus the time needed to compute the remainder. An important drawback of this mode is the slower GPts/s achieved at the remainder areas. The elements in the remainder are not contiguous; therefore,\n",
    "we have less efficient memory access patterns (strides) along vectorizable dimensions. These areas have lower cache utilization and vectorization efficiency."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's see the `diag2` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%px\n",
    "configuration['mpi'] = 'diag2'\n",
    "\n",
    "op = Operator(Eq(u.forward, u.dx + 1))\n",
    "# Uncomment below to show code (it's quite verbose)\n",
    "# print(op)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The body of the time-stepping loop has slightly changed compared to `basic`:\n",
    "\n",
    "Some differences are:\n",
    "\n",
    "* The communication buffers `bufg`, `bufs` are not allocated at C-land, as this already happens in Python-land\n",
    "* We now fire `ncomms` communications which are not only vertical or horizontal, but also diagonal.\n",
    "This leads to more messages, but slightly smaller compared to `basic`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could now peek at the generated code of the `full` mode and see that things now look differently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%px\n",
    "configuration['mpi'] = 'full'\n",
    "\n",
    "op = Operator(Eq(u.forward, u.dx + 1))\n",
    "# Uncomment below to show code (it's quite verbose)\n",
    "# print(op)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The body of the time-stepping loop has changed, as it now implements a classic computation/communication overlap scheme:\n",
    "\n",
    "* `haloupdate0` triggers non-blocking communications;\n",
    "* `compute0` executes the core domain region, that is the sub-region which doesn't require reading from halo data to be computed;\n",
    "* `halowait0` wait and terminates the non-blocking communications;\n",
    "* `remainder0`, which internally calls `compute0`, computes the boundary region requiring the now up-to-date halo data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "More information on Devito's MPI, can be found in this pre-print:\n",
    "[Automated MPI-X code generation for scalable finite-difference solvers](https://arxiv.org/abs/2312.13094)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}