{
 "metadata": {
  "name": "",
  "signature": "sha256:122ca286b322cdc3bbb7f27e17a0e1d35466b24179fffa4131b07bbbee16f059"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# `MPI4Py`:  Message-Passing Interface for Python\n",
      "\n",
      "MPI is the _de facto_ standard for writing parallel programs on distributed-memory clusters.  Until recently, it has only been available directly in C, C++, and Fortran.  The [`MPI4Py` library ](http://mpi4py.scipy.org) now brings the functionality of MPI to Python\u2014and what's more, it does it with the elegance typical of Python, obscuring much of the boilerplate and minutiae required with the reference implementation.\n",
      "\n",
      "_A note on the interface_:  As running `MPI4Py` requires `mpiexec`, we won't execute any parallel code directly in this notebook, instead writing the files to disk and invoking them through the shell.  Otherwise, we can only see one thread (that running the IPython server)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import mpi4py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mpi4py.__dict__"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file hello_world.py\n",
      "from mpi4py import MPI\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "\n",
      "if rank == 0:\n",
      "   data = {'a': 7, 'b': 3.14}\n",
      "   comm.send(data, dest=1, tag=11)\n",
      "elif rank == 1:\n",
      "   data = comm.recv(source=0, tag=11)\n",
      "print rank, data"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 2 python hello_world.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "## MPI in 5 Minutes\n",
      "\n",
      "If you haven't worked with MPI previously, let's take a quick look at what it is and what it offers.\n",
      "\n",
      "[MPI (*M*essage *P*assing *I*nterface)](http://www.mcs.anl.gov/research/projects/mpi/) is a library specification describing a parallel programming paradigm suitable for use on distributed-memory machines, such as modern supercomputers and distributed clusters.\n",
      "\n",
      "Basically, you can think of working on a cluster consisting of physically separate nodes (motherboards with CPUs) communicating via a network, each running its own copy of a program and working in tandem with the others.\n",
      "\n",
      "[](./img/MIMD-DM-base.png)\n",
      "\n",
      "Each of these nodes has one or more processors and its own memory array.  Thus we refer to it as a _distributed-memory_ system and incur the overhead of having each process only have a small slice of overall memory which is addressable and accessible to it.  The advantage of this tedious system is that now we can scale to have hundreds or thousands of processors work each on a small section of our problem (as divided by physical domain, Fourier transform, or other criterion).  This is the promise (and the headache!) of parallel supercomputing.\n",
      "\n",
      "### Using MPI in Practice\n",
      "\n",
      "MPI is typically invoked directly via library commands and preprocessor directives in a systems language like C, C++, or Fortran.  Not incidentally, these languages are among the most popular for heavy-duty scientific and numerical programming due to their efficient use of memory and relative ease of use for linear algebraic operations.  A minimal \"Hello World\" program in C is shown below and may be compiled and executed with the subsequent shell commands.  In this context, _rank_ refers to the number assigned to uniquely identify a process."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file hello_world_mpi.c\n",
      "#include <stdio.h>\n",
      "#include <mpi.h>\n",
      "\n",
      "int main(int argc, char *argv[]) {\n",
      "    int     rank_id, ierr, num_ranks;\n",
      "    double  start_time, wall_time;\n",
      "    \n",
      "    printf(\"C MPI minimal working example\\n\");\n",
      "    \n",
      "    ierr = MPI_Init(&argc, &argv);\n",
      "    start_time = MPI_Wtime();\n",
      "    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_ranks);\n",
      "    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);\n",
      "    \n",
      "    if (rank_id == 0) {\n",
      "        printf(\"Number of available processors = %d.\\n\", num_ranks);\n",
      "    }\n",
      "    printf(\"\\tProcess number %d branching off.\\n\", rank_id);\n",
      "    \n",
      "    wall_time = MPI_Wtime() - start_time;\n",
      "    \n",
      "    if (rank_id == 0) {\n",
      "        printf(\"Elapsed wallclock time = %8.6fs.\\n\", wall_time);\n",
      "    }\n",
      "    \n",
      "    MPI_Finalize();\n",
      "    return 0;\n",
      "}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpicc -o hello_world_mpi hello_world_mpi.c\n",
      "!./hello_world_mpi"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You see that only one processor is involved, which we neither expect nor desire.  To run things in tandem, we need the MPI wrapper script `mpiexec` to handle setting up separate processes and intercommunications automatically for us."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 8 ./hello_world_mpi"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For more information about MPI and example codes, please check out the [CSE training tutorial on MPI](nbviewer.ipython.org/github/uiuc-cse/hpc-fa14/blob/gh-pages/lessons/mpi-c.ipynb)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "## MPI4Py Basics\n",
      "\n",
      "There are a few competing packages to implement MPI operations in Python, and we will here examine [`MPI4Py`](http://mpi4py.scipy.org/).  The `MPI4Py` module provides a Python interface to access and run Python codes in parallel much as you would do with C or Fortran.  We perform similar setup and communications operations to C in the course of a typical program.  (Obviously you will take the relative efficiency hit of using a scripting language over a compiled language, but the code is often easier to write and debug.)\n",
      "\n",
      "### Hello World"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file hello_world.py\n",
      "from mpi4py import MPI\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "\n",
      "print rank, \"saying hello!\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 16 python hello_world.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This gives us a first glimpse of a persistent problem in parallel programming, particularly memory access and input/output operations:  _race conditions_.  A race condition indicates that our threads are working on a problem in conflict with each other for some resource (such as the `stdout` stream) and that the output of the program is thus somewhat unpredictable.  We can modify the previous program trivially to write all data to a string first and then output the string rather than constructing it at the time of `print`ing."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file hello_world.py\n",
      "from mpi4py import MPI\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "\n",
      "string = \"%d saying hello!\\n\" % rank\n",
      "print string,"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 16 python hello_world.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The order of the strings is still indeterminate, but at least the output is coherent.  That program had no communication\u2014let's look at a slightly more involved snippet of code.\n",
      "\n",
      "### Sending and Receiving Messages"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file sendrecv.py\n",
      "from mpi4py import MPI\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "\n",
      "if rank == 0:\n",
      "   data = {'a': 7, 'b': 3.1415926535879, 'kalamazoo': (5,4,'a')}\n",
      "   comm.send(data, dest=1, tag=11)\n",
      "elif rank == 1:\n",
      "   data = comm.recv(source=0, tag=11)\n",
      "\n",
      "print rank, data"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 2 python sendrecv.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Compare the relative simplicity of this Python example with little setup and less cleanup with the `MPI_Init`, `MPI_Finalize`, and associated boilerplate in the C code above.  Now take a look at a typical (although without context) pair of send/receive statements in C:\n",
      "\n",
      "    ierr = MPI_Send(&n_circle, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);\n",
      "    ierr = MPI_Recv(&total_circle[i], 1, MPI_INT, i, MPI_ANY_TAG, MPI_COMM_WORLD, &status);\n",
      "\n",
      "Each requires a reference-passed source variable, a message size, data type, tag, destination, and communicator.  `MPI_Recv` additionally requires a status variable.  By contrast, Python and MPI4Py together hide much of that noisiness, both by the magic of `**kwargs` and automatic pickling, leading to clean short statements without additional interpretive junk.\n",
      "\n",
      "    MPI.COMM_WORLD.send(data, dest=1)\n",
      "    data = MPI.COMM_WORLD.recv(source=0)\n",
      "\n",
      "### Exercise\n",
      "- Write a short code causing four processors to generate each a unique string such as their rank written out (`'four'`) and exchange it forward (_i.e._, 0 send to 1, 1 to 2, 2 to 3, and 3 to 0."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file exercise1.py\n",
      "from mpi4py import MPI\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "\n",
      "# add your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 4 python exercise1.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "### Application:  Calculating Series Terms\n",
      "\n",
      "Consider the following series statement for a Bessel function of the first kind,\n",
      "$$J_{\\alpha}(x) = \\sum_{m=0}^{\\infty} \\frac{(-1)^{m}}{m! \\, \\Gamma(m+\\alpha+1)} {\\left(\\frac{x}{2}\\right)}^{2m+\\alpha} $$\n",
      "With $\\alpha = 0$ and $m \\in \\mathbb{Z^{*}}$, the $\\Gamma$ function reduces to the factorial and we write\n",
      "$$J_{0}(x) = \\sum_{m=0}^{\\infty} \\frac{(-1)^{m}}{m! (m+1)!} {\\left(\\frac{x}{2}\\right)}^{2m} \\text{.} $$\n",
      "\n",
      "Let's parallelize this series by having each process calculate a piece of the series and then perform a _reduction_, or joint summation, across the data points."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file series.py\n",
      "from mpi4py import MPI\n",
      "from scipy.misc import factorial as fact\n",
      "from scipy.special import j0 # for testing error\n",
      "\n",
      "def term(m, x):\n",
      "    return ((-1)**m)/(fact(m)*fact(m+1))*(0.5*x)**(2*m)\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "size = comm.Get_size()\n",
      "\n",
      "value = 0.5\n",
      "my_term = term(rank, value)\n",
      "\n",
      "print '%d: %f'%(rank, my_term)\n",
      "\n",
      "if rank == 0:\n",
      "    term = [my_term]\n",
      "    for i in range(1, size):\n",
      "        datum = comm.recv(source=i)\n",
      "        term.append(datum)\n",
      "    print 'series gives %f'%sum(term)\n",
      "    print 'scipy gives %f'%j0(value)\n",
      "    print 'error is %f'%(sum(term)-j0(value))\n",
      "else:\n",
      "    comm.send(my_term, dest=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 16 python series.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Exercise\n",
      "- Write a sorting algorithm which works in parallel with communications.  The root process should generate a large data set, then slice it unto bins by value and send each bin (save one) to another processor to sort.  That processer can utilize any appropriate method to sort the data (hint:  `numpy.array.sort`), then transmit the sorted data set back to the root process.  The data are now completely sorted when put in rank order.\n",
      "    \n",
      "    As an example, consider sorting this data set on four processors:\n",
      "    \n",
      "        3 5 7 4 6 7 1 9 2 8 3 2\n",
      "    \n",
      "    The first (root) process looks at the range of these data and divides them into four groups, one for each process (0\u20132, 3\u20135, 6\u20138, 9\u201311):\n",
      "    \n",
      "        Group:  1  1  2  1  2  2  3  3  0  2  1  0\n",
      "        Data:   3  5  7  4  6  7 11  9  2  8  3  2\n",
      "    \n",
      "    Thus process rank 0 receives two data points, `[2]`, while process rank 2 receives four, `[7, 6, 7, 8]`.  (A better algorithm will balance the load better but that's not our concern right now.)  When each process has sorted its own data points, then reunifying them on root will produce a completely sorted data set.\n",
      "    \n",
      "        Group:  0  0 | 1  1  1  1 | 2  2  2  2 | 3  3\n",
      "        Data:   2  2 | 3  3  4  5 | 6  7  7  8 | 9 11\n",
      "    \n",
      "    (Sorting algorithms don't typically get engineers that excited, but it's a good illustration of the necessary communication processes involved.)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file exercise2.py\n",
      "from mpi4py import MPI\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "\n",
      "# add your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 4 python exercise2.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Challenge Exercise\n",
      "- Write a two-step merge sort algorithm.  The first step should partition a large data set between several processors which then sort the data locally using any efficient algorithm.  These data are transmitted back to the root process, which then breaks them up into $P$ groups by value and sends them back to the processors to merge into one array each.  Finally, the fully sorted data set should be merged back together on the root process."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "## Numerical Applications with MPI4Py\n",
      "\n",
      "Now we get to the real meat of the discussion:  how can we compose effective numerical codes using MPI4Py?  In the spirit of the series summation for the Bessel function above, what follows illustrates aspects of both point-to-point and collective MPI operations in Python."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Application:  Kinetic Monte Carlo\n",
      "\n",
      "The [kinetic Monte Carlo](https://en.wikipedia.org/wiki/Kinetic_Monte_Carlo) method seeks to model rate-based processes in nature, including diffusion and transport, chemical reaction, and polymer relaxation processes.  Typically, a table of processes expected to occur are expressed in terms of a time step interval.  A random number is generated and used to select between possible reactions (and no reaction at all).  Thus dynamic processes can be simulated if rate laws are known.  It is apparent how this algorithm is highly parallel since only at physical domain borders are cross-process interactions possible.\n",
      "\n",
      "For instance, we can simulate vacancy diffusion on the surface of an alloy.  We write a probability for the vacancy to diffuse in any valid direction according to the composition, number of nearest neighbors of surface atoms, etc. (essentially a random walk).  To make this problem tractable for a workshop, we will use one-dimensional diffusion along a line of atoms.\n",
      "\n",
      "![](https://bytebucket.org/davis68/parallel/raw/fcbc7ceb169194454ac28506832177d318413d5b/lessons/img/kmc-jump.png?token=195dc06c7bcb5e779e6d6e96f8f9e17984a6891e)\n",
      "\n",
      "We will write the probability of diffusion of an atom in a given direction as\n",
      "$$ D = D_{0} v \\exp\\left( -\\frac{E_D}{k_B T} \\right) $$\n",
      "where $D_{0}$ is a rate prefactor, $E_D$ is the diffusion activation energy (relative to absolute temperature $T$), $k_B$ is the Boltzmann rate constant (which we take to be unity), and $v \\in \\left\\{0, 1\\right\\}$ indicates whether or not a given site is vacant (_i.e._, whether or not diffusion is physically possible in a given case).\n",
      "\n",
      "Adjusting to a system of units near $10^{0}$ and simplifying, the equation we will actually program into the computer below is\n",
      "$$ \\rho_{\\left\\{\\ell,r\\right\\}} = v \\exp\\left( -\\delta \\right) $$\n",
      "with $\\delta = E_{D}/T$, a reduced temperature.\n",
      "\n",
      "The total reaction rate will be calculated for a constant time step size $\\tau = \\rho_{\\text{total}}^{-1}$.  Thus we will include a fractional reaction probability for a left transition $\\rho_\\ell$ (if possible), a right transition $\\rho_r$ (if possible), and no transition at all $\\rho_{\\text{total}}-\\rho_\\ell-\\rho_r$.\n",
      "\n",
      "Let's look at a serial code first to get a feel for the [algorithm](https://en.wikipedia.org/wiki/Kinetic_Monte_Carlo#Algorithm)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file kmc_series.py\n",
      "import numpy as np\n",
      "import numpy.random as npr\n",
      "\n",
      "#npr.seed(1000)\n",
      "\n",
      "# Prepare site grid and initial vacancy locations.\n",
      "N = 200  # atom sites\n",
      "V = 0.1 # fraction of vacancies\n",
      "maxit = 1000\n",
      "\n",
      "sites = np.ones(N)\n",
      "vacancies = npr.random((N)) >= V\n",
      "sites *= vacancies\n",
      "allsites = np.ones((maxit+1,N))\n",
      "allsites[0] = sites\n",
      "\n",
      "# Calculate rate of diffusion for current conditions.  In this case, it's basically between\n",
      "# reaction and non-reaction unless there are two adjoining vacancies.\n",
      "R = []\n",
      "lt = 0\n",
      "rt = 1\n",
      "delta = 4.0\n",
      "R.append(np.exp(-delta))\n",
      "R.append(R[lt] + np.exp(-delta))\n",
      "\n",
      "# Execute main KMC residence-time algorithm loop.\n",
      "it = 0\n",
      "while (it < maxit):\n",
      "    for i in range(0, N):\n",
      "        if sites[i] == 1.0: continue # compare only vacancies---don't generally compare FP numbers (but we set this rather than calculated it)\n",
      "        \n",
      "        # Find the event to carry out from the table from a uniform random number.\n",
      "        if i > 0:\n",
      "            n_lt = i-1\n",
      "            if sites[n_lt] != 1.0: continue\n",
      "            u = npr.uniform()\n",
      "            if (u < R[lt]):\n",
      "                # swap atom with vacancy\n",
      "                sites[n_lt] = 0.0\n",
      "                sites[i] = 1.0\n",
      "                continue\n",
      "        if i < N-1:\n",
      "            n_rt = i+1\n",
      "            if sites[n_rt] != 1.0: continue\n",
      "            u = npr.uniform()\n",
      "            if (u < R[rt] and u >= R[lt]):\n",
      "                # swap atom with vacancy\n",
      "                sites[n_rt] = 0.0\n",
      "                sites[i] = 1.0\n",
      "                continue\n",
      "    \n",
      "    it += 1\n",
      "    allsites[it] = sites\n",
      "np.savetxt('allsites.txt', allsites)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python kmc_series.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "allsites = np.loadtxt('allsites.txt')\n",
      "import matplotlib.pyplot as plt\n",
      "from matplotlib import cm\n",
      "%matplotlib inline\n",
      "mpl.rcParams['figure.figsize']=[18,6]\n",
      "plt.imshow(allsites.T, cmap=cm.Blues_r)\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Okay, now let's divide the domain into four segments and assign one to each process.  (The segmentation is apparent in the code, but the number of processors is not, and generally should not be, unless you need a square number of processors.)  Thus each domain will take \u00bc of the total, and communication will only be necessary at the boundaries.  The code actually changes very little from what it was before, just adding a few lines to deal with communication and communicated boundary site data.\n",
      "\n",
      "![](https://bytebucket.org/davis68/parallel/raw/fcbc7ceb169194454ac28506832177d318413d5b/lessons/img/kmc-domains.png?token=b9512263c9cf666fa25a2796263b3a4a85a8bdf5)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file kmc_par.py\n",
      "from mpi4py import MPI\n",
      "import numpy as np\n",
      "import numpy.random as npr\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "size = comm.Get_size()\n",
      "npr.seed(rank+1000)\n",
      "\n",
      "# Prepare site grid and initial vacancy locations.\n",
      "N = 100  # atom sites (per process)\n",
      "V = 0.05 # fraction of vacancies (per process, initially)\n",
      "maxit = 1000\n",
      "\n",
      "sites = np.ones(N)\n",
      "vacancies = npr.random((N)) >= V\n",
      "sites *= vacancies\n",
      "allsites = np.ones((maxit+1,N))\n",
      "allsites[0] = sites\n",
      "\n",
      "# Calculate rate of diffusion for current conditions.  In this case, it's basically between\n",
      "# reaction and non-reaction unless there are two adjoining vacancies.\n",
      "R = []\n",
      "lt = 0\n",
      "rt = 1\n",
      "delta = 4.0\n",
      "R.append(np.exp(-delta))\n",
      "R.append(R[lt] + np.exp(-delta))\n",
      "\n",
      "# Execute main KMC residence-time algorithm loop.\n",
      "it = 0\n",
      "comm.send(sites[-1], dest=(rank+1)%size)\n",
      "s_lt = comm.recv(source=(rank-1)%size)\n",
      "comm.send(sites[0], dest=(rank-1)%size)\n",
      "s_rt = comm.recv(source=(rank+1)%size)\n",
      "while (it < maxit):\n",
      "    # Communicate the end sites' statuses to neighbors (periodic BCs).  We use ghost cells.\n",
      "    # Typically we would only send data if a change had occurred, and sending a large data set here\n",
      "    # would cause MPI4Py to freeze with this configuration (small values use buffered sending).\n",
      "    comm.send(sites[-1], dest=(rank+1)%size)\n",
      "    t_lt = comm.recv(source=(rank-1)%size)\n",
      "    comm.send(sites[0], dest=(rank-1)%size)\n",
      "    t_rt = comm.recv(source=(rank+1)%size)\n",
      "    \n",
      "    comm.send(s_rt, dest=(rank+1)%size)\n",
      "    sites[0] = comm.recv(source=(rank-1)%size)\n",
      "    comm.send(s_lt, dest=(rank-1)%size)\n",
      "    sites[-1] = comm.recv(source=(rank+1)%size)\n",
      "    \n",
      "    s_rt = t_rt\n",
      "    s_lt = t_lt\n",
      "    #print '%d: %f %f %f %f\\n'%(rank, s_rt, s_lt, sites[0], sites[-1])\n",
      "    \n",
      "    for i in range(0, N):\n",
      "        if sites[i] == 1.0: continue # compare only vacancies---don't generally compare FP numbers (but we set this rather than calculated it)\n",
      "        \n",
      "        # Find the event to carry out from the table from a uniform random number.\n",
      "        # Note that we only allow diffusion to occur from a communicated ghost cell\n",
      "        # outbound, not inbound---that is handled from the other side, where it is\n",
      "        # also outbound.\n",
      "        if i == 0:\n",
      "            if s_lt != 1.0: continue # no point in swapping two vacancies\n",
      "            u = npr.uniform()\n",
      "            if (u < R[lt]):\n",
      "                # We consider the vacancy to have diffused outwards.\n",
      "                s_lt = 0.0\n",
      "                sites[i] = 1.0\n",
      "                continue\n",
      "        elif i > 0:\n",
      "            n_lt = i-1\n",
      "            if sites[n_lt] != 1.0: continue # no point in swapping two vacancies\n",
      "            u = npr.uniform()\n",
      "            if (u < R[lt]):\n",
      "                # swap atom with vacancy\n",
      "                sites[n_lt] = 0.0\n",
      "                sites[i] = 1.0\n",
      "                continue\n",
      "        if i == N-1:\n",
      "            if s_rt != 1.0: continue # no point in swapping two vacancies\n",
      "            u = npr.uniform()\n",
      "            if (u < R[rt] and u >= R[lt]):\n",
      "                # We consider the vacancy to have diffused outwards.\n",
      "                s_rt = 0.0\n",
      "                sites[i] = 1.0\n",
      "                continue\n",
      "        if i < N-1:\n",
      "            n_rt = i+1\n",
      "            if sites[n_rt] != 1.0: continue # no point in swapping two vacancies\n",
      "            u = npr.uniform()\n",
      "            if (u < R[rt] and u >= R[lt]):\n",
      "                # swap atom with vacancy\n",
      "                sites[n_rt] = 0.0\n",
      "                sites[i] = 1.0\n",
      "                continue\n",
      "    \n",
      "    it += 1\n",
      "    allsites[it] = sites\n",
      "\n",
      "# Communicate and merge all data.\n",
      "recvbuff = comm.gather(allsites, 0)\n",
      "# recvbuff is a list of ndarrays which need to be merged along their second axis\n",
      "if (rank == 0):\n",
      "    alldata = np.concatenate(recvbuff, axis = 1)\n",
      "    np.savetxt('alldata.txt', alldata)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 4 python kmc_par.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "alldata = np.loadtxt('alldata.txt')\n",
      "import matplotlib as mpl\n",
      "import matplotlib.pyplot as plt\n",
      "from matplotlib import cm\n",
      "%matplotlib inline\n",
      "mpl.rcParams['figure.figsize']=[18,6]\n",
      "plt.imshow(alldata.T, cmap=cm.Reds_r)\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Basically all we have to add to get things working is a buffer ghost cell at each end to store the state of the adjoining domain elements as well as some logic to handle those cases.  For output purposes, we also perform a `gather` operation at the end and `concatenate` the list of `nparray`s into one which we can then output to disk and plot."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Challenge Exercise\n",
      "- Convert this code to track two kinds of alloys, each with their own reaction rate constants.\n",
      "- Convert this code to work in two dimensions (a surface) rather than just one (an edge).  (You could also track vacancies instead of atoms as there are fewer of them to process and they are conserved.)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "### Application:  Heat Equation (Finite Difference Method)\n",
      "\n",
      "Finite-difference models are used throughout engineering to obtain numerical solutions to differential equations.  This particular system models the heat equation\n",
      "\n",
      "$$ \\frac{1}{\\alpha} \\frac{\\partial u}{\\partial t} = \\frac{\\partial^2 u}{\\partial x^2}$$\n",
      "\n",
      "given an initial condition of $u(x,t=0) = \\sin\\left(\\pi x/L\\right)$ and boundary conditions of $u(x=0,t) = 0$ and $u(x=L,t) = 0$.\n",
      "\n",
      "To approximate a derivative, the most straightforward way is to take the formal definition\n",
      "\n",
      "$$f'(x) = \\frac{f(x+h)-f(x)}{h}$$\n",
      "\n",
      "and use a small but nonzero step $h$ in your calculation.\n",
      "\n",
      "Application of this principle to the heat equation leads to a statement of the form\n",
      "\n",
      "$$ \\frac{1}{\\alpha} \\frac{u^m_i - u^{m-1}_i}{\\Delta t} = \\frac{u^{m-1}_{i-1} - 2 u^{m-1}_{i} + u^{m-1}_{i+1}}{\\Delta x^2} $$\n",
      "\n",
      "or $u^m_i = \\frac{\\alpha \\Delta t}{\\Delta x^2}u^{m-1}_{i-1} + \\left[1 - 2\\left(\\frac{\\alpha \\Delta t}{\\Delta x^2}\\right)\\right]u^{m-1}_{i} + \\frac{\\alpha \\Delta t}{\\Delta x^2}u^{m-1}_{i+1}$.\n",
      "\n",
      "This clearly yields a way to calculate subsequent time steps point-by-point from the previous time step's data.\n",
      "\n",
      "Let's look again at a serial code first."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file fd_ser.py\n",
      "import numpy as np\n",
      "\n",
      "# Basic parameters\n",
      "nt = 120\n",
      "nx = 25\n",
      "alpha = 0.1 \n",
      "length = 1.0 \n",
      "tmax = 0.5\n",
      "\n",
      "# Derived parameters:  mesh spacing and time step size\n",
      "dx = length / (nx-1)\n",
      "dt = tmax / (nt-1)\n",
      "\n",
      "# Create arrays to save data in process.\n",
      "x = np.linspace(0, length+1e-15, nx)\n",
      "t = np.linspace(0, tmax+1e-15, nt)\n",
      "u = np.zeros([nx, nt])\n",
      "\n",
      "# Set initial and boundary conditions.\n",
      "u[:, 0] = np.sin(np.pi*x/length)**2\n",
      "#boundaries are implicitly set by this initial condition\n",
      "\n",
      "# Loop through each time step.\n",
      "r  = alpha * dt / (dx*dx)\n",
      "s  = 1 - 2*r\n",
      "for n in range(1, nt):\n",
      "    for j in range(1, nx-1):\n",
      "        u[j, n] = r*u[j-1, n-1] + s*u[j, n-1] + r*u[j+1, n-1]\n",
      "\n",
      "# Output the results.\n",
      "np.savetxt('fd_data.txt', u)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python fd_ser.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from mpl_toolkits.mplot3d import Axes3D\n",
      "from matplotlib import cm\n",
      "from matplotlib.ticker import LinearLocator, FormatStrFormatter\n",
      "import matplotlib as mpl\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline\n",
      "mpl.rcParams['figure.figsize']=[18,6]\n",
      "import numpy as np\n",
      "\n",
      "# Basic parameters\n",
      "nt = 120\n",
      "nx = 25\n",
      "alpha = 0.1 \n",
      "length = 1.0 \n",
      "tmax = 0.5\n",
      "\n",
      "fig = plt.figure()\n",
      "ax = fig.gca(projection='3d')\n",
      "x = np.linspace(0, length+1e-15, nx)[:, np.newaxis]\n",
      "t = np.linspace(0, tmax+1e-15, nt)[:, np.newaxis]\n",
      "t, x = np.meshgrid(t, x)\n",
      "u = np.loadtxt('fd_data.txt')\n",
      "surf = ax.plot_surface(x, t, u, rstride=1, cstride=1, cmap=cm.jet,\n",
      "        linewidth=0, antialiased=False)\n",
      "ax.set_zlim(0.0, u.max())\n",
      "\n",
      "ax.zaxis.set_major_locator(LinearLocator(10))\n",
      "ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))\n",
      "\n",
      "fig.colorbar(surf, shrink=0.5, aspect=5)\n",
      "\n",
      "plt.show()\n",
      "\n",
      "plt.imshow(u, cmap=cm.jet)\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now the threaded MPI version.  The discretization that makes most sense for this short code is by $x$ coordinate, as the problem is highly dependent along the $t$ direction and can't be parallelized that way.  Thus each problem should share its data at the boundaries in order for this finite difference stencil to be effective.\n",
      "\n",
      "<img src=\"https://bytebucket.org/davis68/python/raw/fa6209bae02c22c0d67559098ec423b839f5e3f3/lessons/img/border-cells.png?token=e9d9c785003138c348c699fa13709e7f265524f5\" alt=\"Drawing\" style=\"width: 500px;\"/>\n",
      "\n",
      "Each process will possess a completely separate (nonoverlapping) set of cells, together with a left and right ghost cell."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file fd_par.py\n",
      "from __future__ import division\n",
      "from mpi4py import MPI\n",
      "import numpy as np\n",
      "\n",
      "comm = MPI.COMM_WORLD\n",
      "rank = comm.Get_rank()\n",
      "size = comm.Get_size()\n",
      "nbr_l = rank-1 if rank > 0 else -1\n",
      "nbr_r = rank+1 if rank < size-1 else -1\n",
      "\n",
      "# Basic parameters\n",
      "nt = 1000\n",
      "nx = 10  # per process\n",
      "alpha = 0.1\n",
      "length = 1.0\n",
      "tmax = 2.5\n",
      "\n",
      "# Derived parameters:  mesh spacing and time step size\n",
      "dx = length / ((nx-1)*size)\n",
      "dt = tmax / (nt-1)\n",
      "\n",
      "# Create arrays to save data in process.\n",
      "offset_x = rank/size * length\n",
      "x = np.linspace(offset_x, (rank+1)/size*length-dx+1e-15, nx)\n",
      "t = np.linspace(0, tmax+1e-15, nt)\n",
      "u = np.zeros([nx, nt])\n",
      "\n",
      "# Set initial and boundary conditions.\n",
      "u[:, 0] = np.sin(np.pi*x/length)**2\n",
      "#boundaries are implicitly set by this initial condition\n",
      "\n",
      "# Loop through each time step.\n",
      "r  = alpha * dt / (dx*dx)\n",
      "s  = 1 - 2*r\n",
      "u_l = 0.0\n",
      "u_r = 0.0\n",
      "for n in range(1, nt):\n",
      "    # communicate boundary information to ghost cells\n",
      "    if (nbr_l >= 0): comm.send(u[ 0, n-1], dest=nbr_l) #; print rank, \" sent \", u[ 0,n-1], \" to \", nbr_l\n",
      "    if (nbr_r >= 0): comm.send(u[-1, n-1], dest=nbr_r) #; print rank, \" sent \", u[-1,n-1], \" to \", nbr_r\n",
      "    if (nbr_l >= 0): u_l = comm.recv(source=nbr_l) #; print nbr_l, \" recv \", u_l, \" of \", rank\n",
      "    if (nbr_r >= 0): u_r = comm.recv(source=nbr_r) #; print nbr_r, \" recv \", u_r, \" of \", rank\n",
      "    # loop over cells and solve main equation including ghost cell data\n",
      "    for j in range(0, nx):\n",
      "        if (rank == 0 and j == 0) or (rank == size and j == nx-1): continue\n",
      "        \"\"\"if (rank == 0 and j == nx-1):\n",
      "            print \"%f*(%f if %d>0 else %f) + %f*%f + %f*(%f if %d<nx-1 else %f)\"%(r, u[j-1, n-1], j, u_l, s, u[j, n-1], r, u[j+1, n-1], j, _r)\n",
      "        \"\"\"\n",
      "        u[j, n] = r*(u[j-1, n-1] if j>0 else u_l) + s*u[j, n-1] + r*(u[j+1, n-1] if j<nx-1 else u_r)\n",
      "    #print n, rank, u[:,n], \"\\n---\\n\"\n",
      "# Combine the results and output.\n",
      "#np.savetxt('fd_data_%d.txt'%(rank), u) # use with joint output block (in notes at end).\n",
      "recvbufr = comm.gather(u, 0)\n",
      "if rank==0:\n",
      "    u_all = np.zeros((nx*size,nt))\n",
      "    for i in range(0, size):\n",
      "        offset = i*nx\n",
      "        u_all[offset:offset+nx,:] = recvbufr[i]\n",
      "    np.savetxt('fd_data.txt', u_all)\n",
      "    print u_all[:,nx-1]\n",
      "    print \"Complete.\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!mpiexec -np 4 python fd_par.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from mpl_toolkits.mplot3d import Axes3D\n",
      "from matplotlib import cm\n",
      "from matplotlib.ticker import LinearLocator, FormatStrFormatter\n",
      "import matplotlib as mpl\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline\n",
      "mpl.rcParams['figure.figsize']=[18,6]\n",
      "import numpy as np\n",
      "\n",
      "# Basic parameters\n",
      "nt = 1000\n",
      "nx = 4*10  # per process\n",
      "alpha = 0.1\n",
      "length = 1.0\n",
      "tmax = 2.5\n",
      "\n",
      "fig = plt.figure()\n",
      "ax = fig.gca(projection='3d')\n",
      "x = np.linspace(0, length+1e-15, nx)[:, np.newaxis]\n",
      "t = np.linspace(0, tmax+1e-15, nt)[:, np.newaxis]\n",
      "t, x = np.meshgrid(t, x)\n",
      "u = np.loadtxt('fd_data.txt')\n",
      "surf = ax.plot_surface(x, t, u, rstride=1, cstride=1, cmap=cm.jet,\n",
      "        linewidth=0, antialiased=False)\n",
      "ax.set_zlim(0.0, u.max())\n",
      "\n",
      "ax.zaxis.set_major_locator(LinearLocator(10))\n",
      "ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))\n",
      "\n",
      "fig.colorbar(surf, shrink=0.5, aspect=5)\n",
      "plt.show()\n",
      "\n",
      "plt.imshow(u[:,::10], cmap=cm.jet)\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "### Feedback\n",
      "Please take a few minutes during this break and [provide us with some feedback](https://uiuc-cse.github.io/python-fa14/#feedback) about this tutorial."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "### Application:  Calculating $\\pi$ (Dynamic Threads)\n",
      "\n",
      "One interesting feature of the way MPI (and thus `MPI4Py`) handles processes is that it is reasonably straightforward for a serial master task to dynamically spawn a group of subordinate parallel threads to complete a specific task.  (This is somewhat more involved in vanilla MPI.)  Let's see how that looks in practice\u2014we'll write a master code and then a worker script which will be called for a specific task then terminated.\n",
      "\n",
      "(This example is modified from [one in the documentation](http://mpi4py.scipy.org/docs/usrman/tutorial.html#dynamic-process-management))."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file master.py\n",
      "#!/usr/bin/env python\n",
      "from mpi4py import MPI\n",
      "import numpy\n",
      "import sys\n",
      "\n",
      "comm = MPI.COMM_SELF.Spawn(sys.executable,\n",
      "                           args=['worker.py'],\n",
      "                           maxprocs=5)\n",
      "\n",
      "N = numpy.array(100, 'i')\n",
      "comm.Bcast([N, MPI.INT], root=MPI.ROOT)\n",
      "PI = numpy.array(0.0, 'd')\n",
      "comm.Reduce(None, [PI, MPI.DOUBLE], op=MPI.SUM, root=MPI.ROOT)\n",
      "print PI\n",
      "\n",
      "comm.Disconnect()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%file worker.py\n",
      "#!/usr/bin/env python\n",
      "from mpi4py import MPI\n",
      "import numpy\n",
      "\n",
      "comm = MPI.Comm.Get_parent()\n",
      "size = comm.Get_size()\n",
      "rank = comm.Get_rank()\n",
      "\n",
      "N = numpy.array(0, dtype='i')\n",
      "comm.Bcast([N, MPI.INT], root=0)\n",
      "h = 1.0 / N; s = 0.0\n",
      "for i in range(rank, N, size):\n",
      "    x = h * (i + 0.5)\n",
      "    s += 4.0 / (1.0 + x**2)\n",
      "PI = numpy.array(s * h, dtype='d')\n",
      "comm.Reduce([PI, MPI.DOUBLE], None, op=MPI.SUM, root=0)\n",
      "\n",
      "comm.Disconnect()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!python master.py\n",
      "# Your mileage may vary depending on the MPI library installed."
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Although I haven't studied the question in detail yet, it has occurred to me that this may be the best way to use `MPI4Py` in practice:  run your regular Python script, and when you need a specific module in parallel for performance reasons, step down to a worker script and execute it dynamically."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "## MPI Operations & `MPI4Py`\n",
      "Most of the functionality of MPI is implemented in the current version of MPI4Py (1.3), with a few exceptions.  [Interface documentation](http://mpi4py.scipy.org/docs/usrman/mpi4py.html) is available online.\n",
      "\n",
      "### `pickle`\n",
      "If you know classic MPI, then you're aware that it is well equipped to send and receive basic data types like `float`, `double`, `int`, and arrays of these\u2014the foundational data types of C, C++, and Fortran.  But with `MPI4Py` it is trivially easy to send much more complex data types, such as objects, with little to no programmer overhead.  (There is, of course, a cost in performance somewhere.)  What is going on?\n",
      "\n",
      "Python supports [a concept called \"pickling\"](https://docs.python.org/2/library/pickle.html), which refers to converting an object directly to or from a byte stream.  The `pickle` module aids in this process.  It is used by many data storage and communications libraries in Python, and `MPI4Py` is no exception.  An object is `pickle`d behind the scenes, and that byte stream is sent via MPI and then deserialized behind the scenes on the other end.  (Incidentally, the `cPickle` module is up to 100 times faster, and should be used in practice.)\n",
      "\n",
      "Let's see what `pickle` does and how it works, and then we'll look at it in `MPI4Py`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pickle\n",
      "\n",
      "data1 = {'a': [1, 2.0, 3, 4+6j],\n",
      "         'b': ('string', u'\u6230\u570b\u6642\u4ee3'),\n",
      "         'c': None}\n",
      "\n",
      "selfref_list = [1, 2, 3]\n",
      "selfref_list.append(selfref_list)\n",
      "\n",
      "output = open('data.pkl', 'wb')\n",
      "\n",
      "# Pickle dictionary using protocol 0.\n",
      "pickle.dump(data1, output)\n",
      "\n",
      "# Pickle the list using the highest protocol available.\n",
      "pickle.dump(selfref_list, output, -1)\n",
      "output.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pprint, pickle\n",
      "\n",
      "pkl_file = open('data.pkl', 'rb')\n",
      "\n",
      "data1 = pickle.load(pkl_file)\n",
      "pprint.pprint(data1)\n",
      "\n",
      "data2 = pickle.load(pkl_file)\n",
      "pprint.pprint(data2)\n",
      "\n",
      "pkl_file.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It doesn't look like much, but being able to pack data away (as either binary files or just byte streams) for later manipulation can save you trouble.  You can use it to benchmark the state of a calculation, for instance.   Other motivations are discussed [here](https://stackoverflow.com/questions/8968884/python-serialization-why-pickle).\n",
      "\n",
      "Ultimately, the fact that `pickle` makes working with advanced data structures like `ndarray`s trivial to implement is the real value added (like in the master/worker example above)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---\n",
      "Neal Davis developed these materials for [Computational Science and Engineering](http://cse.illinois.edu/) at the University of Illinois at Urbana\u2013Champaign.  This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).\n",
      "\n",
      "![](http://i.creativecommons.org/l/by/3.0/88x31.png)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}