{ "metadata": { "name": "", "signature": "sha256:57c65c0d9de47fa5f7e561e0ed4f6e3815122c1f06280d2a1df6bea3b65c2ada" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![](https://bytebucket.org/davis68/resources/raw/f7c98d2b95e961fae257707e22a58fa1a2c36bec/logos/baseline_cse_wdmk.png?token=be4cc41d4b2afe594f5b1570a3c5aad96a65f0d6)](http://cse.illinois.edu/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Threaded Programming for Engineers (OpenMP)\n", "\n", "\n", "## Contents\n", "- [Threaded Programs](#intro)\n", "- [Hello World!](#hello)\n", "- [Numerical Integration](#numint)\n", "- [Recursion (Fibonacci Sequence)](#fib)\n", "- [Vector & Matrix Multiplication](#mmult)\n", "- [Series Formul\u00e6](#series)\n", "- [Monte Carlo](#montecarlo)\n", "- [A Few More Tips](#tips)\n", "- [Using Vectorization (SIMD) in Practice](#vector)\n", "- [Assessing Performance](#assess)\n", "- [Best Practices for Scientific Parallel Programming](#bestprac)\n", "- [Conclusion](#conc)\n", "- [References](#refs)\n", "- [Credits](#credits)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Threaded Programs\n", "\n", "[OpenMP](http://www.openmp.org/) is an interface for C/C++/Fortran to direct multithreaded, shared-memory parallel execution. OpenMP allows a group of processors to co\u00f6rdinate work between themselves using the _fork\u2013join_ model of parallelism. In this model, a _master_ thread (or process) spawns groups of _slave_ threads as appropriate to solve certain calculations.\n", "\n", "![](https://bytebucket.org/davis68/parallel/raw/fc24d274c088ba5087688ef25cf8e724f532319d/lessons/img/fork-join.png?token=8f9936a462a4862f1a2894d28b33ce7aa57e7cb4)\n", "\n", "An example of a parallel addition in this model is shown below. Note that, in OpenMP terms, separate processes could be spawned to perform each addition\u2014this differs from _vectorization_ in which a single process would perform the operation as a single step across multiple vector elements ([ref](https://stackoverflow.com/questions/10509872/comparison-between-openmp-and-vectorization)). (More on that distinction\u2014and how to exploit both to your advantage\u2014below.)\n", "\n", "![](https://bytebucket.org/davis68/parallel/raw/fc24d274c088ba5087688ef25cf8e724f532319d/lessons/img/SIMD-vector-add.png?token=7d71987d20cfb22695d0444a321a9d9480610c6f)\n", "\n", "The interface for working with OpenMP is reasonably straightforward (in comparison to MPI), as it is implemented entirely with `#pragma omp`s in C and C++ and `!$OMP` sentinels in Fortran. Thus existing serial code can be na\u00efvely parallelized with almost no effort (although this guarantees no improvement in performance without tweaks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Hello World\n", "\n", "Let's take a look at a trivial C++ code which uses OpenMP.\n", "\n", "As we are using an [IPython](ipython.org) notebook interface to unite code and context, simply select each cell and press Shift+Enter to run the following three blocks of code in succession.\n", "1. The first writes the C++ code in it to disk as the `openmp.cpp`.\n", "1. The second compiles the C++ code.\n", "1. The third executes the code in a `bash` shell and echoes the output back in the notebook." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file openmp.cpp\n", "\n", "#include \n", "#include \n", "using namespace std;\n", " \n", "int main(int argc, char *argv[]) {\n", " int th_id, nthreads;\n", "\n", " #pragma omp parallel private(th_id) shared(nthreads)\n", " {\n", " th_id = omp_get_thread_num();\n", " \n", " cout << \"Hello World from thread \" << th_id << '\\n';\n", " #pragma omp barrier\n", " \n", " #pragma omp single\n", " cout << \"---\" << endl;\n", " \n", " #pragma omp critical\n", " cout << \"Hello World from thread \" << th_id << '\\n';\n", " \n", " #pragma omp barrier\n", "\n", " cout << \"There are \" << omp_get_max_threads() << \" threads\" << '\\n';\n", " }\n", "\n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic openmp.cpp -o openmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./openmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's actually several things going on there, so let's unpack the output first and then the code body to get a taste of OpenMP." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Timing code in IPython\n", "%timeit !./openmp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OpenMP `#pragma`s\n", "\n", "What would happen if we forgot the `-fopenmp` flag? Well, `#pragma`s are preprocessor commands to the compiler to invoke certain features, and thus those features (in this case OpenMP) wouldn't be invoked. In some cases, like the \"Hello World\" example, this _may_ make no difference, but that's hardly the general case. Hey, let's just try it:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -O2 -pedantic openmp.cpp -o openmp_nopragma" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Numerical Integration\n", "Let's do something more interesting, assuming that you find writing parallel numerical methods interesting :). Consider the following decomposition of the composite Simpson's rule for integral evaluation.\n", "\n", "$$I = \\int_{a}^{b} \\textrm{d}x\\, f(x)$$\n", "\n", "$$\n", "T[f] = \\Delta x \\left(\n", "\\frac{1}{2} f_{\\textrm{0}} + \\frac{1}{2} f_{n} + \\sum_{j=1}^{n-1} f_{j}\n", "\\right)\n", "\\approx\n", "I\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the integral\n", "$$\n", "\\int_{\\pi}^{5\\pi/2}\n", "\\textrm{d}x\\,\n", "\\frac{\\cos(x)}{2 + \\cos(x)}\n", "$$\n", "which is depicted below.\n", "\n", "![](https://bytebucket.org/davis68/parallel/raw/fc24d274c088ba5087688ef25cf8e724f532319d/lessons/img/int-cosby2.png?token=91e032a5d2586d8a80d37c03adbe5a5e5072f20a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's look at a serial implementation of the trapezoid rule in C." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file int_trap_serial.cpp\n", "#include \n", "#include \n", "using namespace std;\n", "\n", "#define PI 3.14159265\n", "#define N 1000\n", "\n", "double integrand(double x) {\n", " return cos(x)/(2+cos(x));\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " double result = 0,\n", " x;\n", " double a = PI,\n", " b = 5*PI/2;\n", " double dx = (b - a) / N;\n", " \n", " for (int j = 1; j < N; j++) {\n", " x = dx * j;\n", " result += integrand(x);\n", " }\n", " // Add endpoint calculations.\n", " result += 1/2 * integrand(a);\n", " result += 1/2 * integrand(b);\n", " // Multiply by prefactor\n", " result *= dx;\n", " \n", " cout << \"The calculated value is \" << result << \".\\n\";\n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic int_trap_serial.cpp -o int_trap_serial" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./int_trap_serial" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's convert the same code into OpenMP parallel code." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file int_trap_omp.cpp\n", "\n", "#include \n", "#include \n", "using namespace std;\n", "#include \n", "\n", "#define PI 3.14159265\n", "#define N 1000\n", "\n", "double integrand(double x) {\n", " return cos(x)/(2+cos(x));\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " double result = 0,\n", " x;\n", " double a = PI,\n", " b = 5*PI/2;\n", " double dx = (b - a) / N;\n", " \n", " int j;\n", " #pragma omp parallel for private(x) reduction(+:result) schedule(static,1)\n", " for (j = 1; j < N; j++) {\n", " x = dx * j;\n", " result += integrand(x);\n", " }\n", " cout << \"Using \" << omp_get_max_threads() << \" threads, \";\n", " \n", " // Add endpoint calculations.\n", " result += 1/2 * integrand(a);\n", " result += 1/2 * integrand(b);\n", " // Multiply by prefactor\n", " result *= dx;\n", " \n", " cout << \"the calculated value is \" << result << \".\\n\";\n", " return 0;\n", "} \n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic int_trap_omp.cpp -o int_trap_omp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./int_trap_omp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, with every iterative calculation being essentially independent of the others, we have to add very little. Specifically, we added this compound statement. Let's deconstruct it a bit to see what OpenMP is doing.\n", " #pragma omp parallel for private(x) reduction(+:result) schedule(static,1)\n", " \n", "- [`#pragma`](https://gcc.gnu.org/onlinedocs/cpp/Pragmas.html) instructs the compiler that a special category of compiler-specific instructions follows. Thus `#pragma omp` invokes compilation with OpenMP support for the following code section.\n", "- `parallel` begins a block of\u2014you guessed it\u2014parallel code. It forms a team of parallel threads which begin executing the subsequent code block in parallel.\n", "- `for` signals that the following block is a `for` loop. The number of iterations must be known at run time (_i.e._, you can't alter `N` or skip some iterations. They will be partitioned among the threads as defined by `schedule`.\n", "- `private` indicates a list of variables (previously declared) which are considered to be private to each thread\u2014that is, every thread has its own copy of the variable.\n", "- `reduction` is an operation which takes a group of values and combines them by some well-defined operation like addition or multiplication. `result` is the `double` which will contain the answer.\n", "- `schedule` refers to how iterations are to be divided between threads. `static,1` means that the iterations are divided into groups of size `1` and assigned to threads in round-robin fashion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "- Write a Simpson's rule integrator which solves the integral\n", "$$\n", "\\int_{0}^{2}\n", "\\textrm{d}x\\,\n", "\\frac{\\sinh{x}}{\\cosh{x}} - 1\n", "\\textrm{.}\n", "$$\n", "\n", "Recall that Simpson's rule is\n", "$$\n", "S[f] = \\Delta x \\left(\n", "\\frac{1}{2} f_{\\textrm{0}} + \\frac{1}{2} f_{n} + \\sum_{j=1}^{n-1} f_{j}\n", "\\right)\n", "\\approx\n", "I\n", "\\text{.}\n", "$$\n", "\n", "![](https://bytebucket.org/davis68/parallel/raw/fc24d274c088ba5087688ef25cf8e724f532319d/lessons/img/int-sinhbycosh.png?token=95b968dc849b4128087c5cc1dadeb1f45e035c14)\n", "\n", "A skeleton code for this is provided below." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file int_simp_omp.cpp\n", "\n", "#include \n", "#include \n", "using namespace std;\n", "#include \n", "\n", "#define PI 3.14159265\n", "#define N 1000\n", "\n", "double integrand(double x) {\n", " double value;\n", " // TODO\n", " return value;\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " double result = 0,\n", " x;\n", " double a = 0,\n", " b = 2;\n", " double dx = (b - a) / N;\n", " \n", " int j;\n", " #pragma omp parallel for private(x) reduction(+:result) schedule(static,1)\n", " for (j = 1; j < N; j++) {\n", " //TODO\n", " }\n", " cout << \"Using \" << omp_get_max_threads() << \" threads, \";\n", " \n", " // Add endpoint calculations.\n", " // TODO\n", " // Multiply by prefactor\n", " // TODO\n", " \n", " cout << \"the calculated value is \" << result << \".\\n\";\n", " return 0;\n", "} \n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!/usr/local/bin/g++-4.8 -Wall -fopenmp -O2 -std=c++11 -pedantic int_simp_omp.cpp -o int_simp_omp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./int_simp_omp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Recursion (Fibonacci Sequence)\n", "\n", "Next, let's examine a recursive code written in OpenMP to see a different kind of internal structure. (This is a purely didactic code: there are much better algorithms to calculate the Fibonacci sequence than a recursive algorithm, much less a parallel one.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file fib_serial.cpp\n", "\n", "#include \n", "#include \n", "using namespace std;\n", "\n", "long fib(int n) {\n", " if (n < 2) return n;\n", " else {\n", " long x, y;\n", " x = fib(n-1);\n", " y = fib(n-2);\n", " return x + y;\n", " }\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " int n = atoi(argv[1]); // Note no error checking here.\n", " double F = fib(n);\n", " cout << \"The \" << n << \"th Fibonacci number is \" << F << \".\\n\";\n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Overwriting fib_serial.cpp\n" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "!/usr/local/bin/g++-4.8 -Wall -fopenmp -O2 -std=c++11 -pedantic fib_serial.cpp -o fib_serial" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 38 }, { "cell_type": "code", "collapsed": false, "input": [ "!./fib_serial 18\n", "# Don't go higher than 40." ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "The 18th Fibonacci number is 2584.\r\n" ] } ], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [ "%%file fib_omp.cpp\n", "\n", "#include \n", "#include \n", "using namespace std;\n", "#include \n", "#include \n", "\n", "long fib(int n) {\n", " if (n < 2)\n", " return n;\n", " else {\n", " long x, y;\n", " #pragma omp task shared(x)\n", " x=fib(n-1);\n", " #pragma omp task shared(y)\n", " y=fib(n-2);\n", " #pragma omp taskwait\n", " return x+y;\n", " }\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " int n = atoi(argv[1]); // Note no error checking here.\n", " double F;\n", " #pragma omp parallel\n", " F = fib(n);\n", " cout << \"The \" << n << \"th Fibonacci number is \" << F << \".\\n\";\n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Overwriting fib_omp.cpp\n" ] } ], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic fib_omp.cpp -o fib_omp" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 43 }, { "cell_type": "code", "collapsed": false, "input": [ "!./fib_omp 18\n", "# Don't go higher than 40." ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "The 18th Fibonacci number is 2584.\r\n" ] } ], "prompt_number": 45 }, { "cell_type": "markdown", "metadata": {}, "source": [ "(As an aside, this code works fine without OpenMP.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -O2 -pedantic fib_omp.cpp -o fib_nomp\n", "!./fib_nomp 8" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit !./fib_serial 18\n", "%timeit !./fib_omp 18" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we run the `fib` method inside of a `omp parallel` block.\n", "\n", "This code forks new processes as children of the main (master) process to progressively solve the problem. The total number of threads is still `OMP_MAX_THREADS`, though, so new processes spawned will get remapped back to the original processors.\n", "\n", "As mentioned, this is not a good algorithm, nor is it more efficient in parallel. It's merely a demonstration of a principle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Vector & Matrix Multiplication\n", "\n", "An extremely common set of operations in scientific computing codes is vector\u2013vector, vector\u2013matrix, and matrix\u2013matrix addition and multiplication. (This says more about the way we do mathematical physics and what is straightforward to do on a computer than it does about the way the world works.)\n", "\n", "Let's examine a simple vector\u2013vector addition first, then look at some other cases. In this case, we will actually use the SIMD model shown at first, which simply lets each thread perform its own addition.\n", "\n", "![](https://bytebucket.org/davis68/parallel/raw/fc24d274c088ba5087688ef25cf8e724f532319d/lessons/img/SIMD-vector-add.png?token=7d71987d20cfb22695d0444a321a9d9480610c6f)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file vadd.cpp\n", "#include \n", "#include \n", "#include \n", "using namespace std;\n", "#include \n", "\n", "#define N 16\n", "\n", "int main(int argc, char** argv) {\n", " vector a(N);\n", " vector b(N);\n", " vector c(N);\n", " \n", " // Initialize in parallel.\n", " # pragma omp parallel for\n", " for (int i = 0; i < a.size(); i++) {\n", " a[i] = i;\n", " b[i] = 2*i;\n", " }\n", " \n", " // Add in parallel.\n", " # pragma omp parallel for\n", " for (int i = 0; i < a.size(); i++) {\n", " c[i] = a[i] + b[i];\n", " }\n", " \n", " // Output result of calculation.\n", " for (int i = 0; i < a.size(); i++) {\n", " cout << a[i] << \" + \" << b[i] << \"\\t= \" << c[i] << endl;\n", " }\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic vadd.cpp -o vadd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./vadd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "- Modify the program `vadd.cpp` to so each thread has its own copy of a single variable `i` declared _before_ the `#pragma omp` (hint: use `private`) and use `dynamic` scheduling. (You can copy it into the code block below to preserve a working example above.)" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic vadd.cpp -o vadd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./vadd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "- Write a vector\u2013vector multiplication code in C (no STL but just C-style `double` arrays) using OpenMP." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file vmult.c\n", "#include \n", "#include \n", "#include \n", "\n", "#define N 16\n", "\n", "int main(int argc, char** argv) {\n", " double a[N][N];\n", " double b[N][N];\n", " double c[N][N];\n", " \n", " // your code here\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic vmult.c -o vmult" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./vmult" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counterexample\n", "Ultimately, the reason we can get away with working inside of an STL container is that we don't need threads to write to the same memory location, nor do we reallocate memory. ([ref](https://stackoverflow.com/questions/9269097/openmp-and-stl-vector)) The following code, in contrast, will fail because the behavior of `push_back` requires a consistent memory location, violated by several threads competing for the same allocation." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file vinit.cpp\n", "\n", "#include \n", "using namespace std;\n", "#include \n", "\n", "int main(int argc, char** argv) {\n", " vector a(0);\n", " \n", " // Initialize a vector in parallel by adding new elements to it.\n", " # pragma omp parallel for\n", " for (int i = 0; i < 12; i++) {\n", " a.push_back(i);\n", " }\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic vinit.cpp -o vinit" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./vinit" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matrix\u2013Matrix Operations\n", "Matrix\u2013matrix operations are more involved since they often involve _blocking_ or _chunking_, which separates a large dense matrix into several smaller square pieces and solves them separately (with dependencies, of course). The following C code illustrates this (original by Blaise Barney, LLNL, [src](https://computing.llnl.gov/tutorials/openMP/samples/C/omp_mm.c))." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file mmult-block.c\n", "// Modified from an original example by Blaise Barney for LLNL.\n", "// https://computing.llnl.gov/tutorials/openMP/samples/C/omp_mm.c\n", "#include \n", "#include \n", "#include \n", "\n", "#define NRA 62 /* number of rows in matrix A */\n", "#define NCA 15 /* number of columns in matrix A */\n", "#define NCB 7 /* number of columns in matrix B */\n", "\n", "int main (int argc, char *argv[]) {\n", " int tid, nthreads, i, j, k, chunk;\n", " double a[NRA][NCA], /* matrix A to be multiplied */\n", " b[NCA][NCB], /* matrix B to be multiplied */\n", " c[NRA][NCB]; /* result matrix C */\n", " \n", " chunk = 10; /* set loop iteration chunk size */\n", " \n", " #pragma omp parallel shared(a, b, c, nthreads, chunk) private(tid, i, j, k)\n", " {\n", " tid = omp_get_thread_num();\n", " if (tid == 0) {\n", " nthreads = omp_get_num_threads();\n", " printf(\"Starting matrix multiple example with %d threads\\n\", nthreads);\n", " }\n", " /*** Initialize matrices ***/\n", " #pragma omp for schedule (static, chunk) \n", " for (i = 0; i < NRA; i++)\n", " for (j = 0; j < NCA; j++)\n", " a[i][j] = i + j;\n", " #pragma omp for schedule (static, chunk)\n", " for (i = 0; i < NCA; i++)\n", " for (j = 0; j < NCB; j++)\n", " b[i][j] = i * j;\n", " #pragma omp for schedule (static, chunk)\n", " for (i = 0; i < NRA; i++)\n", " for (j = 0; j < NCB; j++)\n", " c[i][j] = 0;\n", " \n", " /*** Do matrix multiply sharing iterations on outer loop ***/\n", " /*** Display who does which iterations for demonstration purposes ***/\n", " printf(\"Thread %d starting matrix multiply...\\n\",tid);\n", " #pragma omp for schedule (static, chunk)\n", " for (i = 0; i < NRA; i++) {\n", " printf(\"Thread=%d did row=%d\\n\",tid,i);\n", " for (j = 0; j < NCB; j++) \n", " for (k = 0; k < NCA; k++)\n", " c[i][j] += a[i][k] * b[k][j];\n", " }\n", " } /*** end parallel section ***/\n", " \n", " printf(\"\\nFirst Matrix:\\n\");\n", " for (i=0; i\n", "## Series Formul\u00e6\n", "\n", "Series representations are common useful forms for function approximations like trigonometric functions and spherical harmonics. For instance, the sine function (for $z \\in \\mathbb{C}$) may be written in a product form as\n", "$$\\frac{\\sin z}{z} = \\prod_{k=1}^{\\infty} \\left( 1 - \\frac{z^2}{k^2\\pi^2} \\right) \\text{.}$$\n", "For small $z$, this form converges quickly. This being the case, we can calculate each term independently over $P$ processes and reduce the result by multiplication." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file series.cpp\n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "using namespace std;\n", "\n", "#define PI 3.1415926535\n", "\n", "// The k-th term of the infinite product series for sine.\n", "static complex term(int k, complex z) {\n", " return (complex(1.0, 0.0) - pow(z, 2)/(pow(k, 2) * PI*PI));\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " int N = 10;\n", " if (argc > 1) N = atoi(argv[1]);\n", " complex z = complex(PI*0.25, PI*0.25);\n", " if (argc > 2) z = complex(atof(argv[2]), atof(argv[3]));\n", " \n", " complex res, i_res;\n", " double res_r = 1.0, res_i = 1.0;\n", " int j;\n", " #pragma omp parallel for private(i_res, j) reduction(*:res_r,res_i) schedule(static,1)\n", " for (j = 1; j < N; j++) {\n", " i_res = term(j, z);\n", " res_r *= real(i_res);\n", " res_i *= imag(i_res);\n", " }\n", " // note that reduction over STL data types is not yet supported\n", " res = complex(res_r, res_i) * z;\n", " cout << \"Using \" << omp_get_max_threads() << \" threads and \" << N << \" terms, \";\n", " cout << \"the value of sin(\" << z << \") is \" << res << \".\\n\";\n", " cout << \"This compares to the library value of \" << sin(z) << \" with an error of \" << (sin(z)-res) << \".\\n\";\n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic series.cpp -o series" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./series 16 0.0 0.31415" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "- Write a series summation code to calculate sine:\n", "$$\n", "\\begin{eqnarray}\n", "\\sin x\n", "& = x - \\frac{x^3}{3!} + \\frac{x^5}{5!} - \\frac{x^7}{7!} + \\cdots \\\\[8pt]\n", "& = \\sum_{n=0}^\\infty \\frac{(-1)^n}{(2n+1)!}x^{2n+1} \\\\\n", "\\end{eqnarray}\n", "$$\n", "\n", "Test it with $\\sin (\\pi/3) \\approx 0.866025403784438...$." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file sine.cpp\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic sine.cpp -o sine" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./sine 1.0471975511966 #pi/3" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Monte Carlo\n", "\n", "Now, you'll probably never write your own matrix multiplication code (you should use a library instead), but you may very well find yourself needing to write a Monte Carlo code from scratch for a specific problem. Monte Carlo problems use random numbers to sample an integral (or, effectively the same thing, integrate a differential equation) and can be very competitive for some classes of problems because they are often embarassingly parallel." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Number Generation\n", "\n", "The tricky thing about Monte Carlo codes is that the random numbers really do need to be independent. As you are trying to explore some region of phase space thoroughly, even high-order repetitions can bias your results in subtle and difficult-to-discover ways.\n", "\n", "[![](https://bytebucket.org/davis68/parallel/raw/fc24d274c088ba5087688ef25cf8e724f532319d/lessons/img/xkcd-random_number.png?token=35967e7728e79fb403034d834ccf92f127f4d215)](http://xkcd.com/)\n", "\n", "We have previously glossed over this problem by simply invoking a uniform random number generator (RNG), $U \\rightarrow x \\in [0,1)$, that I hacked together because it works and is concise and readable without being distracting.\n", "\n", " double uniform_rng() { return (double)rand() / (double)RAND_MAX; } // Not good\n", "\n", "Generally speaking, RNGs that are not reentrant (_i.e._, able to be accessed multiple times simultaneously and independently as well as interrupted and resumed indiscriminately) are not thread safe, and cannot be used in parallel code. For a role-playing game, say, this is probably fine. When exploring multidimensional phase space for a mission-critical activity, it's better to not take chances. The _problem_ with this RNG, as I mentioned before, is that it is not _thread safe_. That is, we can make no guarantees about whether or not subsequent numbers drawn from this distribution are in fact independent of prior numbers. (Incidentally, I found this problem to be more pronounced with MPI than with OpenMP.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file rng_bad.cpp\n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "#define N 100000\n", "#define R 1\n", "\n", "double uniform_rng() { return (double)rand() / (double)RAND_MAX; }\n", "\n", "int main(int argc, char *argv[]) {\n", " int tid;\n", " srand(time(NULL));\n", " \n", " // Calculate and display a few \"random\" numbers.\n", " #pragma omp parallel private(tid)\n", " {\n", " tid = omp_get_thread_num();\n", " printf(\"%d\\t%f\\n\", tid, uniform_rng());\n", " }\n", " \n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic rng_bad.cpp -o rng_bad\n", "!./rng_bad" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next option you have is to use the RAND48 library which is a collection of POSIX standards and some associated GNU extensions for random number generation. Below we illustrate `erand48`, which is [nominally thread safe](http://www.unix.com/man-page/POSIX/3posix/erand48/) although opinions vary." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file rng_good.cpp\n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "#define N 100000\n", "#define R 1\n", "\n", "int main(int argc, char *argv[]) {\n", " int tid;\n", " unsigned short rand_seed[3];\n", " srand48(time(NULL));\n", " \n", " // Calculate and display a few random numbers.\n", " #pragma omp parallel private(tid, rand_seed)\n", " {\n", " tid = omp_get_thread_num();\n", " rand_seed[0] = tid + 120 + time(NULL);\n", " rand_seed[1] = tid * 2 + time(NULL);\n", " rand_seed[2] = tid * 100 + time(NULL);\n", " printf(\"%d\\t%f\\n\", tid, erand48(rand_seed));\n", " }\n", " \n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic rng_good.cpp -o rng_good\n", "!./rng_good" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we are using C++ instead of C, we can also `#include ` to access some thread-safe generators [(ref)](http://msdn.microsoft.com/en-us/library/bb982398.aspx). (This is the same as the [Boost](http://www.boost.org/) `random` library, and we will make particular use of the Mersenne twister algorithm.) We examine it first as a serial code to clarify what is going on, then as a parallel OpenMP code." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file rng_better.cpp\n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "#define N 100000\n", "#define R 1\n", "\n", "int main(int argc, char *argv[]) {\n", " int tid;\n", " std::random_device rd; // non-deterministic bit generator\n", " std::mt19937 gen; // Mersenne twister URNG\n", " std::uniform_real_distribution<> dist(0,1); // transforms results to a useful distribution and range\n", " \n", " // Calculate and display a few random numbers.\n", " #pragma omp parallel private(tid, rd, gen, dist)\n", " {\n", " tid = omp_get_thread_num();\n", " gen.seed(time(NULL)+tid);\n", " \n", " printf(\"%d\\t%f\\n\", tid, dist(gen));\n", " }\n", " \n", " return 0;\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic rng_better.cpp -o rng_better\n", "!./rng_better" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the _best_ general option is to use a RNG library designed for parallel Monte Carlo simulations like [Tina's RNG](http://numbercrunch.de/trng/). Since installation is specific to a system, we will forgo a demonstration here. (Keep in mind that you can install libraries to your home folder though, even if the system administrators don't jump for a site installation.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# plotting code if you'd like to see RNGs v. rank.\n", "from numpy import linspace, zeros, append\n", "stdout = !./rng_better\n", "x = []\n", "y = []\n", "for out in stdout:\n", " x.append(float(out.split('\\t')[0]))\n", " y.append(float(out.split('\\t')[1]))\n", "\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "from matplotlib import cm\n", "%matplotlib inline\n", "\n", "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16,4))\n", "axes[0].plot(x, y, 'r.')\n", "axes[1].plot(zeros(len(x)),y,'r.')\n", "fig.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monte Carlo problem\n", "\n", "That aside was longer than I had intended, but I hope it was profitable in laying out how to deal with random number generation safely and practically. Let's take a look at our Monte Carlo problem again.\n", "\n", "Consider the problem of determining, for the distance between a pair of random points, the average distance and standard deviation of that value. The following code illustrates this principle." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file monte_carlo_omp.cpp\n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "#define NPTS 10\n", "\n", "using namespace std;\n", "\n", "// Termination criteria\n", "double rtol = 1e-2; // Relative tolerance\n", "long max_trials = 1e6; // Maximum number of MC trials\n", "long num_batch = 500; // Number of batches per trial\n", "\n", "/** is_converged()\n", " * Use adaptive error estimation to terminate the simulation when the error bars\n", " * at one standard deviation are below rtol.\n", " */\n", "bool is_converged(double sum_x, // Sum of x\n", " double sum_x2, // Sum of x, squared\n", " double num_trials) { // Number of trials\n", " double E_x = sum_x / num_trials;\n", " double E_x2 = sum_x2 / num_trials;\n", " double var_x = E_x2 - E_x * E_x;\n", " return (var_x/(E_x*E_x) < rtol*rtol || num_trials > max_trials);\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " int num_threads = omp_get_max_threads();\n", " double E_x, E_x2, sigma_x;\n", " \n", " // Monte Carlo result variables\n", " double all_sum_x = 0;\n", " double all_sum_x2 = 0;\n", " long all_ntrials = 0;\n", " \n", " // Private state variables\n", " int f_done; // Flag for completion\n", " int tid, // ID of this OpenMP thread\n", " tnbatch; // \n", " double sum_x, // \n", " sum_x2; // \n", " long seed; // Unique seed for Mersenne twister PRNG\n", " \n", " std::random_device rd; // non-deterministic bit generator\n", " std::mt19937 gen; // Mersenne twister URNG\n", " std::uniform_real_distribution<> dist(0,1); // transforms results to a useful distribution and range\n", " \n", " #pragma omp parallel shared(all_sum_x, all_sum_x2, all_ntrials, num_batch) private(gen, dist, seed, f_done, tid, sum_x, sum_x2)\n", " {\n", " tnbatch = num_batch;\n", " tid = omp_get_thread_num();\n", " f_done = false;\n", " \n", " #pragma omp critical\n", " seed = static_cast(std::time(0)) + tid;\n", " gen.seed(seed);\n", " \n", " do {\n", " // Run batch of experiments.\n", " sum_x = 0;\n", " sum_x2 = 0;\n", " for (int t = 0; t < tnbatch; ++t) {\n", " double x = dist(gen);\n", " double xx[2][NPTS];\n", " double d2 = -1;\n", " \n", " for (int i = 0; i < NPTS; i++) {\n", " double x_i = dist(gen);\n", " double y_i = dist(gen);\n", " xx[0][i] = x_i;\n", " xx[1][i] = y_i;\n", " // Assess distance between points.\n", " for (int j = 0; j < i; j++) {\n", " double d_x_j = xx[0][j] - x_i;\n", " double d_y_j = xx[1][j] - y_i;\n", " double d_ij2 = d_x_j*d_x_j + d_y_j*d_y_j;\n", " if (d2 < 0 || d_ij2 < d2)\n", " d2 = d_ij2;\n", " }\n", " }\n", " x = sqrt(d2);\n", " sum_x += x;\n", " sum_x2 += x*x;\n", " }\n", " \n", " // Update global counts and test for termination.\n", " #pragma omp critical\n", " {\n", " f_done = (f_done || is_converged(all_sum_x, all_sum_x2, all_ntrials));\n", " all_sum_x += sum_x;\n", " all_sum_x2 += sum_x2;\n", " all_ntrials += tnbatch;\n", " f_done = (f_done || is_converged(all_sum_x, all_sum_x2, all_ntrials));\n", " }\n", " } while (!f_done);\n", " }\n", " \n", " // Compute expected value and error bars at one standard deviation.\n", " E_x = all_sum_x / all_ntrials;\n", " E_x2 = all_sum_x2 / all_ntrials;\n", " sigma_x = sqrt((E_x2-E_x*E_x) / all_ntrials);\n", " \n", " // Output the value and error bar.\n", " printf(\"With %d threads, the result was $\\\\bar{x}$ = %f and $\\\\sigma_{x}$ = %f.\\n\", num_threads, E_x, sigma_x);\n", " return 0;\n", "}\n", "//rewrite this using TINA? using thread lock on master RNG? _both_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic monte_carlo_omp.cpp -o monte_carlo_omp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./monte_carlo_omp" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## A Few More Tips\n", "**Environment Variables**\n", "\n", "OpenMP permits you to specify characteristics of the system's behavior through environment variables as well, letting you tune your application for a given platform without having to deal with those idiosyncrasies in code.\n", "\n", "`$OMP_NUM_THREADS 8/4,2`\n", "Specifies the default number of threads to use in parallel regions. \"Having more OpenMP threads than hardware threads/cores can improve load balancing.\" ([forum](http://openmp.org/forum/viewtopic.php?f=3&t=1566)) (But watch out for resource conflicts and cache thrashing!)\n", "\n", "`$OMP_DYNAMIC TRUE/FALSE`\n", "\n", "`$OMP_NESTED TRUE/FALSE`\n", "\n", "`$OMP_PROC_BIND TRUE/FALSE`. With processor binding, the programmer instructs the operating system that a thread in the program should run on the same processor throughout the execution of the program.\n", "\n", "Performance can be enhanced with processor binding, but performance degradation will occur if multiple threads are bound to the same virtual processor. ([(Oracle/Sun)](http://docs.oracle.com/cd/E24457_01/html/E21996/aewce.html#scrolltoc)) ([Hyperthreading poor performance](https://stackoverflow.com/questions/24368576/poor-performance-due-to-hyper-threading-with-openmp-how-to-bind-threads-to-core))\n", "\n", "Binding threads to processors can result in better cache utilization, thereby reducing costly memory accesses. This is the primary motivation for binding threads to processors. ([LLNL](https://computing.llnl.gov/tutorials//openMP/index.html#Stack))\n", "\n", "`$OMP_STACKSIZE 10M`\n", "\n", "`$OMP_WAIT_POLICY ACTIVE/PASSIVE`\n", "\n", "`$OMP_SCHEDULE 'STATIC'/'GUIDED,4'` ([good explanation of when to use](http://openmp.org/forum/viewtopic.php?f=3&t=83))\n", "\n", "`$GOMP_CPU_AFFINITY \"0 1 2 3 4 5 6 7\"`. Binds threads to specific CPUs. (Differs from `$OMP_PROC_BIND` which merely binds threads to processors without switching, not specifying which ones.) Can help performance, esp with `$OMP_PROC_BIND=TRUE`. ([forum](https://stackoverflow.com/questions/24368576/poor-performance-due-to-hyper-threading-with-openmp-how-to-bind-threads-to-core))\n", "\n", "**Sharing Memory-Managed STL Containers**\n", "\n", "It's a little tricky to use memory-managed containers (STL) in OpenMP, but some clever tricks can enable it (mainly, the judicious use of `master`, `single`, and especially `critical`). For instance, the following technique should let you create a vector in parallel and then use the `critical` keyword to populate the communal vector.\n", "\n", " std::vector vec;\n", " #pragma omp parallel\n", " {\n", " std::vector vec_private;\n", " #pragma omp for nowait //fill vec_private in parallel\n", " for(int i=0; i<100; i++) {\n", " vec_private.push_back(i);\n", " }\n", " #pragma omp critical\n", " vec.insert(vec.end(), vec_private.begin(), vec_private.end());\n", " }\n", "\n", "**Faster `for` Loops**\n", "\n", "The default scheduling behavior for OpenMP is to partition out the iterations between threads with a chunk size of `n_indices / n_threads`. Thus these threads, when they reach the end of their designated workload, just idle until all threads catch up. A better way to handle this is to invoke some automatic load balancing with `schedule(dynamic, 1)`, which forces the system to hand out new work until all of it is done at once. ([src](http://brainvis.wustl.edu/wiki/index.php/Caret7:Development/OpenMP_tricks))\n", "\n", "**Barriers**\n", "\n", "The use of `barrier` should be avoided, as it can cause inefficient overhead in many cases by making all threads wait to carry out an operation. (This goes for `ordered` as well\u2014and don't use large `critical` regions.)\n", "\n", "**`private` Variables**\n", "\n", "Apparently on certain compilers, `private` has not worked properly in all cases. Should you find this to be a problem, simply move your `private` variable declarations within the `parallel` code block, automatically rendering them private. (This should not be a problem on contemporary compilers but may be on systems from 2011 or older.) ([src](http://brainvis.wustl.edu/wiki/index.php/Caret7:Development/OpenMP_tricks))\n", "\n", " int i;\n", " #pragma omp parallel for private(i)\n", " { ... }\n", " \n", " #pragma omp parallel for\n", " for (int i = 0; i < N; i++) { ... }\n", "\n", "**Loop Granularity**\n", "\n", "Generally speaking, you will find better scalability with OpenMP if you keep loop granularity as fine as reasonable (but no finer!). What this means is that you want to use the smallest iteration that doesn't incur sufficient thread creation overhead to make it inefficient. This translates into two tips: use nested OpenMP loops as appropriate; and use the `collapse` keyword to generate collapsed `for` loops. `parallel` regions in inner loops incur overhead repeatedly and should be avoided. ([ref](https://software.intel.com/en-us/articles/openmp-related-tips))\n", "\n", " #pragma omp parallel for collapse(2)\n", " for (i = 0; i < imax; i++)\n", " for (j = 0; j < jmax; j++)\n", " a[ j + jmax*i] = 1.; \n", "\n", "You want to maximize parallel regions to get good benefit from OpenMP.\n", "\n", "**Are My Loop Iterations Independent?**\n", "\n", "Run it backwards. If it still works, they most likely are.\n", "\n", "**Load Balance**\n", "\n", "Just do it. If practical.\n", "\n", "**User-Defined Reduction Operations**\n", "\n", "As of [OpenMP 4.0](http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf), user-defined reduction operations are supported. This means that instead of being limited to scalar reductions (over `int`, `double`, etc.), you can now define reduction over, say, an STL `vector`, thus enabling vector mathematics to use OpenMP. (This is gold, by the way.)\n", "\n", " #pragma omp declare reduction (merge : std::vector : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end())) \n", "\n", "(For what it may be worth, when I first wanted to test this, my installed GCC version (4.8) didn't support the feature yet. I upgraded to GCC 4.9 [which does support OpenMP 4.0](http://www.techenablement.com/gcc-4-9-1-adds-openmp-4-0-fortran-support-for-multicore/) and then tested the code which follows as `reduce.cpp`.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file reduce.cpp\n", "#include \n", "#include \n", "#include \n", "\n", "using namespace std;\n", "\n", "struct force_t { double x, y; };\n", "\n", "int main(int argc, char** argv) {\n", " #pragma omp declare reduction (+ : force_t : omp_out.x += omp_in.x, omp_out.y += omp_in.y)\n", " force_t pt;\n", " int N = 10;\n", " if (argc > 1) N = atoi(argv[1]);\n", " \n", " #pragma omp parallel for reduction(+ : pt) schedule(static,1)\n", " for (int j = 0; j < N; j++) {\n", " force_t my_pt;\n", " my_pt.x = 1.0*j;\n", " my_pt.y = 2.0*j;\n", " pt.x += my_pt.x;\n", " pt.y += my_pt.y;\n", " }\n", " cout << \"Using \" << omp_get_max_threads() << \" threads, the resultant force is (\" << pt.x << \",\" << pt.y << \").\\n\";\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++-4.9 -Wall -fopenmp -O2 -std=c++11 -pedantic reduce.cpp -o reduce" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./reduce 32" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "- Write a reduction code for the C++ STL `complex` data type addition operation." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file reduce-complex.cpp\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic reduce-complex.cpp -o reduce-complex" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./reduce-complex 32" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Using Vectorization (SIMD) in Practice\n", "\n", "Vectorization (SIMD) is typically implemented by using compilation options in many modern processor/compiler workflows, however, so the scientific programmer rarely needs to directly address this operation in code. For instance, with GCC, vectorization may be included by using the following compiler flags:\n", "\n", "- `-O3` Full optimization typically turns on all SSE vector operations available on the target processor platform, but makes it extremely difficult to debug the code effectively and should only be used in testing or release codes ([src]()) ([ref](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html)).\n", "\n", "- `-msse4.2` The current generation of x86-64 processors supports some SSE standard; in this case, the current extension set supported is SSE 4.2. (This enables all lower SSE extensions automatically, _e.g._, `-msse3` is not necessary with `-msse4` invoked. ([src](http://stackoverflow.com/a/10687419)) ([ref](https://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html)))\n", "\n", "- `-mavx` The AVX (*A*dvanced *V*ector *E*xtensions) instruction set architecture is only supported on Intel Sandy Bridge and AMD Bulldozer processors and later models. They extend integer commands to 256 (later 512) bits, increasing the performance of floating-point operations and vectorizable array operations. ([src](https://01.org/blogs/dfineber/2014/devops-does-processor-matter-server-side-java-applications)) \n", "\n", "**Caution!** Know your architecture, however. Specialized hardware, such as the Intel Xeon Phi, may not support a particular method. In this case, the Xeon E3-1275v3 processor supports AVX but not SSE\u2014it is not intended for multimedia operations, but numerical work on supercomputers. ([src](https://communities.intel.com/thread/43942?start=0&tstart=0))\n", "\n", "Thus, effectively using vectorization is often simply a matter of compiling your program correctly on a supporting architecture:\n", "\n", " gcc -O3 -msse4.2 -std=c99 -Wall -o hello_world_vector hello_world_vector.c\n", "\n", "Consult the following references for more information.\n", "\n", "- [(Stack Overflow: Comparison between OpenMP and Vectorization)](https://stackoverflow.com/questions/10509872/comparison-between-openmp-and-vectorization)\n", "\n", "- [(Stack Overflow: Using OpenMP Stops GCC Auto Vectoising)](https://stackoverflow.com/questions/14861447/using-openmp-stops-gcc-auto-vectorising?rq=1)\n", "\n", "- [(Performance Essentials with OpenMP 4.0 Vectorization)](https://software.intel.com/en-us/articles/performance-essentials-with-openmp-40-vectorization)\n", "\n", "- [(SIMD Programming with GCC)](http://basis13.wordpress.com/2014/02/20/simd-programming-with-gcc-first-steps/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Assessing Performance\n", "\n", "We've gone to a fair amount of work to make our codes parallel and vectorized, but is it worth the effort? As always, it depends. We'll run a few sample codes below to assess their relative performance under different library and compilation conditions.\n", "\n", "The basic function we need to do this is actually provided by `omp.h`: `omp_get_wtime`, which has `double` resolution in seconds and so is quite adequate for timing both parallel and serial code segments.\n", "\n", "Let's take a look at the series product problem from earlier." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file series-time.cpp\n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "using namespace std;\n", "\n", "#define PI 3.1415926535\n", "\n", "// The k-th term of the infinite product series for sine.\n", "static complex term(int k, complex z) {\n", " return (complex(1.0, 0.0) - pow(z, 2)/(pow(k, 2) * PI*PI));\n", "}\n", "\n", "int main(int argc, char** argv) {\n", " int N = 10;\n", " if (argc > 1) N = atoi(argv[1]);\n", " complex z = complex(PI*0.25, PI*0.25);\n", " if (argc > 2) z = complex(atof(argv[2]), atof(argv[3]));\n", " \n", " double ser_start, ser_end, par_start, par_end;\n", " \n", " complex res, i_res;\n", " double res_r = 1.0, res_i = 1.0;\n", " int j;\n", " \n", " // Solve the problem serially first.\n", " ser_start = omp_get_wtime();\n", " for (j = 1; j < N; j++) {\n", " i_res = term(j, z);\n", " res_r *= real(i_res);\n", " res_i *= imag(i_res);\n", " }\n", " // note that reduction over STL data types is not yet supported\n", " res = complex(res_r, res_i) * z;\n", " ser_end = omp_get_wtime();\n", " cout << \"The serial code takes \" << (ser_end - ser_start) << \" s to get the answer \" << res << \".\\n\";\n", " \n", " // Solve the parallel problem.\n", " par_start = omp_get_wtime();\n", " #pragma omp parallel for private(i_res, j) reduction(*:res_r,res_i) schedule(static,1)\n", " for (j = 1; j < N; j++) {\n", " i_res = term(j, z);\n", " res_r *= real(i_res);\n", " res_i *= imag(i_res);\n", " }\n", " // note that reduction over STL data types is not yet supported\n", " res = complex(res_r, res_i) * z;\n", " par_end = omp_get_wtime();\n", " cout << \"The parallel code takes \" << (par_end - par_start) << \" s to get the answer \" << res << \" with \" << omp_get_max_threads() << \" threads.\\n\";\n", "}" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!g++ -Wall -fopenmp -O2 -pedantic series-time.cpp -o series-time" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!./series-time 1\n", "!./series-time 10\n", "!./series-time 100\n", "!./series-time 1000\n", "!./series-time 10000\n", "!./series-time 100000\n", "!./series-time 1000000\n", "!./series-time 10000000\n", "!./series-time 100000000" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, for small numbers of the series there is clearly a penalty to be paid for parallelizing the code (see the figure below from some data on my own machine, a 2013 MacBook Pro with quad-core 2.7 GHz Intel i7). However, as the number of terms in the series increases, there is a crossover point and parallel computation becomes more efficient from a _wall time_ perspective. (That is, for time as we observe it. In processor time, the total time spent on the calculation should be multiplied by the number of processes involved.)\n", "\n", "![](https://bytebucket.org/davis68/parallel/raw/fc24d274c088ba5087688ef25cf8e724f532319d/lessons/img/series-time.png?token=136c4e9258fff5a3714dd9fbe0f8317bb0c4dc8b)\n", "\n", "You will find this generally true for a well-crafted parallel algorithm\u2014small problems incur too much overhead to be worth the effort, but for large domains and arrays OpenMP can yield a handsome benefit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Best Practices for Scientific Parallel Programming\n", "\n", "Now this may be kind of a funny aside, but it is a concern I often had as a student. MPI owes a lot to the local computer science and parallel programming folks on campus, so is it just a tool we prefer because it is \"in-house\"? There are tools that get talked about a lot, as well as advanced research projects that offer interesting features but are not widely adopted, like [CHARM++](http://charm.cs.uiuc.edu/) or [UPC](http://upc.lbl.gov/). No, MPI is used everywhere and really is the standard for supercomputing today.\n", "\n", "### The Process of Creating a Parallel Program\n", "\n", "**Matrix Relaxation**\n", "\n", "The best parallel algorithm is oten not the best sequential algorithm. For instance, in matrix solution, we can select from Jacobi relaxation or Gauss\u2013Seidel relaxation. [Jacobi relaxation](https://en.wikipedia.org/wiki/Jacobi_method) is a na\u00efve algorithm to solve a matrix equation iteratively by inverting only the diagonal elements and modifying an initial guess progressively until a solution within certain tolerances is found. [Gauss\u2013Seidel relaxation](https://en.wikipedia.org/wiki/Gauss%E2%80%93Seidel_method) is formally a better algorithm, since it also takes advantage of updated data points from _the current iteration_ to improve the subsequent data points in-flight as well. However, Jacobi relaxation, relying exclusively on a prior iteration, can be sensibly implemented in parallel while Gauss\u2013Seidel iteration can not.\n", "\n", "**Prefix Summation**\n", "\n", "Consider also summing vector elements of $A$ such that each resulting element $B_i$ is the sum over all elements prior to index $i$:\n", "$$\\begin{eqnarray} B_{0} = & A_{0} \\\\ B_i = & \\sum_{j = 1}^i A_j \\text{.} \\end{eqnarray}$$\n", "\n", "In serial, this algorithm requires $N = P$ stages. In parallel, this can be accomplished with a _scan_ operation in $\\log P$ stages. ([ref1](https://en.wikipedia.org/wiki/Prefix_sum)) ([ref2](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html))\n", "\n", "\n", "\n", "\n", "\n", "\n", "

Serial Version

Parallel Version

\n", "\n", "**Basic Process**\n", "\n", "- Determine algorithms\n", "- Decide on decomposition of computation into work and data units\n", "- Decide on assignment of these units to processors\n", "- Express decisions in language available for the machine\n", "\n", "**Tips**\n", "\n", "- Architect a new parallel top-level structure rather than parallelizing a sequential program\n", "- Insert existing sequential code where appropriate\n", "- Refactor a portion of the sequential code into a parallel structure\n", "- Rewrite the remainder\n", "- Express algorithms in terms of $N$ and $P$.\n", "- Visualize data flow, not control flow, to reveal dependencies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Conclusion\n", "\n", "There's a lot more to OpenMP than what we could cover in this brief introduction. Here are a few resources to learn more about OpenMP and its applications.\n", "\n", "- Open MP 4.0 ([standard](http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf)) ([examples](http://openmp.org/mp-documents/OpenMP_Examples_4.0.1.pdf))\n", "- Lawrence Livermore National Laboratory [tutorial](https://computing.llnl.gov/tutorials/openMP/)\n", "- Archer Advanced OpenMP ([workshop](http://www.archer.ac.uk/training/course-material/2014/05/AdvancedOpenMP_Oxford/))\n", "- [32 Traps for OpenMP Developers](http://www.codeguru.com/cpp/cpp/cpp_mfc/general/article.php/c15419/32-OpenMP-Traps-for-C-Developers.htm) (dated now but still good basic troubleshooting)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## References\n", "- [NCSA CI-Tutor](http://www.citutor.org/login.php).\n", "- [LLNL tutorial](https://computing.llnl.gov/tutorials/openMP/)\n", "- [UIUC CS Introduction to OpenMP](http://www.cs.uiuc.edu/class/sp06/cs232/section/disc7.pdf)\n", "- [OpenMP with MPI](http://www.slideserve.com/MikeCarlo/parallel-programming-in-c-with-mpi-and-openmp-)\n", "- [Using OpenMP](https://mitpress.mit.edu/books/using-openmp) (book)\n", "- [Guide Into Using OpenMP](http://bisqwit.iki.fi/story/howto/openmp/)\n", "- [Thoughts on OpenMP in Scientific Computing](http://www.dalkescientific.com/writings/diary/archive/2012/01/16/my_views_on_openmp.html)\n", "- [Official OpenMP Code Examples](http://openmp.org/mp-documents/OpenMP_Examples_4.0.1.pdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Credits\n", "\n", "Neal Davis developed these materials for [Computational Science and Engineering](http://cse.illinois.edu/) at the University of Illinois at Urbana\u2013Champaign.\n", "\n", "\n", "This content is available under a [Creative Commons Attribution 4.0 Unported License](https://creativecommons.org/licenses/by/4.0/).\n", "\n", "[![](https://bytebucket.org/davis68/resources/raw/f7c98d2b95e961fae257707e22a58fa1a2c36bec/logos/baseline_cse_wdmk.png?token=be4cc41d4b2afe594f5b1570a3c5aad96a65f0d6)](http://cse.illinois.edu/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Code Snippets for Teaching" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "from numpy import pi, cos, linspace\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "mpl.rcParams['figure.figsize']=[18,3]\n", "x = linspace(pi-0.05, 2.5*pi+0.02, 1001)\n", "f = cos(x)/(2+cos(x))\n", "f[0] = f[-1] = 0\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "ax.fill(x, f, color='cornflowerblue')\n", "ax.set_xlim(pi, 2.5*pi)\n", "ax.set_ylim(-1.0, 1.0)\n", "unit = 0.5\n", "x_tick = np.arange(1, 2.5+unit, unit)\n", "x_label= [r\"$\\pi$\", r\"$\\frac{3\\pi}{2}$\", r\"$2\\pi$\", r\"$\\frac{5\\pi}{2}$\"]\n", "ax.set_xticks(x_tick*np.pi)\n", "ax.set_xticklabels(x_label, fontsize=16)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "x = linspace(0, 2.02, 1001)\n", "f = np.sinh(x) / np.cosh(x) - 1.0\n", "f[0] = f[-1] = 0\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "ax.fill(x, f, color='maroon')\n", "ax.set_xlim(0.0, 2.0)\n", "ax.set_ylim(-1.2,0.2)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%file reduce-complex.cpp\n", "#include \n", "#include \n", "#include \n", "#include \n", "\n", "int main(int argc, char** argv) {\n", " #pragma omp declare reduction (+ : std::complex : omp_out += omp_in)\n", " int N = 10;\n", " if (argc > 1) N = atoi(argv[1]);\n", " std::complex val;\n", " \n", " #pragma omp parallel for reduction(+ : val) schedule(dynamic,1)\n", " for (int j = 0; j < N; j++) {\n", " std::complex my_val = std::complex(2.0,2.0);\n", " val += my_val;\n", " }\n", " std::cout << \"Using \" << omp_get_max_threads() << \" threads, the summed complex value is \" << val.real() << \"+\" << val.imag() << \"i.\\n\";\n", "}" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }