{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 3: Parallel programming with Cython\n", "\n", "Cython is an extension of the Python language which provides improved performance similar to compiled C-code. \n", "\n", "Cython is sometimes described as a hybrid-style language mixing Python and C, using C-like static type definitions to provide the Cython compiler with enough extra information to produce highly-performant code, offering performance similar to traditional compiled C code.\n", "\n", "In this mini-tutorial, we are going to look at a particular subset of the Cython project that provides parallel programming support, using the `cython.parallel` module.\n", "\n", "## A brief crash course in Cython\n", "\n", "Before we look at the `cython.parallel` module, however, we should cover some basic Cython first to get a feel of how it looks and works compared to pure Python and C code!\n", "\n", "Here is a programming example favourite - a function that computes the n-th Fibonacci number, taking `n` as its input argument.\n", "\n", "Here it is in __Python__:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def fib(n):\n", " a = 0.0\n", " b = 1.0\n", " for i in range(n):\n", " a, b = a + b, a\n", " return a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All Python code is also valid Cython code, and has the same behaviour whether it is run through the Python interpreter, or compiled with the Cython compiler. Using the Cython compiler on the above code will not have much effect on its performance however, but here is where the special __Cython__ syntax comes in. \n", "\n", "As Cython is a kind of Python-C hybrid, let's consider what the above code would look like translated into __pure C__ code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```C\n", "/* C version of the fibonacci calculation*/\n", "double fib(int n)\n", "{\n", " int i;\n", " double a = 0.0, b = 1.0, tmp;\n", " for (i=0; i<n; ++i)\n", " {\n", " tmp = a;\n", " a = a + b;\n", " b = tmp;\n", " }\n", " return a;\n", "} \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the C version, we have to define the types of variables we use (e.g. `int`, `double`), in contrast to the way Python infers the types dynamically at runtime. We also have the usual for-loop style, curly braces, and semi-colon syntax not present in Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the __Cython__ version, we can blend the static types of the C code with more Python-like syntax.\n", "\n", "__Cython__ version of the Fibonacci function:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note on running the examples in Jupyter notebook (or IPython)\n", "\n", "When using Cython in a Jupyter notebook, you need the following line added at the start of a Cython session:\n", "\n", "`%load_ext Cython`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%load_ext Cython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_And this line needs to be added for every notebook cell that contains Cython code:_\n", "\n", "`%%cython`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%%cython\n", "def fib(int n):\n", " cdef int i\n", " cdef double a = 0.0\n", " cdef double b = 1.0\n", " for i in range(n):\n", " a, b = a + b, a\n", " return a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Cython we can use `cdef` to define static variables, just like in C. But note we are still using the Python-syntax for `for` loops and function definitions. Think of Cython as a superset of the Python language, giving us some 'extra' Python syntax and keywords that we can use to help speed up our code and make it more C-like.\n", "\n", "We have not yet used any parallelism, but when compiled with the Cython compiler this code will offer significant speed up at runtime compared to the dynamically interpreted pure-Python version.\n", "\n", "If we were to compile this cython code and run it, we might expect something like at least an order of magnitude speed up compared to the native Python version." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running the examples as standalone scripts\n", "\n", "It is quicker and easier to follow the Cython examples using the Jupyter/IPython \"cellmagic\" commands. But if you want to run these examples as standalone scripts, there are a few extra steps to using Cython. You will need to create:\n", "\n", " - A `.pyx` file which contains any Cython code.\n", " - A `setup.py` script to build your Cython extension\n", " - A separate `.py` script which is used to `import` the Cython extension module and call any functions defined in it.\n", " \n", "I recommend spending 5 minutes reading the first part of the introductory Cython tutorial from the Cython documentation https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parallelism in Cython\n", "\n", "Cython alone may offer sufficient performance gains for an application written in Python. However, since we are here to look at parallel programming in Python, let's look at the `cython.parallel` module.\n", "\n", "The `prange` function within this module can be used to iterate through a for loop where each iteration can be executed in parallel:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "435\n" ] } ], "source": [ "%%cython\n", "from cython.parallel import prange\n", "\n", "# First declare the variables we are going to use with cdefs:\n", "cdef int i\n", "cdef int n = 30\n", "cdef int sum = 0\n", "\n", "# Use prange instead of range\n", "for i in prange(n, nogil=True):\n", " sum += i\n", "\n", "print(sum)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `prange` function takes extra arguments in addition to the number of items to iterate over, `n`. Here we have passed the argument `nogil=True`. This tells Cython we can safely release Python's Global Interpreter Lock for this section of the code in the for loop. Python's normal restriction on thread-based parallelism will be relaxed for the duration of the for loop. Each iteration is therefore free to be computer in parallel, exploiting multiple CPU cores if they are available on the system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes when we are debugging code, it's useful to be able to print out the values of variables. Let's try printing out the current loop iteration within a `prange` loop. What happens if you run this code snippet?:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%cython\n", "#Thread ID\n", "from cython.parallel import prange\n", "\n", "cdef int i\n", "cdef int sum = 0\n", "\n", "# Use prange instead of range\n", "for i in prange(4, nogil=True):\n", " sum += i\n", " print(\"Current loop iter:\", i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes! Cython did not like that...\n", "\n", "Cython has another restriction when you are using the parallel functionality combined with turning off the GIL (i.e. when `nogil=True`) - you may not manipulate pure Python objects anymore within the parallel block. (Since Python objects rely on the presence of the GIL).\n", "\n", "`sum += i` is fine because we are calling code that can be easily compiled into pure C code with Cython, as we have declared the types of `i` and `sum` using `cdef`. \n", "\n", "But we have a problem here because we would like to use the `print` function in Python, which, being another Python object, requires the use of the GIL.\n", "\n", "So instead, Cython provides a way of using GIL-free C functions from the C standard library. Instead of the Python `print()` function, we would have to use the `printf()` function from C.\n", "\n", "This can be accessed by importing it like so, and then adding it within the body of our prange loop:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "%%cython\n", "#Thread ID\n", "from cython.parallel import prange\n", "# Give me some C functions!\n", "from libc.stdio cimport printf\n", "\n", "cdef int i\n", "cdef int sum = 0\n", "\n", "# Use prange instead of range\n", "for i in prange(4, nogil=True):\n", " sum += i\n", " printf(\"Current loop iter: %d\\n\", i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will notice that this doesn't print out directly to the Jupyter notebook. Instead, because we are manipulating a lower level C function, the output goes to the terminal where we launched the jupyter notebook from. Have a look in the terminal and check the output. It should be someting like this:\n", "\n", "```\n", "Current loop iter: 0\n", "Current loop iter: 1\n", "Current loop iter: 2\n", "Current loop iter: 3\n", "```\n", "Success! (Hopefully) - We can now use lower level C functions when we need to perform operations such as `print` inside a `nogil` block, such as `prange`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Cython `parallel` module uses the OpenMP library for parallelisation. If you have not come across OpenMP before, do not worry at this stage - most of main features of the cython.parallel module can be used without an in-depth knowledge of OpenMP. (https://www.openmp.org/)\n", "\n", "When threads are created to execute the code within the prange block, it might be useful to know which thread is working at a time. We can get this information with the `cython.parallel.threadid` function or the OpenMP-native function `omp_get_thread_num()`.\n", "\n", "To use the native OpenMP function call we use a special import statement called `cimport` to say we are importing some C-library code. \n", "\n", "### `with nogil`\n", "\n", "We introduce a new bit of Cython syntax here: `with nogil:`. This is a standard Python `with` statement, but following it with `nogil` says we want to construct a `with` block that will be free of restrictions imposed by the GIL. However, remember it is up to us, the programmer, to make sure we are using no-GIL-safe functions, otherwise Cython will give us error messages!\n", "\n", "Here's a simple block that just prints out the number of available threads and each thread ID number:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "%%cython --compile-args=-fopenmp --link-args=-fopenmp\n", "from cython.parallel cimport parallel\n", "from libc.stdio cimport printf\n", "cimport openmp\n", "\n", "# We need a variable to store the thread id\n", "cdef int num_threads\n", "cdef int thread_id\n", "\n", "with nogil, parallel():\n", " num_threads = openmp.omp_get_num_threads()\n", " printf(\"Number of threads: %d\\n\", num_threads)\n", " thread_id = openmp.omp_get_thread_num()\n", " printf(\"Thread ID: %d\\n\", thread_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Note for jupyter notebook users***: *Notice how we needed to give our Jupyter notebook some more compilation information at the top of the code cell:*\n", "\n", "```\n", "%%cython --compile-args=-fopenmp --link-args=-fopenmp\n", "```\n", "\n", "*This tells the cython compiler (which runs automatically when the code-cell is run in Jupyter) that we need to compile with the OpenMP library.*\n", "\n", "*For users wanting to run 'standalone' Python/Cython scripts (outwith Jupyter/Ipython) you will need to modify the `setup.py` file as instructed here: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html?highlight=openmp#compiling*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After running the above code, the output (in the terminal, because we are using C-level `printf()` functions.) Should be something like:\n", "```\n", "Number of threads: 4\n", "Thread ID: 0\n", "Number of threads: 4\n", "Thread ID: 1\n", "Number of threads: 4\n", "Thread ID: 2\n", "Number of threads: 4\n", "Thread ID: 3\n", "```\n", "Note that the ouptut is not guaranteed to appear in the above order - the threads will execute the printf functions in a seemingly random order. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More options with `prange` and OpenMP\n", "\n", "#### The Julia set\n", "\n", "A classic example of parallel programming is calculating the [Julia set](https://en.wikipedia.org/wiki/Julia_set) (for making pretty pictures of fractals...) It is an embarrasingly-parallel CPU-bound computation, ideal for speeding up with Cython-OpenMP threads. \n", "\n", "Here is the Cython code used to calculate the Julia set. We are not going to go through what every line of the code does. I shall leave it as an \"exercise for the reader\" to determine how it works, if you are so inclined.\n", "\n", "(This would be the `.pyx` file if you were compiling this outwith the Jupyter notebook/IPython environment.)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "%%cython\n", "# julia.pyx\n", "from cython cimport boundscheck, wraparound\n", "from cython.parallel cimport prange\n", "\n", "import numpy as np\n", "\n", "cdef inline double norm2(double complex z) nogil:\n", " return z.real * z.real + z.imag * z.imag\n", "\n", "\n", "cdef int escape(double complex z,\n", " double complex c,\n", " double z_max,\n", " int n_max) nogil:\n", "\n", " cdef:\n", " int i = 0\n", " double z_max2 = z_max * z_max\n", "\n", " while norm2(z) < z_max2 and i < n_max:\n", " z = z * z + c\n", " i += 1\n", "\n", " return i\n", "\n", "\n", "@boundscheck(False)\n", "@wraparound(False)\n", "def calc_julia(int resolution, double complex c,\n", " double bound=1.5, double z_max=4.0, int n_max=1000):\n", "\n", " cdef:\n", " double step = 2.0 * bound / resolution\n", " int i, j\n", " double complex z\n", " double real, imag\n", " int[:, ::1] counts\n", "\n", " counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)\n", "\n", " for i in prange(resolution + 1, nogil=True,\n", " schedule='static', chunksize=1):\n", " real = -bound + i * step\n", " for j in range(resolution + 1):\n", " imag = -bound + j * step\n", " z = real + imag * 1j\n", " counts[i,j] = escape(z, c, z_max, n_max)\n", "\n", " return np.asarray(counts)\n", "\n", "@boundscheck(False)\n", "@wraparound(False)\n", "def julia_fraction(int[:,::1] counts, int maxval=1000):\n", " cdef:\n", " int total = 0\n", " int i, j, N, M\n", " N = counts.shape[0]; M = counts.shape[1]\n", "\n", " for i in prange(N, nogil=True):\n", " for j in range(M):\n", " if counts[i,j] == maxval:\n", " total += 1\n", " return total / float(counts.size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And this would be the code than calls the Cython code above. If you were doing this with a separate compilation script, you would need to add `import julia` or whatever you called your Cython `.pyx` extension module." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "time: 4.604583\n", "julia fraction: 0.23719749320741929\n" ] } ], "source": [ "# julia.py\n", "import numpy as np\n", "from time import clock\n", "#import matplotlib.pyplot as plt\n", "\n", "t1 = clock()\n", "jl = calc_julia(2000, (0.322 + 0.05j))\n", "print(\"time:\", clock() - t1)\n", "\n", "print(\"julia fraction:\", julia_fraction(jl))\n", "\n", "# To plot a nice fractal - uncomment these lines\n", "#plt.imshow(np.log(jl))\n", "#plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The serial version of the code is as so (no `prange`s, mainly):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "%%cython\n", "# The original, non-parallelized version.\n", "\n", "from cython cimport boundscheck, wraparound\n", "\n", "import numpy as np\n", "\n", "cdef inline double norm2(double complex z) nogil:\n", " return z.real * z.real + z.imag * z.imag\n", "\n", "\n", "cdef int escape(double complex z,\n", " double complex c,\n", " double z_max,\n", " int n_max) nogil:\n", "\n", " cdef:\n", " int i = 0\n", " double z_max2 = z_max * z_max\n", "\n", " while norm2(z) < z_max2 and i < n_max:\n", " z = z * z + c\n", " i += 1\n", "\n", " return i\n", "\n", "\n", "@boundscheck(False)\n", "@wraparound(False)\n", "def calc_julia(int resolution, double complex c,\n", " double bound=1.5, double z_max=4.0, int n_max=1000):\n", "\n", " cdef:\n", " double step = 2.0 * bound / resolution\n", " int i, j\n", " double complex z\n", " double real, imag\n", " int[:, ::1] counts\n", "\n", " counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)\n", "\n", " for i in range(resolution + 1):\n", " real = -bound + i * step\n", " for j in range(resolution + 1):\n", " imag = -bound + j * step\n", " z = real + imag * 1j\n", " counts[i,j] = escape(z, c, z_max, n_max)\n", "\n", " return np.asarray(counts)\n", "\n", "\n", "@boundscheck(False)\n", "@wraparound(False)\n", "def julia_fraction(int[:,::1] counts, int maxval=1000):\n", " cdef:\n", " int total = 0\n", " int i, j, N, M\n", " N = counts.shape[0]; M = counts.shape[1]\n", "\n", " for i in range(N):\n", " for j in range(M):\n", " if counts[i,j] == maxval:\n", " total += 1\n", " return total / float(counts.size)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "time: 9.230494\n", "julia fraction: 0.23719749320741929\n" ] } ], "source": [ "import numpy as np\n", "from time import clock\n", "#import matplotlib.pyplot as plt\n", "\n", "#t1 = clock()\n", "jl = calc_julia(2000, (0.322 + 0.05j))\n", "print(\"time:\", clock() - t1)\n", "\n", "print(\"julia fraction:\", julia_fraction(jl))\n", "\n", "# To plot a nice fractal - uncomment these lines\n", "#plt.imshow(np.log(jl))\n", "#plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On my laptop I got roughly a 2x speed up with the parallel version on four CPU cores. (Note that we have not compared this to the pure Python version at any stage - both of these examples are using Cython.)\n", "\n", "### Cython compiler directives\n", "\n", "We are using two new decorators, which are not parallelism-specific, but useful potential performance enhancers:\n", "\n", "`@boundscheck=False` - do not check that array indices are within the bounds of the size of the array\n", "\n", "`@wraparound=False` - do not allow negative (wraparound) python-style array indexing, e.g. `A[-1]` etc.\n", "\n", "Further information on the available directives are here: https://github.com/cython/cython/wiki/enhancements-compilerdirectives\n", "\n", "### OpenMP scheduling directives\n", "\n", "You might also have noticed we have used some extra arguments in the `prange` method:\n", "\n", "`schedule='static', chunksize=1`\n", "\n", "These arguments are the same as the OpenMP loop scheduling directives (If you are already familiar with OpenMP). They allow finer grained control of how work is distributed among OpenMP threads. Static implies that the work is divided up into equal 'chunks' and does not change over the course of the computation. Other options allow the chunksize to vary during runtime (e.g. `guided, dynamic`). When using a fixed, `static` chunksize, the `chunksize` parameter can be set explicitly.\n", "\n", "Scheduling keywords are explained in more detail here:\n", "\n", "https://github.com/cython/cython/wiki/enhancements-compilerdirectives\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary and taking it further\n", "\n", " - Cython is a relatively mature Python technology (compared to say numba or mpi4py), providing an opportunity to do thread-based parallel programming when we can guarantee not to use Python objects in our code, but instead translate it to 'lower level' Cython code. \n", "\n", " - Cython's `cython.parallel` module provides an interface to OpenMP - a type of shared memory parallel programming model suited to multicore CPU systems. (I.e. single-node on a cluster, multicore desktop/laptop computer). Much of the OpenMP API can be accessed through this module. \n", " \n", " - However, the user may still stick to the basics of functions like `prange` and `with nogil...` to access simple parallelism techniques without needing an indepth knowledge of OpenMP.\n", " \n", " - Cython requires more programming effort time compared to something like `numba`, but offers finer grained control, interoperability with C and C++ code, and access to some of the power of the OpenMP library. (We have not covered interoperability with C code, but other tutorials and resources on this are available.\n", " \n", "#### Tutorials that take it further \n", " \n", "https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html\n", "\n", "https://software.intel.com/en-us/articles/thread-parallelism-in-cython" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }