{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from IPython.display import display, HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Calling C Libraries from Numba Using CFFI\n", "\n", "Joshua L. Adelman\n", "11 February 2016\n", "\n", "**TL;DR** - The python CFFI library provides an easy and efficient way to call C code from within a function jitted (just-in-time compiled) by Numba. This makes it simple to produce fast code with functionality that is not yet available directly in Numba. As a simple demonstration, I wrap several statistical functions from the Rmath library. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background and Motivation\n", "\n", "A large fraction of the code that I write has a performance requirement attached to it. Either I'm churning through a large amount of data in an analysis pipeline, or it is part of a real-time system and needs to complete a specific calculation in a constrained amount of time. Sometimes I can rely on numpy, pandas or existing Python packages that wrap C or Fortran code under the covers to get sufficient performance. Often times though, I'm dealing with algorithms that are difficult to implement efficiently using these tools.\n", "\n", "Since I started coding primarily in Python ~6 years ago, in those instances I'd typically reach for [Cython](http://cython.org/) to either wrap something I or others wrote in C/C++/Fortran or to provide sufficient type information to my code so that Cython could generate a performant C-extension that I could call from Python. Although Cython has been a pretty rock solid solution for me, the amount of boilerplate often required and some of the strange semantic of mixing python and low-level C code often feels less than ideal. I also collaborate with people who know Python, but don't have backgrounds in C and/or haven't had enough experience with Cython to understand how it all fits together. \n", "\n", "More and more frequently, I find myself using [Numba](http://numba.pydata.org/) in instances that I had traditionally used Cython. In short, through a simple decorator mechanism, Numba converts a subset of Python code into efficient machine code using LLVM. It uses type inference so you don't have to specify the type of every variable in a function like you do in Cython to generate fast code. This subset primarily deals with numerical code operating on scalars or Numpy arrays, but that covers 95% of the cases where I need efficient code so it does not feel that limiting. That said, the most common mistake I see people making with Numba is trying to use it as a general Python compiler and then being confused/disappointed when it doesn't speed up their code. The library has matured incredibly over the last 6-12 months to the point where at work we have it deployed in a couple of critical pieces of production code. When I first seriously prototyped it maybe a year and a half ago, it was super buggy and missing a number of key features (e.g. caching of jitted functions, memory management of numpy arrays, etc). But now it feels stable and I rarely run into problems, although I've written a very extensive unit test suite for every bit of code that it touches.\n", "\n", "One of the limitations that I do encounter semi-regularly though is when I need some specialized function that is available in Numpy or Scipy, but that function has not been re-implemented in the Numba core library so it can be called in the so-called [\"nopython\" mode](http://numba.pydata.org/numba-doc/0.23.1/glossary.html#term-nopython-mode). Basically this means that if you want to call one of these functions, you have to go through Numba's [object mode](http://numba.pydata.org/numba-doc/0.23.1/glossary.html#term-object-mode), which typically cannot generate nearly as efficient code.\n", "\n", "While there is a [proposal](http://numba.pydata.org/numba-doc/dev/proposals/extension-points.html) [under development](http://numba.pydata.org/numba-doc/dev/extending/index.html) that should allow external libraries to define an interface to make usable in nopython mode, it is not complete and will them require adoption within the larger Scipy/PyData communities. I'm looking forward to that day, but currently you have to choose a different option. The first is to re-implement a function yourself using Numba. This is often possible for functionality that is small and limited in scope, but for anything non-trivial this approach can rapidly become untenable. \n", "\n", "In the remainder of this notebook, I'm going to describe a second technique that involves using [CFFI](https://cffi.readthedocs.org) to call external C code directly from within Numba jitted code. This turns out to be a really great solution if the functionality you want has already been written either in C or a language with a C interface. It is mentioned [in the Numba docs](http://numba.pydata.org/numba-doc/0.23.1/reference/pysupported.html#cffi), but there aren't any examples that I have seen, and looking at the tests only helped a little. \n", "\n", "I had not used CFFI before integrating it with Numba for a recent project. I had largely overlooked it for two reasons: (1) Cython covered the basic usecase of exposing external C code to python and I was already very comfortable with Cython, and (2) I had the (incorrect) impression that CFFI was mostly useful in the PyPy ecosystem. Since PyPy is a non-starter for all of my projects, I largely just ignored its existence. I'm thankfully correcting that mistake now. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rmath. It's not just for R\n", "\n", "Every once in a while I fire up R, usually through rpy2, to do something that I can't do using Statsmodel or Scikit-Learn. But for the most part I live squarely in the Python world, and my experience with R is rudimentary. So it wasn't totally surprising that I only recently discovered that the math library that underpins R, Rmath, can be built in a standalone mode without invoking R at all. In fact, the [Julia](http://julialang.org/) programming language uses Rmath for its probability distributions library and maintains a fork of the package called [Rmath-julia](https://github.com/JuliaLang/Rmath-julia).\n", "\n", "Discovering Rmath over the summer, led to the following tweet (apologies for the Jupyter Notebook input cell) and a horrific amalgamation of code that worked, but was pretty difficult to maintain and extend:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "Today's \n", "the sort of day that I wrote C code that used the Rmath-julia library and then called that from Python via Cython. \n", "Don't ask
— Joshua Adelman (@synapticarbors) \n", "July 11, 2015\n", "
'''))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I began to introduce more and more Numba into various code bases at work, I recently decided to revisit this particular bit and see if I could re-implement the whole thing using Numba + CFFI + Rmath. This would cut out the C code that I wrote, the Cython wrapper that involved a bunch of boilerplate strewn across multiple .pyx and .pxd files, and hopefully would make the code easier to extend in the future by people who didn't know C or Cython, but could write some Python and apply the appropriate Numba jit decorator.\n", "\n", "So to begin with, I vendorized the whole Rmath-julia library into our project under `externals/Rmath-julia`. I'll do the same here in this example. Now the fun begins..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building the Rmath library using CFFI\n", "\n", "Since we are going to use what cffi calls the \"API-level, out-of-line\", we need to define a build script (`build_rmath.py`) that we will use to compile the Rmath source and produce an importable extension module. The notebook \"cell magic\", `%%file` will write the contents of the below cell to a file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting build_rmath.py\n" ] } ], "source": [ "%%file build_rmath.py\n", "\n", "import glob\n", "import os\n", "import platform\n", "\n", "from cffi import FFI\n", "\n", "\n", "include_dirs = [os.path.join('externals', 'Rmath-julia', 'src'),\n", " os.path.join('externals', 'Rmath-julia', 'include')]\n", "\n", "rmath_src = glob.glob(os.path.join('externals', 'Rmath-julia', 'src', '*.c'))\n", "\n", "# Take out dSFMT dependant files; Just use the basic rng\n", "rmath_src = [f for f in rmath_src if ('librandom.c' not in f) and ('randmtzig.c' not in f)]\n", "\n", "extra_compile_args = ['-DMATHLIB_STANDALONE']\n", "if platform.system() == 'Windows':\n", " extra_compile_args.append('-std=c99')\n", "\n", "ffi = FFI()\n", "ffi.set_source('_rmath_ffi', '#includeToday's \n", "the sort of day that I wrote C code that used the Rmath-julia library and then called that from Python via Cython. \n", "Don't ask
— Joshua Adelman (@synapticarbors) \n", "July 11, 2015\n", "