{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Astrophysics background\n", "\n", "It is very common in Astrophysics to work with sky pixels. The sky is tassellated in patches with specific properties and a sky map is then a collection of intensity values for each pixel. The most common pixelization used in Cosmology is [HEALPix](http://healpix.jpl.nasa.gov)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Measurements from telescopes are then represented as an array of pixels that encode the pointing of the instrument at each timestamp and the measurement output." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample timeline" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numba\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity let's assume we have a sky with 50K pixels:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "NPIX = 50000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we have 50 million measurement from our instrument:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "NTIME = int(50 * 1e6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pointing of our instrument is an array of pixels, random in our sample case:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pixels = np.random.randint(0, NPIX-1, NTIME)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our data are also random:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "timeline = np.random.randn(NTIME)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a map of the sky with pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most common operations is to sum all of our measurements in a sky map, so the value of each pixel in our sky map will be the sum of each individual measurement.\n", "The easiest way is to use the `groupby` operation in `pandas`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "timeline_pandas = pd.Series(timeline, index=pixels)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11105 -0.498438\n", "16106 -1.103009\n", "44723 0.392057\n", "16687 0.086918\n", "15197 1.946979\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "timeline_pandas.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.49 s, sys: 604 ms, total: 3.09 s\n", "Wall time: 3.1 s\n" ] } ], "source": [ "%time m = timeline_pandas.groupby(level=0).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a map of the sky with numba" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would like to improve the performance of this operation using `numba`, which allows to produce automatically C-speed compiled code from pure python functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to develop a pure python version of the code, test it, and then have `numba` optimize it:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def groupby_python(index, value, output):\n", " for i in range(index.shape[0]):\n", " output[index[i]] += value[i]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "m_python = np.zeros_like(m)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 25.8 s, sys: 5 ms, total: 25.8 s\n", "Wall time: 25.8 s\n" ] } ], "source": [ "%time groupby_python(pixels, timeline, m_python)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.testing.assert_allclose(m_python, m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pure Python is slower than the `pandas` version implemented in `cython`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optimize the function with numba.jit\n", "\n", "`numba.jit` gets an input function and creates an compiled version with does not depend on slow Python calls, this is enforced by `nopython=True`, `numba` would throw an error if it would not be possible to run in `nopython` mode." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "groupby_numba = numba.jit(groupby_python, nopython=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "m_numba = np.zeros_like(m)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 411 ms, sys: 7 ms, total: 418 ms\n", "Wall time: 441 ms\n" ] } ], "source": [ "%time groupby_numba(pixels, timeline, m_numba)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.testing.assert_allclose(m_numba, m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance improvement is about 50x compared to Python and up to 10x compared to Pandas, pretty good!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use numba.jit as a decorator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The exact same result is obtained if we use `numba.jit` as a decorator:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "@numba.jit(nopython=True)\n", "def groupby_numba(index, value, output):\n", " for i in range(index.shape[0]):\n", " output[index[i]] += value[i]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }