{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# An advanced Mandelbrot Set explorer\n", "\n", "## Introduction\n", "\n", "I came across some very fresh and exciting data science tools in the past few days and felt very inspired to try these.\n", "\n", "One package that I was interested in a lot for a long time is [bokeh](http://bokeh.pydata.org/en/latest/), as it almost perfectly matches the jupyter notebook with its interactive capabilities and I definitly wanted to try an alternative to the excellent [matplotlib](http://matplotlib.org/) package that I am using all the time. Another real neat thing is that *bokeh* based documents stay alive even when they have been persisted into a plain html file, as all data necesary for visualization is contained within that file.\n", "\n", "Another package is [datashader](https://github.com/bokeh/datashader). The examples for visualizing the [taxi travels in Manhattan](https://github.com/bokeh/datashader/blob/master/examples/nyc_taxi.ipynb) as well as the [census data for the U.S.](https://github.com/bokeh/datashader/blob/master/examples/census.ipynb) are very impressive. So I definitly wanted to play around with that as well. \n", "\n", "Finally, as I am one of the big data guys, I was really impressed by [dask](http://dask.pydata.org/en/latest/) and its [distributed](https://github.com/dask/distributed) backend, which allows to spread the work necessary for large computations over a cluster of nodes via a very common and well known mapping api.\n", "\n", "So, how to start exploring all this?\n", "\n", "Following the provided examples is usually one way to learn about the capabilities. However, often they are sort of specialized and minimalist, not directly supporting a transfer of the examples to current real world tasks. What helps me usually is to try to solve a well known problem using the new tools in order to learn.\n", "\n", "As I've done some work on showing how to accelerate numerical computations within python lately using [numba](http://numba.pydata.org/) and did this using the example of speeding up the calculation of the [Mandelbrot Set](https://en.wikipedia.org/wiki/Mandelbrot_set), it was quite obvious that this was the right path here as well.\n", "\n", "So I ended up with a jupyter notebook to calculate the famous fractal again, this time using `bokeh` for displaying the image and adding interaction, the `datashader` as a callback to the zoom events performing the `numba` accelarated calculations for every new selection and the `distributed` backend to spread the work over a cluster of 160 CPUs.\n", "\n", "That was fun!\n", "\n", "And here is how this works." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation\n", "\n", "It obviously all starts with importing the necessary modules. There is `numpy` and `numba` for the maths, `bokeh` and `datashader` for the plotting and `distributed` for scaling out work." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\") {\n", " window._bokeh_onload_callbacks = [];\n", " }\n", "\n", " function run_callbacks() {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " delete window._bokeh_onload_callbacks\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };\n", "\n", " var js_urls = ['https://cdn.pydata.org/bokeh/release/bokeh-0.12.0.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.0.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-compiler-0.12.0.min.js'];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " Bokeh.$(\"#cb7869ca-13a0-464e-aab2-2e886da699e0\").text(\"BokehJS successfully loaded\");\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.0.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.0.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.0.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.0.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from functools import partial\n", "\n", "import numpy as np\n", "import numba\n", "\n", "from bokeh.plotting import figure\n", "from bokeh.io import output_notebook, show\n", "\n", "import datashader as ds\n", "import datashader.transfer_functions as tf\n", "from datashader.bokeh_ext import InteractiveImage\n", "\n", "import xarray as xr\n", "\n", "from distributed import Executor\n", "\n", "output_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next are some definitions. The following coordinates define the standard view of the *Mandelbrot Set*, with the x coordinates defining the boundaries on the real axis and the y coordinates for the complex one. A rather large value for the maximum number of iterations ensures enough work to be distributed on a cluster of nodes and enough detail when zooming in. Feel free to adjust this number to suit your environment." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "xmin=-2.5\n", "xmax=1.0\n", "ymin=-1.25\n", "ymax=1.25\n", "\n", "maxiterations=np.uint64(5000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The executor is the central instance organizing a pool of workers. It consists of a bunch of services, with two of them being most prominent. The first one is the scheduler which delegates the work packages to the cluster and the second is the worker instance running on each node, which is doing all the heavy computations. Whilst workers may come and go within a `distributed` environment the scheduler needs to be up and running, and remain accessible all the time in order to keep the system operable. The `distributed` package comes with a very useful helper script called `dask-ssh` which spawns all necessary components on the nodes, given you have a transparent (passwordless) access to all nodes.\n", "The command\n", "\n", "```bash \n", "dask-ssh linpl00{00,01,02,03,04,05,06,07,08,09,10} --nprocs 16\n", "```\n", "\n", "for example starts the scheduler instance on the first node of the cluster named `linpl0001` and 16 instances of workers on all ten nodes of the cluster resulting in 160 workers all in all. Adjusting the `--nprocs` parameter helps if the code you are distributing is not able to release the GIL. This trick is pretty comparable to using the `multiprocessing` module instead of native `threading`.\n", "\n", "If you're not able to use a cluster of nodes as a backend for the computations, simply call the Executor with no arguments. This will start a scheduler and a worker instance in a separate thread on your computer. But, this will of course remove one dimension of fun from this whole thing. Keep that in mind." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "