{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Datashading a 2.7-billion-point Open Street Map database\n", "\n", "Most [datashader](https://github.com/bokeh/datashader) examples use \"medium-sized\" datasets, because they need to be small enough to be distributed over the internet without racking up huge bandwidth charges for the project maintainers. Even though these datasets can be relatively large (such as the [1-billion point Open Street Map example](https://anaconda.org/jbednar/osm-1billion)), they still fit into memory on a 16GB laptop.\n", "\n", "Because Datashader supports [Dask](http://dask.pydata.org) dataframes, it also works well with truly large datasets, much bigger than will fit in any one machine's physical memory. On a single machine, Dask will automatically and efficiently page in the data as needed, and you can also easily distribute the data and computation across multiple machines. Here we illustrate how to work \"out of core\" on a single machine using a 22GB OSM dataset containing 2.7 billion points.\n", "\n", "The data is taken from Open Street Map's (OSM) [bulk GPS point data](https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/), and is unfortunately too large to distribute with Datashader (8.4GB compressed). The data was collected by OSM contributors' GPS devices, and was provided as a CSV file of `latitude,longitude` coordinates. The data was downloaded from their website, extracted, converted to use positions in Web Mercator format using `datashader.utils.lnglat_to_meters()`, sorted using [spacial indexing](http://datashader.org/user_guide/2_Points.html#Spatial-indexing), and then stored in a [parquet](https://github.com/dask/fastparquet) file for [fast partition-based access](https://github.com/bokeh/datashader/issues/129#issuecomment-300515690). To run this notebook, you would need to do the same process yourself to obtain `osm-3billion.parq`. Once you have it, you can follow the steps below to load and plot the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datashader as ds\n", "import datashader.transfer_functions as tf\n", "import dask.diagnostics as diag\n", "\n", "from datashader import spatial\n", "from datashader.geo import lnglat_to_meters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "df = spatial.read_parquet('./data/osm-3billion.parq')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregation\n", "\n", "First, we define some spatial regions of interest and create a canvas to provide pixel-shaped bins in which points can be aggregated:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def bounds(x_range, y_range):\n", " x,y = lnglat_to_meters(x_range, y_range)\n", " return dict(x_range=x, y_range=y)\n", "\n", "Earth = ((-180.00, 180.00), (-59.00, 74.00))\n", "France = (( -12.00, 16.00), ( 41.26, 51.27))\n", "Paris = (( 2.05, 2.65), ( 48.76, 48.97))\n", "\n", "plot_width = 1000\n", "plot_height = int(plot_width*0.5)\n", "\n", "cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds(*Earth))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we aggregate the data to produce a fixed-size aggregate array. This process may take up to a minute, so we provide a progress bar using dask:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with diag.ProgressBar(), diag.Profiler() as prof, diag.ResourceProfiler(0.5) as rprof:\n", " agg = cvs.points(df, 'x', 'y')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now visualize this data very quickly, ignoring low-count noise as described in the [1-billion point OSM version](https://anaconda.org/jbednar/osm-1billion):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf.shade(agg.where(agg > 15), cmap=[\"lightblue\", \"darkblue\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at smaller areas is much quicker, thanks to SpatialPointsFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot(x_range, y_range):\n", " cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds(x_range, y_range))\n", " agg = cvs.points(df, 'x', 'y')\n", " return tf.shade(agg, cmap=[\"lightblue\",\"darkblue\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time plot(*France)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time plot(*Paris)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Performance Profile\n", "\n", "Dask offers some tools to visualize how memory and processing power are being used during these calculations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bokeh.io import output_notebook\n", "from bokeh.resources import CDN\n", "output_notebook(CDN, hide_banner=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diag.visualize([prof, rprof]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance notes:\n", "- On a 16GB machine, most of the time is spent reading the data from disk (the purple rectangles)\n", "- Reading time includes not just disk I/O, but decompressing chunks of data\n", "- The disk reads don't release the [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock) (GIL), and so CPU usage (see second chart above) drops to only one core during those periods.\n", "- During the aggregation steps (the green rectangles), CPU usage on this machine with 8 hyperthreaded cores (4 full cores) spikes to nearly 700%, because the aggregation function is implemented in parallel. \n", "- The data takes up 22 GB uncompressed, but only a peak of around 6 GB of physical memory is ever used because the data is paged in as needed." ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 1 }