{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Datashading a 1-billion-point Open Street Map database\n", "\n", "The [osm.ipynb](https://anaconda.org/jbednar/osm) example shows how to use Dask's out of core support for working with datasets larger than will fit in memory. Here, we consider the largest dataset that will fit comfortably on a 16GB machine, namely 1 billion x,y points (at 32-bit float resolution). You can download it from our site:\n", "\n", "http://s3.amazonaws.com/datashader-data/osm-1billion.snappy.parq.zip\n", "\n", "and unzip it into the `../data/` directory, but because of its size it's not downloaded with the other examples by default, and please try to limit the number of times you download it so that we can continue making it available.\n", "\n", "The data is taken from Open Street Map's (OSM) [bulk GPS point data](https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/). The data was collected by OSM contributors' GPS devices, and was provided as a CSV file of `latitude,longitude` coordinates. The data was downloaded from their website, extracted, converted to use positions in Web Mercator format using `datashader.utils.lnglat_to_meters()`, and then stored in a [parquet](https://github.com/dask/fastparquet) file for [faster disk access](https://github.com/bokeh/datashader/issues/129#issuecomment-300515690). As usual, loading it into memory will be the most time-consuming step, by far:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import dask.dataframe as dd\n", "import datashader as ds\n", "import datashader.transfer_functions as tf\n", "\n", "from colorcet import fire" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "df = dd.io.parquet.read_parquet('../data/osm-1billion.snappy.parq')\n", "df = df.persist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(len(df))\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregation\n", "\n", "To visualize this data, we first create a canvas to provide pixel-shaped bins in which points can be aggregated, and then aggregate the data to produce a fixed-size aggregate array. For data already in memory, this aggregation is the only expensive step, because it requires reading every point in the dataset. For interactive use we are most interested in the performance for a *second* aggregation, after all of the components and data are in memory and the user is interacting with a plot dynamically. So let's first do a dry run, which will be slightly slower, then do the full aggregation to show what performance can be expected during interactive use with a billion datapoints:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bound = 20026376.39\n", "bounds = dict(x_range = (-bound, bound), y_range = (int(-bound*0.4), int(bound*0.6)))\n", "plot_width = 900\n", "plot_height = int(plot_width*0.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "bounds2 = dict(x_range = (-bound/2, bound/2), y_range = (int(-bound*0.2), int(bound*0.3)))\n", "cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds2)\n", "agg2=cvs.points(df, 'x', 'y', ds.count())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds)\n", "agg = cvs.points(df, 'x', 'y', ds.count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, once everything has been loaded, aggregating this billion-point dataset takes about a second on a 16GB Macbook laptop. The remaining steps for display complete in milliseconds, so those times are not shown below.\n", "\n", "\n", "### Transfer Function\n", "\n", "To actually see the data, we need to render an image of the aggregate array. Let's map small (but nonzero) counts to light blue, the largest counts to dark blue, and interpolate according to a logarithmic function in between:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf.shade(agg, cmap=[\"lightblue\", \"darkblue\"], how='log')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's some odd, low-count, nearly-uniform noise going on in the tropics. It's worth trying to figure out what that could be, but for now we can filter it out quickly from the aggregated data using the `where` method from xarray. We'll also switch to the parameter-free histogram equalization method for mapping to colors using various colormaps:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf.shade(agg.where(agg > 15), cmap=[\"lightblue\", \"darkblue\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf.set_background(tf.shade(agg.where(agg > 15), cmap=fire),\"black\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result covers the most-populated areas of the globe, with Europe apparently having particularly many OpenStreetMap contributors. The long great-circle paths visible on the white background are presumably flight or boat trips, from devices that log their GPS coordinates more than 15 times during the space of one pixel in this plot (or else they would have been eliminated by the `where` call)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive plotting\n", "\n", "As long as the data fits into RAM on your machine, performance should be high enough to explore the data interactively, with only a brief pause after zooming or panning to a new region of the data space:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import holoviews as hv\n", "import dask.dataframe as dd\n", "from holoviews.operation.datashader import datashade, dynspread\n", "hv.extension('bokeh')\n", "\n", "sizes = dict(width=plot_width,height=plot_height)\n", "opts = dict(xaxis=None, yaxis=None, bgcolor=\"black\", **sizes)\n", "points = hv.Points(df, ['x', 'y']).redim.range(x=bounds[\"x_range\"],y=bounds[\"y_range\"])\n", "dynspread(datashade(points, cmap=[\"darkblue\", \"lightcyan\"], **sizes)).options(**opts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the plot will be refreshed only if you are running the notebook with a live Python server; for static notebooks like those at anaconda.org you can zoom in but the plot will never be re-rendered at the higher resolution needed at that zoom level." ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 1 }