{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Datashading a 2.7-billion-point Open Street Map database\n", "\n", "Most [datashader](https://github.com/bokeh/datashader) examples use \"medium-sized\" datasets, because they need to be small enough to be distributed over the internet without racking up huge bandwidth charges for the project maintainers. Even though these datasets can be relatively large (such as the [1-billion point Open Street Map example](https://anaconda.org/jbednar/osm-1billion)), they still fit into memory on a 16GB laptop.\n", "\n", "Because Datashader supports [Dask](http://dask.pydata.org) dataframes, it also works well with truly large datasets, much bigger than will fit in any one machine's physical memory. On a single machine, Dask will automatically and efficiently page in the data as needed, and you can also easily distribute the data and computation across multiple machines. Here we illustrate how to work \"out of core\" on a single machine using a 22GB OSM dataset containing 2.7 billion points.\n", "\n", "The data is taken from Open Street Map's (OSM) [bulk GPS point data](https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/), and is unfortunately too large to distribute with Datashader (8.4GB compressed). The data was collected by OSM contributors' GPS devices, and was provided as a CSV file of `latitude,longitude` coordinates. The data was downloaded from their website, extracted, converted to use positions in Web Mercator format using `datashader.utils.lnglat_to_meters()`, sorted using [spacial indexing](http://datashader.org/user_guide/2_Points.html#Spatial-indexing), and then stored in a [parquet](https://github.com/dask/fastparquet) file for [fast partition-based access](https://github.com/bokeh/datashader/issues/129#issuecomment-300515690). To run this notebook, you would need to do the same process yourself to obtain `osm-3billion.parq`. Once you have it, you can follow the steps below to load and plot the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import datashader as ds\n", "import datashader.transfer_functions as tf\n", "import dask.diagnostics as diag\n", "\n", "from datashader import spatial\n", "from datashader.geo import lnglat_to_meters" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.35 s, sys: 20.9 ms, total: 1.37 s\n", "Wall time: 1.38 s\n" ] } ], "source": [ "%%time\n", "df = spatial.read_parquet('./data/osm-3billion.parq')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | x | \n", "y | \n", "
---|---|---|
0 | \n", "-17219646.0 | \n", "-19219556.0 | \n", "
1 | \n", "-17382392.0 | \n", "-18914976.0 | \n", "
2 | \n", "-16274360.0 | \n", "-17538778.0 | \n", "
3 | \n", "-17219646.0 | \n", "-16627029.0 | \n", "
4 | \n", "-16408889.0 | \n", "-16618700.0 | \n", "
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"
\\n\"+\n \"