{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Tutorial 7. Large data

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "hvPlot and HoloViews support even high-dimensional datasets easily, and the standard mechanisms discussed already work well as long as you select a small enough subset of the data to display at any one time. However, some datasets are just inherently large, even for a single frame of data, and cannot safely be transferred for display in any standard web browser. Luckily, HoloViews makes it simple for you to use the separate [Datashader](http://datashader.readthedocs.io) library together with any of the plotting extension libraries it supports, including Bokeh and Matplotlib. Datashader is designed to complement standard plotting libraries by providing faithful visualizations for very large datasets, focusing on revealing the overall distribution, not just individual data points.\n", "\n", "Datashader uses computations accelerated using [Numba](http://numba.pydata.org), making it fast to work with datasets of millions or billions of datapoints stored in [Dask](http://dask.pydata.org/en/latest/) dataframes. Dask dataframes provide an API that is functionally equivalent to Pandas, but allows working with data out of core and scaling out to many processors across compute clusters. Here we will use Dask to load and visualize the entire earthquake dataset.\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How does datashader work?\n", "\n", "\n", "\n", "* Tools like Bokeh map **Data** (left) directly into an HTML/JavaScript **Plot** (right)\n", "* datashader instead renders **Data** into a plot-sized **Aggregate** array, from which an **Image** can be constructed then embedded into a Bokeh **Plot**\n", "* Only the fixed-sized **Image** needs to be sent to the browser, allowing millions or billions of datapoints to be used\n", "* Every step automatically adjusts to the data, but can be customized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### When not to use datashader\n", "\n", "* Plotting less than 1e5 or 1e6 data points\n", "* When every datapoint must be resolveable individually; standard Bokeh will render all of them\n", "* For full interactivity (hover tools) with every datapoint\n", "\n", "#### When to use datashader\n", "\n", "* Actual big data; when Bokeh/Matplotlib have trouble\n", "* When the distribution matters more than individual points\n", "* When you find yourself sampling, decimating, or binning to better understand the distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import holoviews as hv\n", "import dask.dataframe as dd\n", "import datashader as ds, datashader.geo\n", "\n", "from holoviews import opts\n", "from holoviews.operation.datashader import datashade, rasterize\n", "\n", "hv.extension('bokeh')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n", "\n", "As a first step we will load the earthquake dataset, with 2.1 million seismological events. Let's load this data using Dask to create a dataframe ``df``:\n", "\n", "
\n", " If you are low on memory (less than 8 GB) or have not downloaded the full data, you can load only a subset of the data by changing 'data/' to 'data/.data_stubs/'.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = dd.read_parquet('../data/earthquakes.parq', engine='fastparquet').persist()\n", "\n", "print('%s Rows' % len(df))\n", "print('Columns:', list(df.columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a HoloViews object\n", "\n", "In previous sections we have already seen how to declare a set of HoloViews [``Points``](http://holoviews.org/reference/elements/bokeh/Points.html) from a Pandas dataframe. Here we do the same for a Dask dataframe passed in with the desired key dimensions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x, y = ds.geo.lnglat_to_meters(df.longitude, df.latitude)\n", "ddf = df.assign(x=x, y=y).persist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "points = hv.Points(ddf, ['x', 'y'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could now simply type ``points``, and Bokeh would attempt to display this data as a standard Bokeh plot. Before doing that, however, remember that we have 2 million rows of data, and a normal web browser will be very unhappy with that amount of data! Instead of letting Bokeh see this data, let's convert it to something far more tractable using the ``datashade`` operation. This operation will aggregate the data on a 2D grid, apply shading to assign pixel colors to each bin in this grid, and build an ``RGB`` Element (just a fixed-sized image) we can safely display in a browser:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datashade(points).opts(width=700, height=500, bgcolor=\"lightgray\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are the same as if you pass `datashade=True` to hvPlot. If you zoom in you will note that the plot rerenders depending on the zoom level, which allows the full dataset to be explored interactively even though only an image of it is ever sent to the browser. The way this works is that ``datashade`` is a dynamic operation that also declares some linked streams. These linked streams are automatically instantiated and dynamically supply the ``plot_size``, ``x_range``, and ``y_range`` from the Bokeh plot to the operation based on your current viewport as you zoom or pan:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datashade.streams" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exercise: Plot the earthquake locations ('longitude' and 'latitude' columns)\n", "# Warning: Don't try to display hv.Points() directly; it's too big! Use datashade() or rasterize() for any display\n", "# Optional: Change the cmap on the datashade operation to inferno\n", "\n", "# from datashader.colors import inferno\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding a tile source\n", "\n", "Using a publicly available [tiled map service](https://en.wikipedia.org/wiki/Tiled_web_map), we can display a geographic map in the background. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from holoviews.element.tiles import EsriImagery\n", "\n", "tiles = EsriImagery().opts(xaxis=None, yaxis=None, width=700, height=500)\n", "tiles * datashade(points)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exercise: Overlay the earthquake data on top of the Wikipedia tile source" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating with a variable\n", "\n", "So far we have simply been counting earthquakes, but our dataset is much richer than that. We have information about a number of variables, as listed above. Datashader provides a number of ``aggregator`` functions, which you can supply to the datashade operation. Here we use the ``ds.mean`` aggregator to compute the average magnitude at each location, for events with a depth below zero:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "selected = points.select(depth=(None, 0))\n", "selected.data = selected.data.persist()\n", "tiles * rasterize(selected, aggregator=ds.mean('mag')).opts(colorbar=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exercise: Use the ds.min or ds.max aggregator to visualize other fields\n", "# Optional: Eliminate outliers by using select\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping by a variable\n", "\n", "Because datashading happens only just before visualization, you can use any of the techniques shown in previous sections to select, filter, or group your data before visualizing it, such as grouping it by the declared type:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dset = hv.Dataset(ddf)\n", "grouped = dset.to(hv.Points, ['x', 'y'], groupby=['type'], dynamic=True)\n", "tiles.opts(alpha=0.4, bgcolor=\"black\") * datashade(grouped).opts(\n", " opts.RGB(width=600, height=500, xaxis=None, yaxis=None, tools=['hover']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Exercise: Facet a subset of the types as an NdLayout\n", "# Hint: You can reuse the existing grouped variable or select a subset before using the .to method\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, Datashader requires taking some extra steps into consideration, but it makes it practical to work with even quite large datasets on an ordinary laptop. On a 16GB machine, datasets 100X or 1000X the one used here should be very practical, as illustrated at the [datashader web site](https://datashader.org).\n", "\n", "Here the examples all use point data, but Datashader also supports raster data as shown in earlier sections, along with many other data types (lines, time series, trajectories, areas, trimeshes, quadmeshes, networks, etc.) \n", "\n", "\n", "# Onwards\n", "\n", "* The [HoloViews Large Data](http://holoviews.org/user_guide/Large_Data.html) user guide explains in more detail how to work with large datasets using Datashader.\n", "* HoloViews also contains a [sample bokeh app](http://holoviews.org/gallery/apps/bokeh/nytaxi_hover.html) using this dataset and an additional linked stream that works well as a starting point." ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }