{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "\n", "\n", "

Tutorial 01. Workflow Introduction

\n", "

\n", "\n", "This notebook is intended to present an overview of PyViz as a set of slides that you can follow without having to run any code. You can use this document either as a talk, using [RISE](https://github.com/damianavila/RISE), or as a regular Jupyter notebook. Even though you do not *need* to execute any code to follow this talk, you can execute cells (with full user interaction) as long as you have a Python server running for the notebook." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "

Revealing your data (nearly) effortlessly,
at every step in your workflow

\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Workflow from data to decision\n", "\n", "
\n", "\n", "
\n", "If there's no visualization at any of these stages, you're flying blind.\n", "

\n", "But visualization is often skipped as too hard to construct, particularly for big data.\n", "

\n", "What if it were simple to visualize anything, anywhere?\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "

Good news/
\n", "Bad news


\n", "Lots of choices!\n", "
\n", "Too hard to
\n", "try them all,
\n", "learn them all, or
\n", "get them to work together.\n", "


\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "\n", "

PyViz:

\n", "

\n", "Seamless interoperability
for browser-based
viz tools\n", "

\n", "Supported by Anaconda, Inc.\n", "


\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# PyViz Goals:\n", "\n", "- Full functionality in browsers (not desktop)\n", "- Full interactivity (inside and out of plots)\n", "- Focus on Python users, not web programmers\n", "- Start with data, not coding\n", "- Work with data of any size\n", "- Exploit general-purpose SciPy/PyData tools\n", "- Focus on 2D primarily, with some 3D\n", "- Avoid entangling your data, code, and viz:\n", " * Same viz/analysis code in Jupyter, Python, HPC, ...\n", " * Widgets/apps in Jupyter, standalone servers, web pages\n", " * Jupyter as a tool, not part of the results" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Exploring Pandas Dataframes\n", "\n", "If your data is in a Pandas dataframe, it's natural to explore it using the ``.plot()`` method (based on Matplotlib). Let's look at a [dataset of the number of cases of measles and pertussis](http://graphics.wsj.com/infectious-diseases-and-vaccines/#b02g20t20w15) (per 100,000 people) over time in each state:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/diseases.csv.gz')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Just calling ``.plot()`` won't give anything meaningful, because it doesn't know what should be plotted against what:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "df.plot();" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "But with some Pandas operations we can pull out parts of the data that make sense to plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "by_year = df[[\"Year\",\"measles\"]].groupby(\"Year\").aggregate(np.sum)\n", "by_year.plot();" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here it is easy to see that the 1963 introduction of a measles vaccine brought the cases down to negligible levels." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Exploring Data with hvPlot and Bokeh" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The above plots are just static images, but if you import the `hvplot` package, you can use the same plotting API to get fully interactive plots with hover, pan, and zoom in a web browser:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import hvplot.pandas\n", "\n", "by_year.hvplot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Only needed here because we use matplotlib much later on...\n", "import holoviews as hv\n", "from holoviews import opts\n", "hv.extension('bokeh', 'matplotlib', width=\"100\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here the interactive features are provided by the [Bokeh](http://bokeh.pydata.org) JavaScript-based plotting library. But what's actually returned by this call is something called a [HoloViews](http://holoviews.org) object, here specifically a HoloViews [Curve](http://holoviews.org/reference/elements/bokeh/Curve.html). HoloViews objects *display* as a Bokeh plot, but they are actually much richer objects that make it easy to capture your understanding as you explore the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import holoviews as hv\n", "vline = hv.VLine(1963).opts(color='black')\n", "\n", "m = by_year.hvplot() * vline * \\\n", " hv.Text(1963, 27000, \" Vaccine introduced\", halign='left')\n", "m" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "while still always being able to access the original data involved for further analysis:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "print(m)\n", "m.Curve.I.data.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For other plotting libraries, a given visualization that you construct is a dead end -- if you want to change it in some way, you'll need to reconstruct it from scratch with different settings. \n", "\n", "Because HoloViews objects preserve your original data, you can now do *more* with your data than you could before, including anything you could do with the raw data, plus overlaying (as above), laying out in subfigures, slicing, sampling, setting options, and many other operations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For instance, with HoloViews it's simple to break down the data in different ways. You can inspect each state individually:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "measles_agg = df.groupby(['Year', 'State'])['measles'].sum()\n", "by_state = measles_agg.hvplot('Year', groupby='State', width=500, dynamic=False)\n", "\n", "by_state * vline" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Or pull out a couple of those to put side by side:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "by_state[\"Texas\"].relabel('Texas') + by_state[\"New York\"].relabel('New York')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Or to compare four states over time by overlaying:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "states = ['New York', 'New Jersey', 'California', 'Texas']\n", "measles_agg.loc[1930:2005, states].hvplot(by='State') * vline" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Or by faceting:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "measles_agg.loc[1930:2005, states].hvplot('Year', col='State', width=200, height=150, rot=90) * vline" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Or as a different type of plot, such as a bar chart:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "measles_agg.loc[1980:1990, states].hvplot.bar('Year', by='State', rot=90)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Or with additional information, such as error bars:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "df_error = df.groupby('Year').agg({'measles': [np.mean, np.std]}).xs('measles', axis=1)\n", "df_error.hvplot(y='mean') * hv.ErrorBars(df_error, 'Year').redim.range(mean=(0, None)) * vline" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If we really want to invest a lot of time in making a fancy plot, we can customize it to try to show *all* the yearly data about measles at once:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "def nansum(a, **kwargs):\n", " return np.nan if np.isnan(a).all() else np.nansum(a, **kwargs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "heatmap = df.hvplot.heatmap('Year', 'State', 'measles', reduce_function=nansum,\n", " logz=True, height=500, width=900, xaxis=None, flip_yaxis=True)\n", "\n", "aggregate = hv.Dataset(heatmap).aggregate('Year', np.mean, np.std)\n", "agg = hv.ErrorBars(aggregate) * hv.Curve(aggregate).opts(xrotation=90)\n", "agg = agg.options(height=200, show_title=False)\n", "\n", "marker = hv.Text(1963, 800, u'\\u2193 Vaccine introduced', halign='left')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "(heatmap + (agg * marker).opts(width=900)).cols(1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If you prefer, you can choose Matplotlib to render your HoloViews plots, though you give up the interactive pan, zoom, and hover from Bokeh:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "mpl = by_state * hv.VLine(1963).opts(color=\"black\") * \\\n", " hv.Text(1963, 1000, \" Vaccine introduced\", halign='left')\n", "hv.output(mpl, backend='matplotlib')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As you can see, these tools make it very quick to explore your data in a browser, and if you choose HoloViews+Bokeh plots, you can have full interactivity with very little code even for quite complex datasets." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Interactive statistical plots\n", "\n", "For high-dimensional datasets with additional data variables, we can compose all the above faceting methods as needed.\n", "\n", "For instance, let's look at the Iris dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from bokeh.sampledata.iris import flowers as iris\n", "\n", "iris.tail()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can now look at all these relationships at once, interactively: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "hvplot.scatter_matrix(iris, c='species')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Branching out: large data, geo data, custom controls\n", "\n", "PyViz is a modular suite of tools, and when you need capabilities not handled by Bokeh and HoloViews (and optionally hvPlot) as above, you can bring those in:\n", " \n", "* [**GeoViews**](http://geoviews.org): Visualizable geographic HoloViews objects\n", "* [**Datashader**](http://datashader.org): Rasterizing huge HoloViews objects to images quickly\n", "* [**Param**](https://param.pyviz.org): Declarative parameters\n", "* [**Panel**](https://panel.pyviz.org): Making it simple to work with widgets inside and outside of a notebook context\n", "* [**Colorcet**](https://colorcet.pyviz.org): perceptually uniform colormaps for big data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We'll look at a large(ish) dataset of 10 million taxi trips on a map, with the following caveat:\n", "\n", "
The following examples rely on dynamic updates from a live Python server, and will not update fully when viewed statically on a standard website.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import dask.dataframe as dd, geoviews as gv, cartopy.crs as crs\n", "from colorcet import palette\n", "from holoviews.operation.datashader import datashade\n", "from geoviews.tile_sources import EsriImagery\n", "\n", "topts = hv.opts.WMTS(width=700, height=600, bgcolor='black', \n", " xaxis=None, yaxis=None, show_grid=False)\n", "tiles = EsriImagery.clone(crs=crs.GOOGLE_MERCATOR).opts(topts) \n", "taxi = dd.read_parquet('../data/nyc_taxi_wide.parq').persist()\n", "colormaps = {n: palette[n] for n in ['fire','bgy','bgyw','bmy','gray','kbc']}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def view(location='dropoff', cmap=colormaps['fire'], alpha=1):\n", " pts = hv.Points(taxi, [location+'_x', location+'_y'])\n", " trips = datashade(pts, cmap=cmap)\n", " return tiles.options(alpha=alpha) * trips\n", "\n", "view()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As you can see, you can specify geo plots easily with GeoViews, and if your HoloViews objects are too big to visualize in a browser directly, you can add `datashade()` to render them into images dynamically on zooming, etc.\n", "\n", "You can also easily add widgets to control filtering, selection, and other options interactively, either here in the notebook or by putting the same code in a separate file and running it as a standalone server:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import panel as pn\n", "explorer = pn.interact(view, cmap=colormaps, location=['dropoff', 'pickup'], alpha=(0, 1.))\n", "\n", "pn.Row(pn.Column('# NYC Taxi Explorer', explorer[0]), explorer[1]).servable()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here we used the Panel `interact` function to create a simple app based on the `view` function, and then we mixed and matched some of its components to lay it out in rows and columns as you see above. \n", "\n", "In this simple app, the `view` function is called whenever *any* of the parameters change (alpha, colormap, or location), triggering a full rerender, but you can get a more responsive interface if you take the time to declare which computations depend on which parameters (see the [Deploying Bokeh Apps tutorial](./13_Deploying_Bokeh_Apps.ipynb)). \n", "\n", "Either way, the app should work the same here in the notebook (if you have a live Python process) or as a standalone server by calling `panel serve` with either the name of a Python file with the above code or simply the name of this notebook (where it will run the notebook code and serve any objects marked `.servable()`).)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As you can see, the PyViz tools let you integrate visualization into everything you do, using a small amount of code that reveals your data's properties and captures your understanding of it. The rest of these tutorials will break down each of the topics covered above, showing you step by step how to work with your own data using these tools.\n", "\n", "Thanks to all of the PyViz contributors!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }