{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Tutorial 04. Working with tabular data

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have already discovered, HoloViews elements are simple wrappers around your data that provide a semantically meaningful representation. The real power of HoloViews becomes most evident when working with larger, multi-dimensional datasets, whether they are tabular like in a database or CSV file, or gridded like large datasets of images.\n", "\n", "Tabular data (also called columnar data) is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. There are many different ways to put data into a tabular format, but for interactive analysis having [**tidy data**](http://www.jeannicholashould.com/tidy-data-in-python.html) provides flexibility and simplicity. Here we will show how to make your data tidy as a first step, but see [hvPlot](http://hvplot.pyviz.org) for convenient ways to work with non-tidy data directly.\n", "\n", "In this tutorial all the information you have learned in the previous sections will finally really pay off. We will discover how to facet data and use different element types to explore and visualize the data contained in a real dataset, using many of the same libraries introduced earlier along with some statistics methods from SciPy:\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipy.stats as ss\n", "import pandas as pd\n", "import holoviews as hv\n", "hv.extension('bokeh')\n", "%opts Curve Scatter Bars [tools=['hover']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is tabular, tidy data?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "macro_df = pd.read_csv('../data/macro.csv')\n", "macro_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For tidy data, the **columns** of the table represent **variables** or **dimensions** and the **rows** represent **observations**. \n", "\n", "The opposite of tidy data is so-called **wide** data. To see what wide data looks like, we can use the pandas ``pivot_table`` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wide = macro_df.pivot_table('unem', 'year', 'country')\n", "wide.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this wide format we can see that each column represents the unemployment figures for one country indexed, and each row a particular year. A wide table can represent data very concisely in some cases, but it is difficult to work with in practice because it does not make dimensions easily accessible for plotting or analysis. To go from wide to tidy data you can use the ``pd.melt`` function: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "melted = pd.melt(wide.reset_index(), id_vars='year', value_name='unemployment')\n", "melted.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Declaring Dimensions on a Dataset\n", "\n", "A HoloViews `Dataset` is similar to a HoloViews Element, without having any specific metadata that lets it visualize itself. A Dataset is useful for specifying a set of `Dimension`s that apply to the data, which will later be inherited by any visualizable elements that you create from the Dataset.\n", "\n", "A HoloViews Dimension is the same concept as a **dependent** or **independent** variable in mathematics. In HoloViews such variables are called value dimensions and key dimensions (respectively). In `macro_df`, the ``'country'`` and ``'year'`` are the independent variables, so when we create a Dataset we will declare those as key dimensions. The remaining columns will automatically be inferred to be value dimensions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "macro = hv.Dataset(macro_df, ['country', 'year'])\n", "macro" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the first things we'll want to do with our Dimensions is to give them more sensible labels using ``redim.label``:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "macro = macro.redim.label(growth='GDP Growth', unem='Unemployment', year='Year', country='Country')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how HoloViews differs from using a plotting library directly in this case -- here we can annotate the data *once* to capture our knowledge about what those columns represent, and those annotations will then feed directly into any plots later derived from this data. In a plotting program, you would normally need to supply such metadata every single time you make a new plot, which is tedious, error prone, and discourages easy exploration. In the rest of this tutorial we'll see how we can explore and reveal this annotated data.\n", "\n", "\n", "## Groupby\n", "\n", "The great thing about a tidy tabular Dataset is that we can easily group the data by a particular variable, allowing us to plot or analyze each subset separately. Let's say for instance that we want to break the macro-economic data down by 'year'. Using the groupby method we can easily split the Dataset into subsets by year:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(macro.groupby('Year'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting object has a top-level data structure called a [``HoloMap``](http://holoviews.org/reference/containers/bokeh/HoloMap.html), indexed by year. A HoloMap (like its dynamically generated equivalent [``DynamicMap``](http://holoviews.org/reference/containers/bokeh/DynamicMap.html)) is a potentially many-dimensional indexable container for Elements and Datasets, allowing them to be explored easily. \n", "\n", "However, we cannot visualize this particular HoloMap, because a Dataset has no visual representation. We haven't yet told HoloViews whether the various dependent variables here are continuous, discrete, binned, or any of the other properties the data could have that would allow a specific Element to be chosen." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mapping dimensions to elements\n", "\n", "Luckily, as soon as you choose what sort of Element you want a given column to be, you can make this data visualizable using the convenient ``.to`` method, which allows us to group the dataset and map dimensions to elements in a single step.\n", "\n", "The ``.to`` method of a Dataset takes up to four main arguments:\n", "\n", "1. The element you want to convert to\n", "2. The key dimensions (i.e., independent variables) to display\n", "3. The dependent variables to display, if any\n", "4. The dimensions to group by, if any\n", "\n", "As a first simple example let's go through such a declaration:\n", "\n", "1. We will use a ``Curve``, to declare that the variables are continuous\n", "2. Our independent variable will be the 'year'\n", "3. Our dependent variable will be 'unem'\n", "4. We will ``groupby`` the 'country'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "curves = macro.to(hv.Curve, 'year', 'unem', groupby='country')\n", "print(curves)\n", "curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look at the printed output you will see that instead of a simple ``Curve`` we got a ``HoloMap`` of ``Curve`` Elements for each country. Each Curve is now visualizable, but we haven't told HoloViews what to do with the ``Country``, and so to make the entire structure visualizable, HoloViews creates a widget where you can select which country you want to view. Additional value dimensions would result in additional widgets here.\n", "\n", "Alternatively we could also group by the year and view the unemployment rate by country as Bars instead. If we simply want to groupby all remaining key dimensions (in this case just the year) we can leave out the groupby argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%opts Bars [width=600 xrotation=45]\n", "bars = macro.sort('country').to(hv.Bars, 'country', 'unem')\n", "bars" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Exercise: Create a hv.HeatMap using ``macro.to``, declaring kdims 'year' and 'country', and vdims 'growth'\n", "# You'll need to declare ``width`` and/or ``xrotation`` plot options for HeatMap to make the plot readable\n", "# You can also add ``tools=['hover']`` to get more info on each cell\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Displaying distributions\n", "\n", "Often we want to summarize the distribution of values, e.g. to reveal the distribution of unemployment rates for each OECD country across all measurements. This means we want to ignore the 'year' dimension in our dataset, letting it be summarized instead. To stop HoloViews from grouping by the extra variable, we pass an empty list to the groupby argument. In this case we can easily declare the ``BoxWhisker`` directly, but omitting a key dimension from the ``groupby`` can be useful in cases when there are more dimensions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%opts BoxWhisker [width=800 xrotation=30] (box_fill_color=Palette('Category20'))\n", "macro.to(hv.BoxWhisker, 'country', 'growth', groupby=[])\n", "# Is equivalent to:\n", "hv.BoxWhisker(macro, kdims=['country'], vdims=['growth'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Exercise: Display the distribution of GDP growth by year using the BoxWhisker element\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Faceting dimensions\n", "\n", "Once the data has been grouped into a ``HoloMap`` as we did above, we can further use the grouping capabilities by using the ``.grid``, ``.layout`` and ``.overlay`` methods to lay the groups out on the page rather than flipping through them with a set of widgets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NdOverlay" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%opts Scatter [width=800 height=400 size_index='growth'] (color=Palette('Category20') size=5)\n", "%%opts NdOverlay [legend_position='left']\n", "ndoverlay = macro.to(hv.Scatter, 'year', ['unem', 'growth']).overlay()\n", "print(ndoverlay)\n", "ndoverlay.relabel('OECD Unemployment 1960 - 1990')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### GridSpace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%opts GridSpace [shared_yaxis=True]\n", "subset = macro.select(country=['Austria', 'Belgium', 'Netherlands', 'West Germany'])\n", "grid = subset.to(hv.Bars, 'year', 'unem').grid()\n", "print(grid)\n", "grid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand what is actually going on here, let's rewrite this example in a slightly different way. Instead of using the convenient ``.to`` or ``.groupby`` methods, we can express the same thing by explicitly iterating over the countries we want to look at, selecting the subset of the data for that country using the ``.select`` and then passing these plots to the container we want.\n", "\n", "In the example above that means we ``select`` by 'country' on the macro ``Dataset``, pass the selection to ``Bars`` elements, and declare the key and value dimension to display. We then pass the dictionary of ``Bars`` elements to the ``GridSpace`` container and declare the kdim of the container as 'Country':" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countries = ['Austria', 'Belgium', 'Netherlands', 'West Germany']\n", "hv.GridSpace({country: hv.Bars(macro.select(country=country), 'year', 'unem') for country in countries},\n", " kdims=['Country'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, ``.to`` is much simpler and less error-prone in practice.\n", "\n", "#### NdLayout" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%opts Curve [width=200 height=200]\n", "ndlayout = subset.to(hv.Curve, 'year', 'unem').layout()\n", "print(ndlayout)\n", "ndlayout" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Exercise: Recreate the plot above using hv.NdLayout and using macro.select just as we did for the GridSpace above\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating\n", "\n", "Another common operation is computing aggregates (summary transformations that collapse some dimensions of the data down to scalar values like a mean or a variance). We can compute and visualize these easily using the ``aggregate`` method. The aggregate method lets you declare the dimension(s) to aggregate by and a function to aggregate with (with an optional secondary function to compute the spread if desired). Once we have computed the aggregate we can simply pass it to the [``Curve``](http://holoviews.org/reference/elements/bokeh/Curve.html) and [``ErrorBars``](http://holoviews.org/reference/elements/bokeh/ErrorBars.html):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%opts Curve [width=600]\n", "agg = macro.reindex(vdims=['growth']).aggregate('year', function=np.mean, spreadfn=np.std)\n", "hv.Curve(agg) * hv.ErrorBars(agg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Exercise: Display aggregate GDP growth by country, building it up in a series of steps\n", "# Step 1. First, aggregate the data by country rather than by year, using\n", "# np.mean and ss.sem as the function and spreadfn, respectively, then\n", "# make a `Bars` element from the resulting ``agg``\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Step 2: You should now have a bars plot, but with no error bars. Now add ErrorBars as above. \n", "# Hint: You'll want to make the plot wider and use an xrotation to see the labels clearly\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Onward\n", "\n", "* Go through the Tabular Data [getting started](http://holoviews.org/getting_started/Tabular_Datasets.html) and [user guide](http://holoviews.org/user_guide/Tabular_Datasets.html).\n", "* Learn about slicing, indexing and sampling in the [Indexing and Selecting Data](http://holoviews.org/user_guide/Indexing_and_Selecting_Data.html) user guide.\n", "\n", "The [next section](./05_Working_with_Gridded_Data.ipynb) shows a similar approach, but for working with gridded data, in multidimensional array formats." ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }