{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bokeh Charts Deep Dive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Charts Purpose" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Charts provides a few things, but the core purpose is to **help you understand your data by separating it into unique visual groups *before* you know what those groups are**.\n", "\n", "What Charts ultimately does is build Glyph Renderers to represent the groups of data and add them to a Chart.\n", "\n", "It does this by adapting data into a consistent format, understanding metadata about the specific chart, deriving the unique groups of data based on the chart configuration, then assigning attributes to each group." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The topics covered in this deep dive are:\n", "\n", "1. **Data** - a consistent chart metadata-aware data structure for charts\n", "2. **Attributes** - fault resistant assignment of chart attributes to distinct groups of data\n", "3. **Builders** - glyph renderer factories (data driven glyphs)\n", "4. **New Charts** - how to build custom charts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Charts interface is built around the use of a table data structure, pandas. There are a few primary reasons for doing this.\n", "\n", "1. Other smart people have already considered how a graphics system would work on this type of data (Grammar of Graphics)\n", "2. A typical analyst will likely encounter data that is in this format, or can easily be transformed into it.\n", "3. The table data structure can provide useful information about the data, which aids in building the chart\n", "4. Interactive app generation can be greatly simplified" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "from bokeh.charts import Scatter, show, output_notebook\n", "output_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is table-like data?\n", "Data that contains like data grouped and labeled by a column and new records listed as rows, or data that can be inferred into this format. This format is pretty flexible and an analyst will commonly encounter this format in databases, so we will go further into certain styles of structuring the data and the reasoning behind them.\n", "\n", "When encountering databases, you will likely be accessing one to many tables, joining them, then performing some exploratory analysis. This joined dataset will likely contain columns with dates and strings that describe the values in the record. These columns which uniquely identify the columns that contain numerical measurements are often called `dimensions`, and the numerical measurements are called `values`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Formats Transformed" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from bokeh.charts.data_source import ChartDataSource" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Array (list/pd.series/np.array)\n", "Arrays are assumed to be like values that would be assigned to a column. Passing in `Chart([1, 2], [3, 4])` would be creating the following data source internally for the chart to use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ds = ChartDataSource.from_data([1, 2], [3, 4])\n", "ds.df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In cases where there isn't enough metadata, column names will be automatically assigned to the array-like data in order received." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Record Data\n", "This data would be encountered more when dealing with json data, or serialized objects. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "records = [\n", " {'name': 'bob', 'age': 25},\n", " {'name': 'susan', 'age': 22}\n", "]\n", "\n", "ds = ChartDataSource.from_data(records)\n", "ds.df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Structuring Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example**:\n", "\n", "Imagine we had some sample data. The data is sampled over *time* at two different *weather stations*, each with a status *raining*. The sampled values are **temperature**. Dimensions are in *italics* and values are **bolded**. We will look at two different approaches to storing this data.\n", "\n", "For the example we will assume the two weather stations each record a temperature on three different days, where it is raining on the first day for station a, and the second day for station b." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Tall Data\n", "\n", "*(preferred format for Charts)*\n", "\n", "Table-like data can be thought of as observations about the world or some process or system. As new observations are added, you will want to add new rows, and avoid having to add new columns, because you must carefully consider all other rows when that occurs.\n", "\n", "For the example, a tall data source will minimize the number of columns that contain like information. For Tall data, the format will be the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = dict(sample_time=['2015-12-01', '2015-12-02', '2015-12-03', '2015-12-01', '2015-12-02', '2015-12-03'],\n", " temperature=[68, 67, 77, 45, 50, 43],\n", " location=['station a', 'station a', 'station a', 'station b', 'station b', 'station b'],\n", " raining=[True, False, False, False, True, False]\n", " )\n", "tall = pd.DataFrame(data)\n", "tall.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wide Data\n", "\n", "*(supported for some Charts, or with transformations)*\n", "\n", "It is often in scientific use cases or in pivoted data that you will find where multiple columns contain the same class of measurement. For instance, when sampling temperature for two weather stations, wide data would contain the weather_station dimension encoded into the column names. This is simple if we only have temperature data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = dict(sample_time=['2015-12-01', '2015-12-02', '2015-12-03'],\n", " station_a_temp=[68, 67, 77],\n", " station_b_temp=[45, 50, 43]\n", " )\n", "wide = pd.DataFrame(data)\n", "wide.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, if we need to add the raining flag, we must add two new columns, because the flag can be different between the two stations, and the columns are containing the station dimension." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data['station_a_raining'] = [True, False, False]\n", "data['station_b_raining'] = [False, False, False]\n", "\n", "wide = pd.DataFrame(data)\n", "wide.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why is Tall Data Preferred?\n", "\n", "Tall data is better suited towards interactive analysis. While wide data is fine for simple data that can only be viewed against one or two dimensions, highly dimensional data will be much easier to use to reconfigure charts and/or adding new values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tall_line1 = Scatter(tall, x='sample_time', y='temperature', color='location', legend=True)\n", "tall_line2 = Scatter(tall, x='sample_time', y='temperature', color='raining', legend=True)\n", "\n", "show(tall_line1)\n", "show(tall_line2)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Imagine if a new station was added. There are two problems with the wide data:\n", "\n", "1. More modifications required to the data structure and function call.\n", " * A value must be listed for all stations, no matter if there exists a sample for the date\n", " * Any reference on the function to each station would require adding the new station name. This list of names could grow quite large and you may have columns that aren't stations, and aren't temperature measurements.\n", " \n", "2. Difficult to build interactive applications when required to reference multiple series. Must handle multiple selections for some of the fields, which adds complexity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transforming Data" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The ChartDataSource can process any Charts data transformations that are added when creating the chart." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from bokeh.charts import bins, blend\n", "\n", "data = {\n", " 'temperature_a': [32, 23, 95, 90, 23, 58, 90],\n", " 'temperature_b': [45, 34, 23, 88, 67, 34, 23]\n", "}\n", "\n", "ds = ChartDataSource.from_data(data, x=blend('temperature_a', 'temperature_b')) \n", "\n", "ds.df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ds = ChartDataSource.from_data(data, x=bins('temperature_a'))\n", "\n", "ds.df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ds = ChartDataSource.from_data(data, x=bins('temperature_a'), y=bins('temperature_b'))\n", "\n", "ds.df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }