{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Binning and Aggregation\n", "\n", "We have discussed **data**, **marks**, **encodings**, and **encoding types**.\n", "The next essential piece of Altair's API is its approach to binning and aggregating data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import altair as alt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0chevrolet chevelle malibu18.08307.0130.0350412.01970-01-01USA
1buick skylark 32015.08350.0165.0369311.51970-01-01USA
2plymouth satellite18.08318.0150.0343611.01970-01-01USA
3amc rebel sst16.08304.0150.0343312.01970-01-01USA
4ford torino17.08302.0140.0344910.51970-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "0 chevrolet chevelle malibu 18.0 8 307.0 \n", "1 buick skylark 320 15.0 8 350.0 \n", "2 plymouth satellite 18.0 8 318.0 \n", "3 amc rebel sst 16.0 8 304.0 \n", "4 ford torino 17.0 8 302.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "0 130.0 3504 12.0 1970-01-01 USA \n", "1 165.0 3693 11.5 1970-01-01 USA \n", "2 150.0 3436 11.0 1970-01-01 USA \n", "3 150.0 3433 12.0 1970-01-01 USA \n", "4 140.0 3449 10.5 1970-01-01 USA " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from vega_datasets import data\n", "cars = data.cars()\n", "\n", "cars.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Group-By in Pandas\n", "\n", "One key operation in data exploration is the *group-by*, discussed in detail in [Chaper 4](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html) of the *Python Data Science Handbook*.\n", "In short, the group-by *splits* the data according to some condition, *applies* some aggregation within those groups, and then *combines* the data back together:\n", "\n", "![Split Apply Combine figure](split-apply-combine.png)\n", "[Figure source](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the cars data, you might split by Origin, compute the mean of the miles per gallon, and then combine the results.\n", "In Pandas, the operation looks like this:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Origin\n", "Europe 27.891429\n", "Japan 30.450633\n", "USA 20.083534\n", "Name: Miles_per_Gallon, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cars.groupby('Origin')['Miles_per_Gallon'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Altair, this sort of split-apply-combine can be performed by passing an aggregation operator within a string to any encoding. For example, we can display a plot representing the above aggregation as follows:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " y='Origin',\n", " x='mean(Miles_per_Gallon)'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the grouping is done implicitly within the encodings: here we group only by Origin, then compute the mean over each group." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-dimensional Binnings: Histograms\n", "\n", "One of the most common uses of binning is the creation of histograms. For example, here is a histogram of miles per gallon:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " alt.X('Miles_per_Gallon', bin=True),\n", " alt.Y('count()'),\n", " alt.Color('Origin')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One interesting thing that Altair's declarative approach allows us to start assigning these values to different encodings, to see other views of the exact same data.\n", "\n", "So, for example, if we assign the binned miles per gallon to the color, we get this view of the data:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " color=alt.Color('Miles_per_Gallon', bin=True),\n", " x='count()',\n", " y='Origin'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives us a better appreciation of the proportion of MPG *within* each country.\n", "\n", "If we wish, we can normalize the counts on the x-axis to compare proportions directly:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " color=alt.Color('Miles_per_Gallon', bin=True),\n", " x=alt.X('count()', stack='normalize'),\n", " y='Origin'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that well over half of US cars were in the \"low mileage\" category.\n", "\n", "Changing the encoding again, let's map the color to the count instead:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_rect().encode(\n", " x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=20)),\n", " color='count()',\n", " y='Origin',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we see the same dataset as a heat map!\n", "\n", "This is one of the beautiful things about Altair: it shows you through its API grammar the relationships between different chart types: for example, a 2D heatmap encodes the same data as a stacked histogram!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other aggregates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregates can also be used with data that is only implicitly binned.\n", "For example, look at this plot of MPG over time:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Year:T',\n", " color='Origin',\n", " y='Miles_per_Gallon'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fact that the points overlap so much makes it difficult to see important parts of the data; we can make it clearer by plotting the mean in each group (here, the mean of each Year/Country combination):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_line().encode(\n", " x='Year:T',\n", " color='Origin',\n", " y='mean(Miles_per_Gallon)'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``mean`` aggregate only tells part of the story, though: Altair also provides built-in tools to compute the lower and upper bounds of confidence intervals on the mean.\n", "\n", "We can use ``mark_area()`` here, and specify the lower and upper bounds of the area using ``y`` and ``y2``:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_area(opacity=0.3).encode(\n", " x='Year:T',\n", " color='Origin',\n", " y='ci0(Miles_per_Gallon)',\n", " y2='ci1(Miles_per_Gallon)'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time Binnings\n", "\n", "One special kind of binning is the grouping of temporal values by aspects of the date: for example, month of year, or day of months.\n", "To explore this, let's look at a simple dataset consisting of average temperatures in Seattle:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetemp
02010-01-01 00:00:0039.4
12010-01-01 01:00:0039.2
22010-01-01 02:00:0039.0
32010-01-01 03:00:0038.9
42010-01-01 04:00:0038.8
\n", "
" ], "text/plain": [ " date temp\n", "0 2010-01-01 00:00:00 39.4\n", "1 2010-01-01 01:00:00 39.2\n", "2 2010-01-01 02:00:00 39.0\n", "3 2010-01-01 03:00:00 38.9\n", "4 2010-01-01 04:00:00 38.8" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temps = data.seattle_temps()\n", "temps.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we try to plot this data with Altair, we will get a ``MaxRowsError``:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "ename": "MaxRowsError", "evalue": "The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mMaxRowsError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/altair/vegalite/v4/api.py\u001b[0m in \u001b[0;36mto_dict\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 353\u001b[0m \u001b[0mcopy\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdeep\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 354\u001b[0m \u001b[0moriginal_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'data'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mUndefined\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 355\u001b[0;31m \u001b[0mcopy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_prepare_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moriginal_data\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 356\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 357\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0moriginal_data\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mUndefined\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/altair/vegalite/v4/api.py\u001b[0m in \u001b[0;36m_prepare_data\u001b[0;34m(data, context)\u001b[0m\n\u001b[1;32m 82\u001b[0m \u001b[0;31m# convert dataframes or objects with __geo_interface__ to dict\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 83\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'__geo_interface__'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 84\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpipe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata_transformers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 85\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 86\u001b[0m \u001b[0;31m# convert string input to a URLData\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/toolz/functoolz.py\u001b[0m in \u001b[0;36mpipe\u001b[0;34m(data, *funcs)\u001b[0m\n\u001b[1;32m 632\u001b[0m \"\"\"\n\u001b[1;32m 633\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mfunc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mfuncs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 634\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 635\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 636\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/toolz/functoolz.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 301\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 302\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 303\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_partial\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 304\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mTypeError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 305\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_should_curry\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/altair/vegalite/data.py\u001b[0m in \u001b[0;36mdefault_data_transformer\u001b[0;34m(data, max_rows)\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mcurry\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdefault_data_transformer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmax_rows\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m5000\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mpipe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlimit_rows\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmax_rows\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmax_rows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mto_values\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/toolz/functoolz.py\u001b[0m in \u001b[0;36mpipe\u001b[0;34m(data, *funcs)\u001b[0m\n\u001b[1;32m 632\u001b[0m \"\"\"\n\u001b[1;32m 633\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mfunc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mfuncs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 634\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 635\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 636\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/toolz/functoolz.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 301\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 302\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 303\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_partial\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 304\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mTypeError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 305\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_should_curry\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/lib/python3.7/site-packages/altair/utils/data.py\u001b[0m in \u001b[0;36mlimit_rows\u001b[0;34m(data, max_rows)\u001b[0m\n\u001b[1;32m 76\u001b[0m \u001b[0;34m'than the maximum allowed ({}). '\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[0;34m'For information on how to plot larger datasets '\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 78\u001b[0;31m 'in Altair, see the documentation'.format(max_rows))\n\u001b[0m\u001b[1;32m 79\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 80\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mMaxRowsError\u001b[0m: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation" ] }, { "data": { "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(temps).mark_line().encode(\n", " x='date:T',\n", " y='temp:Q'\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8759" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(temps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aside: How Altair Encodes Data\n", "\n", "We chose to raise a MaxRowsError for datasets larger than 5000 rows because of our observation of students using Altair, because unless you think about how your data is being represented, it's quite easy to end up with **very** large notebooks inwhich performance will suffer.\n", "\n", "When you pass a pandas dataframe to an Altair chart, the result is that the data is converted to JSON and stored in the chart specification. This specification is then embedded in the output of your notebook, and if you make a few dozen charts this way with a large enough dataset, it can significantly slow down your machine.\n", "\n", "So how to get around the error? A few ways:\n", "\n", "1) Use a smaller dataset. For example, we could use Pandas to aggregate the temperatures by day:\n", " ```python\n", " import pandas as pd\n", " temps = temps.groupby(pd.DatetimeIndex(temps.date).date).mean().reset_index()\n", " ```\n", "\n", "2) Disable the MaxRowsError using\n", " ```python\n", " alt.data_transformers.enable('default', max_rows=None)\n", " ```\n", " But note this can lead to *very* large notebooks if you're not careful.\n", " \n", "3) Serve your data from a local threaded server. The [altair data server](https://github.com/altair-viz/altair_data_server) package makes this easy. \n", " ```python\n", " alt.data_transformers.enable('data_server')\n", " ```\n", " Note that this approach may not work on some cloud-based Jupyter notebook services.\n", " \n", "4) Use a URL which points to the data source. Creating a [gist](gist.github.com) is a quick and easy way to store frequently used data. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll do the latter here, which is the most convenient and leads to the best performance. All of the sources in `vega_datasets` contain a `url` property. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "temps = data.seattle_temps.url" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'config': {'view': {'continuousWidth': 400, 'continuousHeight': 300}},\n", " 'data': {'url': 'https://vega.github.io/vega-datasets/data/seattle-temps.csv'},\n", " 'mark': 'line',\n", " '$schema': 'https://vega.github.io/schema/vega-lite/v4.0.2.json'}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(temps).mark_line().to_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that instead of including the entire dataset only the url is used.\n", "\n", "Now lets try again with our plot" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(temps).mark_line().encode(\n", " x='date:T',\n", " y='temp:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data is a little bit crowded; suppose we would like to bin this data by month. We'll do this using ``TimeUnit Transform`` on the date:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(temps).mark_point().encode(\n", " x=alt.X('month(date):T'),\n", " y='temp:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This might be clearer if we now aggregate the temperatures:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(temps).mark_bar().encode(\n", " x=alt.X('month(date):O'),\n", " y='mean(temp):Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also split dates two different ways to produce interesting views of the data; for example:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(temps).mark_rect().encode(\n", " x=alt.X('date(date):O'),\n", " y=alt.Y('month(date):O'),\n", " color='mean(temp):Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can look at the hourly average temperature as a function of month:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(temps).mark_rect().encode(\n", " x=alt.X('hours(date):O'),\n", " y=alt.Y('month(date):O'),\n", " color='mean(temp):Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This kind of transform can be quite useful when working with temporal data.\n", "\n", "More information on the ``TimeUnit Transform`` is available here: https://altair-viz.github.io/user_guide/transform/timeunit.html#user-guide-timeunit-transform" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }