{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple Charts: Core Concepts\n", "\n", "The goal of this section is to teach you the core concepts required to create a basic Altair chart; namely:\n", "\n", "- **Data**, **Marks**, and **Encodings**: the three core pieces of an Altair chart\n", "\n", "- **Encoding Types**: ``Q`` (quantitative), ``N`` (nominal), ``O`` (ordinal), ``T`` (temporal), which drive the visual representation of the encodings\n", "\n", "- **Binning and Aggregation**: which let you control aspects of the data representation within Altair.\n", "\n", "With a good understanding of these core pieces, you will be well on your way to making a variety of charts in Altair." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll start by importing Altair, and (if necessary) enabling the appropriate renderer:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import altair as alt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Basic Altair Chart\n", "\n", "The essential elements of an Altair chart are the **data**, the **mark**, and the **encoding**.\n", "\n", "The format by which these are specified will look something like this:\n", "\n", "```python\n", "alt.Chart(data).mark_point().encode(\n", " encoding_1='column_1',\n", " encoding_2='column_2',\n", " # etc.\n", ")\n", "```\n", "\n", "Let's take a look at these pieces, one at a time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Data\n", "\n", "Data in Altair is built around the [Pandas Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).\n", "For this section, we'll use the cars dataset that we saw before, which we can load using the [vega_datasets](https://github.com/altair-viz/vega_datasets) package:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0chevrolet chevelle malibu18.08307.0130.0350412.01970-01-01USA
1buick skylark 32015.08350.0165.0369311.51970-01-01USA
2plymouth satellite18.08318.0150.0343611.01970-01-01USA
3amc rebel sst16.08304.0150.0343312.01970-01-01USA
4ford torino17.08302.0140.0344910.51970-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "0 chevrolet chevelle malibu 18.0 8 307.0 \n", "1 buick skylark 320 15.0 8 350.0 \n", "2 plymouth satellite 18.0 8 318.0 \n", "3 amc rebel sst 16.0 8 304.0 \n", "4 ford torino 17.0 8 302.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "0 130.0 3504 12.0 1970-01-01 USA \n", "1 165.0 3693 11.5 1970-01-01 USA \n", "2 150.0 3436 11.0 1970-01-01 USA \n", "3 150.0 3433 12.0 1970-01-01 USA \n", "4 140.0 3449 10.5 1970-01-01 USA " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from vega_datasets import data\n", "cars = data.cars()\n", "\n", "cars.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data in Altair is expected to be in a [tidy format](http://vita.had.co.nz/papers/tidy-data.html); in other words:\n", "\n", "- each **row** is an observation\n", "- each **column** is a variable\n", "\n", "See [Altair's Data Documentation](https://altair-viz.github.io/user_guide/data.html) for more information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The *Chart* object\n", "\n", "With the data defined, you can instantiate Altair's fundamental object, the ``Chart``. Fundamentally, a ``Chart`` is an object which knows how to emit a JSON dictionary representing the data and visualization encodings, which can be sent to the notebook and rendered by the Vega-Lite JavaScript library.\n", "Let's take a look at what this JSON representation looks like, using only the first row of the data:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'config': {'view': {'continuousWidth': 400, 'continuousHeight': 300}},\n", " 'data': {'name': 'data-36a712fbaefa4d20aa0b32e160cfd83a'},\n", " 'mark': 'point',\n", " '$schema': 'https://vega.github.io/schema/vega-lite/v4.0.2.json',\n", " 'datasets': {'data-36a712fbaefa4d20aa0b32e160cfd83a': [{'Name': 'chevrolet chevelle malibu',\n", " 'Miles_per_Gallon': 18.0,\n", " 'Cylinders': 8,\n", " 'Displacement': 307.0,\n", " 'Horsepower': 130.0,\n", " 'Weight_in_lbs': 3504,\n", " 'Acceleration': 12.0,\n", " 'Year': '1970-01-01T00:00:00',\n", " 'Origin': 'USA'}]}}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cars1 = cars.iloc[:1]\n", "alt.Chart(cars1).mark_point().to_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point the chart includes a JSON-formatted representation of the dataframe, what type of mark to use, along with some metadata that is included in every chart output." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Mark\n", "\n", "We can decide what sort of *mark* we would like to use to represent our data.\n", "In the previous example, we can choose the ``point`` mark to represent each data as a point on the plot:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a visualization with one point per row in the data, though it is not a particularly interesting: all the points are stacked right on top of each other!\n", "\n", "It is useful to again examine the JSON output here:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'config': {'view': {'continuousWidth': 400, 'continuousHeight': 300}},\n", " 'data': {'name': 'data-36a712fbaefa4d20aa0b32e160cfd83a'},\n", " 'mark': 'point',\n", " '$schema': 'https://vega.github.io/schema/vega-lite/v4.0.2.json',\n", " 'datasets': {'data-36a712fbaefa4d20aa0b32e160cfd83a': [{'Name': 'chevrolet chevelle malibu',\n", " 'Miles_per_Gallon': 18.0,\n", " 'Cylinders': 8,\n", " 'Displacement': 307.0,\n", " 'Horsepower': 130.0,\n", " 'Weight_in_lbs': 3504,\n", " 'Acceleration': 12.0,\n", " 'Year': '1970-01-01T00:00:00',\n", " 'Origin': 'USA'}]}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars1).mark_point().to_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that now in addition to the data, the specification includes information about the mark type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a number of available marks that you can use; some of the more common are the following:\n", "\n", "* ``mark_point()`` \n", "* ``mark_circle()``\n", "* ``mark_square()``\n", "* ``mark_line()``\n", "* ``mark_area()``\n", "* ``mark_bar()``\n", "* ``mark_tick()``\n", "\n", "You can get a complete list of ``mark_*`` methods using Jupyter's tab-completion feature: in any cell just type:\n", "\n", " alt.Chart.mark_\n", " \n", "followed by the tab key to see the available options." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encodings\n", "\n", "The next step is to add *visual encoding channels* (or *encodings* for short) to the chart. An encoding channel specifies how a given data column should be mapped onto the visual properties of the visualization.\n", "Some of the more frequenty used visual encodings are listed here:\n", "\n", "* ``x``: x-axis value\n", "* ``y``: y-axis value\n", "* ``color``: color of the mark\n", "* ``opacity``: transparency/opacity of the mark\n", "* ``shape``: shape of the mark\n", "* ``size``: size of the mark\n", "* ``row``: row within a grid of facet plots\n", "* ``column``: column within a grid of facet plots\n", "\n", "For a complete list of these encodings, see the [Encodings](https://altair-viz.github.io/user_guide/encoding.html) section of the documentation.\n", "\n", "Visual encodings can be created with the `encode()` method of the `Chart` object. For example, we can start by mapping the `y` axis of the chart to the `Origin` column:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " y='Origin'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a one-dimensional visualization representing the values taken on by `Origin`, with the points in each category on top of each other.\n", "As above, we can view the JSON data generated for this visualization:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'config': {'view': {'continuousWidth': 400, 'continuousHeight': 300}},\n", " 'data': {'name': 'data-36a712fbaefa4d20aa0b32e160cfd83a'},\n", " 'mark': 'point',\n", " 'encoding': {'x': {'type': 'nominal', 'field': 'Origin'}},\n", " '$schema': 'https://vega.github.io/schema/vega-lite/v4.0.2.json',\n", " 'datasets': {'data-36a712fbaefa4d20aa0b32e160cfd83a': [{'Name': 'chevrolet chevelle malibu',\n", " 'Miles_per_Gallon': 18.0,\n", " 'Cylinders': 8,\n", " 'Displacement': 307.0,\n", " 'Horsepower': 130.0,\n", " 'Weight_in_lbs': 3504,\n", " 'Acceleration': 12.0,\n", " 'Year': '1970-01-01T00:00:00',\n", " 'Origin': 'USA'}]}}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars1).mark_point().encode(\n", " x='Origin'\n", ").to_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is the same as above with the addition of the `'encoding'` key, which specifies the visualization channel (`y`), the name of the field (`Origin`), and the type of the variable (`nominal`).\n", "We'll discuss these data types in a moment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The visualization can be made more interesting by adding another channel to the encoding: let's encode the `Miles_per_Gallon` as the `x` position:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " y='Origin',\n", " x='Miles_per_Gallon'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can add as many encodings as you wish, with each encoding mapped to a column in the data.\n", "For example, here we will color the points by *Origin*, and plot *Miles_per_gallon* vs *Year*:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " color='Origin',\n", " y='Miles_per_Gallon',\n", " x='Year'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excercise: Exploring Data\n", "\n", "Now that you know the basics (Data, encodings, marks) take some time and try making a few plots!\n", "\n", "In particular, I'd suggest trying various combinations of the following:\n", "\n", "- Marks: ``mark_point()``, ``mark_line()``, ``mark_bar()``, ``mark_text()``, ``mark_rect()``...\n", "- Data Columns: ``'Acceleration'``, ``'Cylinders'``, ``'Displacement'``, ``'Horsepower'``, ``'Miles_per_Gallon'``, ``'Name'``, ``'Origin'``, ``'Weight_in_lbs'``, ``'Year'``\n", "- Encodings: ``x``, ``y``, ``color``, ``shape``, ``row``, ``column``, ``opacity``, ``text``, ``tooltip``...\n", "\n", "Work with a partner to use various combinations of these options, and see what you can learn from the data! In particular, think about the following:\n", "\n", "- Which encodings go well with continuous, quantitative values?\n", "- Which encodings go well with discrete, categorical (i.e. nominal) values?\n", "\n", "After about 10 minutes, we'll ask for a couple volunteers to share their combination of marks, columns, and encodings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encoding Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the central ideas of Altair is that the library will **choose good defaults for your data type**.\n", "\n", "The basic data types supported by Altair are as follows:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Data TypeCodeDescription
quantitativeQNumerical quantity (real-valued)
nominalNName / Unordered categorical
ordinalOOrdered categorial
temporalTDate/time
\n", "\n", "When you specify data as a pandas dataframe, these types are **automatically determined** by Altair.\n", "\n", "When you specify data as a URL, you must **manually specify** data types for each of your columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at a simple plot containing three of the columns from the cars data:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_tick().encode(\n", " x='Miles_per_Gallon',\n", " y='Origin',\n", " color='Cylinders'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Questions:\n", "\n", "- what data type best goes with ``Miles_per_Gallon``?\n", "- what data type best goes with ``Origin``?\n", "- what data type best goes with ``Cylinders``?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add the shorthands for each of these data types to our specification, using the one-letter codes above\n", "(for example, change ``\"Miles_per_Gallon\"`` to ``\"Miles_per_Gallon:Q\"`` to explicitly specify that it is a quantitative type):" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_tick().encode(\n", " x='Miles_per_Gallon:Q',\n", " y='Origin:N',\n", " color='Cylinders:O'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how if we change the data type for ``'Cylinders'`` to ordinal the plot changes.\n", "\n", "As you use Altair, it is useful to get into the habit of always specifying these types explicitly, because this is *mandatory* when working with data loaded from a file or a URL." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise: Adding Explicit Types\n", "\n", "Following are a few simple charts made with the cars dataset. For each one, try to add explicit types to the encodings (i.e. change ``\"Horsepower\"`` to ``\"Horsepower:Q\"`` so that the plot doesn't change.\n", "\n", "Are there any plots that can be made better by changing the type?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " y='Origin',\n", " x='mean(Horsepower)'\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_line().encode(\n", " x='Year',\n", " y='mean(Miles_per_Gallon)',\n", " color='Origin'\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " y='Cylinders',\n", " x='count()',\n", " color='Origin'\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_rect().encode(\n", " x='Cylinders',\n", " y='Origin',\n", " color='count()'\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }