{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Visualization in Python: Altair" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Altair is a declarative statistical visualization library for Python. It offers a powerful and concise visualization grammar that enables users to build a wide range of statistical visualizations quickly and simply.\n", "\n", "Here are some benefits of using Altair for visualization:\n", "\n", "- The graph can easily be interative\n", "\n", "- Every visualization generated by Altair can be downloaded as PNG file if click the three dots on the upper right side of the graph\n", "\n", "- Coding grammar is greatly formatted and easy to add features\n", "\n", "(Note: materials included are common useful visualizations methods collected from https://altair-viz.github.io/, if it doesn't include any specific visualization problem, please visit the website for further reference)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "\n", "If you are using pip: \n", "`! pip install altair vega_datasets`\n", "\n", "If you are using conda: \n", "`! conda install -c conda-forge altair vega_datasets`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Package\n", "\n", "Loading the Altair package is similar to loading other packages." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import altair as alt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Graph Format\n", "\n", "### Chart\n", "The fundamental object in Altair is the **Chart**, which takes a **dataframe** as a single argument.\n", "\n", "`alt.Chart(dataframe)`\n", "\n", "However, on its own, it will not draw anything because we have not yet told the chart to do anything with the data.We need to specify marks to successully draw the graph.\n", "\n", "### Marks\n", "The mark property lets you specify how the data needs to be represented on the plot.\n", "\n", "`alt.Chart(dataframe).mark_point()`\n", "\n", "### Encodings\n", "Once we have the data and determined how it is represented, we want to specify what columns in the dataframe to represent it. That is, we need to set up the x and y data, size, color, etc. This is where we use encodings.\n", "\n", "`alt.Chart(dataframe).mark_point().encode()`\n", "\n", "**After knowing the general structure of the command, we can explore more deeply in each categories.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Marks\n", "The mark property lets you specify how the data needs to be represented on the plot. Following are the common mark properties provided by Altair:\n", "\n", "(For detailed marks, visit https://altair-viz.github.io/user_guide/marks.html)\n", "\n", "| Mark Type | Command | Description |\n", "| :-: | :-: | :-: |\n", "| **area** | `mark_area()` | A filled area |\n", "| **line** | `mark_line()` | A line plot |\n", "| **bar** | `mark_bar()` | A bar plot |\n", "| **point** | `mark_point()` | A scatter plot with hollow point |\n", "| **circle** | `mark_circle()` | A scatter plot with solid point |\n", "| **text** | `mark_line()` | A scatter plot with point as text |\n", "| **square** | `mark_square()` | A scatter plot with square point |\n", "| **rect** | `mark_rect()` | A heatmap |\n", "| **box plot** | `mark_boxplot()` | A box plot |\n", "\n", "#### Example 1. Scatter Plot of Acceleration vs. Horsepower in Cars Dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from vega_datasets import data\n", "cars = data.cars()\n", "alt.Chart(cars).mark_circle().encode(\n", " x='Horsepower:Q',\n", " y='Acceleration:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 2. Line Plot" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'x':np.array([-1,0,1,2,3]),'y':np.array([-1,0,1,2,3])**2})\n", "alt.Chart(df).mark_line().encode(\n", " x='x:Q',\n", " y='y:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 3. Box Plot of Acceleration vs. Cylinders in Cars Dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_boxplot().encode(\n", " y='Cylinders:O',\n", " x='Acceleration:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 4. Heatmap of Cylinders vs. Origin in Cars Dataset" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_rect().encode(\n", " y='Cylinders:O',\n", " x='Origin:O',\n", " color = 'count():Q'\n", ").properties(\n", " width=200,\n", " height=200\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Encodings\n", "\n", "Following are some common encoding parameters to put inside `.encode()`:\n", "\n", "(For detailed encoding channels, visit https://altair-viz.github.io/user_guide/encoding.html)\n", "\n", "| Encoding Description | Command | Description |\n", "| :-: | :-: | :-: |\n", "| **x-axis** | `alt.X()` | x-axis data |\n", "| **y-axis value** | `alt.Y()` | y-axis data |\n", "| **size** | `alt.Size()` | size change with data |\n", "| **shape** | `alt.Shape()` | shape change with data |\n", "| **color** | `alt.Color()` | color change with data |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Encoding Data Type\n", "\n", "Sometimes we need to deal with different types of data. For example, for continuous data, we want the color to change gradually, but for categorical data, we want the color to be distinct for each category.\n", "\n", "There is a way to specify each data type in `encoding()`:\n", "\n", "`alt.Chart(df).mark_point().encode(x = ':Q')`\n", "\n", "or\n", "\n", "`alt.Chart(df).mark_point().encode(alt.X(field = '', type = 'quantitative')`\n", "\n", "Here are possible data types:\n", "\n", "|Data Type|\tShorthand Code|\tDescription|\n", "| :-: | :-: | :-: |\n", "|quantitative|\t`Q`\t| a continuous real-valued quantity | \n", "|ordinal| `O` |\ta discrete ordered quantity | \n", "|nominal| `N` | a discrete unordered category|\n", "|temporal|\t`T` | a time or date value|\n", "|geojson|`G` |a geographic shape|\n", "\n", "#### Example 1. Scatter plot of the acceleration vs. horsepower colored by number of cylinders in three different ways, with the color encoded as a quantitative, ordinal, and nominal type." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = alt.Chart(cars).mark_point().encode(\n", " alt.X('Horsepower',\n", " type = 'quantitative',\n", " title = 'Horsepower'),\n", " alt.Y('Acceleration',\n", " type = 'quantitative',\n", " title = 'Acceleration')\n", ").properties(\n", " width=150,\n", " height=150\n", ")\n", "\n", "# horizontally concat three graphs\n", "alt.hconcat(\n", " base.encode(alt.Color('Cylinders', type = 'quantitative')).properties(title='quantitative'),\n", " base.encode(alt.Color('Cylinders', type = 'ordinal')).properties(title='ordinal'),\n", " base.encode(alt.Color('Cylinders', type = 'nominal')).properties(title='nominal')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Tooltip\n", "Tooltip lets us to show the details that the data point represents when moving around.\n", "\n", "`tooltip = [alt.Tooltip('')]`\n", "\n", "### 2.3 Interactive\n", "To make the graph a interactive plot, we can add `.interactive()` after all the marks and encodings.\n", "\n", "`alt.Chart(dataframe).mark_point().encode().interactive()`\n", "\n", "#### Example. Interactive scatter plot of Acceleration vs. Horsepower in cars data, colored by number of cylinders and shaped by origin." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point(size = 50).encode(\n", " alt.X('Horsepower',\n", " type = 'quantitative',\n", " title = 'Horsepower'), \n", " alt.Y('Acceleration',\n", " type = 'quantitative',\n", " title = 'Acceleration'), \n", " alt.Color('Cylinders', type = 'ordinal'),\n", " alt.Shape('Origin', type = 'nominal'),\n", " # include petal length, petal width and species information for each point\n", " tooltip = [alt.Tooltip('Horsepower'),\n", " alt.Tooltip('Acceleration'),\n", " alt.Tooltip('Cylinders'),\n", " alt.Tooltip('Origin')\n", " ]\n", ").interactive() # make the plot interactive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Data Transformation\n", "There are several ways to transform the original data when during the visualization. Here I selected several useful transformations.\n", "\n", "(For detailed transformation methods, visit https://altair-viz.github.io/user_guide/transform/index.html)\n", "\n", "### 3.1 Bin Transform (Historgram)\n", "\n", "#### Example. Histogram of Acceleration distribution in Cars dataset" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_area(interpolate='step').encode(\n", " alt.X(\"Acceleration:Q\",\n", " axis = alt.Axis(title = \"MPG\"),\n", " bin = alt.Bin(maxbins=10)),\n", " alt.Y(\"count():Q\",\n", " axis = alt.Axis(title = \"Count\"), \n", " stack=None)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Convert wide-form data into long-form data\n", "\n", "There are two common conventions for storing data in a dataframe, sometimes called long-form and wide-form.\n", "\n", "- wide-form data has one row per independent variable, with different features recorded in different columns.\n", "- long-form data has one row per observation, with features recorded within the table as values.\n", "\n", "Altair’s grammar works best with long-form data, in which each row corresponds to a single observation along with its features. Hence, we can converting wide-form data to the long-form data used by Altair:\n", "\n", "`.transform_fold([featurs],as = [key, value])`\n", "\n", "(For detailed information, visit: https://altair-viz.github.io/user_guide/transform/fold.html#user-guide-fold-transform)\n", "\n", "#### Example. Wide-form Data of Daily Fruit Price" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOrangeApplePeach
02021-08-01535.0
12021-09-01645.5
22021-10-01745.0
\n", "
" ], "text/plain": [ " Date Orange Apple Peach\n", "0 2021-08-01 5 3 5.0\n", "1 2021-09-01 6 4 5.5\n", "2 2021-10-01 7 4 5.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wide_form = pd.DataFrame({'Date': ['2021-08-01', '2021-09-01', '2021-10-01'],\n", " 'Orange': [5,6,7],\n", " 'Apple': [3,4,4],\n", " 'Peach': [5,5.5,5]})\n", "wide_form" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(wide_form).transform_fold(\n", " ['Orange', 'Apple', 'Peach'],\n", " as_=['fruit', 'price']\n", ").mark_line().encode(\n", " x='Date:T',\n", " y='price:Q',\n", " color='fruit:N'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 LOESS transform\n", "\n", "The LOESS transform (LOcally Estimated Scatterplot Smoothing) uses a locally-estimated regression to produce a trend line. LOESS performs a sequence of local weighted regressions over a sliding window of nearest-neighbor points.\n", "\n", "#### Example. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(42)\n", "\n", "df = pd.DataFrame({\n", " 'x': range(100),\n", " 'y': np.random.randn(100).cumsum()\n", "})\n", "\n", "chart = alt.Chart(df).mark_point().encode(\n", " x='x',\n", " y='y'\n", ")\n", "\n", "chart + chart.transform_loess('x', 'y').mark_line()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Compound Charts\n", "(For detailed ways to compound charts, visit https://altair-viz.github.io/user_guide/compound_charts.html)\n", "### 4.1 Layered Charts\n", "Layered charts allow user to overlay two different charts on the same set of axes.\n", "\n", "(Example above shows a way to layer two graphs, one scatter plot one line plot)\n", "#### Example. Using `alt.layer`" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from altair.expr import datum\n", "stocks = data.stocks()\n", "\n", "base = alt.Chart(stocks).encode(\n", " x='date:T',\n", " y='price:Q',\n", " color='symbol:N'\n", ").transform_filter(\n", " datum.symbol == 'GOOG'\n", ")\n", "alt.layer(\n", " base.mark_line(),\n", " base.mark_point(),\n", " base.mark_rule()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Horizontal/Vertical Concatenation\n", "\n", "Displaying two plots side-by-side, which can be created using the `hconcat()` function or the `|` operator.\n", "\n", "Similarly, two plots can be vertically combined via the `vconcat()` function or the `&` operator.\n", "\n", "#### Example. horizontal concatenation of Iris Data" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chart1 = alt.Chart(cars).mark_point().encode(\n", " x='Horsepower:Q',\n", " y='Miles_per_Gallon:Q',\n", " color='Origin:N'\n", ").properties(\n", " height=300,\n", " width=300\n", ")\n", "\n", "chart2 = alt.Chart(cars).mark_bar().encode(\n", " x='count():Q',\n", " y=alt.Y('Miles_per_Gallon:Q', bin=alt.Bin(maxbins=10)),\n", " color='Origin:N'\n", ").properties(\n", " height=300,\n", " width=100\n", ")\n", "\n", "chart1 | chart2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.3 Faceted Charts\n", "Using `alt.facet()` can put data into facets:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.FacetChart(...)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Horsepower:Q',\n", " y='Acceleration:Q',\n", " color='Origin:N'\n", ").properties(\n", " width=180,\n", " height=180\n", ").facet(\n", " column='Origin:N'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Customizing Visualizations\n", "(https://altair-viz.github.io/user_guide/customization.html)\n", "\n", "### 5.1 Global Config\n", "Acts on an entire chart object.\n", "\n", "Every chart type has a \"config\" property at the top level that acts as a sort of theme for the whole chart and all of its sub-charts. Here you can specify things like axes properties, mark properties, selection properties, and more. Altair allows you to access these through the `configure_*()` methods of the chart.\n", "\n", "E.g. `alt.Chart().mark_point().encode().configure_mark(opacity=0.2, color='red')`\n", "\n", "- By design configurations will affect every mark used within the chart\n", "- The global configuration is only permissible at the top-level; so, for example, if you tried to layer the above chart with another, it would result in an error.\n", "\n", "(Detailed ways to change global configuration: https://altair-viz.github.io/user_guide/configuration.html)\n", "\n", "### 5.2 Local Config \n", "Acts on one mark of the chart.\n", "\n", "If you would like to configure the look of the mark locally, such that the setting only affects the particular chart property you reference, this can be done via a local configuration setting. In the case of mark properties, the best approach is to set the property as an argument to the `mark_*()` method.\n", "\n", "E.g. `alt.Chart().mark_point(opacity=0.2, color='red').encode()`\n", "\n", "- Unlike when using the global configuration, here it is possible to use the resulting chart as a layer or facet in a compound chart.\n", "\n", "- Local config settings like this one will always override global settings.\n", "\n", "### 5.3 Encoding channels \n", "Be used to set some chart properties.\n", "\n", "- Encoding settings will always override local or global configuration settings.\n", "\n", "### 5.4 Adjusting Axis Limits\n", "\n", "#### 5.4.1. Not Starting from Zero\n", "\n", "Add a Scale property to the X encoding that specifies `zero=False`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " alt.X('Horsepower:Q',\n", " scale=alt.Scale(zero=False)\n", " ),\n", " y='Acceleration:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.4.2. Rescale\n", "To specify exact axis limits, you can use the `domain()` property of the scale. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " alt.X('Horsepower:Q',\n", " scale=alt.Scale(domain=(40, 200))\n", " ),\n", " y='Acceleration:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is one problem with rescaling is that some data outside the domain may exist beyond the scale, and we need to tell Altair what to do with this data. One option is to “clip” the data by setting the `clip` property of the mark to `True`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point(clip=True).encode(\n", " alt.X('Horsepower:Q',\n", " scale=alt.Scale(domain=(40, 200))\n", " ),\n", " y='Acceleration:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**In addition to the properties and methods for visualizations selected in this notebook, there are a wide array of different ways for advanced visualizations in Python using Altair on the website https://altair-viz.github.io/. The left column includes a detailed user guide which provides solutions for any possible problems emerged during the visualization process.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Possible Template to Use" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alt.Chart(df).mark_point().encode(\n", " alt.X(':Q'\n", " scale=alt.Scale(zero= ),\n", " title='' # x-axis name\n", " ),\n", " alt.Y(':Q'\n", " scale=alt.Scale(domain=(,)),\n", " aggregate='',\n", " title=''\n", " ),\n", " alt.Color(':Q'\n", " title='' # color legend name\n", " ),\n", " alt.Size(':Q'\n", " title='' # size legend name\n", " ),\n", " alt.Shape(':Q'\n", " title='' # size legend name\n", " ),\n", " tooltip = [alt.Tooltip('')] # add which column to be seen when pointed the data\n", ").properties(\n", " width=,\n", " height=,\n", " title=''\n", ").interactive()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }