{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Visualization in Python: Altair" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Altair is a declarative statistical visualization library for Python. It offers a powerful and concise visualization grammar that enables users to build a wide range of statistical visualizations quickly and simply.\n", "\n", "Here are some benefits of using Altair for visualization:\n", "\n", "- The graph can easily be interative\n", "\n", "- Every visualization generated by Altair can be downloaded as PNG file if click the three dots on the upper right side of the graph\n", "\n", "- Coding grammar is greatly formatted and easy to add features\n", "\n", "(Note: materials included are common useful visualizations methods collected from https://altair-viz.github.io/, if it doesn't include any specific visualization problem, please visit the website for further reference)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "\n", "If you are using pip: \n", "`! pip install altair vega_datasets`\n", "\n", "If you are using conda: \n", "`! conda install -c conda-forge altair vega_datasets`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Package\n", "\n", "Loading the Altair package is similar to loading other packages." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import altair as alt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Graph Format\n", "\n", "### Chart\n", "The fundamental object in Altair is the **Chart**, which takes a **dataframe** as a single argument.\n", "\n", "`alt.Chart(dataframe)`\n", "\n", "However, on its own, it will not draw anything because we have not yet told the chart to do anything with the data.We need to specify marks to successully draw the graph.\n", "\n", "### Marks\n", "The mark property lets you specify how the data needs to be represented on the plot.\n", "\n", "`alt.Chart(dataframe).mark_point()`\n", "\n", "### Encodings\n", "Once we have the data and determined how it is represented, we want to specify what columns in the dataframe to represent it. That is, we need to set up the x and y data, size, color, etc. This is where we use encodings.\n", "\n", "`alt.Chart(dataframe).mark_point().encode()`\n", "\n", "**After knowing the general structure of the command, we can explore more deeply in each categories.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Marks\n", "The mark property lets you specify how the data needs to be represented on the plot. Following are the common mark properties provided by Altair:\n", "\n", "(For detailed marks, visit https://altair-viz.github.io/user_guide/marks.html)\n", "\n", "| Mark Type | Command | Description |\n", "| :-: | :-: | :-: |\n", "| **area** | `mark_area()` | A filled area |\n", "| **line** | `mark_line()` | A line plot |\n", "| **bar** | `mark_bar()` | A bar plot |\n", "| **point** | `mark_point()` | A scatter plot with hollow point |\n", "| **circle** | `mark_circle()` | A scatter plot with solid point |\n", "| **text** | `mark_line()` | A scatter plot with point as text |\n", "| **square** | `mark_square()` | A scatter plot with square point |\n", "| **rect** | `mark_rect()` | A heatmap |\n", "| **box plot** | `mark_boxplot()` | A box plot |\n", "\n", "#### Example 1. Scatter Plot of Acceleration vs. Horsepower in Cars Dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from vega_datasets import data\n", "cars = data.cars()\n", "alt.Chart(cars).mark_circle().encode(\n", " x='Horsepower:Q',\n", " y='Acceleration:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 2. Line Plot" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'x':np.array([-1,0,1,2,3]),'y':np.array([-1,0,1,2,3])**2})\n", "alt.Chart(df).mark_line().encode(\n", " x='x:Q',\n", " y='y:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 3. Box Plot of Acceleration vs. Cylinders in Cars Dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_boxplot().encode(\n", " y='Cylinders:O',\n", " x='Acceleration:Q'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 4. Heatmap of Cylinders vs. Origin in Cars Dataset" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_rect().encode(\n", " y='Cylinders:O',\n", " x='Origin:O',\n", " color = 'count():Q'\n", ").properties(\n", " width=200,\n", " height=200\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Encodings\n", "\n", "Following are some common encoding parameters to put inside `.encode()`:\n", "\n", "(For detailed encoding channels, visit https://altair-viz.github.io/user_guide/encoding.html)\n", "\n", "| Encoding Description | Command | Description |\n", "| :-: | :-: | :-: |\n", "| **x-axis** | `alt.X()` | x-axis data |\n", "| **y-axis value** | `alt.Y()` | y-axis data |\n", "| **size** | `alt.Size()` | size change with data |\n", "| **shape** | `alt.Shape()` | shape change with data |\n", "| **color** | `alt.Color()` | color change with data |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Encoding Data Type\n", "\n", "Sometimes we need to deal with different types of data. For example, for continuous data, we want the color to change gradually, but for categorical data, we want the color to be distinct for each category.\n", "\n", "There is a way to specify each data type in `encoding()`:\n", "\n", "`alt.Chart(df).mark_point().encode(x = ':Q')`\n", "\n", "or\n", "\n", "`alt.Chart(df).mark_point().encode(alt.X(field = '', type = 'quantitative')`\n", "\n", "Here are possible data types:\n", "\n", "|Data Type|\tShorthand Code|\tDescription|\n", "| :-: | :-: | :-: |\n", "|quantitative|\t`Q`\t| a continuous real-valued quantity | \n", "|ordinal| `O` |\ta discrete ordered quantity | \n", "|nominal| `N` | a discrete unordered category|\n", "|temporal|\t`T` | a time or date value|\n", "|geojson|`G` |a geographic shape|\n", "\n", "#### Example 1. Scatter plot of the acceleration vs. horsepower colored by number of cylinders in three different ways, with the color encoded as a quantitative, ordinal, and nominal type." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base = alt.Chart(cars).mark_point().encode(\n", " alt.X('Horsepower',\n", " type = 'quantitative',\n", " title = 'Horsepower'),\n", " alt.Y('Acceleration',\n", " type = 'quantitative',\n", " title = 'Acceleration')\n", ").properties(\n", " width=150,\n", " height=150\n", ")\n", "\n", "# horizontally concat three graphs\n", "alt.hconcat(\n", " base.encode(alt.Color('Cylinders', type = 'quantitative')).properties(title='quantitative'),\n", " base.encode(alt.Color('Cylinders', type = 'ordinal')).properties(title='ordinal'),\n", " base.encode(alt.Color('Cylinders', type = 'nominal')).properties(title='nominal')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Tooltip\n", "Tooltip lets us to show the details that the data point represents when moving around.\n", "\n", "`tooltip = [alt.Tooltip('')]`\n", "\n", "### 2.3 Interactive\n", "To make the graph a interactive plot, we can add `.interactive()` after all the marks and encodings.\n", "\n", "`alt.Chart(dataframe).mark_point().encode().interactive()`\n", "\n", "#### Example. Interactive scatter plot of Acceleration vs. Horsepower in cars data, colored by number of cylinders and shaped by origin." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point(size = 50).encode(\n", " alt.X('Horsepower',\n", " type = 'quantitative',\n", " title = 'Horsepower'), \n", " alt.Y('Acceleration',\n", " type = 'quantitative',\n", " title = 'Acceleration'), \n", " alt.Color('Cylinders', type = 'ordinal'),\n", " alt.Shape('Origin', type = 'nominal'),\n", " # include petal length, petal width and species information for each point\n", " tooltip = [alt.Tooltip('Horsepower'),\n", " alt.Tooltip('Acceleration'),\n", " alt.Tooltip('Cylinders'),\n", " alt.Tooltip('Origin')\n", " ]\n", ").interactive() # make the plot interactive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Data Transformation\n", "There are several ways to transform the original data when during the visualization. Here I selected several useful transformations.\n", "\n", "(For detailed transformation methods, visit https://altair-viz.github.io/user_guide/transform/index.html)\n", "\n", "### 3.1 Bin Transform (Historgram)\n", "\n", "#### Example. Histogram of Acceleration distribution in Cars dataset" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_area(interpolate='step').encode(\n", " alt.X(\"Acceleration:Q\",\n", " axis = alt.Axis(title = \"MPG\"),\n", " bin = alt.Bin(maxbins=10)),\n", " alt.Y(\"count():Q\",\n", " axis = alt.Axis(title = \"Count\"), \n", " stack=None)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Convert wide-form data into long-form data\n", "\n", "There are two common conventions for storing data in a dataframe, sometimes called long-form and wide-form.\n", "\n", "- wide-form data has one row per independent variable, with different features recorded in different columns.\n", "- long-form data has one row per observation, with features recorded within the table as values.\n", "\n", "Altair’s grammar works best with long-form data, in which each row corresponds to a single observation along with its features. Hence, we can converting wide-form data to the long-form data used by Altair:\n", "\n", "`.transform_fold([featurs],as = [key, value])`\n", "\n", "(For detailed information, visit: https://altair-viz.github.io/user_guide/transform/fold.html#user-guide-fold-transform)\n", "\n", "#### Example. Wide-form Data of Daily Fruit Price" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | Date | \n", "Orange | \n", "Apple | \n", "Peach | \n", "
---|---|---|---|---|
0 | \n", "2021-08-01 | \n", "5 | \n", "3 | \n", "5.0 | \n", "
1 | \n", "2021-09-01 | \n", "6 | \n", "4 | \n", "5.5 | \n", "
2 | \n", "2021-10-01 | \n", "7 | \n", "4 | \n", "5.0 | \n", "