--- title: "A Vocabulary of Marks" description: | Examples of marks and their encodings in both ggplot2 and vega-lite. author: - name: Kris Sankaran affiliation: UW Madison date: 01-29-2021 output: distill::distill_article: self_contained: false --- ```{r, echo = FALSE} library("knitr") knitr::opts_chunk$set(cache = TRUE, message = FALSE, warning = FALSE, echo = TRUE) ``` _[Reading](https://observablehq.com/@uwdata/data-types-graphical-marks-and-visual-encoding-channels?collection=@uwdata/visualization-curriculum), [Recording](https://mediaspace.wisc.edu/media/1_svnu2d3u), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-18-week1-5/week1-5.Rmd), [Notebook](https://observablehq.com/@krisrs1128/vocabulary-of-marks)_ The choice of encodings can have a strong effect on (1) the types of comparisons that a visualization suggests and (2) the accuracy of the conclusions that readers leave with. With this in mind, it's in our best interest to build a rich vocabulary of potential visual encodings. The more kinds of marks and encodings that are at your fingertips, the better your chances are that you’ll arrive at a configuration that helps you achieve your purpose. So, let’s look at a few different types of marks and encodings in both ggplot2 and vega-lite. Before we get started, let's load up the libraries that will be used in these notes. ```{r} library("readr") library("ggplot2") library("dplyr") library("robservable") theme_set(theme_bw()) ``` ### Point Marks Let's read in the gapminder dataset that we used in the introduction to vega-lite. ```{r} # load data gapminder <- read_csv("https://uwmadison.box.com/shared/static/dyz0qohqvgake2ghm4ngupbltkzpqb7t.csv", col_types = cols()) %>% mutate(cluster = as.factor(cluster)) # specify that cluster is nominal gap2000 <- gapminder %>% filter(year == 2000) # keep only year 2000 ``` Point marks can encode data fields using their $x$ and $y$ positions, color, size, and shape. Below, each mark is a country, and we're using shape and the $y$ position to distinguish between country clusters. ```{r} ggplot(gap2000) + geom_point(aes(x = fertility, y = cluster, shape = cluster)) ``` Aside from some small differences in the syntax and styling, the result is almost identical in vega-lite. ````js vl.markPoint() .data(data2000) .encode( vl.x().fieldQ('fertility'), vl.y().fieldN('cluster'), vl.shape().fieldN('cluster') ) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 5, height = 170) ``` We can specify different types of shapes using the "shape" parameter outside of the `aes` encoding. ```{r} ggplot(gap2000) + geom_point(aes(x = fertility, y = cluster), shape = 15) ``` For the equivalent in vega-lite, use `markSquare`, ````js vl.markSquare({size: 100}) .data(data2000) .encode( vl.x().fieldQ('fertility'), vl.y().fieldN('cluster') ) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 6, height = 170) ``` ### Bar Marks Bar marks let us associate a continuous field with a nominal one. ```{r} ggplot(gap2000) + geom_bar(aes(x = country, y = pop), stat = "identity") + theme( axis.text.x = element_text(angle = 90), panel.grid.major.x = element_blank() ) ``` ````js vl.markBar() .data(data2000) .encode( vl.x().fieldN('country'), vl.y().fieldQ('pop') ) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 7) ``` To make comparisons between countries with similar populations easier, we can order them by population (alphabetical ordering is not that meaningful). To compare clusters, we can color in the bars. ```{r} ggplot(gap2000) + geom_bar( aes(x = reorder(country, -pop, mean), y = pop, fill = cluster), stat = "identity" ) + scale_fill_brewer(palette = "Set2") + scale_y_continuous(expand = c(0, 0, .1, .1)) + theme( axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.2, size = 8), # more readable axes panel.grid.major.x = element_blank() ) ``` Here is the same plot in vega-lite. ````js vl.markBar() .data(data2000) .encode( vl.x().fieldN('country').sort("-y"), vl.y().fieldQ('pop'), vl.color().field("cluster").scale({"scheme": "set2"}) ) .width(500) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 8) ``` In the plot above, each bar is anchored at 0. Instead, we could have each bar encode *two* continuous values, a top and bottom. To illustrate, let's compare the minimum and maximimum life expectancies within each country cluster. vega-lite lets you compute summary statistics within the encoding step (see below). For ggplot2, we'll need to create a new `data.frame` with just the summary information. For this, we `group_by` each cluster, so that a `summarise` call finds the minimum and maximum life expectancies restricted to each cluster. ```{r} # find summary statistics life_ranges <- gap2000 %>% group_by(cluster) %>% summarise( min_life = min(life_expect), max_life = max(life_expect) ) # look at a few rows head(life_ranges) # plot them ggplot(life_ranges) + geom_segment( aes(x = min_life, xend = max_life, y = cluster, yend = cluster), size = 4 ) + xlim(0, 85) ``` Notice here that we can just specify `.min()` and `.max()` within the encoding, rather than precomputing the summary statistics. ````js vl.markBar() .data(data2000) .encode( vl.x().min('life_expect'), vl.x2().max('life_expect'), vl.y().fieldN('cluster') ) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 9, height = 170) ``` ### Line Marks Line marks are useful for comparing changes. Our eyes naturally focus on rates of change when we see lines. Below, we'll plot the fertility over time, colored in by country cluster. ```{r} ggplot(gapminder) + geom_line( aes(x = year, y = fertility, col = cluster, group = country), alpha = 0.7, size = 0.9 ) + scale_color_brewer(palette = "Set2") ``` This is the equivalent code for vega-lite. You can get a country name by hovering over the lines. ````js vl.markLine({opacity: 0.7, size: 3}) .data(data) .encode( vl.x().fieldO('year'), vl.y().fieldQ('fertility'), vl.color().fieldN('cluster').scale({"scheme": "set2"}), vl.detail().fieldN("country"), vl.tooltip().fieldN('country') ) .width(500) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 10, height = 330) ``` ### Area Marks Area marks have a flavor of both bar and line marks. The filled area supports absolute comparisons, while the changes in shape suggest derivatives. ```{r} population_sums <- gapminder %>% group_by(year, cluster) %>% summarise(total_pop = sum(pop)) head(population_sums) ggplot(population_sums) + geom_area( aes(x = year, y = total_pop, fill = cluster) ) + scale_fill_brewer(palette = "Set2") + scale_y_continuous(expand = c(0, 0, .1, .1)) + scale_x_continuous(expand = c(0, 0)) ``` In the vega-lite version, we can add a tooltip that shows country names on mouseover. It's not so clear where one country begins and another starts, though -- it would be clearer if there were small lines demarcating the boundaries between countries. ````js vl.markArea() .data(data) .encode( vl.x().fieldO('year'), vl.y().sum('pop'), vl.color().fieldN('cluster').scale({"scheme": "set2"}), vl.detail().fieldN("country"), vl.tooltip().fieldN('country') ) .width(600) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 11) ``` Just like in bar marks, we don't necessarily need to anchor the $y$-axis at 0. For example, here the bottom and top of each area mark is given by the 30% and 70% quantiles of population within each country cluster. ```{r} population_ranges <- gapminder %>% group_by(year, cluster) %>% summarise(min_pop = quantile(pop, 0.3), max_pop = quantile(pop, 0.7)) head(population_ranges) ggplot(population_ranges) + geom_ribbon( aes(x = year, ymin = min_pop, ymax = max_pop, fill = cluster), alpha = 0.8 ) + scale_fill_brewer(palette = "Set2") + scale_y_continuous(expand = c(0, 0, .1, .1)) + scale_x_continuous(expand = c(0, 0)) ``` For these quantiles, we can't use a built-in vega-lite summary functions, like the `.min()` and `.max()` from above. Instead, we need the javascript equivalent of dplyr's `group_by` + `summarise`, which are called `groupby` and `rollup`. ````js population_ranges = data.groupby("year", "cluster") .rollup({ // equivalent of dplyr summarise pop_low: op.quantile("pop", 0.3), pop_high: op.quantile("pop", 0.7) }) ```` These data can now be used to generate the equivalent vega-lite plot. ````js vl.markArea({"opacity": 0.7}) .data(population_ranges) .encode( vl.x().fieldO('year'), vl.y().fieldQ('pop_low').title("Population"), vl.y2().fieldQ('pop_high'), vl.color().fieldN('cluster').scale({"scheme": "set2"}) ) .width(600) .render() ```` ```{r} robservable("@krisrs1128/vocabulary-of-marks", include = 13) ```