--- title: "Introduction to Vega-Lite" description: | Learn the basic concepts for creating vega-lite plots, and see how the library supports interactivity. author: - name: Kris Sankaran affiliation: UW Madison date: 01-27-2021 output: rmarkdown::pdf_document --- _[Reading](https://observablehq.com/@uwdata/data-types-graphical-marks-and-visual-encoding-channels?collection=@uwdata/visualization-curriculum), [Recording](https://mediaspace.wisc.edu/media/Week+1+%5B3%5D+Introduction+to+Vega-Lite/1_mm4yv2c9), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-16-week1-3/week1-3.Rmd), [Notebook](https://observablehq.com/@krisrs1128/introduction-to-vega-lite)_ Vega-Lite is the javascript equivalent of ggplot2. It provides a grammar of graphics, where layers can be composed on top of one another. The main reason for knowing vega-lite (opposed to only ggplot2) is that it can support flexible interaction. Also, it’s a good gateway toward understanding richer data visualization packages, like `d3`. ```{r} library("robservable") library("htmlwidgets") ``` ### Components of a Graph We’re going to construct this graph step-by-step, just like in the ggplot2 lecture. I realize that this is a bit repetitive, but I want to make it obvious just how similar ggplot2 and vega-lite are. ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 4, height = 325) ``` ## Data and Geometry The first thing to do is import the necessary libraries. The arquero library (`aq`) makes it easier to manage data in javascript -- the filtering and printing steps would be more difficult without it. ````js import {vl} from '@vega/vega-lite-api' import { aq, op } from '@uwdata/arquero' ```` Now we can preview and filter the raw [data](https://uwmadison.box.com/s/dyz0qohqvgake2ghm4ngupbltkzpqb7t). \ ````js data = aq.fromCSV(await FileAttachment("gapminder.csv").text()) // I used the file upload button to add this file data2000 = data.filter(d => d.year == 2000) // filter to only 2000 data2000.view() ```` ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 6, height = 250) ``` Let's create a visual mark, without any encodings. Like in our first `ggplot2` figure, the mark is collapsed into a single point at the origin. Notice the analogy betwen vega-lite's `markPoint()` and ggplot2's `geom_point()`. ````js vl.markPoint() .data(data2000) .render() ```` ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 7, height = 20) ``` ## Encodings We can deconstruct the original figure to see data fields are mapped to visual properties, * Fertility → $x$-axis coordinate * Life expectancy → $y$-axis coordinate * Country cluster → Point color * Population → Point size We can build this up one step at a time. To encode the fertility and life expectancy fields, we use `vl.x()` and `vl.y()` in an `encode()` call. One subtlety is that it's necessary to specify the data type of each field. `fieldQ` means that the field is "Quantitative". We'll review data types in the [next lecture](../2021-01-18-week1-4/index.html). ````js vl.markPoint() .data(data2000) .encode( vl.x().fieldQ("fertility"), // what happens if you change this to fieldN()? vl.y().fieldQ("life_expect") ) .render() ```` ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 8, height = 340) ``` The same idea can be used to encode the country cluster and population fields. `fieldN` specifies that the cluster field is a Nominal (i.e., categorical) variable. Hint: How do you know which commands encode which fields? It's often good to keep the [vega-lite-api](https://vega.github.io/vega-lite-api/api/) and [vega-lite](https://vega.github.io/vega-lite/) documentation close at hand. ````js vl.markPoint() .data(data2000) .encode( vl.x().fieldQ("fertility"), vl.y().fieldQ("life_expect"), vl.color().fieldN("cluster"), // what happens if you change this to fieldO()? vl.size().fieldQ("pop") ) .render() ```` ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 9, height = 340) ``` ## Finishing touches We've made headway in our visualization task, but a more critical eye will help us improve the display. It will also give us an initial look at interactivity. The main things we ought to improve in this display are, 1. The open rings are a bit distracting. They violate the principle of [Smallest Effective Difference](https://sites.google.com/site/tufteondesign/home/practical-design-strategies). 2. Some of the points are impossibly small. 3. The axis labels are not proper English words. 4. There are distractingly many values in the legend for population size. 5. We can't actually tell which country each point corresponds to. It would be nice to add a tooltip (a label that appears when you mouseover the point). For (1), let's replace the open rings with filled circles. We can pass options to `markPoint` in the same way that we passed options to ggplot `geom` layers. For (2), we can modify the default size range by adding on a `.scale()` to the `vl.size()` encoding -- this is similar to adding a custom scale in ggplot2. ````js vl.markPoint({filled: true}) .data(data2000) .encode( vl.x().fieldQ("fertility"), vl.y().fieldQ("life_expect"), vl.color().fieldN("cluster"), vl.size().fieldQ("pop").scale({range: [25, 1000]}) // try changing the ranges ) .render() ```` ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 10, height = 340) ``` For (3), we can add a `.title()` to fields for which we want to customize titles, and for (4), we can modify the tick count in the `.legend()` for `size`. ````js vl.markPoint({filled: true}) .data(data2000) .encode( vl.x().fieldQ("fertility").title("Fertility"), vl.y().fieldQ("life_expect").title("Life Expectancy"), vl.color().fieldN("cluster"), vl.size().fieldQ("pop").scale({range: [25, 1000]}).legend({tickCount: 3}).title("Population"), ) .render() ```` ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 11, height = 340) ``` Now to (5). The idea is to think of a tooltip as another type of encoding -- just like coordinate positions and colors, a label is a way of transforming a field in an abstract dataset into a visual property we can observe on a screen. ````js vl.markPoint({filled: true}) .data(data2000) .encode( vl.x().fieldQ("fertility").title("Fertility"), vl.y().fieldQ("life_expect").title("Life Expectancy"), vl.color().fieldN("cluster"), vl.size().fieldQ("pop").scale({range: [25, 1000]}).legend({tickCount: 3}).title("Population"), vl.tooltip().fieldN("country"), vl.order().fieldQ('pop').sort('descending') // if you remove this, large points prevent mouseover on the points below them ) .render() ```` ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 12, height = 340) ``` One point I want you to take away from both this and the ggplot2 introductions is that building a plot is an iterative process. Once you develop your eye for good visual design, you will be able to take any initial plot and make it more effective. Finally, for a taste of what's to come, here's a version of the same plot that let's us compare many years over time. ```{r} robservable("@krisrs1128/introduction-to-vega-lite", include = 13) ``` Don't worry about the code implementation, but in case you're curious, it's only a few extra lines over what we've already implemented. ````js { // define a slider selector object let selectYear = vl.selectSingle("select").fields("year") .init({year: 1955}) .bind(vl.slider().min(1955).max(2005).step(5).name("Year")) // update the plot depending on the value of the slider return vl.markPoint({filled: true}) .data(data) .select(selectYear) // refers to the slider bar .transform(vl.filter(selectYear)) .encode( vl.x().fieldQ("fertility").title("Fertility").scale({domain: [0, 9]}), // hard code the axes ranges vl.y().fieldQ("life_expect").title("Life Expectancy").scale({domain: [0, 90]}), vl.color().fieldN("cluster"), vl.size().fieldQ("pop").scale({domain: [0, 1200000000], range: [25, 1000]}).legend({tickCount: 3}).title("Population"), vl.tooltip().fieldN("country"), vl.order().fieldQ('pop').sort('descending') ) .render() } ````