---
title: "Introduction to Vega-Lite"
description: | 
  Learn the basic concepts for creating vega-lite plots, and see how the library
  supports interactivity.
author:
  - name: Kris Sankaran
    affiliation: UW Madison
date: 01-27-2021
output:
  rmarkdown::pdf_document
---

_[Reading](https://observablehq.com/@uwdata/data-types-graphical-marks-and-visual-encoding-channels?collection=@uwdata/visualization-curriculum), [Recording](https://mediaspace.wisc.edu/media/Week+1+%5B3%5D+Introduction+to+Vega-Lite/1_mm4yv2c9),  [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-16-week1-3/week1-3.Rmd),  [Notebook](https://observablehq.com/@krisrs1128/introduction-to-vega-lite)_


Vega-Lite is the javascript equivalent of ggplot2. It provides a grammar of
graphics, where layers can be composed on top of one another. The main reason
for knowing vega-lite (opposed to only ggplot2) is that it can support flexible
interaction. Also, it’s a good gateway toward understanding richer data
visualization packages, like `d3`.

```{r}
library("robservable")
library("htmlwidgets")
```

### Components of a Graph

We’re going to construct this graph step-by-step, just like in the ggplot2
lecture. I realize that this is a bit repetitive, but I want to make it obvious
just how similar ggplot2 and vega-lite are.

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 4, height = 325)
```

## Data and Geometry

The first thing to do is import the necessary libraries. The arquero library
(`aq`) makes it easier to manage data in javascript -- the filtering and
printing steps would be more difficult without it.

````js
import {vl} from '@vega/vega-lite-api'
import { aq, op } from '@uwdata/arquero'
````

Now we can preview and filter the raw
[data](https://uwmadison.box.com/s/dyz0qohqvgake2ghm4ngupbltkzpqb7t). \

````js
data = aq.fromCSV(await FileAttachment("gapminder.csv").text()) // I used the file upload button to add this file
data2000 = data.filter(d => d.year == 2000) // filter to only 2000
data2000.view()
````
```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 6, height = 250)
```

Let's create a visual mark, without any encodings. Like in our first `ggplot2`
figure, the mark is collapsed into a single point at the origin. Notice the
analogy betwen vega-lite's `markPoint()` and ggplot2's `geom_point()`.

````js
vl.markPoint()
  .data(data2000)
  .render()
````

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 7, height = 20)
```

## Encodings

We can deconstruct the original figure to see data fields are mapped to visual
properties,

* Fertility &rarr; $x$-axis coordinate
* Life expectancy &rarr; $y$-axis coordinate
* Country cluster &rarr; Point color
* Population &rarr; Point size

We can build this up one step at a time. To encode the fertility and life
expectancy fields, we use `vl.x()` and `vl.y()` in an `encode()` call. One
subtlety is that it's necessary to specify the data type of each field. `fieldQ`
means that the field is "Quantitative". We'll review data types in the [next
lecture](../2021-01-18-week1-4/index.html).

````js
vl.markPoint()
  .data(data2000)
  .encode(
    vl.x().fieldQ("fertility"), // what happens if you change this to fieldN()?
    vl.y().fieldQ("life_expect")
  )
  .render()
````

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 8, height = 340)
```

The same idea can be used to encode the country cluster and population fields.
`fieldN` specifies that the cluster field is a Nominal (i.e., categorical)
variable.

Hint: How do you know which commands encode which fields? It's often good to
keep the [vega-lite-api](https://vega.github.io/vega-lite-api/api/) and
[vega-lite](https://vega.github.io/vega-lite/) documentation close at hand.

````js
vl.markPoint()
  .data(data2000)
  .encode(
    vl.x().fieldQ("fertility"),
    vl.y().fieldQ("life_expect"),
    vl.color().fieldN("cluster"), // what happens if you change this to fieldO()?
    vl.size().fieldQ("pop")
  )
  .render()
````

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 9, height = 340)
```

## Finishing touches

We've made headway in our visualization task, but a more critical eye will help
us improve the display. It will also give us an initial look at interactivity.
The main things we ought to improve in this display are,

1. The open rings are a bit distracting. They violate the principle of [Smallest
Effective
Difference](https://sites.google.com/site/tufteondesign/home/practical-design-strategies).
2. Some of the points are impossibly small.
3. The axis labels are not proper English words.
4. There are distractingly many values in the legend for population size.
5. We can't actually tell which country each point corresponds to. It would be
nice to add a tooltip (a label that appears when you mouseover the point).

For (1), let's replace the open rings with filled circles. We can pass options
to `markPoint` in the same way that we passed options to ggplot `geom` layers.
For (2), we can modify the default size range by adding on a `.scale()` to the
`vl.size()` encoding -- this is similar to adding a custom scale in ggplot2.

````js
vl.markPoint({filled: true})
.data(data2000)
.encode(
  vl.x().fieldQ("fertility"),
  vl.y().fieldQ("life_expect"),
  vl.color().fieldN("cluster"),
  vl.size().fieldQ("pop").scale({range: [25, 1000]}) // try changing the ranges
)
.render()
````

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 10, height = 340)
```

For (3), we can add a `.title()` to fields for which we want to customize
titles, and for (4), we can modify the tick count in the `.legend()` for `size`.

````js
vl.markPoint({filled: true})
.data(data2000)
.encode(
  vl.x().fieldQ("fertility").title("Fertility"),
  vl.y().fieldQ("life_expect").title("Life Expectancy"),
  vl.color().fieldN("cluster"),
  vl.size().fieldQ("pop").scale({range: [25, 1000]}).legend({tickCount: 3}).title("Population"),
)
.render()
````

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 11, height = 340)
```

Now to (5). The idea is to think of a tooltip as another type of encoding --
just like coordinate positions and colors, a label is a way of transforming a
field in an abstract dataset into a visual property we can observe on a screen.

````js
vl.markPoint({filled: true})
.data(data2000)
.encode(
  vl.x().fieldQ("fertility").title("Fertility"),
  vl.y().fieldQ("life_expect").title("Life Expectancy"),
  vl.color().fieldN("cluster"),
  vl.size().fieldQ("pop").scale({range: [25, 1000]}).legend({tickCount: 3}).title("Population"),
  vl.tooltip().fieldN("country"),
  vl.order().fieldQ('pop').sort('descending') // if you remove this, large points prevent mouseover on the points below them
)
.render()
````

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 12, height = 340)
```

One point I want you to take away from both this and the ggplot2 introductions
is that building a plot is an iterative process. Once you develop your eye for
good visual design, you will be able to take any initial plot and make it more
effective.

Finally, for a taste of what's to come, here's a version of the same plot that
let's us compare many years over time.

```{r}
robservable("@krisrs1128/introduction-to-vega-lite", include = 13)
```

Don't worry about the code implementation, but in case you're curious, it's only
a few extra lines over what we've already implemented.

````js
{
  // define a slider selector object
  let selectYear = vl.selectSingle("select").fields("year")
    .init({year: 1955})
    .bind(vl.slider().min(1955).max(2005).step(5).name("Year"))
  
  // update the plot depending on the value of the slider
  return vl.markPoint({filled: true})
    .data(data)
    .select(selectYear) // refers to the slider bar
    .transform(vl.filter(selectYear))
    .encode(
      vl.x().fieldQ("fertility").title("Fertility").scale({domain: [0, 9]}), // hard code the axes ranges
      vl.y().fieldQ("life_expect").title("Life Expectancy").scale({domain: [0, 90]}),
      vl.color().fieldN("cluster"),
      vl.size().fieldQ("pop").scale({domain: [0, 1200000000], range: [25, 1000]}).legend({tickCount: 3}).title("Population"),
      vl.tooltip().fieldN("country"),
      vl.order().fieldQ('pop').sort('descending')
    )
    .render()
}
````