--- title: "Deriving Variables" description: | Using `separate`, `mutate`, and `summarise` to derive new variables for downstream visualization. author: - name: Kris Sankaran affiliation: UW Madison date: 02-17-2021 output: distill::distill_article: self_contained: false --- _[Reading](https://r4ds.had.co.nz/tidy-data.html), [Recording](https://mediaspace.wisc.edu/media/Week%204%20%5B3%5D%20Deriving%20Variables/1_b40dfhtt), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-27-week4-3/week4-3.Rmd)_ ```{r setup, include=FALSE} knitr::opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE) ``` ```{r} library("tidyr") library("dplyr") ``` 1. It’s easiest to define visual encodings when the variables we want to encode are contained in their own columns. After sketching out a visualization of interest, we may find that these variables are not explicitly represented among the columns of the raw dataset. In this case, we may need to derive them based on what is available. The `dplyr` and `tidyr` packages provide functions for deriving new variables, which we review in these notes. 2. Sometimes a single column is used to implicitly store several variables. To make the data tidy, `separate` can be used to split that single column into several columns, each of which corresponds to exactly one variable. 3. The block below separates our earlier `table3`, which stored rate as a fraction in a character column. The original table was, ```{r} table3 ``` and the separated version is, ```{r} table3 %>% separate(rate, into = c("cases", "population"), convert = TRUE) # try without convert, and compare the data types of the columns ``` 4. Note that this function has an inverse, called `unite`, which can merge several columns into one. This is sometimes useful, but not as often as `separate`, since it isn’t needed to tidy a dataset. 5. Separating a single column into several is a special case of a more general operation, `mutate`, which defines new columns as functions of existing ones. We have used this is in previous lectures, but now we can philosophically justify it: the variables we want to encode need to be defined in advance. 6. For example, we may want to create a column `rate` that includes cases over population, ```{r} table3 %>% separate(rate, into = c("cases", "population"), convert = TRUE) %>% mutate(rate = cases / year) ``` 7. Sometimes, the variables of interest are functions of several rows. For example, perhaps we want to visualize averages of a variable across age groups. In this case, we can derive a summary across groups of rows using the `group_by`-followed-by-`summarise` pattern. 8. For example, perhaps we want the average rate over both years. ```{r} table3 %>% separate(rate, into = c("cases", "population"), convert = TRUE) %>% mutate(rate = cases / year) %>% group_by(country) %>% summarise(avg_rate = mean(rate)) ```