--- title: "Pivoting" description: | Tools for reshaping data into tidy format. author: - name: Kris Sankaran affiliation: UW Madison date: 02-16-2021 output: distill::distill_article: self_contained: false --- _[Reading](https://r4ds.had.co.nz/tidy-data.html), [Recording](https://mediaspace.wisc.edu/media/Week%204%20%5B2%5D%20Pivoting/1_dp7cqka3), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-27-week4-2/week4-2.Rmd)_ ```{r setup, include=FALSE} knitr::opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE) ``` ```{r} library("dplyr") library("tidyr") library("ggplot2") theme_set(theme_bw()) ``` 1. Pivoting refers to the process of changing the interpretation of each row in a data frame. It is useful for addressing problems 1 - 2 in the previous lecture, which we repeat here for completeness. a. A variable might be implicitly stored within column names, rather than explicitly stored in its own column. b. The same observation may appear in multiple rows, where each instance of the row is associated with a different variable. 2. To address (a), we can use the `pivot_longer` function in `tidyr`. It takes an implicitly stored variable and explicitly stores it in a column defined by the `names_to` argument. In 3. The example below shows `pivot_longer` being used to tidy one of the non-tidy tuberculosis datasets. Note that the data has doubled in length, because there are now two rows per country (one per year). For reference, these are the original data. ```{r} table4a ``` This step lengthens the data, ```{r} table4a_longer <- table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases") table4a_longer ``` ```{r} dim(table4a) dim(table4a_longer) ``` 4. We can pivot both the population and the cases table, then combine them using a join operation. A join operation matches rows across two tables according to their shared columns. ```{r} # helper function, to avoid copying and pasting code pivot_fun <- function(x, value_column = "cases") { x %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = value_column) } table4 <- left_join( pivot_fun(table4a), # look for all country x year combinations in left table pivot_fun(table4b, "population") # and find matching rows in right table ) table4 ``` This lets us make the year vs. rate plot that we had tried to put together in the last lecture. It's much easier to recognize trends when comparing the rates, than when looking at the raw case counts. ```{r} ggplot(table4, aes(x = year, y = cases / population, col = country)) + geom_point() + geom_line(aes(group = country)) ``` 5. To address (b), we can use the `pivot_wider` function. It spreads the column in the `values_from` argument across new columns specified by the `names_from` argument. 6. The example below shows `pivot_wider` being used to tidy one of the other non-tidy datasets. Note when there are more than two levels in the `names_from` column, this will always be wider than the starting data frame, which is why this operation is called `pivot_wider`. For reference, here is `table2` before pivoting. ```{r} table2 ``` Now, we spread the `cases` and `population` variables into their own columns. ```{r} table2 %>% pivot_wider(names_from = type, values_from = count) ```