---
title: "Tidy Data"
description: |
  The definition of tidy data, and why it's often helpful for visualization.
author:
  - name: Kris Sankaran
    affiliation: UW Madison
date: 02-15-2021
output:
  distill::distill_article:
    self_contained: false
---

_[Reading](https://r4ds.had.co.nz/tidy-data.html), [Recording](https://mediaspace.wisc.edu/media/Week%204%20%5B1%5D%20Tidy%20Data/1_13526lye), [Rmarkdown](https://raw.githubusercontent.com/krisrs1128/stat479/master/_posts/2021-01-27-week4-1/week4-1.Rmd)_


```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = FALSE, message = FALSE, warning = FALSE, echo = TRUE)
```
```{r}
library("tidyr")
library("ggplot2")
theme_set(theme_bw())
```

1. A dataset is called tidy if rows correspond to distinct observations and columns correspond to distinct variables.

![](week4-1_files/tidy-1.png)

2. For visualization, it is important that data be in tidy format. This is because (a) each visual mark will be associated with a row of the dataset and (b) properties of the visual marks will determined by values within the columns. A plot that is easy to create when the data are in tidy format might be very hard to create otherwise.
3. The tidy data might seem like an idea so natural that it’s not worth teaching (let alone formalizing). However, exceptions are encountered frequently, and it’s important that you be able to spot them. Further, there are now many utilities for "tidying" data, and they are worth becoming familiar with.
4. Here is an example of a tidy dataset.

```{r}
table1
```

It is easy to visualize the tidy dataset.

```{r}
ggplot(table1, aes(x = year, y = cases, col = country)) +
  geom_point() +
  geom_line()
```

5. Below are three non-tidy versions of the same dataset. They are
representative of more general classes of problems that may arise,

	a. A variable might be implicitly stored within column names, rather than
	explicitly stored in its own column. Here, the years are stored as column
	names. It's not really possible to create the plot above using the data in this
	format.

```{r}
table4a # cases
table4b # population
```

b. The same observation may appear in multiple rows, where each instance of the
row is associated with a different variable. Here, the observations are the
country by year combinations.
	
```{r}
table2
```

c. A single column actually stores multiple variables. Here, `rate` is being
used to store both the population and case count variables.
	
```{r}
table3
```

The trouble is that this variable has to be stored as a character; otherwise, we
lose access to the original population and case variable. But, this makes the
plot useless.

```{r}
ggplot(table3, aes(x = year, y = rate)) +
  geom_point() +
  geom_line(aes(group = country))
```
The next few lectures provide tools for addressing these three problems.

6. A few caveats are in order. It’s easy to become a tidy-data purist, and lose
sight of the bigger data-analytic picture. To prevent that, first, remember that
what is or is not tidy may be context dependent. Maybe you want to treat each
week as an observation, rather than each day. Second, know that there are
sometimes computational reasons to prefer non-tidy data. For example, "long"
data often require more memory, since column names that were originally stored
once now have to be copied onto each row. Certain statistical models are also
sometimes best framed as matrix operations on non-tidy datasets.