--- title: "Practical R: Data Munging" author: Abhijit Dasgupta date: BIOF 339 --- ```{r setup, include=F, child = here::here('slides/templates/setup.Rmd')} ``` ```{r setup1, include=FALSE} library(countdown) library(fontawesome) ``` class: middle, center # Data munging --- background-image: url(../img/tidyverse_celestial.png) background-size: contain --- ## What is the tidyverse? > The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. -- Dr. Hadley Wickham - A human-friendly syntax and semantics to make code more understandable - The functions in the tidyverse often wraps harder-to-understand functions into simpler, more understandable forms - We're taking an opinionated choice here - Covers maybe 85% of the cases you'll ever face - Takes a particular viewpoint about how data _should_ be organized - But this makes things easier and simpler ---- The [tidyverse.org](https://www.tidyverse.org) site and the [R4DS book](https://r4ds.had.co.nz) are the definitive sources for tidyverse information. The packages are united in a common philosophy of how data analysis should be done. --- class:middle, center # Tidying data --- ## Tidy data
Separating columns into different variables
Country | Year | Gender | Age | Count
---------|--------|---------|--------|-------
---
## Tidying data
The typical steps are
+ Transforming data from wide to tall (`pivot_longer`) and from tall to wide (`pivot_wider`)
+ Separating columns into different columns (`separate`)
+ Putting columns together into new variables (`unite`)
----
>The functions `pivot_longer` and `pivot_wider` supercede the older functions `gather` and `spread`,
which I have used in previous iterations of this class. However, if you are familiar with `gather` and `spread`, they aren't gone and can still be used in the current **tidyr** package.
---
## Tidy data
A first step in the tidyverse is to activate the `tidyverse` meta-package
```{r munging-1, message = F, warning=F}
library(tidyverse)
```
.pull-left[
```{r munging-2, echo = F, results = 'asis'}
library(tidyverse)
library(glue)
tidy_pkgs <-
tibble('pkgs' = c('ggplot2','purrr','readr','tidyr',
'dplyr','forcats','lubridate','stringr')) %>%
mutate(
descr = map_chr(pkgs, ~packageDescription(., fields='Title')))
glue_data(tidy_pkgs[1:4,],
"- [**{pkgs}**](http://{pkgs}.tidyverse.org): {descr}")
```
]
.pull-right[
```{r munging-3, echo = F, results = 'asis'}
library(tidyverse)
library(glue)
glue_data(tidy_pkgs[5:8,],
"- [**{pkgs}**](http://{pkgs}.tidyverse.org): {descr}")
```
]
---
## Tidy data
The common feature of all these packages is that their functions take a
data frame (which the tidyverse calls a `tibble`) as their first argument.
So the starting point for any analysis is the data set.
---
## Tidy data
```{r munging-4 }
table1
```
Is this tidy?
---
## Tidy data
```{r munging-5 }
table2
```
Is this tidy?
---
## Tidy data
```{r munging-6 }
table3
```
Is this tidy?
---
## Tidy data
```{r munging-7 }
table4a # cases
table4b # population
```
Are these tidy?
---
## Can we make datasets tidy?
Sometimes. The functions in the `tidyr` package can help
- `separate` is a function that can split a column into multiple columns
- When there are multiple variables together in a column
```{r munging-8 }
table3
```
We need to separate `rate` into two variables, cases and population
---
## Can we make datasets tidy?
```{r munging-9 }
separate(table3, col = rate, into = c("cases", "population"),
sep = "/",
convert = TRUE) # convert type if possible #<<
```
> I've been explicit about naming all the options. R functions can work by
position as well, so `separate(table3, rate, c('cases','population'), '/')` would work, but it's not very clear, is it?
---
## Can we make datasets tidy?
```{r munging-10 }
table2
```
Here there are observations on two variables in successive rows
---
## Can we make datasets tidy?
We need to `spread` these rows out into different columns. This function is now called `pivot_wider`.
.left-column30[
![](../img/tidyr-spread-gather.gif)
]
.right-column70[
```{r munging-11 }
pivot_wider(table2, names_from = type, values_from = count)
```
]
---
## Can we make datasets tidy?
```{r munging-12 }
table4a
```
Here, the variable for year is stored as a header, not as data in a cell.
We need to `gather` that data and put it into a column. This function is now called `pivot_longer`
---
## Can we make datasets tidy?
.left-column30[
![](../img/tidyr-spread-gather.gif)
]
.right-column70[
```{r munging-13 }
pivot_longer(table4a, names_to = 'year', values_to = 'cases',
cols = c(`1999`, `2000`))
```
]
---
## Making data tidy
Admittedly, `pivot_wider` and `pivot_longer` are not easy concepts, but we'll practice with them more.
1. `pivot_longer` collects multiple columns into 2, and only 2 columns
- One column represents the data in the column headers
- One column represents the values in the column
- All other columns are repeated to keep all the data properly associated
1. `pivot_wider` takes two columns and makes them multiple columns
- The values in one column form the headers to different new columns
- The values in the other column represent the values in the corresponding cells
- The other columns are repeated to start with, but reduce repetitions to make all associated data stay together
---
## Tidying the weather data
```{r munging-15 }
library(tidyverse)
weather_data <- rio::import('../data/weather.csv')
```
.pull-left[
```{r munging-16, echo=F}
head(weather_data)
```
]
.pull-right[
1. Days are in separate columns
1. Temperatures for each day is in two rows, max and min
1. Don't worry about missing values. Just work on getting the shape right
]
---
## Tidying the weather data
```{r munging-17, eval = T, echo = T}
weather1 <- pivot_longer(weather_data, names_to='day', values_to='temp',
cols = c(-(1:4))) #<<
head(weather1,5)
```
---
## Tidying the weather data
```{r munging-19, eval = T, echo = T}
weather1 <- pivot_longer(weather_data, names_to='day', values_to='temp',
cols = c(-(1:4)))
weather2 <- pivot_wider(weather1, names_from='element', values_from = 'temp')
head(weather2, 5)
```
---
## Tidying the weather data
```{r munging-21, eval = T, echo = T}
weather1 <- pivot_longer(weather_data, names_to='day', values_to='temp',
cols = c(-(1:4)))
weather2 <- pivot_wider(weather1, names_from='element', values_from = 'temp')
weather3 <- separate(weather2, col='day', into=c('symbol','day'), sep=1)
head(weather3,5)
```
This gets us into the right shape for the data.
There still is some work to do, but the format is tidy
---
## Data transformation (dplyr)
The `dplyr` package gives us a few verbs for data manipulation
```{r munging-23, echo = F, results='asis'}
dat <- tribble(
~Function, ~Purpose,
'select', "Select columns based on name or position",
'mutate', 'Create or change a column',
'filter', 'Extract rows based on some criteria',
'arrange', 'Re-order rows based on values of variable(s)',
'group_by', 'Split a dataset by unique values of a variable',
'summarize', 'Create summary statistics based on columns')
knitr::kable(dat, format='markdown')
```
---
## `select`
You can select columns by name or position, of course, e.g., `select(weather, month)` or `select(weather, 3)`
You can select consecutive columns using `:` notation, e.g. `select(weather, d1:d31)`
You can also select columns based on some criteria, which are encapsulated in functions.
- `starts_with("___")`, `ends_with("___")`, `contains("____")`
- `one_of("____","_____","______")`
- `everything()`
There are others; see `help(starts_with)`.
These selection methods work in all tidyverse functions
> Note that for `select` the names of the columns don't need to be quoted. This is called *non-standard evaluation* and
is a convenience. However for the criteria-based selectors within `select`, you **do** need to quote the criteria
---
## select
.pull-left[
```{r munging-24, eval = F, echo = T}
weather1 <- select(weather_data, year, month, d1:d31) #<<
head(weather1, 20)
```
]
.pull-right[
```{r munging-25, eval=T, echo = F, ref.label="munging-24"}
```
]
---
## select
.pull-left[
```{r munging-26, eval = F, echo = T}
weather1 <- select(weather_data, starts_with('d')) #<<
head(weather1, 20)
```
]
.pull-right[
```{r munging-27, eval=T, echo = F, ref.label="munging-26"}
```
]
---
## select
The flexibility of the `select` function, which is also used in other tidyverse functions like `pivot_longer`, and others we'll see presently, is quite powerful.
Suppose you have a large genomic data where the columns are different genes, and suppose that the housekeeping genes all start with "HK". Then, in order to _remove_ the housekeeping genes, you could just do
```{r munging-54, eval=F}
new_data <- select(old_data, -starts_with("HK"))
```
Here, the `-` sign means, remove those columns.
Also note that we have to assign the selected dataset to a new (or old) name in order to
preserve it in the workspace.
---
## select
I always prefer naming my columns well and using the capabilities of `select` to grab columns.
However, you can use `select` with column numbers. For example, if you wanted to grab the
first 4 columns of a dataset, you could do
```{r munging-55, eval=F}
new_data <- select(old_data, 1:4)
```
.footnote[The notation `1:4` is a short hand for the sequence `1,2,3,4`. Generally, the notation `m:n` means the set of consecutive integers between `m` and `n`.]
---
## mutate
`mutate`, as the name suggests, either creates a new column in your data set or transforms an existing column.
.pull-left[
```{r munging-28, eval = F, echo = T}
weather4 <- mutate(weather3,
num_day = as.numeric(day))
as_tibble(weather4)
```
]
.pull-right[
```{r munging-29, eval=T, echo = F, ref.label="munging-28"}
```
]
---
## mutate
`mutate` can either transform a column in place or create a new column in a dataset
.pull-left[
```{r munging-30, eval = F, echo = T}
weather4 <- mutate(weather3, day = as.numeric(day))
as_tibble(weather4)
```
]
.pull-right[
```{r munging-31, eval=T, echo = F, ref.label="munging-30"}
```
]
---
## mutate
`mutate` can also be used to deal with missing values, by replacing them with a value, for example
.pull-left[
```{r munging-32, eval = F, echo = T}
mutate(weather4, tmax = replace_na(tmax, 0))
```
You wouldn't want to do exactly this, of course
]
.pull-right[
```{r munging-33, eval=T, echo = F, ref.label="munging-32"}
```
]
---
## across
**dplyr** version 1.0 introduced a new verb, `across` to allow functions like `mutate` (and `summarize`, which we shall see in the statistics module) to act on a selection of columns
which can be chosen using the same syntax as `select`, or by condition.
.pull-left[
```{r, eval=F}
mutate(mpg,
cty = cty * 1.6/3.8,
hwy = hwy * 1.6/3.8)
```
]
.pull-right[
```{r munging-59, echo=T, eval=F}
mutate(mpg,
across(c(cty, hwy),
function(x) {x * 1.6/3.8}))
```
]
-----
```{r munging-60,eval=F}
mutate(mpg,
across(is.character, as.factor)) # select based on condition
```
---
## filter
`filter` extracts **rows** based on criteria
So if we wanted to just grab January data, we could use
.pull-left[
```{r munging-34, eval = F, echo = T}
january <- filter(weather4, month==1)
head(january)
```
]
.pull-right[
```{r munging-35, eval=T, echo = F, ref.label="munging-34"}
```
]
---
## filter
Some comparison operators for filtering
| Operator | Meaning |
|----------|----------------------------------|
| == | Equals |
| != | Not equals |
| > / < | Greater / less than |
| >= / <= | Greater or equal / Less or equal |
| ! | Not |
| %in% | In a set |
Combining comparisons
| Operator | Meaning |
|------------|---------|
| & | And |
| | | Or |
---
## filter
Some comparison operators for filtering
Strings: `str_detect(