---
title: "Data Wrangling I"
output:
  revealjs::revealjs_presentation:
    theme: white
    center: true
    transition: none
    incremental: true
---


```{r setup, include=FALSE}
library(knitr)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
```

# Data Wrangling

## Data Wrangling
As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill. 

- These skills are generally referred to as data wrangling or munging.
- In a data analysis or visualization setting, they will undoubtedly require a majority of your time.
- Wrangling data can be a painful process.
- This lecture will provide some tools and example of organizing data.

## Data Wrangling Concepts

- Wide and thin datasets
- Merging and joining relational data
- Dealing with strings
- Dealing with date/time objects


# Tidy Data

## Rules for Tidy Data
The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data. [Tidy Data Reference](https://vita.had.co.nz/papers/tidy-data.pdf)

1. Each variable must have its own column,
2. Each observation must have its own row, and
3. Each value must have its own cell.

## Rules for Tidy Data

![Visual Representation of Tidy Data. Source: R4DS ](images/tidy.png)


## Why use Tidy Data

*Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham*

- Storing data in a consistent way gives familiarity with methods for manipulating data.
- Tidy data structure takes advantage of vectorised operations in R
- Many useful packages: such as `dplyr` and `ggplot2` require tidy data.


# Merge / Join

## Merge
An important skill is merging or combining data sets.

Consider the two data frames, how can we merge them and what should be the dimensions of the merged data frame.
```{r, echo = FALSE}
df1 <- data.frame(school = c('MSU','VT','Mines'),
            state= c('MT','VA','CO'), stringsAsFactors = F)
kable(df1)
df2 <- data.frame(school = c('Mines','MSU','VT'),
            enrollment = c(5794,15688,30598), stringsAsFactors = F)
kable(df2)
```

## pre-sort
One possibility is to use the arrange the data frames first.

```{r}
df1 <- df1[order(df1$school),]
df2 <- df2[order(df2$school),]
```

## pre-sort
One possibility is to use the arrange the data frames first.

```{r}
df1 
df2 
```


## rbind() and cbind()

Now, given that the data frames are both sorted the same way, we can bind the rows together.

```{r}
comb_df <- cbind(df1,df2)
comb_df
comb_df <- comb_df[,-3]
```

## rbind() and cbind()
Now assume we want to add another school to the data frame.

```{r, error=TRUE}
new.school <- c('Luther', 'IA',2337)
rbind(comb_df, new.school)
```
Note: if your strings are saved as factors, this chunk of code will give you an error.

## bind_rows() / bind_cols()

`dplyr` also contains functions for binding rows and columns.

## Exercise

Now combine the following information into a single table sorted alphabetically by the name of the ski hill.
```{r, echo = F}
ski_acres <- data.frame( ski.hill = c('Big Sky','Bridger Bowl', 'Jackson', 'Steamboat' ), 
                        skiable.acres = c(5800,2000, "2500+",2965))
kable(ski_acres)

df_cost <- data.frame( ski.resort = c('Bridger Bowl', 'Big Sky', 'Steamboat', 'Jackson'), 
                       ticket.cost = c(60, 'depends',145, 130))
df_cost
disco <- c('Discovery', 2200, 20); disco
```

## Solution part 1

Now combine the following information into a single table sorted alphabetically by the name of the ski hill.
```{r, echo = F}
ski_acres
df_cost <- df_cost %>% arrange(ski.resort)
df_comb <- ski_acres %>% bind_cols(df_cost) %>% select(-ski.resort)
str(df_comb)
```

## Solution part 2

```{r}
df_comb <- df_comb %>% mutate(ski.hill = as.character(ski.hill),
                              skiable.acres = as.character(skiable.acres),
                              ticket.cost = as.character(ticket.cost))
str(df_comb)
```

## Solution part 3

```{r}
df_comb <- df_comb %>% rbind(disco) %>% arrange(ski.hill)
df_comb
```

## join()
We could have also used some of the more advanced merge (join) features from dplyr.

```{r}
new.df <- full_join(df1,df2, by='school')
new.df
```

## more about join

The dplyr join command includes the following options:

- left join
- right join
- inner join
- full join
- anti join

## Exercise

Combine the two data sets
```{r,echo = F}
ski_acres
df_cost
disco_df <- tibble(ski.hill = 'Discovery', skiable.acres = '2200', ticket.cost = '20'); disco_df
```

## Solution

```{r}
full_join(ski_acres, df_cost, by = c("ski.hill"= "ski.resort")) %>%
  mutate(ski.hill = as.character(ski.hill),
  skiable.acres = as.character(skiable.acres),
  ticket.cost = as.character(ticket.cost)) %>% 
  full_join(disco_df) %>%
  arrange(ski.hill)
```


# wide / long data

## Hoops Data

This is a _wide_ dataset

```{r, message = F}
hoops <- read_csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/TourneyDetailedResults.csv')
head(hoops)
```

## Plot Creation
Consider a plot where the number of points is colored or faceted by winning or losing. 

This requires a dataset that  is _thin_ and looks something like

|      Game Id     | Points |win/lose | 
|:----------------:|-------|:---------:|
| 1 | 92   | win       |
| 1   | 84   | lose       |
| 2  | 80   | win       |
| 2 | 51    | lose      |

## Data Wrangling

```{r}
lose_score <- hoops %>% select(score = Lscore) %>% 
  mutate(outcome = 'lose', game_id = row_number()) 
win_score <- hoops %>% select(score = Wscore) %>% 
  mutate(outcome = 'win', game_id = row_number())
thin_scores <- bind_rows(win_score, lose_score) %>% 
  arrange(game_id)
thin_scores
```


## Plot Creation

```{r, echo = F, message = F}
thin_scores %>% ggplot(aes(x = score)) + geom_histogram() + facet_wrap(.~outcome) +
  ggtitle('Points scored in NCAA games for Winning and Losing Teams')
```

## tidyr package

The `tidyr` package contains additional tools for converting files from long to wide formats.

Consider the `billboard` dataset which contains the rank of the song (in 2000) for each week after it first entered the list.

```{r}
billboard
```

## billboard data

If we want to identify songs that reach number 1 quickly, the data needs to manipulated.
```{r}
billboard %>% 
  select(artist, track, date.entered, wk1, wk2) %>%
  pivot_longer(
  cols= c('wk1', 'wk2'),
  names_to = 'week',
  values_to = 'rank',
  values_drop_na = T)
```

## Billboard Data Analysis

```{r}
billboard %>% 
  pivot_longer(
  cols= starts_with('wk'),
  names_to = 'week',
  values_to = 'rank',
  values_drop_na = T) %>%
  mutate(week_numb= as.numeric(str_replace(week, 'wk',''))) %>%
  filter(rank == 1) %>%
  arrange(week_numb) %>% 
  slice(1) %>%
  kable()
```

## Exercise

Determine which song in this dataset spent the most time at #1

## Solution

```{r}
billboard %>% 
  pivot_longer(
  cols= starts_with('wk'),
  names_to = 'week',
  values_to = 'rank',
  values_drop_na = T) %>%
  mutate(week_numb= as.numeric(str_replace(week, 'wk',''))) %>%
  filter(rank == 1) %>%
  group_by(track) %>% 
  tally() %>%
  arrange(desc(n)) %>%
  kable()
```


## pivot_wider()

The `pivot_wider()` function allows transformation in the other direction from thin to wide.