--- title: "Data Wrangling I" output: revealjs::revealjs_presentation: theme: white center: true transition: none incremental: true --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) knitr::opts_chunk$set(echo = TRUE) ``` # Data Wrangling ## Data Wrangling As a statistician or more generally a data scientist the ability to manipulate, process, clean, and merge datasets is an essential skill. - These skills are generally referred to as data wrangling or munging. - In a data analysis or visualization setting, they will undoubtedly require a majority of your time. - Wrangling data can be a painful process. - This lecture will provide some tools and example of organizing data. ## Data Wrangling Concepts - Wide and thin datasets - Merging and joining relational data - Dealing with strings - Dealing with date/time objects # Tidy Data ## Rules for Tidy Data The concept of tidy data can be attributed to Hadley Wickham and has three principles for organizing data. [Tidy Data Reference](https://vita.had.co.nz/papers/tidy-data.pdf) 1. Each variable must have its own column, 2. Each observation must have its own row, and 3. Each value must have its own cell. ## Rules for Tidy Data ![Visual Representation of Tidy Data. Source: R4DS ](images/tidy.png) ## Why use Tidy Data *Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham* - Storing data in a consistent way gives familiarity with methods for manipulating data. - Tidy data structure takes advantage of vectorised operations in R - Many useful packages: such as `dplyr` and `ggplot2` require tidy data. # Merge / Join ## Merge An important skill is merging or combining data sets. Consider the two data frames, how can we merge them and what should be the dimensions of the merged data frame. ```{r, echo = FALSE} df1 <- data.frame(school = c('MSU','VT','Mines'), state= c('MT','VA','CO'), stringsAsFactors = F) kable(df1) df2 <- data.frame(school = c('Mines','MSU','VT'), enrollment = c(5794,15688,30598), stringsAsFactors = F) kable(df2) ``` ## pre-sort One possibility is to use the arrange the data frames first. ```{r} df1 <- df1[order(df1$school),] df2 <- df2[order(df2$school),] ``` ## pre-sort One possibility is to use the arrange the data frames first. ```{r} df1 df2 ``` ## rbind() and cbind() Now, given that the data frames are both sorted the same way, we can bind the rows together. ```{r} comb_df <- cbind(df1,df2) comb_df comb_df <- comb_df[,-3] ``` ## rbind() and cbind() Now assume we want to add another school to the data frame. ```{r, error=TRUE} new.school <- c('Luther', 'IA',2337) rbind(comb_df, new.school) ``` Note: if your strings are saved as factors, this chunk of code will give you an error. ## bind_rows() / bind_cols() `dplyr` also contains functions for binding rows and columns. ## Exercise Now combine the following information into a single table sorted alphabetically by the name of the ski hill. ```{r, echo = F} ski_acres <- data.frame( ski.hill = c('Big Sky','Bridger Bowl', 'Jackson', 'Steamboat' ), skiable.acres = c(5800,2000, "2500+",2965)) kable(ski_acres) df_cost <- data.frame( ski.resort = c('Bridger Bowl', 'Big Sky', 'Steamboat', 'Jackson'), ticket.cost = c(60, 'depends',145, 130)) df_cost disco <- c('Discovery', 2200, 20); disco ``` ## Solution part 1 Now combine the following information into a single table sorted alphabetically by the name of the ski hill. ```{r, echo = F} ski_acres df_cost <- df_cost %>% arrange(ski.resort) df_comb <- ski_acres %>% bind_cols(df_cost) %>% select(-ski.resort) str(df_comb) ``` ## Solution part 2 ```{r} df_comb <- df_comb %>% mutate(ski.hill = as.character(ski.hill), skiable.acres = as.character(skiable.acres), ticket.cost = as.character(ticket.cost)) str(df_comb) ``` ## Solution part 3 ```{r} df_comb <- df_comb %>% rbind(disco) %>% arrange(ski.hill) df_comb ``` ## join() We could have also used some of the more advanced merge (join) features from dplyr. ```{r} new.df <- full_join(df1,df2, by='school') new.df ``` ## more about join The dplyr join command includes the following options: - left join - right join - inner join - full join - anti join ## Exercise Combine the two data sets ```{r,echo = F} ski_acres df_cost disco_df <- tibble(ski.hill = 'Discovery', skiable.acres = '2200', ticket.cost = '20'); disco_df ``` ## Solution ```{r} full_join(ski_acres, df_cost, by = c("ski.hill"= "ski.resort")) %>% mutate(ski.hill = as.character(ski.hill), skiable.acres = as.character(skiable.acres), ticket.cost = as.character(ticket.cost)) %>% full_join(disco_df) %>% arrange(ski.hill) ``` # wide / long data ## Hoops Data This is a _wide_ dataset ```{r, message = F} hoops <- read_csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/TourneyDetailedResults.csv') head(hoops) ``` ## Plot Creation Consider a plot where the number of points is colored or faceted by winning or losing. This requires a dataset that is _thin_ and looks something like | Game Id | Points |win/lose | |:----------------:|-------|:---------:| | 1 | 92 | win | | 1 | 84 | lose | | 2 | 80 | win | | 2 | 51 | lose | ## Data Wrangling ```{r} lose_score <- hoops %>% select(score = Lscore) %>% mutate(outcome = 'lose', game_id = row_number()) win_score <- hoops %>% select(score = Wscore) %>% mutate(outcome = 'win', game_id = row_number()) thin_scores <- bind_rows(win_score, lose_score) %>% arrange(game_id) thin_scores ``` ## Plot Creation ```{r, echo = F, message = F} thin_scores %>% ggplot(aes(x = score)) + geom_histogram() + facet_wrap(.~outcome) + ggtitle('Points scored in NCAA games for Winning and Losing Teams') ``` ## tidyr package The `tidyr` package contains additional tools for converting files from long to wide formats. Consider the `billboard` dataset which contains the rank of the song (in 2000) for each week after it first entered the list. ```{r} billboard ``` ## billboard data If we want to identify songs that reach number 1 quickly, the data needs to manipulated. ```{r} billboard %>% select(artist, track, date.entered, wk1, wk2) %>% pivot_longer( cols= c('wk1', 'wk2'), names_to = 'week', values_to = 'rank', values_drop_na = T) ``` ## Billboard Data Analysis ```{r} billboard %>% pivot_longer( cols= starts_with('wk'), names_to = 'week', values_to = 'rank', values_drop_na = T) %>% mutate(week_numb= as.numeric(str_replace(week, 'wk',''))) %>% filter(rank == 1) %>% arrange(week_numb) %>% slice(1) %>% kable() ``` ## Exercise Determine which song in this dataset spent the most time at #1 ## Solution ```{r} billboard %>% pivot_longer( cols= starts_with('wk'), names_to = 'week', values_to = 'rank', values_drop_na = T) %>% mutate(week_numb= as.numeric(str_replace(week, 'wk',''))) %>% filter(rank == 1) %>% group_by(track) %>% tally() %>% arrange(desc(n)) %>% kable() ``` ## pivot_wider() The `pivot_wider()` function allows transformation in the other direction from thin to wide.