--- title: "Tidying data" editor: markdown: wrap: 72 --- ## Tidying data {.scrollable} - Data rarely come to us as we want to use them. - Before we can do analysis, typically have organizing to do. - This is typical of ANOVA-type data, "wide format": ``` pig feed1 feed2 feed3 feed4 1 60.8 68.7 92.6 87.9 2 57.0 67.7 92.1 84.2 3 65.0 74.0 90.2 83.1 4 58.6 66.3 96.5 85.7 5 61.7 69.8 99.1 90.3 ``` - 20 pigs randomly allocated to one of four feeds. At end of study, weight of each pig is recorded. - Are any differences in mean weights among the feeds? - Problem: want all weights in one column, with 2nd column labelling which feed. Untidy! ## Tidy and untidy data (Wickham) {.scrollable} - Data set easier to deal with if: - each observation is one row - each variable is one column - each type of observation unit is one table - Data arranged this way called "tidy"; otherwise called "untidy". - For the pig data: - response variable is weight, but scattered over 4 columns, which are levels of a factor `feed`. - Want all the weights in one column, with a second column `feed` saying which feed that weight goes with. - Then we can run `aov`. ## Packages for this section ```{r tidying-R-1} library(tidyverse) ``` ## Reading in the pig data ```{r tidying-R-2, message=F} my_url <- "http://ritsokiguess.site/datafiles/pigs1.txt" pigs1 <- read_delim(my_url, " ") pigs1 ``` ## Making it longer - We wanted all the weights in one column, labelled by which feed they went with. - This is a very common reorganization, and the magic "verb" is `pivot_longer`: ```{r tidying-R-3} pigs1 %>% pivot_longer(feed1:feed4, names_to="feed", values_to="weight") -> pigs2 pigs2 ``` ## Alternatives Any way of choosing the columns to pivot longer is good, eg: ```{r} pigs1 %>% pivot_longer(-pig, names_to="feed", values_to="weight") ``` or ```{r} pigs1 %>% pivot_longer(starts_with("feed"), names_to="feed", values_to="weight") ``` ## Comments - `pigs2` now in "long" format, ready for analysis. - Anatomy of `pivot_longer`: - columns to combine - a name for column that will contain groups ("names") - a name for column that will contain measurements ("values") ## Identifying the pigs - Values in `pig` identify pigs *within each group*: pig 1 is four different pigs! - Create unique pig IDs by gluing pig number onto feed: ```{r tidying-R-6} pigs2 %>% mutate(pig_id=str_c(feed, "_", pig)) -> pigs2 pigs2 ``` ## ...and finally, the analysis - which is just what we saw before: ```{r tidying-R-7} weight.1 <- aov(weight ~ feed, data = pigs2) summary(weight.1) ``` - The mean weights of pigs on the different feeds are definitely not all equal. - So we run Tukey to see which ones differ (over). ## Tukey ```{r tidying-R-8} TukeyHSD(weight.1) ``` All of the feeds differ! ## Mean weights by feed To find the best and worst, get mean weight by feed group. I borrowed an idea from earlier to put the means in descending order: ```{r tidying-R-9} pigs2 %>% group_by(feed) %>% summarize(mean_weight = mean(weight))%>% arrange(desc(mean_weight)) ``` Feed 3 is best, feed 1 worst. ## Should we have any concerns about the ANOVA? ```{r pigfeedplot} ggplot(pigs2, aes(x = feed, y = weight)) + geom_boxplot() ``` ## Comments - Feed 2 has an outlier - But there are only 5 pigs in each group - The conclusion is so clear that I am OK with this. ## Tuberculosis - The World Health Organization keeps track of number of cases of various diseases, eg. tuberculosis. - Some data: ```{r tidying-R-10} my_url <- "http://ritsokiguess.site/datafiles/tb.csv" tb <- read_csv(my_url) ``` ## The data (randomly chosen rows) ```{r} #| echo: false options(width = 72) ``` ```{r tidying-R-11} tb %>% slice_sample(n = 10) ``` Many rows: ```{r} nrow(tb) ``` ## What we have - Variables: country (abbreviated), year. Then number of cases for each gender and age group, eg. `m1524` is males aged 15--24. Also `mu` and `fu`, where age is unknown. - Lots of missings. Want to get rid of. - Abbreviations [here](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2). ```{r tidying-R-13} tb %>% pivot_longer(m04:fu, names_to = "genage", values_to = "freq", values_drop_na = TRUE) -> tb2 ``` - Code for `pivot_longer`: - columns to make longer - column to contain the names (categorical) - column to contain the values (quantitative) - drop missings in the values ## Results (some) ```{r tidying-R-14} tb2 ``` ## Separating - 4 columns, but 5 variables, since `genage` contains both gender and age group. Split that up using separate. - `separate` needs 3 things: - what to separate (no quotes needed), - what to separate into (here you do need quotes), - how to split. - For "how to split", here "after first character": ```{r tidying-R-15} tb2 %>% separate_wider_position(genage, widths = c("gender" = 1, "age" = 4), too_few = "align_start") -> tb3 tb3 ``` ## Tidied tuberculosis data (some) ```{r tidying-R-16} tb3 ``` ```{r tidying-R-17, echo=FALSE} write_rds(tb3, "tb3.rds") ``` ## In practice... {.smaller} - instead of doing the pipe one step at a time, you *debug* it one step at a time, and when you have each step working, you use that step's output as input to the next step, thus: ```{r tidying-R-18} tb %>% pivot_longer(m04:fu, names_to = "genage", values_to = "freq", values_drop_na = TRUE) %>% separate_wider_position(genage, widths = c("gender" = 1, "age" = 4), too_few = "align_start") ``` - You can split the R code over as many lines as you like, as long as each line is incomplete, so that R knows more is to come. - I like to put the pipe symbol on the end of the line. ## Total tuberculosis cases by year (some of the years) ```{r tidying-R-19} tb3 %>% filter(between(year, 1991, 1998)) %>% group_by(year) %>% summarize(total=sum(freq)) ``` - Something very interesting happened between 1994 and 1995. ## To find out what - try counting up total cases by country: ```{r tidying-R-20} tb3 %>% group_by(iso2) %>% summarize(total=sum(freq)) %>% arrange(desc(total)) ``` ## What years do I have for China? China started recording in 1995, which is at least part of the problem: ```{r tidying-R-21} tb3 %>% filter(iso2=="CN") %>% group_by(year) %>% summarize(total=sum(freq)) ``` ## First year of recording by country? - A lot of countries started recording in about 1995, in fact: ```{r tidying-R-22, echo=FALSE} options(dplyr.summarise.inform = FALSE) ``` ```{r tidying-R-23} tb3 %>% group_by(iso2) %>% summarize(first_year=min(year)) %>% count(first_year) ``` - So the reason for the big jump in cases is that so many countries started recording then, not that there really were more cases. ## Some Toronto weather data ```{r tidying-R-24} my_url <- "http://ritsokiguess.site/STAC32/toronto_weather.csv" weather <- read_csv(my_url) weather ``` ## The columns - Daily weather records for "Toronto City" weather station in 2018: - `station`: identifier for this weather station (always same here) - `Year`, `Month` - `element`: whether temperature given was daily max or daily min - `d01`, `d02`,... `d31`: day of the month from 1st to 31st. ## Off we go Numbers in data frame all temperatures (for different days of the month), so first step is ```{r tidying-R-26} weather %>% pivot_longer(d01:d31, names_to="day", values_to="temperature", values_drop_na = TRUE) ``` ## `Element` - Column `element` contains names of two different variables, that should each be in separate column. - Distinct from eg. `m1524` in tuberculosis data, that contained levels of two different factors, handled by separate. - Untangling names of variables handled by `pivot_wider`. ## Handling `element` ```{r tidying-R-28} weather %>% pivot_longer(d01:d31, names_to="day", values_to="temperature", values_drop_na = TRUE) %>% pivot_wider(names_from=element, values_from=temperature) ``` ## Further improvements 1/2 - We have tidy data now, but can improve things further. - `mutate` creates new columns from old (or assign back to change a variable). - Would like numerical dates. `separate` works, or pull out number as below. - `select` keeps columns (or drops, with minus). Station name has no value to us. ## Further improvements 2/2 ```{r tidying-R-30} weather %>% pivot_longer(d01:d31, names_to="day", values_to="temperature", values_drop_na = TRUE) %>% pivot_wider(names_from=element, values_from=temperature) %>% mutate(Day = parse_number(day)) %>% select(-station) ``` ## Final step(s) - Make year-month-day into proper date. - Keep only date, tmax, tmin: ```{r tidying-R-32} weather %>% pivot_longer(d01:d31, names_to="day", values_to="temperature", values_drop_na = T) %>% pivot_wider(names_from=element, values_from=temperature) %>% mutate(Day = parse_number(day)) %>% select(-station) %>% unite(datestr, c(Year, Month, Day), sep = "-") %>% mutate(date = as.Date(datestr)) %>% select(c(date, tmax, tmin)) -> weather_tidy ``` ## Our tidy data frame ```{r tidying-R-33} weather_tidy ``` ## Plotting the temperatures - Plot temperature against date joined by lines, but with separate lines for max and min. `ggplot` requires something like \footnotesize ```{r tidying-R-34, eval=F} ggplot(..., aes(x = date, y = temperature)) + geom_point() + geom_line() ``` \normalsize only we have two temperatures, one a max and one a min, that we want to keep separate. - The trick: combine `tmax` and `tmin` together into one column, keeping track of what kind of temp they are. (This actually same format as untidy `weather`.) Are making `weather_tidy` untidy for purposes of drawing graph only. - Then can do something like \footnotesize ```{r tidying-R-35, eval=F} ggplot(d, aes(x = date, y = temperature, colour = maxmin)) + geom_point() + geom_line() ``` \normalsize to distinguish max and min on graph. ## Setting up plot - Since we only need data frame for plot, we can do the column-creation and plot in a pipeline. - For a `ggplot` in a pipeline, the initial data frame is omitted, because it is whatever came out of the previous step. - To make those "one column"s: `pivot_longer`. I save the graph to show overleaf: ```{r tidying-R-36} weather_tidy %>% pivot_longer(tmax:tmin, names_to="maxmin", values_to="temperature") %>% ggplot(aes(x = date, y = temperature, colour = maxmin)) + geom_point() + geom_line() -> g ``` ## The plot ```{r temp-plot} g ``` ## Summary of tidying "verbs" | Verb | Purpose | |:---------------|:---------------------------------------------------------------------------------------------| | `pivot_longer` | Combine columns that measure same thing into one | | `pivot_wider` | Take column that measures one thing under different conditions and put into multiple columns | | `separate` | Turn a column that encodes several variables into several columns | | `unite` | Combine several (related) variables into one "combination" variable | `pivot_longer` and `pivot_wider` are opposites; `separate` and `unite` are opposites.