--- title: "Tidying data: extras" --- ## Packages ```{r tidy-extra-R-1} library(tidyverse) ``` ## The pig feed data again ```{r tidy-extra-R-2, message=F} my_url <- "http://ritsokiguess.site/datafiles/pigs1.txt" pigs <- read_table(my_url) pigs ``` ## Make longer (as before) ```{r tidy-extra-R-3} pigs %>% pivot_longer(-pig, names_to="feed", values_to="weight") -> pigs_longer pigs_longer ``` ## Make wider two ways 1/2 `pivot_wider` is inverse of `pivot_longer`: ```{r tidy-extra-R-4} pigs_longer %>% pivot_wider(names_from=feed, values_from=weight) ``` we are back where we started. ## Make wider 2/2 Or ```{r tidy-extra-R-5} pigs_longer %>% pivot_wider(names_from=pig, values_from=weight) ``` ## Disease presence and absence at two locations Frequencies of plants observed with and without disease at two locations: ``` Species Disease present Disease absent Location X Location Y Location X Location Y A 44 12 38 10 B 28 22 20 18 ``` This has two rows of headers, so I rewrote the data file: ``` Species present_x present_y absent_x absent_y A 44 12 38 10 B 28 22 20 18 ``` Read into data frame called `prevalence`. ```{r tidy-extra-R-6, message=F} my_url <- "http://ritsokiguess.site/STAC32/disease.txt" prevalence <- read_table(my_url) prevalence ``` ## Lengthen and separate ```{r tidy-extra-R-7} prevalence %>% pivot_longer(-Species, names_to = "column", values_to = "freq") %>% separate_wider_delim(column, "_", names = c("disease", "location")) ``` ## Making longer, the better way ```{r tidy-extra-R-8} prevalence prevalence %>% pivot_longer(-Species, names_to=c("disease", "location"), names_sep="_", values_to="frequency") -> prevalence_longer prevalence_longer ``` ## Making wider, different ways ```{r tidy-extra-R-9} prevalence_longer %>% pivot_wider(names_from=c(Species, location), values_from=frequency) ``` ```{r tidy-extra-R-10} prevalence_longer %>% pivot_wider(names_from=location, values_from=frequency) ``` ## Interlude {.smaller} ```{r tidy-extra-R-11} pigs_longer pigs_longer %>% group_by(feed) %>% summarize(weight_mean=mean(weight)) ``` ## What if summary is more than one number? eg. quartiles: ```{r tidy-extra-R-12, error=T} pigs_longer %>% group_by(feed) %>% summarize(r=quantile(weight, c(0.25, 0.75))) ``` ## Following the hint... ```{r} pigs_longer %>% group_by(feed) %>% reframe(r=quantile(weight, c(0.25, 0.75))) ``` ## this also works ```{r tidy-extra-R-13, error=T} pigs_longer %>% group_by(feed) %>% summarize(r=quantile(weight, c(0.25, 0.75))) pigs_longer %>% group_by(feed) %>% summarize(r=list(quantile(weight, c(0.25, 0.75)))) %>% unnest(r) ``` ## or, even better, use `enframe`: ```{r tidy-extra-R-14} quantile(pigs_longer$weight, c(0.25, 0.75)) enframe(quantile(pigs_longer$weight, c(0.25, 0.75))) ``` ## A nice look Run this one line at a time to see how it works: ```{r tidy-extra-R-15, warning=FALSE} pigs_longer %>% group_by(feed) %>% summarize(r=list(enframe(quantile(weight, c(0.25, 0.75))))) %>% unnest(r) %>% pivot_wider(names_from=name, values_from=value) -> d d ``` ## A hairy one 18 people receive one of three treatments. At 3 different times (pre, post, followup) two variables `y` and `z` are measured on each person: ```{r tidy-extra-R-16, message=F} my_url <- "http://ritsokiguess.site/STAC32/repmes.txt" repmes0 <- read_table(my_url) repmes0 repmes0 %>% mutate(id=str_c(treatment, ".", rep)) %>% select(-rep) %>% select(id, everything()) -> repmes repmes ``` ```{r} repmes %>% pivot_longer(contains("_"), names_to=c("time", ".value"), names_sep="_" ) ``` ## Attempt 1 ```{r tidy-extra-R-17} repmes %>% pivot_longer(contains("_"), names_to=c("time", "var"), names_sep="_", values_to = "vvv" ) ``` This is *too* long! We wanted a column called `y` and a column called `z`, but they have been pivoted-longer too. ## Attempt 2 ```{r tidy-extra-R-18} repmes repmes %>% pivot_longer(contains("_"), names_to=c("time", ".value"), names_sep="_" ) -> repmes3 repmes3 ``` This has done what we wanted. ## make a graph ```{r tidy-extra-R-19} ggplot(repmes3, aes(x=fct_inorder(time), y=y, colour=treatment, group = id)) + geom_point() + geom_line() ``` A so-called spaghetti plot. The three measurements for each person are joined by lines, and the lines are coloured by treatment. ## or do the plot with means ```{r tidy-extra-R-20} repmes3 %>% group_by(treatment, ftime=fct_inorder(time)) %>% summarize(mean_y=mean(y)) %>% ggplot(aes(x=ftime, y=mean_y, colour=treatment, group=treatment)) + geom_point() + geom_line() ``` On average, the two real treatments go up and level off, but the control group is very different.