--- title: "When pivot-wider goes wrong" --- ## Packages The inevitable: ```{r} library(tidyverse) ``` ## Some long data that should be wide ```{r wider-wrong-1, echo=FALSE} d <- tribble( ~obs, ~time, ~y, 1, "pre", 19, 2, "post", 18, 3, "pre", 17, 4, "post", 16, 5, "pre", 15, 6, "post", 14 ) d ``` - Six observations of variable `y`, but three measured before some treatment and three measured after. - Really matched pairs, so want column of `y`-values for `pre` and for `post`. - `pivot_wider`. ## What happens here? ```{r wider-wrong-2} d %>% pivot_wider(names_from = time, values_from = y) ``` - Should be *three* `pre` values and *three* `post`. Why did this happen? - `pivot_wider` needs to know which *row* to put each observation in. - Uses combo of columns *not* named in `pivot_wider`, here `obs` (only). ## The problem ```{r wider-wrong-3} d %>% pivot_wider(names_from = time, values_from = y) ``` - There are 6 different `obs` values, so 6 different rows. - No data for `obs` 2 and `pre`, so that cell missing (`NA`). - Not enough data (6 obs) to fill 12 ($= 2 \times 6$) cells. - `obs` needs to say which subject provided which *2* observations. ## Fixing it up ```{r wider-wrong-4, echo=FALSE} d2 <- tribble( ~subject, ~time, ~y, 1, "pre", 19, 1, "post", 18, 2, "pre", 17, 2, "post", 16, 3, "pre", 15, 3, "post", 14 ) d2 ``` - column `subject` shows which subject provided each `pre` and `post`. - when we do `pivot_wider`, now only *3* rows, one per subject. ## Coming out right ```{r wider-wrong-5} d2 %>% pivot_wider(names_from = time, values_from = y) ``` - row each observation goes to determined by other column `subject`, and now a `pre` and `post` for each `subject`. - right layout for matched pairs $t$ or to make differences for sign test or normal quantile plot. - "spaghetti plot" needs data longer, as `d2`. ## Spaghetti plot ```{r wider-wrong-6} d2 %>% mutate(time = fct_inorder(time)) %>% ggplot(aes(x = time, y = y, group = subject)) + geom_point() + geom_line() ``` - each subject's `y` decreases over time, with subject 1 highest overall. ## Another example - Two independent samples this time ```{r wider-wrong-8, echo=FALSE} d3 <- tribble( ~group, ~y, "control", 8, "control", 11, "control", 13, "control", 14, "treatment", 12, "treatment", 15, "treatment", 16, "treatment", 17, ) d3 ``` - These should be arranged like this - but what if we make them wider? ## Wider ```{r wider-wrong-9} d3 %>% pivot_wider(names_from = group, values_from = y) ``` - row determined by what not used for `pivot_wider`: nothing! - everything smooshed into *one* row! - this time, too *much* data for the layout. - Four data values squeezed into each of the two cells: "list-columns". ## Get the data out - To expand list-columns out into the data values they contain, can use `unnest`: ```{r wider-wrong-10} d3 %>% pivot_wider(names_from = group, values_from = y) %>% unnest(c(control, treatment)) ``` - in this case, wrong layout, because data values not paired. ## A proper use of list-columns ```{r wider-wrong-11} d3 %>% nest_by(group) %>% summarize(n = nrow(data), mean_y = mean(data$y), sd_y = sd(data$y)) ``` - another way to do `group_by` and `summarize` to find stats by group. - run this one piece at a time to see what it does.