--- title: "Intro to Faux" output: html_document: df_print: paged toc: true toc_float: true --- ```{r, include = FALSE} # control the appearance of the knitted result knitr::opts_chunk$set( collapse = TRUE, out.width = "100%", fig.width = 5, fig.height = 3, dpi = 144 ) ``` In this tutorial, we'll learn how to simulate data for factorial designs using {faux}. There are more extensive examples at . ## Setup We'll be using 4 packages in this tutorial. ```{r libs, message=FALSE} library(tidyverse) # for data wrangling library(faux) # for simulation library(broom) # for tidy analysis results library(afex) # for ANOVA set.seed(8675309) # Jenny, I've got your number ``` A seed makes randomness reproducible. Run the following code several times. Change the seed to your favourite integer. If the seed is the same, the random numbers after it will be the same, as long as the code is always executed in the same order. ```{r} set.seed(0) rnorm(1) ``` ## Normal Let's start with a normal distribution using the base R function `rnorm()`, which returns `n` values from a normal distribution with a mean of 0 and a standard deviation of 1. ```{r} rnorm(n = 10) ``` You can change the mean and SD. Simulate a lot of values (1e5 == 100,000), save them in a variable, and visualise them with `hist()`. ```{r} x <- rnorm(1e5, mean = 30, sd = 5) hist(x) ``` ## Multivariate normal But how do you create correlated values? You can do this with `MASS::mvrnorm()`, but you need to construct the `Sigma` argument yourself from the correlation matrix and the standard deviations of the populations, and then you need to turn the resulting matrix into a data frame for many use cases. This isn't very difficult, but can be tedious with larger numbers of variables. ```{r} n = 1e5 # this is a large number to demonstrate that the result is as expected mu = c(A = 1, B = 2, C = 3) sd = c(0.5, 1, 1.5) r = c(0, .25, .5) cor_mat <- matrix(c(1, r[1], r[2], r[1], 1, r[3], r[2], r[3], 1), nrow = 3) Sigma <- (sd %*% t(sd)) * cor_mat vars <- MASS::mvrnorm(n, mu, Sigma) |> as.data.frame() cor(vars) |> round(2) ``` ### rnorm_multi In faux, you can create sets of correlated normally distributed values using `rnorm_multi()`. ```{r} dat3 <- rnorm_multi( n = 50, mu = c(A = 1, B = 2, C = 3), sd = c(0.5, 1, 1.5), r = c(0, .25, .5) ) ``` The function `get_params()` gives you a quick way to see the means, SDs and correlations in the simulated data set to make sure you set the parameters correctly. ```{r} get_params(dat3) ``` If you set `empirical` to `TRUE`, the values you set will be the **sample** parameters, not the **population** parameters. This isn't usually what you want for a simulation, but can be useful to check you set the parameters correctly. ```{r} dat3 <- rnorm_multi( n = 50, mu = c(A = 1, B = 2, C = 3), sd = c(0.5, 1, 1.5), r = c(0, .25, .5), empirical = TRUE ) get_params(dat3) ``` ### Setting r You can set the `r` argument for correlations in a few different ways. If all correlations have the same value, just set r equal to a single number. ```{r} # all correlations the same value rho_same <- rnorm_multi(50, 4, r = .5, empirical = TRUE) get_params(rho_same) ``` You can set `r` to a vector or matrix of the full correlation matrix. This is convenient when you're getting the values from an existing dataset, where you can just use the output of the `cor()` function. ```{r} rho <- cor(iris[1:4]) round(rho, 2) ``` Notice how, since we didn't specify the names of the 4 variables anywhere else, `rnorm_multi()` will take them from the named correlation matrix. ```{r} rho_cormat <- rnorm_multi(50, 4, r = rho, empirical = TRUE) get_params(rho_cormat) ``` Alternatively, you can just specify the values from the upper right triangle of a correlation matrix. This might be easier if you're reading the values out of a paper. ```{r} # upper right triangle # X2 X3 X4 rho <- c(0.5, 0.4, 0.3, # X1 0.2, 0.1, # X2 0.0) # X3 rho_urt <- rnorm_multi(50, 4, r = rho, empirical = TRUE) get_params(rho_urt) ``` ## Factorial Designs You can use `rnorm_multi()` to simulate data for each between-subjects cell of a factorial design and manually combine the tables, but faux has a function that better maps onto how we usually think and teach about factorial designs. The default design is 100 observations of one variable (named `y`) with a mean of 0 and SD of 1. Unless you set `plot = FALSE` or run `faux_options(plot = FALSE)`, this function will show you a plot of your design so you can check that it looks like you expect. ```{r} simdat1 <- sim_design() ``` ### Factors Use named lists to set the names and levels of `within` and `between` subject factors. ```{r} pettime <- sim_design( between = list(pet = c("cat", "dog", "ferret")), within = list(time = c("pre", "post")) ) ``` You can set `mu` and `sd` with unnamed vectors, but getting the order right can take some trial and error. ```{r} pettime <- sim_design( between = list(pet = c("cat", "dog", "ferret")), within = list(time = c("pre", "post")), mu = 1:6 ) ``` You can set values with a named vector for a single type of factor. The values do not have to be in the right order if they're named. ```{r} pettime <- sim_design( between = list(pet = c("cat", "dog", "ferret")), within = list(time = c("pre", "post")), mu = c(cat = 1, ferret = 5, dog = 3), sd = c(pre = 1, post = 2) ) ``` Or use a data frame for within- and between-subject factors. ```{r} pettime <- sim_design( between = list(pet = c("cat", "dog", "ferret")), within = list(time = c("pre", "post")), mu = data.frame( pre = c(1, 3, 5), post = c(2, 4, 6), row.names = c("cat", "dog", "ferret") ) ) ``` If you have within-subject factors, set the correlations for each between-subject cell like this. ```{r} pettime <- sim_design( between = list(pet = c("cat", "dog", "ferret")), within = list(time = c("pre", "post")), r = list(cat = 0.5, dog = 0.25, ferret = 0), empirical = TRUE, plot = FALSE ) get_params(pettime) ``` You can also change the name of the `dv` and `id` columns and output the data in long format. If you do this, you also need to tell `get_params()` what columns contain the between- and within-subject factors, the dv, and the id. ```{r} dat_long <- sim_design( between = list(pet = c("cat", "dog", "ferret")), within = list(time = c("pre", "post")), id = "subj_id", dv = "score", long = TRUE, plot = FALSE ) get_params(dat_long, digits = 3) ``` ### Multiple Factors Set more than one within-or between-subject factor like this: ```{r} dat_multi <- sim_design( between = list(pet = c("cat", "dog", "ferret"), country = c("UK", "NL")), within = list(time = c("pre", "post"), condition = c("ctl", "exp")), mu = data.frame( cat_UK = 1:4, cat_NL = 5:8, dog_UK = 9:12, dog_NL = 13:16, ferret_UK = 17:20, ferret_NL = 21:24, row.names = c("pre_ctl", "pre_exp", "post_ctl", "post_exp") ) ) ``` Because faux uses an underscore for the separator, you have to set the `sep` argument to something different if you want to use underscores in your variable names (or set the separator globally with `faux_options`). ```{r} # faux_options(sep = ".") dat_multi <- sim_design( between = list(pet = c("cat", "dog", "ferret"), country = c("Glasgow_UK", "Rotterdam_NL")), within = list(time = c("pre", "post"), condition = c("ctl", "exp")), mu = data.frame( cat.Glasgow_UK = 1:4, cat.Rotterdam_NL = 5:8, dog.Glasgow_UK = 9:12, dog.Rotterdam_NL = 13:16, ferret.Glasgow_UK = 17:20, ferret.Rotterdam_NL = 21:24, row.names = c("pre.ctl", "pre.exp", "post.ctl", "post.exp") ), sep = "." ) ``` ### Anonymous Factors If you need to make a quick demo, you can set factors anonymously with integer vectors. For example, the following code makes 3B\*2B\*2W mixed design. ```{r} dat_anon <- sim_design( n = 50, between = c(3, 2), within = 2, mu = 1:12 ) ``` Faux has a quick plotting function for visualising data made with faux. The plot created by `sim_design()` shows the *design*, while this function shows the simulated *data*. ```{r} plot(dat_anon) ``` You can change the order of plotting and the types of geoms plotted. This takes a little trial and error, so this function will probably be refined in later versions. ```{r} plot(dat_anon, "B1", "B2", "W1", geoms = c("violin", "pointrangeSD")) ``` ## Replications You often want to simulate data repeatedly to do things like calculate power. The `sim_design()` function has a lot of overhead for checking if a design makes sense and if the correlation matrix is possible, so you can speed up the creation of multiple datasets with the same design using the `rep` argument. This will give you a nested data frame with each dataset in the `data` column. ```{r} dat_rep <- sim_design( within = 2, n = 20, mu = c(0, 0.25), rep = 5, plot = FALSE ) ``` ### Analyse each replicate You can run analyses on the nested data by wrapping your analysis code in a function then using `map()` to run the analysis on each data set and `unnest()` to expand the results into a data table. ```{r} # define function analyse <- function(data) { t.test(data$W1a, data$W1b, paired = TRUE) %>% broom::tidy() } # get one test data set data <- dat_rep$data[[1]] # check function returns what you want analyse(data) ``` ```{r} # run the function on each data set dat_rep |> mutate(analysis = map(data, analyse)) |> select(-data) |> unnest(analysis) ``` ### ANOVA Use the same pattern to run an ANOVA on a version of the `pettime` dataset. First, simulate 100 datasets in long format. These data will have small main effects of pet and time, but no interaction. ```{r} pettime100 <- sim_design( between = list(pet = c("cat", "dog")), within = list(time = c("pre", "post")), n = c(cat = 50, dog = 40), mu = data.frame( pre = c(1, 1.2), post = c(1.2, 1.4), row.names = c("cat", "dog") ), sd = 1, id = "pet_id", dv = "score", r = 0.5, long = TRUE, rep = 100 ) ``` Then set up your analysis. We'll use the `aov_ez()` function from the {afex} package because its arguments match those of `sim_design()`. There's a little setup to run first to get rid of annoying messages and make this run faster by omitting calculations we won't need. ```{r} afex::set_sum_contrasts() # avoids annoying afex message afex_options(include_aov = FALSE) # runs faster afex_options(es_aov = "pes") # changes effect size measure to partial eta squared ``` This custom function takes the data frame as input and runs our ANOVA on it. The code at the end just cleans up the resulting table a bit. ```{r} analyse <- function(data) { a <- afex::aov_ez( id = "pet_id", dv = "score", between = "pet", within = "time", data = data ) # return anova_table for GG-corrected DF as_tibble(a$anova_table, rownames = "term") |> mutate(term = factor(term, levels = term)) |> # keeps terms in order rename(p.value = `Pr(>F)`) # fixes annoying p.value name } ``` Test the analysis code on the first simulated data frame. ```{r} analyse( pettime100$data[[1]] ) ``` Use the same code we used in the first example to make a table of the results of each analysis: ```{r} pettime_sim <- pettime100 |> mutate(analysis = map(data, analyse)) |> select(-data) |> unnest(analysis) ``` ```{r, echo = FALSE} # show the first 6 rows head(pettime_sim) |> mutate(across(5:8, \(x) round(x, 3))) ``` Then you can summarise the data to calculate things like power for each effect or mean effect size. ```{r} pettime_sim |> group_by(term) |> summarise(power = mean(p.value < 0.05), mean_pes = mean(pes) |> round(3), .groups = "drop") ``` The power for the between-subjects effect of pet is smaller than for the within-subjects effect of time. What happens if you reduce the correlation between pre and post? ## Non-normal Distributions The newest version of faux has a new function for simulating non-normal distributions using the NORTA method (NORmal To Anything). The `dist` argument lists the variables with their distribution names (e.g., "norm", "pois", unif", "truncnorm", or anything that has an "rdist" function). The `params` argument lists the distribution function argument values for each variable (e.g., arguments to `rnorm`, `rpois`, `runif`, `rtruncnorm`). This function simulates multivariate non-normal distributions by using simulation to work out the correlations for a multivariate normal distribution that will produce the desired correlations after the normal distributions are converted to the desired distributions. This simulation can take a while if you have several variables and should warn you if you're requesting an impossible combination (but is still an experimental function, so let Lisa know if you have any problems). ```{r} dat_norta <- rmulti( n = 1000, dist = c(U = "unif", T = "truncnorm", L = "likert"), params = list( U = list(min = 0, max = 10), T = list(a = 1, b = 7, mean = 3.5, sd = 2.1), L = list(prob = c(`much less` = .10, `less` = .20, `equal` = .35, `more` = .25, `much more` = .10)) ), r = c(-0.5, 0, 0.5) ) ``` The "likert" type is a set of distribution functions provided by faux to make creating Likert scale variables easier (see `?rlikert`). You may need to convert Likert-scale variables to numbers before analysis or calculating descriptives. ```{r} # convert likert-scale variable to integer dat_norta$L <- as.integer(dat_norta$L) get_params(dat_norta) ``` ## Exercises ### Multivariate normal Sample 40 values of three variables named `J`, `K` and `L` from a population with means of 10, 20 and 30, and SDs of 5. `J` and `K` are correlated 0.5, `J` and `L` are correlated 0.25, and `K` and `L` are not correlated. ```{r} ``` ### From existing data Using the data from the built-in dataset `attitude`, simulate a new set of 20 observations drawn from a population with the same means, SDs and correlations for each column as the original data. ```{r} ``` ### 2b Create a dataset with a between-subject factor of "pet" having two levels, "cat", and "dog". The DV is "happiness" score. There are 20 cat-owners with a mean happiness score of 10 (SD = 3) and there are 30 dog-owners with a mean happiness score of 11 (SD = 3). ```{r} ``` ### 3w Create a dataset of 20 observations with 1 within-subject variable ("condition") having 3 levels ("A", "B", "C") with means of 10, 20 and 30 and SD of 5. The correlations between each level have r = 0.4. The dataset should look like this: | id | condition | score | |:---|:----------|------:| |S01 | A | 9.17 | |... | ... | ... | |S20 | A | 11.57 | |S01 | B | 18.44 | |... | ... | ... | |S20 | B | 20.04 | |S01 | C | 35.11 | |... | ... | ... | |S20 | C | 29.16 | ```{r} ``` ### 2w*2w Create a dataset with 50 subjects and 2 within-subject variables ("W1" and "W2") each having 2 levels. The mean for all cells is 10 and the SD is 2. The correlations look like this: | | W1a_W2a | W1a_W2b | W1b_W2a | W1b_W2b | |:--------|------:|------:|------:|------:| | W1a_W2a | 1.0 | 0.5 | 0.5 | 0.2 | | W1a_W2b | 0.5 | 1.0 | 0.2 | 0.5 | | W1b_W2a | 0.5 | 0.2 | 1.0 | 0.5 | | W1b_W2b | 0.2 | 0.5 | 0.5 | 1.0 | ```{r} ``` ### 2w*3b Create a dataset with a between-subject factor of "pet" having 3 levels ("cat", "dog", and "ferret") and a within-subject factor of "time" having 2 levels ("pre" and "post"). The N in each group should be 10. Means are: * cats: pre = 10, post = 12 * dogs: pre = 14, post = 16 * ferrets: pre = 18, post = 20 SDs are all 5 and within-cell correlations are all 0.25. ```{r} ``` ### Replications Create 5 datasets with a 2b*2b design, 30 participants in each cell. Each cell's mean should be 0, except B1a:B2a, which should be 0.5. The SD should be 1. Make the resulting data in long format. ```{r} ``` ### Power Simulate 100 datasets like the one above and use `lm()` or `afex::aov_ez()` to look at the interaction between B1 and B2. What is the power of this design? ```{r} ```