---
title: "Fixed Effects"
output: 
  html_document:
    df_print: kable
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  out.width = "100%",
  fig.width = 5,
  fig.height = 3,
  dpi = 144
)
```

```{r libs, message=FALSE}
library(tidyverse)
library(faux)
library(afex) # for anova and lmer
library(broom)
library(broom.mixed) # to make tidy tables of lmer output

theme_set(theme_minimal(base_size = 14))
```


## Simulation functions

The functions below are commonly used when you're setting up a simulated dataset.

### Repeating

The function `rep()` lets you repeat the first argument a number of times.

Use `rep()` to create a vector of alternating `"A"` and `"B"` values of length 24.

```{r rep1-times}
rep(c("A", "B"), times = 12)
```

If the second argument is a vector that is the same length as the first argument, each element in the first vector is repeated that many times. Use `rep()` to create a vector of 11 `"A"` values followed by 3 `"B"` values.

```{r rep-vector}
rep(c("A", "B"), c(11, 3))
```

You can repeat each element of the vector a specified number of times using the `each` argument, Use `rep()` to create a vector of 12 `"A"` values followed by 12 `"B"` values.

```{r rep-each}
rep(c("A", "B"), each = 12)
```

What do you think will happen if you set `times` to 3 and `each` to 2?

```{r rep-times-each}
rep(c("A", "B"), times = 3, each = 2)
```


### Sequences

The function `seq()` is useful for generating a sequence of numbers with some pattern.

Use `seq()` to create a vector of the integers 0 to 10.

```{r seq1-10}
seq(0, 10)
```

You can set the `by` argument to count by numbers other than 1 (the default). Use `seq()` to create a vector of the numbers 0 to 100 by 10s.

```{r seq-by}
seq(0, 100, by = 10)
```

The argument `length.out` is useful if you know how many steps you want to divide something into. Use `seq()` to create a vector that starts with 0, ends with 100, and has 12 equally spaced steps (hint: how many numbers would be in a vector with 2 *steps*?).

```{r seq-length-out}
seq(0, 100, length.out = 13)
```

### Uniform Distribution

The uniform distribution is the simplest distribution. All numbers in the range have an equal probability of being sampled. Use `runif()` to sample from a continuous uniform distribution.

```{r runif}
runif(n = 10, min = 0, max = 1)
```


Pipe the result to `hist()` to make a quick histogram of your simulated data.

```{r runif-hist}
runif(100000, min = 0, max = 1) %>% hist()
```

### Discrete Distribution

You can use `sample()` to simulate events like rolling dice or choosing from a deck of cards. The code below simulates rolling a 6-sided die 10000 times. We set `replace` to `TRUE` so that each event is independent. See what happens if you set `replace` to `FALSE`.

```{r sample-replace, fig.cap = "Distribution of dice rolls."}
rolls <- sample(1:6, 10000, replace = TRUE)

# plot the results
as.factor(rolls) %>% plot()
```

You can also use sample to sample from a list of named outcomes.

```{r sample-list}
pet_types <- c("cat", "dog", "ferret", "bird", "fish")
sample(pet_types, 10, replace = TRUE)
```

Ferrets, while the best pet, are a much less common pet than cats and dogs, so our sample isn't very realistic. You can set the probabilities of each item in the list with the `prob` argument.

```{r sample-prob}
pet_types <- c("cat", "dog", "ferret", "bird", "fish")
pet_prob <- c(0.3, 0.4, 0.1, 0.1, 0.1)
pet_data <- sample(pet_types, 100, replace = TRUE, prob = pet_prob) 

as.factor(pet_data) %>% plot()
```


### Binomial Distribution

The `rbinom` function will generate a random binomial distribution.

* `n` = number of observations
* `size` = number of trials
* `prob` = probability of success on each trial

Coin flips are a typical example of a binomial distribution, where we can assign heads to 1 and tails to 0.

```{r rbinom-fair}
# 20 individual coin flips of a fair coin
rbinom(20, 1, 0.5)
```


```{r rbinom-bias}
# 20 individual coin flips of a baised (0.75) coin
rbinom(20, 1, 0.75)
```

You can generate the total number of heads in 1 set of 20 coin flips by setting `size` to 20 and `n` to 1.

```{r rbinom-size}
# 1 set of 20 fair coin flips
rbinom(1, 20, 0.75)
```

You can generate more sets of 20 coin flips by increasing the `n`.

```{r rbinom-n}
# 10 sets of 20 fair coin flips
rbinom(10, 20, 0.5)
```

### Normal Distribution

We can simulate a normal distribution of size `n` if we know the `mean` and standard deviation (`sd`). 

```{r rnorm}
# 10 samples from a normal distribution with a mean of 0 and SD of 1
rnorm(10, 0, 1)
```

A density plot is usually the best way to visualise this type of data.

```{r rnorm-plot}
# 100 samples from a normal distribution with a mean of 10 and SD of 2
dv <- rnorm(100, 10, 2)

# use sample to get a random colour
fill_colour <- sample(colours(), 1)

ggplot() +
  geom_density(aes(dv), fill = fill_colour) +
  scale_x_continuous(
    limits = c(0,20), 
    breaks = seq(0,20)
  )
```

Run the simulation above several times, noting how the density plot changes. Try changing the values of `n`, `mean`, and `sd`.

## Independent samples

Now we're ready to start simulating some data. Let's start with a simple independent-samples design where the variables are from a normal distribution. Each subject produces one score (in condition A or B). What we need to know about these scores is:

* How many subjects are in each condition?
* What are the score means?
* What are the score variances (or SDs)?

### Parameters

First, set parameters for these values. This way, you can use these variables wherever you need them in the rest of the code and you can easily change them.

```{r ind-vars}

A_sub_n <- 50
B_sub_n <- 50
A_mean  <- 10
B_mean  <- 11
A_sd    <- 2.5
B_sd    <- 2.5

```

### Scores

We can the generate the scores using the `rnorm()` function.

```{r ind-dat}
A_scores <- rnorm(A_sub_n, A_mean, A_sd)
B_scores <- rnorm(B_sub_n, B_mean, B_sd)
```

You can stop here and just analyse your simulated data with `t.test(A_scores, B_scores)`, but usually you want to get your simulated data into a data table that looks like what you might eventually import from a CSV file with your actual experimental data.

```{r ind-tibble}
dat <- tibble(
  sub_condition = rep( c("A", "B"), c(A_sub_n, B_sub_n) ),
  score = c(A_scores, B_scores)
)
```

If you're simulating data for a script where you will eventually import data from a csv file, you can save these data to a csv file and then re-read them in, so when you get your real data, all you need to do is comment out the simulation steps.

```{r}
# make a data directory if there isn't one already
if (!dir.exists("data")) dir.create("data")

# save your simulated data
write_csv(dat, "data/sim-data-ind-samples.csv")

# start your analysis here
dat <- read_csv("data/sim-data-ind-samples.csv")

```


### Check your data

Always examine your simulated data after you generate it to make sure it looks like you want.

```{r ind-check}
dat %>%
  group_by(sub_condition) %>%
  summarise(n = n() ,
            mean = mean(score),
            sd = sd(score),
            .groups = "drop")
```


### Analysis

```{r ind-test}
t.test(score~sub_condition, dat)
```

### Function

You can wrap all this in a function so you can run it many times to do a power calculation. Put all your parameters as arguments to the function.

```{r ind-func}

ind_sim <- function(A_sub_n, B_sub_n, 
                    A_mean, B_mean, 
                    A_sd, B_sd) {
  # simulate data for groups A and B
  A_scores <- rnorm(A_sub_n, A_mean, A_sd)
  B_scores <- rnorm(B_sub_n, B_mean, B_sd)
  
  # put the data into a table
  dat <- tibble(
    sub_condition = rep( c("A", "B"), c(A_sub_n, B_sub_n) ),
    score = c(A_scores, B_scores)
  )
  
  # analyse the data
  t <- t.test(score~sub_condition, dat)
  
  # return a list of the values you care about
  # the double brackets ([[]]) get rid of the name of named numbers
  list(
    t = t$statistic[[1]],
    ci_lower = t$conf.int[[1]],
    ci_upper = t$conf.int[[2]],
    p = t$p.value[[1]],
    estimate = t$estimate[[1]] - t$estimate[[2]]
  )
}

```

Now run your new function with the values you used above.

```{r}
# str() prints the resulting list in a shorter format
ind_sim(50, 50, 10, 11, 2.5, 2.5) %>% str()
```

Now you can use this function to run many simulations. The function `map_df` from the `purrr` package (loaded with `tidyverse`) is one of many ways to run a function many times and organise the results into a table.

```{r}
mysim <- map_df(1:1000, ~ind_sim(50, 50, 10, 11, 2.5, 2.5))
```

Now you can graph the data from your simulations.

```{r sim-p-fig}
# set boundary = 0 when plotting p-values
ggplot(mysim, aes(p)) +
  geom_histogram(binwidth = 0.05, boundary = 0,
                 fill = "white", colour = "black")
```


```{r ind-sim-fig, fig.cap = "Distribution of results from simulated independent samples data"}
mysim %>%
  gather(stat, value, t:estimate) %>%
  ggplot() + 
  geom_density(aes(value, color = stat), show.legend = FALSE) +
  facet_wrap(~stat, scales = "free")
```

You can calculate power as the proportion of simulations on which the p-value was less than your alpha.

```{r}
alpha <- 0.05
power <- mean(mysim$p < alpha)
power
```


## Paired samples

Now let's try a paired-samples design where the variables are from a normal distribution. Each subject produces two scores (in conditions A and B). What we need to know about these two scores is:

* How many subjects?
* What are the score means?
* What are the score variances (or SDs)?
* What is the correlation between the scores?

### Parameters {#paired-params}

```{r paired-vars}

sub_n <- 100
A_mean <- 10
B_mean <- 11
A_sd <- 2.5
B_sd <- 2.5
AB_r <- 0.5

```


### Correlated Scores

You can then use `rnorm_multi()` to generate a data table with simulated values for correlated scores:

```{r sim-design}
dat <- faux::rnorm_multi(
    n = sub_n, 
    vars = 2, 
    r = AB_r, 
    mu = c(A_mean, B_mean), 
    sd = c(A_sd, B_sd), 
    varnames = c("A", "B")
  )
```

You can also do this using the `MASS::mvrnorm` function, but `faux::rnorm_multi` is easier when you have more variables to simulate.

```{r}
# make the correlation matrix
cormat <- matrix(c(   1, AB_r,
                   AB_r,    1), 
             nrow = 2, byrow = TRUE)

# make a corresponding matrix of the variance 
# (multiply the SDs for each cell)
varmat <- matrix(c(A_sd * A_sd, A_sd * B_sd,
                   A_sd * B_sd, B_sd * B_sd), 
             nrow = 2, byrow = TRUE) 

# create correlated variables with the specified parameters
S <- MASS::mvrnorm(n = sub_n, 
                   mu = c(A_mean, B_mean), 
                   Sigma = cormat * varmat)
dat <- data.frame(
  A = S[, 1],
  B = S[, 2]
)

```


### Check your data

Now check your data; `faux` has a function `get_params()` that gives you the correlation table, means, and SDs for each numeric column in a data table.

```{r paired-check}
faux::get_params(dat)
```

### Analysis

```{r paired-test}
# paired-samples t-test
t.test(dat$A, dat$B, paired = TRUE)
```

### Function

```{r paired-func}

paired_sim <- function(sub_n, A_mean, B_mean, A_sd, B_sd, AB_r) {

  dat <- faux::rnorm_multi(
    n = sub_n, 
    vars = 2, 
    r = AB_r, 
    mu = c(A_mean, B_mean), 
    sd = c(A_sd, B_sd), 
    varnames = c("A", "B")
  )
  t <- t.test(dat$A, dat$B, paired = TRUE)
  
  # return just the values you care about
  list(
    t = t$statistic[[1]],
    ci_lower = t$conf.int[[1]],
    ci_upper = t$conf.int[[2]],
    p = t$p.value[[1]],
    estimate = t$estimate[[1]]
  )
}

```

Run 1000 simulations and graph the results.

```{r}
mysim_p <- map_df(1:1000, ~paired_sim(100, 10, 11, 2.5, 2.5, .5))
```

```{r pair-sim-fig, fig.cap = "Distribution of results from simulated paired samples data"}
mysim_p %>%
  gather(stat, value, t:estimate) %>%
  ggplot() + 
  geom_density(aes(value, color = stat), show.legend = FALSE) +
  facet_wrap(~stat, scales = "free")
```

```{r}
alpha <- 0.05
power <- mean(mysim_p$p < alpha)
power
```


## Intercept model

Now I'm going to show you a different way to simulate the same design. This might seem excessively complicated, but you will need this pattern when you start simulating data for mixed effects models.

### Parameters

Remember, we used the following parameters to set up our simulation above:

```{r paired-vars2}
sub_n  <- 100
A_mean <- 10
B_mean <- 11
A_sd   <- 2.5
B_sd   <- 2.5
AB_r   <- 0.5
```

From these, we can calculate the grand intercept (the overall mean regardless of condition), and the effect of condition (the mean of B minus A).

```{r}
grand_i   <- (A_mean + B_mean)/2
AB_effect <- B_mean - A_mean
```

We also need to think about variance a little differently. First, calculate the pooled variance as the mean of the variances for A and B (remember, variance is SD squared).

```{r}
pooled_var <- (A_sd^2 + B_sd^2)/2
```

The variance of the subject intercepts is `r` times this pooled variance and the error variance is what is left over. We take the square root (`sqrt()`) to set the subject intercept and error SDs for simulation later.

```{r}
sub_sd   <- sqrt(pooled_var * AB_r)
error_sd <- sqrt(pooled_var * (1-AB_r))
```


### Subject intercepts

Now we use these variables to create a data table for our subjects. Each subject gets an ID and a **random intercept** (`sub_i`). The intercept is simulated from a random normal distribution with a mean of 0 and an SD of `sub_sd`. This represents how much higher or lower than the average score each subject tends to be (regardless of condition).

```{r}
sub <- tibble(
  sub_id = 1:sub_n,
  sub_i = rnorm(sub_n, 0, sub_sd)
)
```

### Observations

Next, set up a table where each row represents one observation. We'll use one of my favourite functions for simulation: `crossing()`. This creates every possible combination of the listed factors (it works the same as `expand.grid()`, but the results are in a more intuitive order). Here, we're using it to create a row for each subject in each condition, since this is a fully within-subjects design.

```{r}
obs <- crossing(
  sub_id = 1:sub_n,
  condition = c("A", "B")
)
```

### Calculate the score

Next, we join the subject table so each row has the information about the subject's random intercept and then calculate the score. I've done it in a few steps below for clarity. The score is just the sum of:

* the overall mean (`grand_i`)
* the subject-specific intercept (`sub_i`)
* the effect (`effect`): the numeric code for condition (`condition.e`) multiplied by the effect of condition (`AB_effect`)
* the error term (simulated from a normal distribution with mean of 0 and SD of `error_sd`)

```{r im-data}
dat <- obs %>%
  left_join(sub, by = "sub_id") %>%
  mutate(
    condition.e = recode(condition, "A" = -0.5, "B" = 0.5),
    effect = AB_effect * condition.e,
    error = rnorm(nrow(.), 0, error_sd),
    score = grand_i + sub_i + effect + error
  )
```

Use `get_params` to check the data. With data in long format, you need to specify the columns that contain the id, dv, and within-id variables.

```{r im-get-params}
# check the data
faux::get_params(dat, 
                 id = "sub_id",
                 dv = "score",
                 within = "condition")
```

You can use the following code to put the data table into a more familiar "wide" format.

```{r im-wide}
dat_wide <- dat %>%
  select(sub_id, condition, score) %>%
  spread(condition, score)
```

### Analyses

You can analyse the data with a paired-samples t-test from the wide format:

```{r im-wide-t}
# paired-samples t-test from dat_wide
t.test(dat_wide$A, dat_wide$B, paired = TRUE)
```

Or in the long format:

```{r im-long-t}
# paired-samples t-test from dat (long)
t.test(score ~ condition, dat, paired = TRUE)
```

You can analyse the data with ANOVA using the `aov_4()` function from `afex`. (Notice how the F-value is the square of the t-value above.)

```{r im-afex}
# anova using afex::aov_4
aov <- afex::aov_4(score ~ (condition | sub_id), data = dat)

aov$anova_table
```


You can even analyse the data with a mixed effects model using the `lmer` function (the `afex` version gives you p-values, but the `lme4` version does not). 

```{r im-lmer}
# mixed effect model using afex::lmer
lmem <- afex::lmer(score ~ condition.e + (1 | sub_id), data = dat)

# displays a tidy table of the fixed effects
broom.mixed::tidy(lmem, effects = "fixed")
```

## Simulate a dataset from an analysis

Simulate a dataset from the parameters of an analysis. We'll use the built-in dataset `mtcars` to predict miles per gallon (`mpg`) from transmission type (`am`) and engine type (`vs`). 

```{r}
model <- lm(mpg ~ am * vs, data = mtcars)
broom::tidy(model)
```

### Simulate 

We can now simulate a dataset with 50 observations from each transmission type (`am`) and engine type (`vs`) combination, then use the model parameters to generate predicted values for `mpg`. 

```{r}
err_sd <- sigma(model) # SD of the error term from the model
fx <- coefficients(model) # fixed effect coefficients

sim_mtcars <- tibble(
  am = rep(c(0, 0, 1, 1), each = 50),
  vs = rep(c(0, 1, 0, 1), each = 50)
) %>%
  mutate(err = rnorm(200, 0, err_sd),
         mpg = fx[1] + 
               fx["am"]*am + 
               fx["vs"]*vs +
               fx["am:vs"]*am*vs + err)
  
```

Analyse the simulated data with `lm()` and output the results as a table using `broom::tidy()`

```{r}
sim_model <- lm(mpg ~ am * vs, data = sim_mtcars)
broom::tidy(sim_model)
```

### Function

```{r}
carsim <- function(n, b0, b_am, b_vs, b_am_vs, err_sd) {
  sim_mtcars <- tibble(
    am = rep(c(0, 0, 1, 1), each = n),
    vs = rep(c(0, 1, 0, 1), each = n)
  ) %>%
    mutate(err = rnorm(n*4, 0, err_sd),
           mpg = b0 + b_am*am + b_vs*vs + b_am_vs*am*vs + err)
  
  sim_model <- lm(mpg ~ am * vs, data = sim_mtcars)
  broom::tidy(sim_model)
}
```


Run the function with the values from the original model, but cut the fixed effect sizes in half.

```{r}
err_sd <- sigma(model)
fx2 <- coefficients(model)/2

carsim(50, fx2[1], fx2[2], fx2[3], fx2[4], err_sd)
```

Repeat this 100 time and calculate power for each effect.

```{r}
simstats <- map_df(1:100, ~carsim(50, fx2[1], fx2[2], fx2[3], fx2[4], err_sd))

simstats %>%
  group_by(term) %>%
  summarise(power = mean(p.value < .05), .groups = "drop")
```

## Exercises

Using the dataset below, predict `moral` disgust from the interaction between `pathogen` and `sexual` disgust using `lm()`.

```{r}
disgust <- read_csv("https://psyteachr.github.io/msc-data-skills/data/disgust_scores.csv")
```


```{r}

```

Simulate a new dataset of 100 people with a similar pathogen and sexual disgust distribution to the original dataset. Remember that these are likely to be correlated and that scores can only range from 0 to 6. (Hint: look at the help for `norm2trunc`)

```{r}

```

Write a function to simulate data, analyse it, and return a table of results. Make sure you can vary the important parameters using arguments.

```{r}

```


Calculate power for the same fixed effects as in the original analysis. Adjust the N until the dsign has around .80 power to detect a main effect of pathogen disgust.

```{r}

```