--- title: "Getting started with simulating data in R: some helpful functions and how to use them" author: "Ariel Muldoon" date: "August 28, 2018" output: github_document --- # Overview Here's what we'll do today: 1. Simulate quantitative variables with `rnorm()` and `runif()` 2. Generate character variables that represent groups via `rep()`. 3. Simulate data with both quantitative and categorical variables. 4. Use `replicate()` to repeat the data simulation process many times # Generating random numbers An easy way to generate numeric data is to pull random numbers from some distribution. The functions that do this in R always start with the letter `r` (for "random"). ## `rnorm()` to generate random numbers from the normal distribution Pull 5 random numbers from a standard normal distribution. ```{r} rnorm(5) ``` ### Writing out arguments for clearer code Using the defaults makes for quick coding but does not make the parameters of the generating distribution clear. ```{r} rnorm(n = 5, mean = 0, sd = 1) ``` ### Setting the random seed for reproducible random numbers To reproduce the random numbers, set the seed via `set.seed()`. Set the seed and generate 5 numbers. ```{r} set.seed(16) rnorm(n = 5, mean = 0, sd = 1) ``` Reset the seed and we get the same 5 numbers. ```{r} set.seed(16) rnorm(n = 5, mean = 0, sd = 1) ``` ### Change parameters in `rnorm()` We can pull from different normal distributions by changing the parameters. First, using a mean of 0 and standard deviation of 2 (so a variance of 4). ```{r} rnorm(n = 5, mean = 0, sd = 2) ``` Using a large mean and relatively small standard deviation can give values that are strictly positive. ```{r} rnorm(n = 5, mean = 50, sd = 20) ``` ### Using vectors of values for the parameter arguments Both `mean` and `sd` will take vectors of values. Let's pull 10 values from distributions with different means but the same standard deviation. ```{r} rnorm(n = 10, mean = c(0, 5, 20), sd = 1) ``` We could pass a vector to `sd`, as well, but not `n`. The `n` argument uses the length of the vector to indicate the number of values desired. ```{r} rnorm(n = c(2, 10, 10), mean = c(0, 5, 20), sd = c(1, 5, 20) ) ``` ## Example of using the simulated numbers from `rnorm()` Exploring how related two unrelated vectors can appear. ```{r} x = rnorm(n = 10, mean = 0, sd = 1) y = rnorm(n = 10, mean = 0, sd = 1) plot(y ~ x) ``` ## `runif()` pulls from the uniform distribution I like `runif()` to to generate continuous data within a set range. The default is numbers between 0 and 1. ```{r} runif(n = 5, min = 0, max = 1) ``` But we can do any range. ```{r} runif(n = 5, min = 50, max = 100) ``` ## Example of using the simulated numbers from `runif()` I like `runif()` to demonstrate how the relative size of the explanatory variable affects the estimated coefficient. Make a response variable via `rnorm()` and then generate an explanatory variable between 1 and 2. ```{r} set.seed(16) y = rnorm(n = 100, mean = 0, sd = 1) x1 = runif(n = 100, min = 1, max = 2) head(x1) ``` Then make an explanatory variable between 200 and 300. ```{r} x2 = runif(n = 100, min = 200, max = 300) head(x2) ``` We'll use the data in a regression model fit via `lm()`, making note of the coefficients. ```{r} lm(y ~ x1 + x2) ``` # Generate character vectors with `rep()` Simulations involve categorical variables, as well, that often need to be repeated in a pattern. ## Using `letters` and `LETTERS` These are *built in constants* in R, and convenient for making a simple character vectors. The first two lowercase letters. ```{r} letters[1:2] ``` The last 17 uppercase letters. ```{r} LETTERS[10:26] ``` ## Repeat each element of a vector with `each` With `each` we repeat each unique character in the vector some number of times. ```{r} rep(letters[1:2], each = 3) ``` ## Repeat a whole vector with the `times` argument Repeating is different with `times`. ```{r} rep(letters[1:2], times = 3) ``` ## Set the output vector length with the `length.out` argument Using `length.out` is similar to `times` but the groups can be imbalanced. ```{r} rep(letters[1:2], length.out = 5) ``` ## Repeat each element a different number of `times` We can get unbalanced data with `times` if we use a vector for the argument. ```{r} rep(letters[1:2], times = c(2, 4) ) ``` ## Combining `each` with `times` When using `times` this way it will only take a single value and not a vector. ```{r} rep(letters[1:2], each = 2, times = 3) ``` ## Combining `each` with `length.out` This is another way to impart imbalance. ```{r} rep(letters[1:2], each = 2, length.out = 7) ``` Note you can't use `length.out` and `times` together (if you try, `length.out` will be given priority and `times` ignored). # Creating datasets with quantiative and categorical variables ## Simulate data with no differences among two groups We want to simulate a two level grouping variable and a "response" variable where there are no differences among the two groups. ```{r} group = rep(letters[1:2], each = 3) response = rnorm(n = 6, mean = 0, sd = 1) data.frame(group, response) ``` We don't have to make each variable separately before putting in a data.frame. ```{r} data.frame(group = rep(letters[1:2], each = 3), response = rnorm(n = 6, mean = 0, sd = 1) ) ``` Now let's add another categorical variable to this dataset that's *crossed* with the first. The new factor will have three values. ```{r} LETTERS[3:5] ``` Remember the `group` factor is repeated elementwise. ```{r} rep(letters[1:2], each = 3) ``` So what argument do we use for the new variable? ```{r, eval = FALSE} rep(LETTERS[3:5], ?) ``` To repeat the whole vector twice can use the `times` argument or `length.out = 6`. You can see that every level of this new variable occurs with every level of `group`. ```{r} data.frame(group = rep(letters[1:2], each = 3), factor = rep(LETTERS[3:5], times = 2), response = rnorm(n = 6, mean = 0, sd = 1) ) ``` What if we tried to use `each` instead? ```{r} data.frame(group = rep(letters[1:2], each = 3), factor = rep(LETTERS[3:5], each = 2), response = rnorm(n = 6, mean = 0, sd = 1) ) ``` ## Simulate data with a difference in means among two groups We can use a vector for `mean` in `rnorm()` for this. ```{r} response = rnorm(n = 6, mean = c(5, 10), sd = 1) response ``` How do we get the `group` pattern correct? ```{r, eval = FALSE} rep(letters[1:2], ?) ``` We need `times` or `length.out` to repeat the whole vector to match the output of `rnorm()`. ```{r} group = rep(letters[1:2], length.out = 6) group ``` Getting the order correct is one reason to build vectors separately before binding them into a data.frame. ```{r} data.frame(group, response) ``` # Repeatedly simulating data with `replicate()` The `replicate()` function is a real workhorse when making repeated simulations as it is for *repeated evaluation of an expression (which will usually involve random number generation)*. It takes three arguments: * `n`, which is the number of replications to perform. This is to set the number of repeated runs we want. * `expr`, the expression that should be run repeatedly. This is often a function. * `simplify`, which controls the type of output the results of `expr` are saved into. Use `simplify = FALSE` to get output saved into a list instead of in an array. ## Simple example of `replicate()` Generate 5 values from a standard normal distribution 3 times. ```{r} set.seed(16) replicate(n = 3, expr = rnorm(n = 5, mean = 0, sd = 1), simplify = FALSE ) ``` Without `simplify = FALSE` we get a matrix. ```{r} set.seed(16) replicate(n = 3, expr = rnorm(n = 5, mean = 0, sd = 1) ) ``` ## An equivalent `for()` loop example The same thing can be done with a `for()` loop, which I've found can be easier code to follow for R beginners. ```{r} set.seed(16) list1 = list() # Make an empty list to save output in for (i in 1:3) { # Indicate number of iterations with "i" list1[[i]] = rnorm(n = 5, mean = 0, sd = 1) # Save output in list for each iteration } list1 ``` ## Using `replicate()` to repeatedly make a dataset Let's replicate the "two groups with no difference in means" dataset from earlier. ```{r} data.frame(group = rep(letters[1:2], each = 3), response = rnorm(n = 6, mean = 0, sd = 1) ) ``` This can be put as the `expr` in `replicate()`. ```{r} simlist = replicate(n = 3, expr = data.frame(group = rep(letters[1:2], each = 3), response = rnorm(n = 6, mean = 0, sd = 1) ), simplify = FALSE) ``` We can see this result is a list of three data.frames. ```{r} str(simlist) ``` Here is the first of the three. ```{r} simlist[[1]] ``` # What's the next step? By saving our generated variables or data.frames into a list we've made it so we can loop via list looping functions like `lapply()` or `purrr::map()`. You can see a few examples of `replicate()` followed by `map()` in my blog post [A closer look at replicate() and purrr::map() for simulations](https://aosmith.rbind.io/2018/06/05/a-closer-look-at-replicate-and-purrr/).