---
title: "Getting started with simulating data in R: some helpful functions and how to use them"
author: "Ariel Muldoon"
date: "August 28, 2018"
output: github_document
---
# Overview
Here's what we'll do today:
1. Simulate quantitative variables with `rnorm()` and `runif()`
2. Generate character variables that represent groups via `rep()`.
3. Simulate data with both quantitative and categorical variables.
4. Use `replicate()` to repeat the data simulation process many times
# Generating random numbers
An easy way to generate numeric data is to pull random numbers from some distribution. The functions that do this in R always start with the letter `r` (for "random").
## `rnorm()` to generate random numbers from the normal distribution
Pull 5 random numbers from a standard normal distribution.
```{r}
rnorm(5)
```
### Writing out arguments for clearer code
Using the defaults makes for quick coding but does not make the parameters of the generating distribution clear.
```{r}
rnorm(n = 5, mean = 0, sd = 1)
```
### Setting the random seed for reproducible random numbers
To reproduce the random numbers, set the seed via `set.seed()`.
Set the seed and generate 5 numbers.
```{r}
set.seed(16)
rnorm(n = 5, mean = 0, sd = 1)
```
Reset the seed and we get the same 5 numbers.
```{r}
set.seed(16)
rnorm(n = 5, mean = 0, sd = 1)
```
### Change parameters in `rnorm()`
We can pull from different normal distributions by changing the parameters.
First, using a mean of 0 and standard deviation of 2 (so a variance of 4).
```{r}
rnorm(n = 5, mean = 0, sd = 2)
```
Using a large mean and relatively small standard deviation can give values that are strictly positive.
```{r}
rnorm(n = 5, mean = 50, sd = 20)
```
### Using vectors of values for the parameter arguments
Both `mean` and `sd` will take vectors of values.
Let's pull 10 values from distributions with different means but the same standard deviation.
```{r}
rnorm(n = 10, mean = c(0, 5, 20), sd = 1)
```
We could pass a vector to `sd`, as well, but not `n`. The `n` argument uses the length of the vector to indicate the number of values desired.
```{r}
rnorm(n = c(2, 10, 10), mean = c(0, 5, 20), sd = c(1, 5, 20) )
```
## Example of using the simulated numbers from `rnorm()`
Exploring how related two unrelated vectors can appear.
```{r}
x = rnorm(n = 10, mean = 0, sd = 1)
y = rnorm(n = 10, mean = 0, sd = 1)
plot(y ~ x)
```
## `runif()` pulls from the uniform distribution
I like `runif()` to to generate continuous data within a set range.
The default is numbers between 0 and 1.
```{r}
runif(n = 5, min = 0, max = 1)
```
But we can do any range.
```{r}
runif(n = 5, min = 50, max = 100)
```
## Example of using the simulated numbers from `runif()`
I like `runif()` to demonstrate how the relative size of the explanatory variable affects the estimated coefficient.
Make a response variable via `rnorm()` and then generate an explanatory variable between 1 and 2.
```{r}
set.seed(16)
y = rnorm(n = 100, mean = 0, sd = 1)
x1 = runif(n = 100, min = 1, max = 2)
head(x1)
```
Then make an explanatory variable between 200 and 300.
```{r}
x2 = runif(n = 100, min = 200, max = 300)
head(x2)
```
We'll use the data in a regression model fit via `lm()`, making note of the coefficients.
```{r}
lm(y ~ x1 + x2)
```
# Generate character vectors with `rep()`
Simulations involve categorical variables, as well, that often need to be repeated in a pattern.
## Using `letters` and `LETTERS`
These are *built in constants* in R, and convenient for making a simple character vectors.
The first two lowercase letters.
```{r}
letters[1:2]
```
The last 17 uppercase letters.
```{r}
LETTERS[10:26]
```
## Repeat each element of a vector with `each`
With `each` we repeat each unique character in the vector some number of times.
```{r}
rep(letters[1:2], each = 3)
```
## Repeat a whole vector with the `times` argument
Repeating is different with `times`.
```{r}
rep(letters[1:2], times = 3)
```
## Set the output vector length with the `length.out` argument
Using `length.out` is similar to `times` but the groups can be imbalanced.
```{r}
rep(letters[1:2], length.out = 5)
```
## Repeat each element a different number of `times`
We can get unbalanced data with `times` if we use a vector for the argument.
```{r}
rep(letters[1:2], times = c(2, 4) )
```
## Combining `each` with `times`
When using `times` this way it will only take a single value and not a vector.
```{r}
rep(letters[1:2], each = 2, times = 3)
```
## Combining `each` with `length.out`
This is another way to impart imbalance.
```{r}
rep(letters[1:2], each = 2, length.out = 7)
```
Note you can't use `length.out` and `times` together (if you try, `length.out` will be given priority and `times` ignored).
# Creating datasets with quantiative and categorical variables
## Simulate data with no differences among two groups
We want to simulate a two level grouping variable and a "response" variable where there are no differences among the two groups.
```{r}
group = rep(letters[1:2], each = 3)
response = rnorm(n = 6, mean = 0, sd = 1)
data.frame(group,
response)
```
We don't have to make each variable separately before putting in a data.frame.
```{r}
data.frame(group = rep(letters[1:2], each = 3),
response = rnorm(n = 6, mean = 0, sd = 1) )
```
Now let's add another categorical variable to this dataset that's *crossed* with the first.
The new factor will have three values.
```{r}
LETTERS[3:5]
```
Remember the `group` factor is repeated elementwise.
```{r}
rep(letters[1:2], each = 3)
```
So what argument do we use for the new variable?
```{r, eval = FALSE}
rep(LETTERS[3:5], ?)
```
To repeat the whole vector twice can use the `times` argument or `length.out = 6`.
You can see that every level of this new variable occurs with every level of `group`.
```{r}
data.frame(group = rep(letters[1:2], each = 3),
factor = rep(LETTERS[3:5], times = 2),
response = rnorm(n = 6, mean = 0, sd = 1) )
```
What if we tried to use `each` instead?
```{r}
data.frame(group = rep(letters[1:2], each = 3),
factor = rep(LETTERS[3:5], each = 2),
response = rnorm(n = 6, mean = 0, sd = 1) )
```
## Simulate data with a difference in means among two groups
We can use a vector for `mean` in `rnorm()` for this.
```{r}
response = rnorm(n = 6, mean = c(5, 10), sd = 1)
response
```
How do we get the `group` pattern correct?
```{r, eval = FALSE}
rep(letters[1:2], ?)
```
We need `times` or `length.out` to repeat the whole vector to match the output of `rnorm()`.
```{r}
group = rep(letters[1:2], length.out = 6)
group
```
Getting the order correct is one reason to build vectors separately before binding them into a data.frame.
```{r}
data.frame(group,
response)
```
# Repeatedly simulating data with `replicate()`
The `replicate()` function is a real workhorse when making repeated simulations as it is for *repeated evaluation of an expression (which will usually involve random number generation)*.
It takes three arguments:
* `n`, which is the number of replications to perform. This is to set the number of repeated runs we want.
* `expr`, the expression that should be run repeatedly. This is often a function.
* `simplify`, which controls the type of output the results of `expr` are saved into. Use `simplify = FALSE` to get output saved into a list instead of in an array.
## Simple example of `replicate()`
Generate 5 values from a standard normal distribution 3 times.
```{r}
set.seed(16)
replicate(n = 3,
expr = rnorm(n = 5, mean = 0, sd = 1),
simplify = FALSE )
```
Without `simplify = FALSE` we get a matrix.
```{r}
set.seed(16)
replicate(n = 3,
expr = rnorm(n = 5, mean = 0, sd = 1) )
```
## An equivalent `for()` loop example
The same thing can be done with a `for()` loop, which I've found can be easier code to follow for R beginners.
```{r}
set.seed(16)
list1 = list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
list1[[i]] = rnorm(n = 5, mean = 0, sd = 1) # Save output in list for each iteration
}
list1
```
## Using `replicate()` to repeatedly make a dataset
Let's replicate the "two groups with no difference in means" dataset from earlier.
```{r}
data.frame(group = rep(letters[1:2], each = 3),
response = rnorm(n = 6, mean = 0, sd = 1) )
```
This can be put as the `expr` in `replicate()`.
```{r}
simlist = replicate(n = 3,
expr = data.frame(group = rep(letters[1:2], each = 3),
response = rnorm(n = 6, mean = 0, sd = 1) ),
simplify = FALSE)
```
We can see this result is a list of three data.frames.
```{r}
str(simlist)
```
Here is the first of the three.
```{r}
simlist[[1]]
```
# What's the next step?
By saving our generated variables or data.frames into a list we've made it so we can loop via list looping functions like `lapply()` or `purrr::map()`.
You can see a few examples of `replicate()` followed by `map()` in my blog post [A closer look at replicate() and purrr::map() for simulations](https://aosmith.rbind.io/2018/06/05/a-closer-look-at-replicate-and-purrr/).