--- title: "8. Introduction to randomization, Part 1" author: "Put your name here" date: "Put the date here" output: html_notebook: toc: yes toc_float: yes --- `r options(scipen=999)`

`r intToUtf8(c(50,46,48))`

::: {.summary} ### Functions introduced in this chapter `set.seed`, `rflip`, `do` ::: ## Introduction In this module, we'll learn about randomization and simulation. When we want to understand how sampling works, it's helpful to simulate the process of drawing samples repeatedly from a population. In the days before computing, this was very difficult to do. Now, a few simple lines of computer code can generate thousands (even millions) of random samples, often in a matter of seconds or less. ### Install new packages If you are using RStudio Workbench, you do not need to install any packages. (Any packages you need should already be installed by the server administrators.) If you are using R and RStudio on your own machine instead of accessing RStudio Workbench through a browser, you'll need to type the following command at the Console: ``` install.packages("mosaic") ``` ### Download the R notebook file Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`). Once the file is downloaded, move it to your project folder in RStudio and open it there. ### Restart R and run all chunks In RStudio, select "Restart R and Run All Chunks" from the "Run" menu. ### Load packages We load the `tidyverse` package. The `mosaic` package contains some tools for making it easier to learn about randomization and simulation. ```{r} library(tidyverse) library(mosaic) ``` ## Sample and population The goal of the next few chapters is to help you think about the process of sampling from a population. What do these terms mean? A *population* is a group of objects we would like to study. If that sounds vague, that's because it is. A population can be a group of any size and of any type of thing in which we're interested. Often, populations refer to groups of people. For example, in an election, the population of interest is all voters. But if you're a biologist, you might study populations of other kinds of organisms. If you're an engineer, you might study populations of bolts on bridges. If you're in finance, you might study populations of loans. Populations are usually inaccessible in their entirety. It is impossible to survey every voter in any reasonably sized election, for example. Therefore, to study them, we have to collect a *sample*. A sample is a subset of the population. We might conduct a poll of 2000 voters to try to learn about voting intentions for the entire population. Of course, for that to work, the sample has to be *representative* of its population. We'll have more to say about that in the future. ## Flipping a coin Before we talk about how samples are obtained from populations in the real world, we're going to perform some simulations. One of the simplest acts to simulate is flipping a coin. We could get an actual coin and physically flip it over and over again, but that is time-consuming and annoying. It is much easier to flip a "virtual" coin inside the computer. One way to accomplish this in R is to use the `rflip` command from the `mosaic` package. To make sure we're flipping a fair coin, we'll say that we want a 50% chance of heads by including the parameter `prob = 0.5`. One more bit of technical detail. Since there will be some randomness involved here, we will need to include an R command to ensure that we all get the same results every time this code runs. This is called "setting the seed". Don't worry too much about what this is doing under the hood. The basic idea is that two people who start with the same seed will generate the same sequence of "random" numbers. The seed `1234` in the chunk below is totally arbitrary. It could have been any number at all. (And, in fact, we'll use different numbers just for fun.) If you change the seed, you will get different output, so we all need to use the same seed. But the actual common value we all use for the seed is irrelevant. Here is one coin flip: ```{r} set.seed(1234) rflip(1, prob = 0.5) ``` Here are ten coin flips: ```{r} set.seed(1234) rflip(10, prob = 0.5) ``` Just to confirm that this is a random process, let's flip ten coins again (but without setting the seed again): ```{r} rflip(10, prob = 0.5) ``` If we return to the previous seed of 1234, we should obtain the same ten coin flips we did at first: ```{r} set.seed(1234) rflip(10, prob = 0.5) ``` And just to see the effect of setting a different seed: ```{r} set.seed(9999) rflip(10, prob = 0.5) ``` ##### Exercise 1 In ten coin flips, how many would you generally expect to come up heads? Is that the actual number of heads you saw in the simulations above? Why aren't the simulations coming up with the expected number of heads each time? ::: {.answer} Please write up your answer here. ::: ## Multiple simulations Suppose now that you are not the only person flipping coins. Suppose a bunch of people in a room are all flipping coins. We'll start with ten coin flips per person, a task that could be reasonably done even without a computer. You might observe three heads in ten flips. Fine, but what about everyone else in the room? What numbers of heads will they see? The `do` command from `mosaic` is a way of doing something multiple times. Imagine there are twenty people in the room, each flipping a coin ten times, each time with a 50% probability of coming up heads. Observe: ```{r} set.seed(12345) do(20) * rflip(10) ``` The syntax could not be any simpler: `do(20) *` means, literally, "do twenty times." In other words, this command is telling R to repeat an action twenty times, where the action is flipping a single coin ten times. You'll notice that in place of a list of outcomes (H or T) of all the individual flips, we have instead a summary of the number of heads and tails each person sees. Each row represents a person, and the columns give information about each person's flips. (There are `n = 10` flips for each person, but then the number of heads/tails---and the corresponding "proportion" of heads---changes from person to person.) Looking at the above rows and columns, we see that the output of our little coin-flipping experiment is actually stored in a data frame! Let's give it a name and work with it. ```{r} set.seed(12345) coin_flips_20_10 <- do(20) * rflip(10, prob = 0.5) coin_flips_20_10 ``` It is significant that we can store our outcomes this way. Because we have a data frame, we can apply all our data analysis tools (graphs, charts, tables, summary statistics, etc.) to the "data" generated from our set of simulations. For example, what is the mean number of heads these twenty people observed? ```{r} mean(coin_flips_20_10$heads) ``` ##### Exercise 2 The data frame `coin_flips_20_10` contains four variables: `n`, `heads`, `tails`, and `prop`. In the code chunk above, we calculated `mean(coin_flips_20_10$heads)` which gave us the mean count of heads for all people flipping coins. Instead of calculating the mean count of heads, change the variable from `heads` to `prop` to calculate the mean *proportion* of heads. Then explain why your answer makes sense in light of the mean count of heads calculated above. ::: {.answer} ```{r} # Add code here to calculate the mean proportion of heads. ``` Please write up your answer here. ::: ***** Let's look at a histogram of the number of heads we see in the simulated flips. (The fancy stuff in `scale_x_continuous` is just making sure that the x-axis goes from 0 to 10 and that the tick marks appear on each whole number.) ```{r} ggplot(coin_flips_20_10, aes(x = heads)) + geom_histogram(binwidth = 0.5) + scale_x_continuous(limits = c(-1, 11), breaks = seq(0, 10, 1)) ``` Let's do the same thing, but now let's consider the *proportion* of heads. ```{r} ggplot(coin_flips_20_10, aes(x = prop)) + geom_histogram(binwidth = 0.05) + scale_x_continuous(limits = c(-0.1, 1.1), breaks = seq(0, 1, 0.1)) ``` ## Bigger and better! With only twenty people, it was possible that, for example, nobody would get all heads or all tails. Indeed, in `coin_flips_20_10` there were no people who got all heads or all tails. Also, there were more people with six and seven heads than with five heads, even though we "expected" the average to be five heads. There is nothing particularly significant about that; it happened by pure chance alone. Another run through the above commands would generate a somewhat different outcome. That's what happens when things are random. Instead, let's imagine that we recruited way more people to flip coins with us. Let's try it again with 2000 people: ```{r} set.seed(1234) coin_flips_2000_10 <- do(2000) * rflip(10, prob = 0.5) coin_flips_2000_10 ``` This is the same idea as before, but now there are 2000 rows in the data frame instead of 20. ```{r} mean(coin_flips_2000_10$heads) ``` ```{r} ggplot(coin_flips_2000_10, aes(x = heads)) + geom_histogram(binwidth = 0.5) + scale_x_continuous(limits = c(-1, 11), breaks = seq(0, 10, 1)) ``` This is helpful. In contrast with the set of simulations with twenty people, the last histogram gives us something closer to what we expect. The mode is at five heads, and every possible number of heads is represented, with decreasing counts as one moves away from five. With 2000 people flipping coins, all possible outcomes---including rare ones---are better represented. Here is the the same histogram, but this time with the proportion of heads instead of the count of heads: ```{r} ggplot(coin_flips_2000_10, aes(x = prop)) + geom_histogram(binwidth = 0.05) + scale_x_continuous(limits = c(-0.1, 1.1), breaks = seq(0, 1, 0.1)) ``` ##### Exercise 3 Do you think the shape of the distribution would be appreciably different if we used 20,000 or even 200,000 people? Why or why not? (Normally, I would encourage you to test your theory by trying it in R. However, it takes a *long* time to simulate that many flips and I don't want you to tie up resources and memory. Think through this in your head.) ::: {.answer} Please write up your answer here. ::: ***** From now on, we will insist on using at least a thousand simulations---if not more---to make sure that we represent the full range of possible outcomes.^[There is some theory behind choosing the number of times we need to simulate, but we're not going to get into all that.] ## More flips Now let's increase the number of coin flips each person performs. We'll still use 2000 simulations (imagine 2000 people all flipping coins), but this time, each person will flip the coin 1000 times instead of only 10 times. The first code chunk below accounts for a substantial amount of the time it takes to run the code in this document. ```{r} set.seed(1234) coin_flips_2000_1000 <- do(2000) * rflip(1000, prob = 0.5) coin_flips_2000_1000 ``` ```{r} mean(coin_flips_2000_1000$heads) ``` ```{r} ggplot(coin_flips_2000_1000, aes(x = heads)) + geom_histogram(binwidth = 10, boundary = 500) ``` And now the same histogram, but with proportions: ```{r} ggplot(coin_flips_2000_1000, aes(x = prop)) + geom_histogram(binwidth = 0.01, boundary = 0.5) ``` ##### Exercise 4 Comment on the histogram above. Describe its shape using the vocabulary of the three important features (modes, symmetry, outliers). Why do you think it's shaped like this? ::: {.answer} Please write up your answer here. ::: ##### Exercise 5 Given the amount of randomness involved (each person is tossing coins which randomly come up heads or tails), why do we see so much structure and orderliness in the histograms? ::: {.answer} Please write up your answer here. ::: ## But who cares about coin flips? It's fair to ask why we go to all this trouble to talk about coin flips. The most pressing research questions of our day do not involve people sitting around and flipping coins, either physically or virtually. But now substitute "heads" and "tails" with "cancer" and "no cancer". Or "guilty" and "not guilty". Or "shot" and "not shot". The fact is that many important issues are measured as variables with two possible outcomes. There is some underlying "probability" of seeing one outcome over the other. (It doesn't have to be 50% like the coin.) Statistical methods---including simulation---can say a lot about what we "expect" to see if these outcomes are truly random. More importantly, when we see outcomes that *aren't* consistent with our simulations, we may wonder if there is some underlying mechanism that may be not so random after all. It may not look like it on first blush, but this idea is at the core of the scientific method. For example, let's suppose that 85% of U.S. adults support some form of background checks for gun buyers.^[This is likely close to the truth. See this article: https://iop.harvard.edu/get-involved/harvard-political-review/vast-majority-americans-support-universal-background-checks] Now, imagine we went out and surveyed a random group of people and asked them a simple yes/no question about their support for background checks. What might we see? Let's simulate. Imagine flipping a coin, but instead of coming up heads 50% of the time, suppose it were possible for the coin to come up heads 85% of the time.^[The idea of a "weighted" coin that can do this comes up all the time in probability and statistics courses, but it seems that it's not likely one could actually manufacture a coin that came up heads more or less than 50% of the time when flipped. See this paper for more details: http://www.stat.columbia.edu/~gelman/research/published/diceRev2.pdf] A sequence of heads and tails with this weird coin would be much like randomly surveying people and asking them about background checks. We can make a "virtual" weird coin with the `rflip` command by specifying how often we want heads to come up. ```{r} set.seed(1234) rflip(1, prob = 0.85) ``` If we flip our weird coin a bunch of times, we can see that our coin is not fair. Indeed, it appears to come up heads way more often than not: ```{r} set.seed(1234) rflip(100, prob = 0.85) ``` The results from the above code can be thought of as a survey of 100 random U.S. adults about their support for background checks for purchasing guns. "Heads" means "supports" and "tails" means "opposes." If the majority of Americans support background checks, then we will come across more people in our survey who tell us they support background checks. This shows up in our simulation as the appearance of more heads than tails. Note that there is no guarantee that our sample will have exactly 85% heads. In fact, it doesn't; it has 90% heads. Again, keep in mind that we're simulating the act of obtaining a random sample of 100 U.S. adults. If we get a different sample, we'll get different results. (We set a different seed here. That ensures that this code chunk is randomly different from the one above.) ```{r} set.seed(123456) rflip(100, prob = 0.85) ``` See, this time, only 81% came up heads, even though we expected 85%. That's how randomness works. ##### Exercise 6(a) Now imagine that 2000 people all go out and conduct surveys of 100 random U.S. adults, asking them about their support for background checks. Write some R code that simulates this. Plot a histogram of the results. (Hint: you'll need `do(2000) *` in there.) Use the proportion of supporters (`prop`), not the raw count of supporters (`heads`). ::: {.answer} ```{r} set.seed(1234) # Add code here to simulate 2000 surveys of 100 U.S. adults. ``` ```{r} # Plot the results in a histogram using proportions. ``` ::: ##### Exercise 6(b) Run another simulation, but this time, have each person survey 1000 adults and not just 100. ::: {.answer} ```{r} set.seed(1234) # Add code here to simulate 2000 surveys of 1000 U.S. adults. ``` ```{r} # Plot the results in a histogram using proportions. ``` ::: ##### Exercise 6(c) What changed when you surveyed 1000 people instead of 100? ::: {.answer} Please write up your answer here. ::: ## Sampling variability We've seen that taking repeated samples (using the `do` command) leads to lots of different outcomes. That is randomness in action. We don't expect the results of each survey to be exactly the same every time the survey is administered. But despite this randomness, there is an interesting pattern that we can observe. It has to do with the number of times we flip the coin. Since we're using coin flips to simulate the act of conducting a survey, the number of coin flips is playing the role of the *sample size*. In other words, if we want to simulate a survey of U.S. adults with a sample size of 100, we simulate that by flipping 100 coins. ##### Exercise 7 Go back and look at all the examples above. What do you notice about the range of values on the x-axis when the sample size is small versus large? (In other words, in what way are the histograms different when using `rflip(10, prob = ...)` or `rflip(100, prob = ...)` versus `rflip(1000, prob = ...)`? It's easier to compare histograms one to another when looking at the proportions instead of the raw head counts because proportions are always on the same scale from 0 to 1.) ::: {.answer} Please write up your answer here. ::: ## Conclusion Simulation is a tool for understanding what happens when a statistical process is repeated many times in a randomized way. The availability of fast computer processing makes simulation easy and accessible. Eventually, the goal will be to use simulation to answer important questions about data and the processes in the world that generate data. This is possible because, despite the ubiquitous presence of randomness, a certain order emerges when the number of samples is large enough. Even though there is sampling variability (different random outcomes each time we sample), there are patterns in that variability that can be exploited to make predictions.