This page was last updated on October 09, 2019.
In this tutorial we will learn about the following:
NOTE: In order to ensure we all get the same results when using random sampling routines, please run the following line of code every time you open an R/RStudio session (and disregard the warning it gives you):
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-
## uniform 'Rounding' sampler used
tigerstats
packagebinom
packageYou may need to install the binom
package, and if you do, copy this code into your console and press enter:
install.packages("binom", dependencies = T, quiet = T)
Then don’t forget to load the packages, by typing the following in your console:
library(tigerstats, warn.conflicts = FALSE, quietly = TRUE)
library(binom, warn.conflicts = FALSE, quietly = TRUE)
We previously learned that when the population characteristic of interest (i.e. the variable of interest) is numeric, such as height, the key population parameter of interest is typically the mean.
In the Sampling, Estimation, and Uncertainty tutorial we learned about the sampling distribution of the mean. We learned that different random samples from the population will each yield different estimates of the population mean, due simply to sampling error.
We also learned that the standard error of the mean provides a measure of uncertainty around our estimate of the population mean.
Here we are dealing a population characteristic, or variable, that is categorical - “hair colour” for example, with categories brown, black, blonde, and red.
The key population parameter of interest when dealing with a categorical variable is the proportion, but first we need to define what category we’re focusing on.
Let’s consider “red” to be the category value of interest. Doing so, we have simplified our categorical variable into a binary variable in which having red hair is considered a success, and any other colour is a failure.
We can now ask, for example: What proportion of students in the class has red hair?
This is analagous to asking: What is the mean height of students in the class?
Just as our estimate of the population mean has a sampling distribution, so too does our estimate of the population proportion, a sampling distribution for a proportion.
We can simulate the sampling distribution for a proportion as follows:
Here are the steps in R.
We will use the function rep
from the base R package. Look at the help file:
?rep
First let’s create the population from which we’ll sample individuals. We’ll make it a population of 10000 individuals, and we’ll set the proportion of red-haired individuals (i.e. the true proportion of “successes”) to 0.18:
Create a vector that stores the category names (i.e. all 4 potential outcomes), and their respective proportions in the population, noting that we already defined the proportion of red-haired individuals above as true.prop.red = 0.18
. We’ll make up the proportions for brown, black, and blonde hair.
Now we’re going to create a new vector object pop.hair
that repeats each of the four hair colour names according to their respective frequencies.
First, let’s take the hair.props
vector of proportions and multiply these by the pop.size
; the products of these multiplications will give us the frequency of each hair colour in the population. For example, multiplying the first value in hair.props
, the proportion of brown hair individuals 0.5, by 10000 yields 5000 individuals with brown hair.
## [1] 5000 2000 1800 1200
Now let’s create a new vector that repeats the category names (brown, black, red, blonde) the correct number of times:
Double-check that you have created a vector object of length 10000:
## [1] 10000
And have a look at the first handful of values in the vector:
## [1] "brown" "brown" "brown" "brown" "brown" "brown"
Now, let’s refresh our memory on how to calculate proportions of categories from a given vector of observations.
First we create a table of frequencies using the xtabs
function (from the tigerstats package), then by using the prop.table
function (functions we learned about in an earlier tutorial):
true.hair.col.freqs <- xtabs(~ pop.hair)
true.hair.col.props <- prop.table(true.hair.col.freqs)
true.hair.col.props
## pop.hair
## black blonde brown red
## 0.20 0.12 0.50 0.18
So, we verify that the true proportion of red-haired individuals in our population of 10000 is 0.18.
NOTE: notice that the ordering of the categories from this xtabs
output differs from what we entered in the hair.colours
object. This is because xtabs
re-ordered the categories alphabetically.
Now for our sampling exercise that will yield a sampling distribution:
Define a sample size to use, and the number of trials (i.e. the number of times we wish to sample the population):
To start, sample 10 individuals from the population (without replacement) and calculate the observed proportion of red-haired individuals:
set.seed(28)
samp.hair <- sample(pop.hair, size = sampsize, replace = FALSE)
obs.prop.red <- sum(samp.hair == "red") / sampsize
Let’s look at our result:
## [1] 0.2
In this example we got a proportion of 0.2, meaning 2 of the 10 people sampled had red hair.
Repeat the sampling number.of.trials
times using the familiar do
function, and store the resulting proportions (NOTE: this might take a few seconds).
TIP: In the following code, we tell R to first take a sample of size 10 from the vector object pop.hair
(without replacement), and we assign that resulting sample to a temporary object we call temp
. Next, after the semicolon, we use the sum
function to tally the number of observations in our sample that equal the category red
. Finally, we divide this number by 10 to calculate the sample proportion. All of these steps are repeated number.of.trials
times, and all the commands are placed inside curly brackets { }
.
set.seed(200)
red.prop.trials <- do(number.of.trials) *
{
temp <- sample(pop.hair, size = sampsize, replace = FALSE);
sum(temp == "red")/sampsize
}
The resulting object red.prop.trials
is a data frame with a variable called result
:
## Classes 'do.data.frame' and 'data.frame': 10000 obs. of 1 variable:
## $ result: num 0 0.1 0 0.2 0 0.2 0 0.3 0 0.1 ...
## - attr(*, "lazy")=Class 'formula' language ~{ temp <- sample(pop.hair, size = sampsize, replace = FALSE) ...
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "culler")=function (object, ...)
Think carefully about what the variable result
is storing: it stores the proportion of the sample of 10 people that had red hair, and holds 10000 such proportions.
Let’s visualize the resulting sampling distribution we created above using the barplot
function from the base package (in a previous tutorial, we learned how to use the barchartGC
function from the tigerstats
package; here we’re using the base
package function barplot
instead).
Let’s look at the help file for the barplot
function:
?barplot
Now, let’s tabulate the proportion values that we observed in our 10000 trials:
red.samp.freqs <- xtabs(~ result, data = red.prop.trials)
red.samp.props <- prop.table(red.samp.freqs)
red.samp.props
## result
## 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
## 0.1377 0.3017 0.2991 0.1718 0.0676 0.0184 0.0029 0.0008
This is a discrete probability distribution, because it deals with discrete outcomes, each with a known probability. Thus, a bar chart is appropriate here.
barplot(red.samp.props,
col = "grey",
xlab = "Proportion of people with red hair",
ylab = "Relative frequency",
las = 1,
ylim = c(0, 0.4))
Figure 1: Bar chart of the outcomes of 1000 samples from the population.
In our simulation, we see that the most common outcomes in the random trials were observing proportions of 0.1 and 0.2, corresponding to 1 or 2 out of 10 people having red hair, respectively. However, we did occassionally get as many as 6 or 7 people out of 10 with red hair. We never got more than 7 out of 10 with red hair, even though it was possible in theory.
This distribution is know as a binomial distribution, because it dealt with a binary outcome (success/failure). We’ll learn more about the binomial distribution the next tutorial.
In the Sampling, Estimation, and Uncertainty tutorial we learned that the sampling distribution for a mean was centred on the true population mean \(\mu\), and had a spread that we quantified using a special kind of standard deviation, the standard error of the mean.
Likewise, the sampling distribution for a proportion is centred on the true population proportion p, and has a spread that we quantify using the standard error of a proportion.
Let’s verify that our sampling distribution created above, and stored in the sum
variable of the data frame object prop.red.samps
, is indeed centred on the true population proportion:
## [1] 0.18007
It sure is!
NOTE: Most of the time one would simply report an estimate of a proportion with a 95% confidence interval instead of the standard error for the estimate. Nevertheless, we’ll demonstrate here how you can calculate the standard error of a proportion in R.
Let’s take a new random sample of n = 40 individuals from the population (without replacement):
Now, using this sample, let’s calculate the standard error of the proportion, analogous to the “standard error of the mean”:
In R, it can be calculated long-hand as follows:
Calculate the observed proportion of people with red hair, or \(\hat{p}\):
number.red <- sum(samp40.counts == "red") # frequency of successes
pHat.red <- number.red / sampsize
pHat.red
## [1] 0.175
Thus, 0.17 is our sample-based estimate of the true proportion of red-haired people in the population. Due to sampling error (something we expect), this is a little off from the known, true proportion of 0.18.
NOTE: In practice, we typically don’t know the true propulation value… we estimate it using a sample!
Now calculate the sample-based standard error of the proportion, using appropriate nesting of functions for expediency:
## [1] 0.06007807
When estimating a proportion, we should report both the estimate itself and the standard error of the estimate as follows:
The estimated proportion of red-haired people in the population, \(\hat{p}\), is 0.17 \(\pm\) 0.06.
Use the binom
package and the binom.confint
function to calculate the Agresti-Coull 95% confidence interval for a proportion.
First have a look at the function details:
?binom.confint
And here’s the code:
ac.conf <- binom.confint(x = number.red,
n = sampsize,
conf.level = 0.95,
methods = "ac") # the "ac" stands for "Agresti-Coull"
ac.conf
## method x n mean lower upper
## 1 agresti-coull 7 40 0.175 0.08430826 0.3226458
This returns our observed proportion (0.17, somewhat confusingly under the heading “mean”!), and our lower and upper 95% confidence limits for the proportion.
We would write this as 0.084 \(\leq\) p \(\leq\) 0.32.
In this example, our calculated 95% confidence interval does indeed encompass our known, true proportion of 0.18.
NOTE: The proptestGC
function available in the tigerstats
package calculates the standard error for a proportion, but the confidence interval that it provides is not the appropriate Agresti-Coull confidence interval. We therefore do not use this function.
sample
function as above, sample 20 individuals at random from this populationGetting started:
library
Data frame structure:
str
inspect
Tabulation:
xtabs
prop.table
Simulation:
rep
set.seed
sample
do
Math:
sum
Graphs:
barplot
Binomial distribution:
binom.confint
(from the binom
package)