This page was last updated on October 09, 2019.


Getting started

In this tutorial we will learn about the following:

  • The Sampling Distribution for a Proportion
  • Properties of the sampling distribution for a proportion
  • Calculating the Standard Error for a Proportion
  • Calculating a Confidence interval for a Proportion

NOTE: In order to ensure we all get the same results when using random sampling routines, please run the following line of code every time you open an R/RStudio session (and disregard the warning it gives you):

## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-
## uniform 'Rounding' sampler used

Required packages

  • the tigerstats package
  • the binom package

You may need to install the binom package, and if you do, copy this code into your console and press enter:

install.packages("binom", dependencies = T, quiet = T)

Then don’t forget to load the packages, by typing the following in your console:

library(tigerstats, warn.conflicts = FALSE, quietly = TRUE)
library(binom, warn.conflicts = FALSE, quietly = TRUE)

Required data

  • none

The sampling distribution for a proportion

A refresher

We previously learned that when the population characteristic of interest (i.e. the variable of interest) is numeric, such as height, the key population parameter of interest is typically the mean.

In the Sampling, Estimation, and Uncertainty tutorial we learned about the sampling distribution of the mean. We learned that different random samples from the population will each yield different estimates of the population mean, due simply to sampling error.

We also learned that the standard error of the mean provides a measure of uncertainty around our estimate of the population mean.

Here we are dealing a population characteristic, or variable, that is categorical - “hair colour” for example, with categories brown, black, blonde, and red.

The key population parameter of interest when dealing with a categorical variable is the proportion, but first we need to define what category we’re focusing on.

Let’s consider “red” to be the category value of interest. Doing so, we have simplified our categorical variable into a binary variable in which having red hair is considered a success, and any other colour is a failure.

We can now ask, for example: What proportion of students in the class has red hair?

This is analagous to asking: What is the mean height of students in the class?

Just as our estimate of the population mean has a sampling distribution, so too does our estimate of the population proportion, a sampling distribution for a proportion.


Simulating the sampling distribution for a proportion

We can simulate the sampling distribution for a proportion as follows:

  • define a population whose individuals have brown, black, red, or blonde hair in some defined proportions
  • define “red” hair as our category of interest, i.e. the “success”
  • draw a sample of size n from this population and calculate the proportion of successes
  • repeat the sampling (random trial) many times, each time tallying the number and corresponding proportion of successes
  • plot the resulting sampling distribution

Here are the steps in R.

We will use the function rep from the base R package. Look at the help file:

?rep

First let’s create the population from which we’ll sample individuals. We’ll make it a population of 10000 individuals, and we’ll set the proportion of red-haired individuals (i.e. the true proportion of “successes”) to 0.18:

Create a vector that stores the category names (i.e. all 4 potential outcomes), and their respective proportions in the population, noting that we already defined the proportion of red-haired individuals above as true.prop.red = 0.18. We’ll make up the proportions for brown, black, and blonde hair.

Now we’re going to create a new vector object pop.hair that repeats each of the four hair colour names according to their respective frequencies.

First, let’s take the hair.props vector of proportions and multiply these by the pop.size; the products of these multiplications will give us the frequency of each hair colour in the population. For example, multiplying the first value in hair.props, the proportion of brown hair individuals 0.5, by 10000 yields 5000 individuals with brown hair.

## [1] 5000 2000 1800 1200

Now let’s create a new vector that repeats the category names (brown, black, red, blonde) the correct number of times:

Double-check that you have created a vector object of length 10000:

## [1] 10000

And have a look at the first handful of values in the vector:

## [1] "brown" "brown" "brown" "brown" "brown" "brown"

Now, let’s refresh our memory on how to calculate proportions of categories from a given vector of observations.

First we create a table of frequencies using the xtabs function (from the tigerstats package), then by using the prop.table function (functions we learned about in an earlier tutorial):

## pop.hair
##  black blonde  brown    red 
##   0.20   0.12   0.50   0.18

So, we verify that the true proportion of red-haired individuals in our population of 10000 is 0.18.

NOTE: notice that the ordering of the categories from this xtabs output differs from what we entered in the hair.colours object. This is because xtabs re-ordered the categories alphabetically.

  1. Refresher: create a bar chart showing the frequency distribution of hair colours (relevant tutorial found here)

Now for our sampling exercise that will yield a sampling distribution:

Define a sample size to use, and the number of trials (i.e. the number of times we wish to sample the population):

To start, sample 10 individuals from the population (without replacement) and calculate the observed proportion of red-haired individuals:

Let’s look at our result:

## [1] 0.2

In this example we got a proportion of 0.2, meaning 2 of the 10 people sampled had red hair.

Repeat the sampling number.of.trials times using the familiar do function, and store the resulting proportions (NOTE: this might take a few seconds).

TIP: In the following code, we tell R to first take a sample of size 10 from the vector object pop.hair (without replacement), and we assign that resulting sample to a temporary object we call temp. Next, after the semicolon, we use the sum function to tally the number of observations in our sample that equal the category red. Finally, we divide this number by 10 to calculate the sample proportion. All of these steps are repeated number.of.trials times, and all the commands are placed inside curly brackets { }.

The resulting object red.prop.trials is a data frame with a variable called result:

## Classes 'do.data.frame' and 'data.frame':    10000 obs. of  1 variable:
##  $ result: num  0 0.1 0 0.2 0 0.2 0 0.3 0 0.1 ...
##  - attr(*, "lazy")=Class 'formula'  language ~{     temp <- sample(pop.hair, size = sampsize, replace = FALSE) ...
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  - attr(*, "culler")=function (object, ...)

Think carefully about what the variable result is storing: it stores the proportion of the sample of 10 people that had red hair, and holds 10000 such proportions.


Visualizing the sampling distribution for a proportion

Let’s visualize the resulting sampling distribution we created above using the barplot function from the base package (in a previous tutorial, we learned how to use the barchartGC function from the tigerstats package; here we’re using the base package function barplot instead).

Let’s look at the help file for the barplot function:

?barplot

Now, let’s tabulate the proportion values that we observed in our 10000 trials:

## result
##      0    0.1    0.2    0.3    0.4    0.5    0.6    0.7 
## 0.1377 0.3017 0.2991 0.1718 0.0676 0.0184 0.0029 0.0008

This is a discrete probability distribution, because it deals with discrete outcomes, each with a known probability. Thus, a bar chart is appropriate here.

Figure 1: Bar chart of the outcomes of 1000 samples from the population.

Figure 1: Bar chart of the outcomes of 1000 samples from the population.

In our simulation, we see that the most common outcomes in the random trials were observing proportions of 0.1 and 0.2, corresponding to 1 or 2 out of 10 people having red hair, respectively. However, we did occassionally get as many as 6 or 7 people out of 10 with red hair. We never got more than 7 out of 10 with red hair, even though it was possible in theory.

This distribution is know as a binomial distribution, because it dealt with a binary outcome (success/failure). We’ll learn more about the binomial distribution the next tutorial.


Properties of the sampling distribution for a proportion

In the Sampling, Estimation, and Uncertainty tutorial we learned that the sampling distribution for a mean was centred on the true population mean \(\mu\), and had a spread that we quantified using a special kind of standard deviation, the standard error of the mean.

Likewise, the sampling distribution for a proportion is centred on the true population proportion p, and has a spread that we quantify using the standard error of a proportion.

Let’s verify that our sampling distribution created above, and stored in the sum variable of the data frame object prop.red.samps, is indeed centred on the true population proportion:

## [1] 0.18007

It sure is!


Calculating the standard error for a proportion

NOTE: Most of the time one would simply report an estimate of a proportion with a 95% confidence interval instead of the standard error for the estimate. Nevertheless, we’ll demonstrate here how you can calculate the standard error of a proportion in R.

Let’s take a new random sample of n = 40 individuals from the population (without replacement):

Now, using this sample, let’s calculate the standard error of the proportion, analogous to the “standard error of the mean”:

In R, it can be calculated long-hand as follows:

Calculate the observed proportion of people with red hair, or \(\hat{p}\):

## [1] 0.175

Thus, 0.17 is our sample-based estimate of the true proportion of red-haired people in the population. Due to sampling error (something we expect), this is a little off from the known, true proportion of 0.18.

NOTE: In practice, we typically don’t know the true propulation value… we estimate it using a sample!

Now calculate the sample-based standard error of the proportion, using appropriate nesting of functions for expediency:

## [1] 0.06007807

When estimating a proportion, we should report both the estimate itself and the standard error of the estimate as follows:

The estimated proportion of red-haired people in the population, \(\hat{p}\), is 0.17 \(\pm\) 0.06.


Calculating the 95% confidence interval for a proportion

Use the binom package and the binom.confint function to calculate the Agresti-Coull 95% confidence interval for a proportion.

First have a look at the function details:

?binom.confint

And here’s the code:

##          method x  n  mean      lower     upper
## 1 agresti-coull 7 40 0.175 0.08430826 0.3226458

This returns our observed proportion (0.17, somewhat confusingly under the heading “mean”!), and our lower and upper 95% confidence limits for the proportion.

We would write this as 0.084 \(\leq\) p \(\leq\) 0.32.

In this example, our calculated 95% confidence interval does indeed encompass our known, true proportion of 0.18.


NOTE: The proptestGC function available in the tigerstats package calculates the standard error for a proportion, but the confidence interval that it provides is not the appropriate Agresti-Coull confidence interval. We therefore do not use this function.


  1. Challenge: simulate estimating a proportion
  • The proportion of males in a population of 100000 people is known to be 0.52
  • Using the sample function as above, sample 20 individuals at random from this population
  • Calculate the sample-based estimate of the proportion of males, and the Agresti-Coull 95% confidence interval
  • Set your seed with number 133 at the beginning of your code

List of functions

Getting started:

  • library

Data frame structure:

  • str
  • inspect

Tabulation:

  • xtabs
  • prop.table

Simulation:

  • rep
  • set.seed
  • sample
  • do

Math:

  • sum

Graphs:

  • barplot

Binomial distribution:

  • binom.confint (from the binom package)