This page was last updated on September 17, 2019.

For students in BIOL202, you can find R code associated with all chapters of the textbook here.


Getting started

In this tutorial you will learn how to:

  • generate random samples from a population, to explore concepts such as sampling error
  • quantify uncertainty around estimates: standard error, confidence intervals
  • create and visualize sampling distributions of the mean
  • generate a vector of random numbers drawn from a normal distribution

Required packages

  • the tigerstats package
  • the dplyr package

You may need to install the ‘dplyr’ package:

install.packages("dplyr")
library(dplyr)

Import data

Import in the human gene length data used in the text book (Chapter 4) directly from the text website:


Inspect the data

Let’s look at the first handful of rows:

##   geneLength
## 1       2069
## 2       1303
## 3       1067
## 4       2849
## 5       3378
## 6       2391

And use inspect to get some descriptive statistics:

## 
## quantitative variables:  
##         name   class min      Q1 median   Q3   max     mean       sd     n
## 1 geneLength integer  60 1312.25 2226.5 3444 99631 2622.027 2036.967 20290
##   missing
## 1       0

There are 20290 observations in the data set, each representing a gene length.

Let’s visualize the data with a histogram:

Figure 1: Histogram of gene lengths in the human genome (n = 20290).

Figure 1: Histogram of gene lengths in the human genome (n = 20290).

Problem: there is a substantial skew to the data. If we consult Fig. 4.1-1 in the text, we see that there are 26 genes excluded from the histogram because they are beyond 15000 nucleotides long, and thus render the histogram uninformative. Let’s exclude these here also.

To do so, we first introduce a new package dplyr, which is used to manipulate data sets within R.

There are many webpages about the dplyr package, including this excellent one here.

We will use the filter function from the dplyr package, to create a new data frame that holds only those genes less than 15000 nucleotides long:

This same outcome could have been achieved with the subset function in the base R package:

shortgenes <- subset(humanGeneLengths, geneLength < 15000)

But we’ll make more use of the dplyr package later, so introduce it here.

Now let’s re-plot the data after having excluded the extreme values:

Figure 1: Histogram of gene lengths in the human genome (n = 20264).

Figure 1: Histogram of gene lengths in the human genome (n = 20264).

Recall that this is the entire population of genes in the human genome. We are going to draw random samples from this dataset to explore the concepts of sampling error and estimating with uncertainty.

  1. Try reproducing the histogram above using the hist function instead of the histogram function. It doesn’t need to be reproduced exactly, but do your best to get close.

HINT: Go to Tutorial_01 and the section “creating a histogram”, and you’ll see some arguments used in the hist function. For instance, the argument las = 1 rotates the y-axis tick labels so they’re horizontal, and the main = "" makes sure that there is no “main” title on the graph (because we provide a figure heading instead). Consult the help on the hist function to find out about the other arguments you can use.


Random samples

There is a function sample in the base R package that is easy to use for the purpose collecting random samples. Look at the help function for it (copy and paste the code between the “```” into your command console):

?sample

This may show you multiple help options (depending on the number of different R packages that include a sample function). Choose the one that says “Random samples and permutations” from the base package.

The replace argument in the dictates whether the sampling we conduct occurs with or without replacement. For instance, the following code will not work (try copying and pasting into your console):

sample(1:6, 10, replace = F)

Here we’re telling R to sample 10 values from a vector that includes only 6 values, and with the replace = F argument we told R not to put the sampled integers back in the pile for future sampling.

If instead we specified replace = T, as shown below, then we be conducting “sampling with replacement”, and sampled values are placed back in the pool each time, thus enabling the code to work (try it):

sample(1:6, 10, replace = T)

TIP: Most of the time we wish to “sample without replacement”, hence replace = F.


Setting the “seed”

Before we make use of any function in R that uses a built-in random number generator, including the sample function, we need to set the seed number in order to ensure that our results are reproducible. For this, we use the set.seed function.

You can pick any number you want as the seed number. Here we’ll use 12, and if you do too, you’ll get the same results as shown here.


Sampling error

Now that the seed number has been set, we can draw a random sample from our population.

Let’s sample n = 20 gene lengths at random from the humanGeneLengths data frame, and store them in an object:

Now let’s calculate the mean and standard deviation using our sample:

## [1] 2684.05
## [1] 1853.788

Let’s now draw another random sample of the same size (20):

And calculate the mean and sd:

## [1] 3416.95
## [1] 3550.181

Are they the same as we saw using the first random sample? NO! This reflects sampling error.

  1. Follow this online tutorial that helps visualize the construction of a sampling distribution of the mean

Sampling distribution of the mean

Now let’s learn code to conduct our own resampling exercise.

Let’s repeat the sampling exercise above, but do it many time, say 1000 times.

To do this we’re going to use the do function from the mosaic package (which is automatically loaded as part of the tigerstats package).

?do

TIP: if you want to ensure that you’re looking up the help file for the correct function (i.e. when multiple packages may have a function of the same name), then you can type in the name of the source package followed by two colons as follows:

?mosaic::do

The do function simply tells R to repeat whatever is after the *.

Let’s first create an object n to hold our sample size, and another object num.samps to hold a number refering to the number of samples we wish to draw:

Now we can use the do function to create a vector of length num.samps, wherein each element of the vector represents one sample mean, calculated using a random sample of “n” gene lengths.

Note that we nest the sample function within the mean function. When provided with commands nested within other commands, R completes the inner-most command first, then works its way outwards. Thus, R first completes the sample function first (it generates the desired sample), then it completes the mean function, then it completes the do function (which tells it to repeat the commands after the "*" num.samps times).

Let’s visualize this sampling distribution of the mean that we’ve created:

Figure 1: Sampling distribution of mean gene lengths from the human genome (n = 1000 samples of size 20).

Figure 1: Sampling distribution of mean gene lengths from the human genome (n = 1000 samples of size 20).

  1. Change the value of the object “n” (the sample size variable we defined earlier) from 20 to 50, and re-do the do sampling routine above, and also the histogram. Is the sampling distribution of the mean narrower or wider than the one associated with sample size n = 20? (TIP: Be sure to set the x-axis lower and upper limits to be the same as those above, to facilitate direct comparison)

Standard error of the mean

This is the formula for calculating the standard error of the mean when the population parameter is unknown:

This is one measure of uncertainty that we report alongside our estimate of the population mean.

It represents the standard deviation of the sampling distribution of the mean. Thus, it is a measure of spread for the sampling distribution of the mean.


Calculating the SEM in R

We know that the inspect function can be used to tell us the number of observations “n” in a sample (and importantly, it takes account of missing values or “NA” values), and also the standard deviation of the variable.

But let’s learn how to calculate sample size also using the length function, ensuring to take account of any potential missing values by using the na.omit function to exclude any “NA” values:

## [1] 20

So, randsamp1.n now holds a single number that represents the number of observations (i.e. sample size “n”) in our sample.

NOTE: For some functions, such as mean and sd, one can include the na.rm = T argument, which removes “NA” values before calculating. Others, such as the length function do not accommodate the na.rm = T argument. Hence our use of the na.omit function above.

Now let’s calclulate the standard deviation, again using the na.omit function nested in the sd function:

## [1] 1853.788
  1. Calculate the standard deviation of the randsamp1 sample again, this time using the na.rm = T argument in lieu of the na.omit function.

Now we have the ingredients to calculate the standard error. We need to use the sqrt function for the denominator:

## [1] 414.5196

Calculating the rule of thumb 95% confidence interval

Here again we use simply mathematical operations in R to calculate the Rule of Thumb 95% confidence interval.

The lower confidence limit is calculated as the sample mean minus 2 times the standard error.

First, let’s calculate the sample mean, and store it in an object:

## [1] 2684.05

Now for the lower confidence limit:

## [1] 1855.011

And the upper confidence limit:

## [1] 3513.089

So the rule of thumb 95% confidence interval is:

## [1] 1855.011 3513.089

NOTE: In a later tutorial you’ll learn how to calculate actual confidence intervals, rather than “rule of thumb” confidence intervals

List of functions (and the source packages) used in tutorial

Getting started:

  • read.csv
  • url
  • library

Data frame structure:

  • head
  • str
  • inspect (tigerstats / mosaic packages)
  • length

Sampling:

  • set.seed
  • sample
  • do (mosaic package)

Data management and exploration:

  • filter (from the dplyr package)
  • subset
  • na.omit
  • length
  • mean
  • sd
  • sqrt

Graphs:

  • histogram
  • hist