This page was last updated on September 17, 2019.
For students in BIOL202, you can find R code associated with all chapters of the textbook here.
In this tutorial you will learn how to:
tigerstats
packagedplyr
packageYou may need to install the ‘dplyr’ package:
install.packages("dplyr")
library(dplyr)
Import in the human gene length data used in the text book (Chapter 4) directly from the text website:
Let’s look at the first handful of rows:
## geneLength
## 1 2069
## 2 1303
## 3 1067
## 4 2849
## 5 3378
## 6 2391
And use inspect
to get some descriptive statistics:
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n
## 1 geneLength integer 60 1312.25 2226.5 3444 99631 2622.027 2036.967 20290
## missing
## 1 0
There are 20290 observations in the data set, each representing a gene length.
Let’s visualize the data with a histogram:
Figure 1: Histogram of gene lengths in the human genome (n = 20290).
Problem: there is a substantial skew to the data. If we consult Fig. 4.1-1 in the text, we see that there are 26 genes excluded from the histogram because they are beyond 15000 nucleotides long, and thus render the histogram uninformative. Let’s exclude these here also.
To do so, we first introduce a new package dplyr
, which is used to manipulate data sets within R.
There are many webpages about the dplyr
package, including this excellent one here.
We will use the filter
function from the dplyr
package, to create a new data frame that holds only those genes less than 15000 nucleotides long:
This same outcome could have been achieved with the subset
function in the base R package:
shortgenes <- subset(humanGeneLengths, geneLength < 15000)
But we’ll make more use of the dplyr
package later, so introduce it here.
Now let’s re-plot the data after having excluded the extreme values:
histogram(~ geneLength, data = shortgenes,
type = "count",
col = "firebrick",
ylab = "Number of genes",
xlab = "Gene length (number of nucleotides)",
nint = 25)
Figure 1: Histogram of gene lengths in the human genome (n = 20264).
Recall that this is the entire population of genes in the human genome. We are going to draw random samples from this dataset to explore the concepts of sampling error and estimating with uncertainty.
hist
function instead of the histogram
function. It doesn’t need to be reproduced exactly, but do your best to get close.HINT: Go to Tutorial_01 and the section “creating a histogram”, and you’ll see some arguments used in the hist
function. For instance, the argument las = 1
rotates the y-axis tick labels so they’re horizontal, and the main = ""
makes sure that there is no “main” title on the graph (because we provide a figure heading instead). Consult the help on the hist
function to find out about the other arguments you can use.
There is a function sample
in the base R package that is easy to use for the purpose collecting random samples. Look at the help function for it (copy and paste the code between the “```” into your command console):
?sample
This may show you multiple help options (depending on the number of different R packages that include a sample
function). Choose the one that says “Random samples and permutations” from the base
package.
The replace
argument in the dictates whether the sampling we conduct occurs with or without replacement. For instance, the following code will not work (try copying and pasting into your console):
sample(1:6, 10, replace = F)
Here we’re telling R to sample 10 values from a vector that includes only 6 values, and with the replace = F
argument we told R not to put the sampled integers back in the pile for future sampling.
If instead we specified replace = T
, as shown below, then we be conducting “sampling with replacement”, and sampled values are placed back in the pool each time, thus enabling the code to work (try it):
sample(1:6, 10, replace = T)
TIP: Most of the time we wish to “sample without replacement”, hence replace = F
.
Before we make use of any function in R that uses a built-in random number generator, including the sample
function, we need to set the seed number in order to ensure that our results are reproducible. For this, we use the set.seed
function.
You can pick any number you want as the seed number. Here we’ll use 12, and if you do too, you’ll get the same results as shown here.
Now that the seed number has been set, we can draw a random sample from our population.
Let’s sample n = 20 gene lengths at random from the humanGeneLengths
data frame, and store them in an object:
Now let’s calculate the mean and standard deviation using our sample:
## [1] 2684.05
## [1] 1853.788
Let’s now draw another random sample of the same size (20):
And calculate the mean and sd:
## [1] 3416.95
## [1] 3550.181
Are they the same as we saw using the first random sample? NO! This reflects sampling error.
Now let’s learn code to conduct our own resampling exercise.
Let’s repeat the sampling exercise above, but do it many time, say 1000 times.
To do this we’re going to use the do
function from the mosaic
package (which is automatically loaded as part of the tigerstats
package).
?do
TIP: if you want to ensure that you’re looking up the help file for the correct function (i.e. when multiple packages may have a function of the same name), then you can type in the name of the source package followed by two colons as follows:
?mosaic::do
The do
function simply tells R to repeat whatever is after the *
.
Let’s first create an object n
to hold our sample size, and another object num.samps
to hold a number refering to the number of samples we wish to draw:
n <- 20 # our initial sample size for each sample
num.samps <- 1000 # number of different samples to draw from the population
Now we can use the do
function to create a vector of length num.samps
, wherein each element of the vector represents one sample mean, calculated using a random sample of “n” gene lengths.
sample.means <- do(num.samps) * mean(sample(humanGeneLengths$geneLength, size = n, replace = FALSE))
Note that we nest the sample
function within the mean
function. When provided with commands nested within other commands, R completes the inner-most command first, then works its way outwards. Thus, R first completes the sample
function first (it generates the desired sample), then it completes the mean
function, then it completes the do
function (which tells it to repeat the commands after the "*" num.samps
times).
Let’s visualize this sampling distribution of the mean that we’ve created:
histogram(~ sample.means,
type = "count",
col = "firebrick",
ylab = "Frequency",
xlab = "Mean gene length (number of nucleotides)",
xlim = c(0, 10000),
nint = 25)
Figure 1: Sampling distribution of mean gene lengths from the human genome (n = 1000 samples of size 20).
do
sampling routine above, and also the histogram. Is the sampling distribution of the mean narrower or wider than the one associated with sample size n = 20? (TIP: Be sure to set the x-axis lower and upper limits to be the same as those above, to facilitate direct comparison)This is the formula for calculating the standard error of the mean when the population parameter
This is one measure of uncertainty that we report alongside our estimate of the population mean.
It represents the standard deviation of the sampling distribution of the mean. Thus, it is a measure of spread for the sampling distribution of the mean.
We know that the inspect
function can be used to tell us the number of observations “n” in a sample (and importantly, it takes account of missing values or “NA” values), and also the standard deviation of the variable.
But let’s learn how to calculate sample size also using the length
function, ensuring to take account of any potential missing values by using the na.omit
function to exclude any “NA” values:
## [1] 20
So, randsamp1.n
now holds a single number that represents the number of observations (i.e. sample size “n”) in our sample.
NOTE: For some functions, such as mean
and sd
, one can include the na.rm = T
argument, which removes “NA” values before calculating. Others, such as the length
function do not accommodate the na.rm = T
argument. Hence our use of the na.omit
function above.
Now let’s calclulate the standard deviation, again using the na.omit
function nested in the sd
function:
## [1] 1853.788
randsamp1
sample again, this time using the na.rm = T
argument in lieu of the na.omit
function.Now we have the ingredients to calculate the standard error. We need to use the sqrt
function for the denominator:
## [1] 414.5196
Here again we use simply mathematical operations in R to calculate the Rule of Thumb 95% confidence interval.
The lower confidence limit is calculated as the sample mean minus 2 times the standard error.
First, let’s calculate the sample mean, and store it in an object:
## [1] 2684.05
Now for the lower confidence limit:
## [1] 1855.011
And the upper confidence limit:
## [1] 3513.089
So the rule of thumb 95% confidence interval is:
## [1] 1855.011 3513.089
NOTE: In a later tutorial you’ll learn how to calculate actual confidence intervals, rather than “rule of thumb” confidence intervals
Getting started:
read.csv
url
library
Data frame structure:
head
str
inspect
(tigerstats
/ mosaic
packages)length
Sampling:
set.seed
sample
do
(mosaic
package)Data management and exploration:
filter
(from the dplyr
package)subset
na.omit
length
mean
sd
sqrt
Graphs:
histogram
hist