## EPID 7500 - Binomial Distribution Tutorial
## Fall 2017
## UGA
## Steve Bellan
################################################################################
## Adapted from the below
################################################################################
## Fun with one of my favorite distributions
## Clinic on Dynamical Approaches to Infectious Disease Data
## International Clinics on Infectious Disease Dynamics and Data Program
## University of Florida, Gainesveille, FL, USA
##
## Jim Scott, 2012
## Some Rights Reserved
## CC BY-NC 3.0 (http://creativecommons.org/licenses/by-nc/3.0/)
################################################################################
################################################################################
##
## By the end of this tutorial you should:
##
## * understand the terms: sample space, discrete, random variable, probability
## distribution
## * know the characterisitics of a binomial process
## * know the binomial formula and how to use it
## * be able to utilize the functions dbinom, pbinom, and rbinom
## * be able to plot a binomial distribution in R
## * visualize how the parameters of the binomial distribution affect it.
# The binomial probability model is a discrete probability model that is
# commonly used in many applications.
# But, what IS it, exactly?
# The "discrete" part means that it produces random variables that have a finite
# number of possible values. The "probability model" part means
# that it provides us with two important pieces of infomation: the possible
# outcomes that may occur and the probabilities with which each outcome occurs.
# There are a number of common discrete probability models that exist, but the
# binomial model is used to model the number of successes that occur in fixed
# number of independent trials. It's a good model for a process that: 1) has a
# fixed number of independent trials 2) has outcomes that can be categorized
# into either "success" or "failure" 3) has constant probability for "success"
# (i.e. the probability of a "success" doesn't change.
# Coin flipping is a common example of a binomial process. Suppose I flip a
# coin 10 times. One flip represents one trial. The number of trials is fixed
# (n=10). There are only two outcomes. The probability of a success (e.g.
# getting "tails") remains constant. The number of tails in 10 trials is a
# binomially distributed random variable.
# What's a random variable? Well, before we talk about that, lets define one
# other term - sample space. The sample space of a random phenomenon is the set
# of all possible outcomes that could occur. Note that an outcome need not be
# numeric. For example, consider tossing a coin one time. The sample space is
# {heads, tails}. If you were to
# toss the coin 2 times the sample space would consist of:
# S = {heads heads, heads tails, tails heads, tails tails}
# and we would say the size of the sample space is 4.
# A random variable maps each outcome in the sample space to a numeric value.
# For example, if I defined the random variable X as the number of tails that
# occurs in two flips of a coin, I would observe the following:
#
# outcome X
# heads heads 0
# heads tails 1
# tails heads 1
# tails tails 2
#
# I could describe the distribution of X in the following manner:
#
# outcome probability
# X = 0 0.25
# X = 1 0.50
# X = 2 0.25
# Also, notice that the random variable X is binomial because we have: 1) fixed
# number of independent trials (n=2), 2) either a "success" (tails) or failure
# occurs (not tails), and 3) the probability of a "success" (getting tails)
# remains constant. Here the probability of a success in any one trial is p=0.5
# If we suspect that a random variable is binomial, we can use the binomial
# formula to calculate the probability that X equals some specific value. For
# example, if we didn't know that the probabiliy of getting 0 tails on 2 flips
# is 0.25, we could have used the following formula to obtain it:
#
# Binomial formula:
# P(X = x) = N!/[(x!)(N-x)!] * p^x * (1-p)^(N-x)
#
# You might be familiar with seeing N!/[(x!)(N-x)!] written as "N choose x"
#
# In the above equation:
# x = number of successes
# N = number of trials
# p = the probability of a success
# In our example, N = 2 and p = 0.5. To calculate the probability that X = 0
# use:
# P(X = 0) = 2!/[(0!)(2-0)!] * 0.5^0 * (1-0.5)^(2-0)
# = 1 * 1 * 0.25 = 0.25 (recall that 0! = 1)
# This has all been a preamble toward getting you prepared to use the binomial
# distribution in R. To obtain the calculation above in R, you need only
# execute the following command:
dbinom(0,2,0.5)
# The syntax is pbinom(x, N, p)
# That one isn't too hard to calculate in your head, but suppose you wanted to
# know the probability of getting exactly 20 tails if you flipped a coin 50
# times. Instead of calculating it with the formula (or a calculator), just use
# R. Here, N=50, p is still 0.5. Run:
dbinom(20, 50, 0.5)
# If you'd like to determine the entire distribution when N=50 and p=0.5 you
# could type:
dbinom(0:50, 50, 0.5)
# Nice, but not that visually stimulating. Let's plot it!
numtails <- 0:50
probdens <- dbinom(numtails, size=50, prob=1/2)
binomdat <- tibble(numtails, probdens)
ggplot(binomdat, aes(numtails, probdens)) + geom_bar(stat='identity', fill='blue') +
ylab('Probability') + xlab('Number of Tails') +
ggtitle('Binomial Distribution, N=50, p=1/2')
# The binomial distribution has 2 parameters, N, the number of fixed trials, and
# p, the probability of a success. The distribution is completely determined by
# these values, just like a normal distribution is completely determined by its
# paramters, the mean and the standard deviation. To see how the distriution
# changes as p changes, try highlighting and running the following code (here
# we're holding N constant at 50):
binomBarPlot <- function(N=50, p=.5) {
numtails <- 0:N
binomdat <- mutate(binomdat, probdens = dbinom(numtails, size = N, prob = p))
bp <- ggplot(binomdat, aes(numtails, probdens)) + geom_bar(stat='identity', fill='blue') +
ylab('Probability') + xlab('Number of Tails') +
ggtitle(paste0('Binomial Distribution, N=', N, ', p=', p))
print(bp)
}
binomBarPlot(p=.2)
binomBarPlot(p=.5)
binomBarPlot(p=.8)
binomBarPlot(p=.9)
# If you didn't see all the plots, try clicking the back arrow a few times in
# the plot window to see all the plots that R produced.
# To get an idea of how the binomial distribution can change when N varies try
# running the following code. Here p is set to 1/2, after you've run the code
# once, try changing p to a different value.
binomBarPlot(N=3)
binomBarPlot(N=5)
binomBarPlot(N=20)
binomBarPlot(N=100)
# Again, if you didn't see all the plots, try clicking the back arrow a few
# times in the plot window to see all the plots that R produced.
# The binomial formula provides you with just the height of one bar,
# which equates to the probability of that many tails occurring out of
# your entire sample space. To determine the probability that we flip
# a certain number of tails OR less, we can sum up all the bar
# heights: for example, the probability that we get 20 or fewer tails
# in 50 flips, you could type:
dbinom(0:20, 50, 0.5)
sum(dbinom(0:20, 50, 0.5))
# or, equivalently
?pbinom
pbinom(q=20, size=50, prob=0.5)
# The difference between these two commands is that dbinom gives the height of
# each bar you specify whereas pbinom sums up the bars that are less than or
# equal to the value of the first number in the parentheses
# To create a visual of this you could execute the following:
numtails <- 0:50
probdens <- dbinom(numtails, size=50, prob=1/2)
binomdat <- tibble(numtails, probdens)
ggplot(binomdat, aes(numtails, probdens)) + geom_bar(stat='identity', fill=ifelse(numtails<=20, 'blue', 'light blue')) +
ylab('Probability') + xlab('Number of Tails') +
ggtitle('Binomial Distribution, N=50, p=1/2')
## In case you want to do this in base R plotting, see the below code
barplot(dbinom(numtails, size=50, prob=1/2),names.arg=0:50,ylab="Probability",
main="Binomial Distribution, N=50, p=1/2",xlab="Number of Tails",
col=ifelse(numtails<=20,"blue","light blue"))
# The area of the dark blue bars is:
pbinom(20,50,0.5)
# You can also use R to simulate binomially distributed random variables. For
# example, suppose N = 50 and p=1/2. You'd expect to get 25 successes, but due
# to random chance you won't always get 25. Sometimes you'll get a few more,
# sometimes a few less. To simulate the outcome of 50 coin tosses, just type:
?rbinom
rbinom(n=1, size=50, prob=1/2)
# If you'd like to similate the results of 1000 people flipping a coin 50 times
# type:
numTailsSim <- rbinom(1000,50,1/2)
mySim <- tibble(numTailsSim)
mySim
# the output isn't so useful in that form...try plotting it instead:
p1 <- ggplot(mySim, aes(x=numTailsSim)) + xlab('Number of Tails')
p1 + geom_histogram(binwidth = 1) ## frequency counts (number of observations)
p1 + geom_histogram(binwidth = 1, aes(y=..density..)) ## density (proportion of observations
## in base R
## hist(mysims, freq=FALSE, breaks=0:50-.5, ylim=c(0,0.2),col="grey",
## ylab="Number of occurences",xlab="Number of tails in 50 flips")
# Your distribution should look similar to the theoretical binomial
# distribution:
numtails <- 0:50
probdens <- dbinom(numtails, size=50, prob=1/2)
binomdat <- tibble(numtails, probdens)
## Plot the two together
p1 + geom_histogram(binwidth = 1, aes(y=..density..)) + ## density (proportion of observations
geom_line(data=binomdat, aes(numtails, probdens), col='red', lwd=3, alpha = .6)
## What if you have 10 people flipping different coins, 3 of which
## have a 90% chance of tails, and 7 of which have only a 30% chance
## of tails and they each flip it 100 times?
## rbinom() is *vectorized*. This means that it can handle multiple values
## for each of it's arguments and returns the corresponding answer,
## recycling arguments that don't repeat as many times
coinProbs <- rep(c(.9,.3), c(3,7))
coinProbs
rbinom(n=10, size=100, prob = coinProbs)
## what if we had 10 people who only flipped a coin once?
rbinom(n=10, size=1, prob = coinProbs)
## We could similarly imagine that instead of tails, a 1 indicates having some disease (e.g. cancer)
## imagine in a study of 100 people, that 80 people have low risk and 20 have high risk for cancer
groupSizes <- c(lowRisk=80, highRisk=20)
cancerRisk <- c(lowRisk=.05, highRisk=.15)
cancerRiskByIndividual <- rep(cancerRisk, groupSizes)
sampleSize <- length(cancerRiskByIndividual)
rbinom(n=sampleSize, size = 1, prob = cancerRiskByIndividual)
## Note that this is often what we have with binary outcome
## data. There is some underlying idea of risk differences accross
## individuals, but we only see a binary outcome, making it difficult
## to disentangle what's going on.