## Fun with one of my favorite distributions
## Clinic on Dynamical Approaches to Infectious Disease Data
## International Clinics on Infectious Disease Dynamics and Data Program
## University of Florida, Gainesveille, FL, USA
##
## Jim Scott, 2012
## Some Rights Reserved
## CC BY-NC 3.0 (http://creativecommons.org/licenses/by-nc/3.0/)
################################################################################
##
## By the end of this tutorial you should:
##
## * understand the terms: sample space, discrete, random variable, probability
## distribution
## * know the characterisitics of a binomial process
## * know the binomial formula and how to use it
## * be able to utilize the functions dbinom, pbinom, and rbinom
## * be able to plot a binomial distribution in R
## * visualize how the parameters of the binomial distribution affect it.
## * realize that this text editor has no spell check capability and forgive me
## for any typos I may have made!
# The binomial probability model is a discrete probability model that is
# commonly used in many applications, including modeling disease transmission.
# But, what IS it, exactly?
# The "discrete" part means that it produces random variables that have a finite
# number of possible values (e.g. integers). The "probability model" part means
# that it provides us with two important pieces of infomation: the possible
# outcomes that may occur and the probabilities with which each outcome occurs.
# There are a number of common discrete probability models that exist, but the
# binomial model is used to model the number of successes that occur in fixed
# number of independent trials. It's a good model for a process that: 1) has a
# fixed number of independent trials 2) has outcomes that can be categorized
# into either "success" or "failure" 3) has constant probability for "success"
# (i.e. the probability of a "success" doesn't change.
# Coin flipping is a common example of a binomial process. Suppose I flip a
# coin 10 times. One flip represents one trial. The number of trials is fixed
# (n=10). There are only two outcomes. The probability of a success (e.g.
# getting "tails") remains constant. The number of tails in 10 trials is a
# binomially distributed random variable.
# What's a random variable? Well, before we talk about that, lets define one
# other term - sample space. The sample space of a random phenomenon is the set
# of all possible outcomes that could occur. Note that an outcome need not be
# numeric. For example, consider tossing a coin one time. The sample space is
# {heads, tails}. If you were to
# toss the coin 2 times the sample space would consist of:
# S = {heads heads, heads tails, tails heads, tails tails}
# and we would say the size of the sample space is 4.
# A random variable maps each outcome in the sample space to a numeric value.
# For example, if I defined the random variable X as the number of tails that
# occurs in two flips of a coin, I would observe the following:
#
# outcome X
# heads heads 0
# heads tails 1
# tails heads 1
# tails tails 2
#
# I could describe the distribution of X in the following manner:
#
# outcome probability
# X = 0 0.25
# X = 1 0.50
# X = 2 0.25
# Also, notice that the random variable X is binomial because we have: 1) fixed
# number of independent trials (n=2), 2) either a "success" (tails) or failure
# occurs (not tails), and 3) the probability of a "success" (getting tails)
# remains constant. Here the probability of a success in any one trial is p=0.5
# If we suspect that a random variable is binomial, we can use the binomial
# formula to calculate the probability that X equals some specific value. For
# example, if we didn't know that the probabiliy of getting 0 tails on 2 flips
# is 0.25, we could have used the following formula to obtain it:
#
# Binomial formula:
# P(X = x) = N!/[(x!)(N-x)!] * p^x * (1-p)^(N-x)
#
# You might be familiar with seeing N!/[(x!)(N-x)!] written as "N choose x"
#
# In the above equation:
# x = number of successes
# N = number of trials
# p = the probability of a success
# In our example, N = 2 and p = 0.5. To calculate the probability that X = 0
# use:
# P(X = 0) = 2!/[(0!)(2-0)!] * 0.5^0 * (1-0.5)^(2-0)
# = 1 * 1 * 0.25 = 0.25 (recall that 0! = 1)
# This has all been a preamble toward getting you prepared to use the binomial
# distribution in R. To obtain the calculation above in R, you need only
# execute the following command:
dbinom(0,2,0.5)
# The syntax is pbinom(x, N, p)
# That one isn't too hard to calculate in your head, but suppose you wanted to
# know the probability of getting exactly 20 tails if you flipped a coin 50
# times. Instead of calculating it with the formula (or a calculator), just use
# R. Here, N=50, p is still 0.5. Run:
dbinom(20, 50, 0.5)
# If you'd like to determine the entire distribution when N=50 and p=0.5 you
# could type:
dbinom(0:50, 50, 0.5)
# Nice, but not that visually stimulating. Let's plot it!
barplot(dbinom(0:50, size=50, prob=1/2),names.arg=0:50,ylab="Probability",
main="Binomial Distribution, N=50, p=1/2",xlab="Number of Tails",
col="light blue",ylim=c(0,.12))
# The binomial distribution has 2 parameters, N, the number of fixed trials, and
# p, the probability of a success. The distribution is completely determined by
# these values, just like a normal distribution is completely determined by its
# paramters, the mean and the standard deviation. To see how the distriution
# changes as p changes, try highlighting and running the following code (here
# we're holding N constant at 50):
binom <- function(p) {
barplot(dbinom(0:50, size=50, prob=p),names.arg=0:50,ylab="Probability",
main=paste0("Binomial Distribution, N=50, p=",p),xlab="Number of Tails",
col="light blue", ylim=c(0,0.3))
Sys.sleep(0.1)
}
ignore <- sapply((1:19)/20, binom)
# If you didn't see all the plots, try clicking the back arrow a few times in
# the plot window to see all the plots that R produced.
# To get an idea of how the binomial distribution can change when N varies try
# running the following code. Here p is set to 1/2, after you've run the code
# once, try changing p to a different value.
binom <- function(N) {
barplot(dbinom(0:N, size=N, prob=1/2),names.arg=0:N,ylab="Probability",
main=paste0("Binomial Distribution, N=",N,", p=1/2"),xlab="Number of Tails",
col="light blue", ylim=c(0,0.5))
Sys.sleep(0.1)
}
ignore <- sapply(1:20, binom)
# Again, if you didn't see all the plots, try clicking the back arrow a few
# times in the plot window to see all the plots that R produced.
# The binomial formula provides you with just the area of one bar. To determine
# the area that is less than or equal to a specific value, for example, the
# probability that we get 20 or fewer tails in 50 flips, you could type:
sum(dbinom(0:20, 50, 0.5))
# or, equivalently
pbinom(20,50,0.5)
# The difference between these two commands is that dbinom sums up the areas of
# each bar you specify whereas pbinom sums up the bars that are less than or
# equal to the value of the first number in the parentheses
# To create a visual of this you could execute the following:
ints=0:50
barplot(dbinom(ints, size=50, prob=1/2),names.arg=0:50,ylab="Probability",
main="Binomial Distribution, N=50, p=1/2",xlab="Number of Tails",
col=ifelse(ints<=20,"blue","light blue"))
# The area of the dark blue bars is:
pbinom(20,50,0.5)
# You can also use R to simulate binomially distributed random variables. For
# example, suppose N = 50 and p=1/2. You'd expect to get 25 successes, but due
# to random chance you won't always get 25. Sometimes you'll get a few more,
# sometimes a few less. To simulate the outcome of 50 coin tosses, just type:
rbinom(1,50,1/2)
# If you'd like to similate the results of 1000 people flipping a coin 50 times
# type:
mysims <- rbinom(1000,50,1/2)
mysims
# the output isn't so useful in that form...try plotting it instead:
hist(mysims, freq=FALSE, breaks=seq(-0.5,50.5,1),ylim=c(0,0.2),col="grey",
ylab="Number of occurences",xlab="Number of tails in 50 flips")
# Your distribution should look similar to the following, theoretical binomial
# distribution:
barplot(dbinom(0:50, size=50, prob=1/2),names.arg=0:50,ylab="Probability",
main="Binomial Distribution, N=50, p=1/2",xlab="Number of Tails"
,ylim=c(0,0.2),col="light blue")
# click the back arrow in the Plots window to compare and contrast. You should
# also
# notice that this particular binomial distribution is 'bell shaped' like the
# normal
# distribution. As a rule of thumb, whenever N*p > 10 and N*(1-p) > 10, you can
# use a normal distribution to approximate the binomial distribution.
# Use mean = N*p and variance = N*p*(1-p). In this case, mean = 50*0.5 = 25 and
# variance = 50*0.5*0.5 = 12.5. The SD is sqrt(12.5) or 3.54. To compare the
# approximated normal distribution to your simulated binomial data, you could
# type:
hist(mysims, freq=FALSE, breaks=seq(-0.5,50.5,1),ylim=c(0,0.2),col="grey",
ylab="Number of occurences",xlab="Number of tails in 50 flips")
curve(dnorm(x,mean=25,sd=3.54), add=TRUE, col="blue",lwd=3)
# That concludes the tutorial!
# Refs:
# Introduction to Probability with R, Baclawski, Chapman & Hall, 2008
# Introduction to the Practice of Statistics ed 7, Moore, McCabe, & Craig, W.H.
# Freeman, 2012
# To test your skills try the following problems.
#
#
#
# Roulette - on a roulette wheel, there are 18 red numbers, 18 black numbers,
# and 2 green numbers. The numbers 1 to 36 appear on the wheel along with 0 and # 00.
# The probability that the roulette ball lands on any one number is the same
# (i.e. 1/38). Suppose you always bet on red, and you play 20 times.
#
#
# a) Determine the probability of each outcome (eg, winning 0 times, 1 time,
# etc)
# b) Determine the probability of winning at least 10 times
# C) Produce a plot of the probability distribution of the number of 'wins'
# d) Simulate the number of wins when you play roulette 20 times for 1000
# replications
# e) Make a histogram of your results from part d)
# f) In how many of your replications did you win at least 10 times?