## Fun with one of my favorite distributions ## Clinic on Dynamical Approaches to Infectious Disease Data ## International Clinics on Infectious Disease Dynamics and Data Program ## University of Florida, Gainesveille, FL, USA ## ## Jim Scott, 2012 ## Some Rights Reserved ## CC BY-NC 3.0 (http://creativecommons.org/licenses/by-nc/3.0/) ################################################################################ ## ## By the end of this tutorial you should: ## ## * understand the terms: sample space, discrete, random variable, probability ## distribution ## * know the characterisitics of a binomial process ## * know the binomial formula and how to use it ## * be able to utilize the functions dbinom, pbinom, and rbinom ## * be able to plot a binomial distribution in R ## * visualize how the parameters of the binomial distribution affect it. ## * realize that this text editor has no spell check capability and forgive me ## for any typos I may have made! # The binomial probability model is a discrete probability model that is # commonly used in many applications, including modeling disease transmission. # But, what IS it, exactly? # The "discrete" part means that it produces random variables that have a finite # number of possible values (e.g. integers). The "probability model" part means # that it provides us with two important pieces of infomation: the possible # outcomes that may occur and the probabilities with which each outcome occurs. # There are a number of common discrete probability models that exist, but the # binomial model is used to model the number of successes that occur in fixed # number of independent trials. It's a good model for a process that: 1) has a # fixed number of independent trials 2) has outcomes that can be categorized # into either "success" or "failure" 3) has constant probability for "success" # (i.e. the probability of a "success" doesn't change. # Coin flipping is a common example of a binomial process. Suppose I flip a # coin 10 times. One flip represents one trial. The number of trials is fixed # (n=10). There are only two outcomes. The probability of a success (e.g. # getting "tails") remains constant. The number of tails in 10 trials is a # binomially distributed random variable. # What's a random variable? Well, before we talk about that, lets define one # other term - sample space. The sample space of a random phenomenon is the set # of all possible outcomes that could occur. Note that an outcome need not be # numeric. For example, consider tossing a coin one time. The sample space is # {heads, tails}. If you were to # toss the coin 2 times the sample space would consist of: # S = {heads heads, heads tails, tails heads, tails tails} # and we would say the size of the sample space is 4. # A random variable maps each outcome in the sample space to a numeric value. # For example, if I defined the random variable X as the number of tails that # occurs in two flips of a coin, I would observe the following: # # outcome X # heads heads 0 # heads tails 1 # tails heads 1 # tails tails 2 # # I could describe the distribution of X in the following manner: # # outcome probability # X = 0 0.25 # X = 1 0.50 # X = 2 0.25 # Also, notice that the random variable X is binomial because we have: 1) fixed # number of independent trials (n=2), 2) either a "success" (tails) or failure # occurs (not tails), and 3) the probability of a "success" (getting tails) # remains constant. Here the probability of a success in any one trial is p=0.5 # If we suspect that a random variable is binomial, we can use the binomial # formula to calculate the probability that X equals some specific value. For # example, if we didn't know that the probabiliy of getting 0 tails on 2 flips # is 0.25, we could have used the following formula to obtain it: # # Binomial formula: # P(X = x) = N!/[(x!)(N-x)!] * p^x * (1-p)^(N-x) # # You might be familiar with seeing N!/[(x!)(N-x)!] written as "N choose x" # # In the above equation: # x = number of successes # N = number of trials # p = the probability of a success # In our example, N = 2 and p = 0.5. To calculate the probability that X = 0 # use: # P(X = 0) = 2!/[(0!)(2-0)!] * 0.5^0 * (1-0.5)^(2-0) # = 1 * 1 * 0.25 = 0.25 (recall that 0! = 1) # This has all been a preamble toward getting you prepared to use the binomial # distribution in R. To obtain the calculation above in R, you need only # execute the following command: dbinom(0,2,0.5) # The syntax is pbinom(x, N, p) # That one isn't too hard to calculate in your head, but suppose you wanted to # know the probability of getting exactly 20 tails if you flipped a coin 50 # times. Instead of calculating it with the formula (or a calculator), just use # R. Here, N=50, p is still 0.5. Run: dbinom(20, 50, 0.5) # If you'd like to determine the entire distribution when N=50 and p=0.5 you # could type: dbinom(0:50, 50, 0.5) # Nice, but not that visually stimulating. Let's plot it! barplot(dbinom(0:50, size=50, prob=1/2),names.arg=0:50,ylab="Probability", main="Binomial Distribution, N=50, p=1/2",xlab="Number of Tails", col="light blue",ylim=c(0,.12)) # The binomial distribution has 2 parameters, N, the number of fixed trials, and # p, the probability of a success. The distribution is completely determined by # these values, just like a normal distribution is completely determined by its # paramters, the mean and the standard deviation. To see how the distriution # changes as p changes, try highlighting and running the following code (here # we're holding N constant at 50): binom <- function(p) { barplot(dbinom(0:50, size=50, prob=p),names.arg=0:50,ylab="Probability", main=paste0("Binomial Distribution, N=50, p=",p),xlab="Number of Tails", col="light blue", ylim=c(0,0.3)) Sys.sleep(0.1) } ignore <- sapply((1:19)/20, binom) # If you didn't see all the plots, try clicking the back arrow a few times in # the plot window to see all the plots that R produced. # To get an idea of how the binomial distribution can change when N varies try # running the following code. Here p is set to 1/2, after you've run the code # once, try changing p to a different value. binom <- function(N) { barplot(dbinom(0:N, size=N, prob=1/2),names.arg=0:N,ylab="Probability", main=paste0("Binomial Distribution, N=",N,", p=1/2"),xlab="Number of Tails", col="light blue", ylim=c(0,0.5)) Sys.sleep(0.1) } ignore <- sapply(1:20, binom) # Again, if you didn't see all the plots, try clicking the back arrow a few # times in the plot window to see all the plots that R produced. # The binomial formula provides you with just the area of one bar. To determine # the area that is less than or equal to a specific value, for example, the # probability that we get 20 or fewer tails in 50 flips, you could type: sum(dbinom(0:20, 50, 0.5)) # or, equivalently pbinom(20,50,0.5) # The difference between these two commands is that dbinom sums up the areas of # each bar you specify whereas pbinom sums up the bars that are less than or # equal to the value of the first number in the parentheses # To create a visual of this you could execute the following: ints=0:50 barplot(dbinom(ints, size=50, prob=1/2),names.arg=0:50,ylab="Probability", main="Binomial Distribution, N=50, p=1/2",xlab="Number of Tails", col=ifelse(ints<=20,"blue","light blue")) # The area of the dark blue bars is: pbinom(20,50,0.5) # You can also use R to simulate binomially distributed random variables. For # example, suppose N = 50 and p=1/2. You'd expect to get 25 successes, but due # to random chance you won't always get 25. Sometimes you'll get a few more, # sometimes a few less. To simulate the outcome of 50 coin tosses, just type: rbinom(1,50,1/2) # If you'd like to similate the results of 1000 people flipping a coin 50 times # type: mysims <- rbinom(1000,50,1/2) mysims # the output isn't so useful in that form...try plotting it instead: hist(mysims, freq=FALSE, breaks=seq(-0.5,50.5,1),ylim=c(0,0.2),col="grey", ylab="Number of occurences",xlab="Number of tails in 50 flips") # Your distribution should look similar to the following, theoretical binomial # distribution: barplot(dbinom(0:50, size=50, prob=1/2),names.arg=0:50,ylab="Probability", main="Binomial Distribution, N=50, p=1/2",xlab="Number of Tails" ,ylim=c(0,0.2),col="light blue") # click the back arrow in the Plots window to compare and contrast. You should # also # notice that this particular binomial distribution is 'bell shaped' like the # normal # distribution. As a rule of thumb, whenever N*p > 10 and N*(1-p) > 10, you can # use a normal distribution to approximate the binomial distribution. # Use mean = N*p and variance = N*p*(1-p). In this case, mean = 50*0.5 = 25 and # variance = 50*0.5*0.5 = 12.5. The SD is sqrt(12.5) or 3.54. To compare the # approximated normal distribution to your simulated binomial data, you could # type: hist(mysims, freq=FALSE, breaks=seq(-0.5,50.5,1),ylim=c(0,0.2),col="grey", ylab="Number of occurences",xlab="Number of tails in 50 flips") curve(dnorm(x,mean=25,sd=3.54), add=TRUE, col="blue",lwd=3) # That concludes the tutorial! # Refs: # Introduction to Probability with R, Baclawski, Chapman & Hall, 2008 # Introduction to the Practice of Statistics ed 7, Moore, McCabe, & Craig, W.H. # Freeman, 2012 # To test your skills try the following problems. # # # # Roulette - on a roulette wheel, there are 18 red numbers, 18 black numbers, # and 2 green numbers. The numbers 1 to 36 appear on the wheel along with 0 and # 00. # The probability that the roulette ball lands on any one number is the same # (i.e. 1/38). Suppose you always bet on red, and you play 20 times. # # # a) Determine the probability of each outcome (eg, winning 0 times, 1 time, # etc) # b) Determine the probability of winning at least 10 times # C) Produce a plot of the probability distribution of the number of 'wins' # d) Simulate the number of wins when you play roulette 20 times for 1000 # replications # e) Make a histogram of your results from part d) # f) In how many of your replications did you win at least 10 times?