## Tutorial 3: Probability Distributions and Control Structures ## Clinic on Meaningful Modeling of Epidemiological Data ## International Clinics on Infectious Disease Dynamics and Data Program ## African Institute for Mathematical Sciences, Muizenberg, Cape Town, RSA ## David M. Goehring 2004 ## Juliet R.C. Pulliam 2008, 2009 ## Steve Bellan 2010, 2012 ## ## Last updated by Juliet R.C. Pulliam, May 2019 ## Some Rights Reserved ## CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/) ###################################################################### ## By the end of this tutorial you should... ## * Feel comfortable working with probability distributions ## * Know how to use conditional structures ## * Be warned about the dangers of loop structures in R ## * Have tried your hands at synthesizing all the new material ###################################################################### ## SECTION A. Probability Distributions in R ###################################################################### #################### ## Probability distributions #################### ## So far the calculations we have been performing have been doable on ## a modern calculator, so my assertion that R will make things easier ## for you has not seen any confirmation in experience. ## In terms of solving probability problems, R does not provide many ## tools that extend too far beyond your intuition. If you want to ## know the probability of rolling two sixes on two dice you do not ## need R to tell you that 1/6 times 1/6 is 1/36. The two exceptions ## might be the functions factorial() and choose(), which only appear ## on a minority of modern calculators and which can solve a problem ## for you once you know to reach for them. These functions are ## sufficiently self-explanatory with their help files to render a ## treatment of them here unnecessary. A third exception is ## simulation, which you will try your hand at in the Benchmark ## Questions, and is an important component of mathematical modelling. ## Enter probability distributions. R is able to do in a few ## milliseconds what took hours of hunting in tables not 30 years ## ago. If you understand what a probability distribution is, R will ## provide four different utilities for (nearly) any such distribution, as ## follows: ## * Generating individual density values or a plot of the density function ## * Generating random numbers drawn from the distribution ## * Providing cumulative probability for points along the distribution ## * Providing points along the distribution corresponding to a cumulative probability ## Each of these is an important part of the toolkit for working with ## probability distributions. ## Making the functions meet your needs ## The different distributions have different corresponding functions ## in R, but aside from the particulars of their arguments, they ## consist of four functions of the same general form. ## We will use the binomial distribution (which is a ## discrete distribution) as an example for each of our four methods, ## and you should be able to extend that experience to any ## distribution you like. ## As we have seen hints of before, R can generate random numbers from ## any distribution. For discrete distributions, creating a histogram ## of a large number of these values can give a quick picture of the ## distribution. To generate random numbers from the binomial ## distribution, use rbinom(): hist(rbinom(10000, 10, 1/3), freq=FALSE, breaks=-1:10) ## The arguments of rbinom() tell it how many numbers you want to ## generate, how many trials are used to generate the distribution, ## and what the probability of a success per trial is, ## respectively. The extra arguments of hist() tell it that we want it ## to give proportions rather than counts and then how to divide up ## the data into bars. ## This is an empirical estimation of the distribution. The R function ## that calculates the probability density function of the binomial ## distribution is dbinom(), and that will give the precise ## theoretical values we may need. What is the probability of having 7 ## successes on 10 trials with a probability of success of 1/3? dbinom(7, 10, 1/3) ## Which is the equivalent of about 1.6% of the time. By extension, we ## can generate the complete distribution with vector input to ## dbinom(), as in dbinom(0:10, 10, 1/3) ## A useful application of rbinom() and dbinom() is to plot a curve of ## the density function over the histogram of observed values. For a ## discrete distribution such as this, the “curve” will be slightly ## misleading, but we will know what it means. Generating fewer random ## numbers this time, hist(rbinom(500,10,1/3), freq=FALSE, breaks=-1:10) curve(dbinom(ceiling(x), 10, 1/3), add=TRUE, col="red") ## The argument ceiling(x) resolves the fact that the x-values on the ## plot are continuous, not discrete. The correction is not needed for ## continuous distributions. ## Alternatively, you may use a paired histogram for comparing ## simulations to predicted densities for discrete distributions. The ## visualization used here is less elegant for discrete distributions ## but is more easily extended to continuous distributions. ## Using dbinom() to calculate cumulative probabilities for discrete ## distributions would be tedious, and we cannot calculate the ## cumulative probabilities using the equivalent density values from a ## continuous distribution (because it requires integration). The ## solution in R is to use a function, such as pbinom() for the ## binomial distribution, that calculates the cumulative density ## function (c.d.f.), that is, the function F (which here takes ## integer values). # F(x) = P(X <= x) ## As a quick example, what is the probability of getting 2 or less ## with our chosen binomial distribution? pbinom(2, 10, 1/3) ## The last function we need in our probability-distribution toolkit ## is qbinom(), the R function that calculates the inverse cumulative ## density function, F^-1 of the binomial distribution The c. d. f. and ## the inverse c. d. f. are related by # p = F(x) # x = F^-1(p) ## As a quick example, where is the 80% threshold in the cumulative ## probability distribution? qbinom(.8, 10, 1/3) ## The c. d. f. and inverse c. d. f. are particularly useful when ## performing statistical tests. ## I do not know why the R language chose such counter-intuitive ## naming practices for its probability distribution functions. But ## let’s have some mnemonics: # Function Prefix Mnemonic # # random numbers from dist. r random # density of distribution d density # cumulative probability p probability (is output) # inverse cumulative probability q qumulative (with an “inversed “c”) ## Feel free to remember them in any way you like, of course. #################### ## Probability distributions galore #################### ## You may not be familiar with all of these distributions, but this ## list should prove helpful for future reference. Keep in mind that ## some of these distributions are continuous and some are discrete – ## this will affect the allowed inputs of the functions. # Beta pbeta() qbeta() dbeta() rbeta() # Binomial pbinom() qbinom() dbinom() rbinom() # Cauchy pcauchy() qcauchy() dcauchy() rcauchy() # Chi-square pchisq() qchisq() dchisq() rchisq() # Exponential pexp() qexp() dexp() rexp() # F pf() qf() df() rf() # Gamma pgamma() qgamma() dgamma() rgamma() # Geometric pgeom() qgeom() dgeom() rgeom() # Hypergeometric phyper() qhyper() dhyper() rhyper() # Logistic plogis() qlogis() dlogis() rlogis() # Log Normal plnorm() qlnorm() dlnorm() rlnorm() # Negative Binomial pnbinom() qnbinom() dnbinom() rnbinom() # Normal pnorm() qnorm() dnorm() rnorm() # Poisson ppois() qpois() dpois() rpois() # Student's t pt() qt() dt() rt() # Uniform punif() qunif() dunif() runif() ###################################################################### ## SECTION B. Control structures ###################################################################### ## Now that you have gotten your proverbial feet wet by composing your ## own functions, you are ready to learn how to make your functions ## (or your scripts) more powerful by using control structures. The ## two main varieties of control structures are conditional structures ## and loop structures, but more important than what they are called ## is what they do... ## The semicolon, ";", allows you to put two or more commands ## together on a single line. The commands will be executed ## sequentially, just as if they had been written on separate lines. ## The best use of the semicolon is probably to combine lists of ## assignments into a single line, such as, x <- 8; y <- 4; z <- 3 c(x, y, z) ## Using the semicolon in this way will condense your code ## and make it look cleaner. In most situations, though, clarity is ## more important than compactness, so in general it is best not to ## string long lists of commands together on a single line using ## semicolons. Remember that the amount of whitespace (such as spaces, ## indentation, and empty lines) in your code will not affect its ## efficiency and that adding whitespace (and comments!) will often ## make your code easier to follow. ## Conditional structures ## “If you are under the age of 6, you cannot ride this ride.” ## “If you are available Friday night, please give me your number.” ## Needless to say, we are all familiar with conditional structures in ## casual speech, and conditional structures in R are not much ## different. ## The command ''if'' initiates a conditional structure. (Like function, ## it looks like a function, but is not a function.) What comes in ## parentheses after if is an expression of the condition that must be ## met for the remaining code to be executed. That is, it is a logical ## expression (meaning it should evaluate to TRUE or FALSE) which if ## TRUE tells R to execute the next line (or set) of code. ## A silly, but fully functional, example: age <- 19 if(age >= 6){ can.ride <- TRUE } can.ride ## Here we obviously know the result before we start, but in a ## (just-slightly-less silly) function, we would not: ride.test <- function(age) { can.ride <- FALSE if(age >= 6){ can.ride <- TRUE } return(can.ride) } ride.test(5) ride.test(65) ## Note that the curly brackets, {}, play the same role as for ## conditional statements as they do for the function construct: they ## allow multi-line procedures. They can also make your code easier to ## follow, even for single-line procedures. ## As an aside, the above function can (and perhaps should) be written ## more simply as ride.test <- function(age){ return(age >= 6) } ## The conditional structure becomes important when you want different ## functions or operations to be performed depending on some ## condition. This is something that you will often want. ## As an example, consider a function that measures central ## tendency. While you may want to avoid the influence of extremely ## large values by using the median, you may want to forgo it on very small ## samples, replacing the median with the mean. Let us say if we have fewer ## than 10 measurements in our sample we will resort to the mean: central.tend <- function(x){ my.answer <- median(x) if(length(x) < 10){my.answer <- mean(x)} return(my.answer) } central.tend(c(1,1,3)) ## This last example may have raised a red flag about efficiency – I ## determined the median of the data regardless of whether this value ## will ultimately be used in my output. Here, the cost is ## exceptionally low, because that calculation will only be irrelevant ## if there are fewer than 10 items in the vector, in which case it ## took trivially long to execute. But, you can see where there might ## be a problem. ## One solution is to create separate if statements for each ## condition, such as: # if(x < 0) #followed by what should happen if x < 0 # if(x = 0) #followed by what should happen if x = 0 # if((x > 0) & (x < 1)) # followed by what should happen if x is between 0 and 1, exclusive # if(x >= 1) # followed by what should happen if x >= 1 ## Note that defining conditions which are not mutually exclusive may ## make your code hard to follow and therefore render your results ## unpredictable without detailed examination. ## Another solution, particularly useful for binary (either-or) cases, ## is to use the command else, which follows the if procedure and ## gives what to do if your conditional expression evaluates to ## FALSE. If you are careful about your use of brackets, you can nest ## if and else commands to create more complicated conditional ## structures. ## Here are some additional options for writing the central tendancy ## function above. Using if and else, as described above: central.tend2 <- function(x){ if(length(x) >= 10){ my.answer <- median(x) }else{ my.answer <- mean(x) } return(my.answer) } # and using the ifelse() control structure, which is similar but more compact: central.tend3 <- function(x){ my.answer <- ifelse(length(x) >= 10, median(x), mean(x)) return(my.answer) } ## There are two potential big errors that may occur in implementing a ## conditional structure. One occurs if the conditional expression ## evaluates to NA or NaN – try this out yourself: if(Inf - Inf > 0){ x <- 5 } ## To make sure that your if statements don’t abort like this, you will ## use sometime want to use the is.na() function, which returns TRUE if the ## value being evaluated is NA or NaN. ## The second big mistake that you can make is assuming that R will act ## on the elements of a conditional expression that is a vector. Try this: if(c(T, F, F)){ 1 + 1 } ## As the warning that is generated indicates, R will ignore all but ## the first element of the vector. On reflection, this makes sense ## because R wouldn’t know what to do with a vector of logical values ## – the if statement has no inputs or outputs, just a conditional ## procedure. In other words, you shouldn’t expect the if statement to ## behave like a function because it isn’t one. #################### ## Loop structures #################### ## R employs a number of common loop structures familiar to those of ## you with some programming experience. However, those of you without ## programming experience may actually have an edge here. ## The reason is that, in general, in mathematical programming it is ## much more efficient to perform what traditionally has been done ## with programming loops using vector calculations. You have already ## seen one of the basic methods used to replace loop structures, the ## apply() function. ###################################################################### ###################################################################### ## This concludes Tutorial III. The benchmark questions should help ## solidify your understanding of the material. ## The probability-distribution section of this tutorial was aided by ## information available at ## http://www.stat.umn.edu/geyer/5102/examp/rlook.html. ## If you are unfamiliar with or rusty on your understanding of the ## binomial distribtion, you may also want to work through the introductory ## Binomial Distribution tutorial, available here: ## https://github.com/ICI3D/RTutorials/blob/master/binomialDistribution.R?raw=true ###################################################################### ## SECTION C. Benchmark Questions ###################################################################### #################### ## 1. The standard medication used to treat some disease is known to ## produce a severe side effect in 20% of the patients treated. A ## pharmaceutical company develops a new drug to treat the disease and ## tests it on 9 patients. There are no side effects observed in these ## patients. ## a. What is the probability of treating 9 patients without side ## effects using the standard medication? ## b. Generate a plot of the probability of observing no side effects ## in patients receiving treatment, as a function of the number of ## patients in the study, assuming the new drug causes the severe side ## effect with the same probability as the old. ## c. Add a horizontal red line at p = 0.025, visually depicting the ## number of patients minimally required for a study to show that the ## new medication produces side effects in fewer patients than the ## standard treatment (traditional statistical significance). #################### #################### ## 2. Re-do the suite of calculations and plots found in the Tutorial ## using the Poisson distribution (with lambda = 3) rather than the ## binomial distribution. (Hint: Make sure the histogram has an ## appropriate range with breaks.) ####################