## Tutorial 2: More on Vectors, Data Frames, and Functions ## Clinic on Meaningful Modeling of Epidemiological Data ## International Clinics on Infectious Disease Dynamics and Data Program ## African Institute for Mathematical Sciences, Muizenberg, Cape Town, RSA ## David M. Goehring 2004 ## Juliet R.C. Pulliam 2008, 2009 ## Steve Bellan 2010, 2012 ## ## Last updated by Juliet R.C. Pulliam, May 2023 ## Some Rights Reserved ## CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/) ###################################################################### ## SECTION A. Accessing Vector Elements ###################################################################### ## By the end of this tutorial you should… ## * Be able to retrieve useful subsets of your data ## * Understand more about data frames ## * Know the methods and uses of logical values in R ## * Be able to generate and use factors ## * Know how to write your own generic functions #################### ## Beyond Numbers: Relational and Logical Operations in R #################### ## So far everything you have done in R has involved numbers or ## vectors of numbers. To properly exploit R’s complexity, you need to ## become familiar with relational and logical operations in R. ## Relational operations work just like numerical operations, in terms ## of how they are processed. Return for a moment to our first ## calculation from the last tutorial, an addition problem: 3 + 4 ## The analogous calculation of a single relational operation is ## something like 5 > 4 ## "Is 5 greater than 4?” Yes. And R tells you that this is a TRUE ## statement. Or, 1 + 1 < 1 ## Makes sense, right? ## The greater-than, >, and less-than, <, symbols are ## straightforward. Similarly, R has greater-than-or-equal-to and ## less-than-or-equal-to symbols, >= and <=, respectively. ## Slightly less intuitive are the relational operators for equality, ## ==, and inequality, !=. Try x <- 4 x == 1 + 3 y <- x != 4 ## This last example demonstrates that variables can hold logical ## values. These relational operators also operate on logical values, ## as in, y == FALSE ## Logical operations are operations that only make sense when ## performed on TRUE and FALSE values. These will likely be familiar ## to you, the central operations being AND, OR, and NOT. ## The operators used in R are standard: &, |, and !, ## respectively. Let’s see them in action: !TRUE to.be <- FALSE to.be | !to.be FALSE & (TRUE | FALSE) ## By combining logical and relational operations, we can make complex ## inquiries about values. ## Hands off the keyboard! Pick up a writing implement… ## a <- TRUE != (4 > 3) ## b <- a | 1 + 1 == 4 - 2 ## c <- !FALSE & (log(Inf) == Inf + 1) ## What do a, b & c equal? Now execute the commands and compare ## your answers. ## Note that R has special values for infinity (Inf), not-a-number ## (NaN), and not-applicable, NA. These generally behave sensibly – a mathematical ## operation on not-a-number is obviously not a number as so is ## returned as NaN. Things are less simple when using logical and relational ## operators. Consider 4 != NaN. In one respect, the answer perhaps ## should be TRUE; that is, 4 definitely isn’t equal to ## not-a-number. But, striving for consistency, R returns NA, much as ## it would for a mathematical operation. Even worse is the situation x <- NaN x == NaN ## You might think that this is a reasonable test for whether x has a ## numerical value, but it won’t work for the same reason mentioned above. ## In general, keep this trickiness in mind and remember there is a special ## function is.nan() for determining whether x is defined as not-a-number: is.nan(x) ## There is also a special function is.finite() for determining whether ## x is a valid (finite) number: is.finite(x) ## This is all getting thrown at you in very quick succession, ## especially if you do not have much experience programming in other ## languages. It is worth noting that information about these ## operations can be pulled up at any time by typing help("&”) or ## help(">”) or the using help() function with any of the other ## symbols used in these operations. #################### ## Vectors of Logical Values #################### ## As a shorthand, TRUE and FALSE can be entered as T and F. This ## allows for rapid entry of vectors of logical values, for example: logical.vec <- c(T, T, F, T) logical.vec ## Unfortunately, and rather inexplicably, T and F cab be reassigned ## to any arbitrary values. This will render most code utterly ## unpredictable. So, never, never, never do this: T <- 4 # REALLY BAD, BUT NO ERROR IS PRODUCED ## And, if you ever do something like this (though you shouldn’t!), ## make sure you quickly do this: rm(T) ## which will set T (or F) back to its default logical value. ## To ensure your code is robust, we recommend always spelling out ## TRUE and FALSE for logical values. ## Relational or logical operations also act on vectors to produce ## vectors of logical values, as in, x <- rnorm(10) x < 0 y <- (x > -.5) & (x < .5) !y ## This will be especially handy when we look at the concept of ## indexing, below. #################### ## Generating Sequences #################### ## There are many occasions in R when you need a patterned sequence of ## numbers. As mentioned in the previous tutorial, most counting can ## be accomplished by use of the seq() function. If you haven’t ## already done so, it is worth taking a look at the help-file on ## seq() because it has a few arguments that can make your life ## easier. ?seq ## For example, seq() can generate a vector of a certain length ## between certain endpoints by typing x <- seq(0, 1, length.out = 20) ## giving you a vector of length 20 between 0 and 1; to confirm this, type length(x) ## A very common need in R is to generate vectors with an interval of ## 1 between each element. R has a shorthand for this using the colon ## notation, as follows: y <- 5:10 ## This generates a vector that counts from 5 to 10, inclusive. Note that ## : is generally treated first in the order of operations. ## Don’t underestimate the value of the colon notation. Even for ## typing a vector of length 2, like "(1,2)” or "(2,1),” using the c ## function to generate the vector is pretty tedious (e.g., c(1,2)). ## These vectors can be generated in three quick characters by typing ## 1:2 or 2:1, respectively. I will also point your attention to the ## rep() function, for repeating sequences, which can also save time. #################### ## Indexing #################### ## R has an incredibly useful way of accessing items from a ## data set. Each item in a data set has its own index, or numbered ## location, in the object’s structure. Square brackets are used to ## extract an item or items from a data set, but it is crucial to ## understand that there are two completely distinct ways in which ## brackets are used to access items. I will consider the two methods ## for accessing a vector of length n in turn below. ## The first option: Logical ## Requirements: Logical vector of length n ## Use it for: Finding a subset of data based on a rule ## Logical indexing works as if you’ve asked your indexing vector the ## question, "Do you want this item?” for each of the items in the ## vector. x <- 1:5 x[c(T, F, F, F, T)] ## If we combine this logical indexing with the relational and logical ## operators you learned above, we have an exceptionally powerful tool ## to retrieve data that meet any set of criteria. y <- rnorm(10000) hist(y[!( (y > -2) & (y < 0) )]) ## I will give more insight below when I discuss indexing data ## frames. Stay tuned. ## In any operation in R, vectors will be automatically repeated until ## they reach the necessary length for the operation to make ## sense. For example, note the results of 1:6 + 1:2 ## The same repetition holds for logical vectors. For this reason, you should ## be very cautious using operations on vectors that differ in length. ## The second option: Numerical ## Requirements: Value or vector of any length with values ## (1 to n) OR (-n to -1) ## Use it for: Single item retrieval or shuffling, sorting, and repeating ## Accessing single items with brackets and a single index should be ## straightforward x <- 3 * (0:5) x[4] ## One tedious way of creating a new vector of values from a vector’s ## elements would be c(x[2], x[3], x[4]) #TEDIOUS ## So R makes it much easier by allowing a vector of indices to ## generate a vector. Thereby, the command above becomes x[2:4] ## There is nothing preventing you from accessing any element any ## number of times. x[c(2, 2, 2, 5, 5, 5)] ## Additionally, R allows you to use negative indices, indicating ## which items you want to exclude, as in, x[c(-1, -6)] ## This is fine and productive as long as you remember never to mix ## negative and positive indices – R will not know what you want it to ## do: x[c(-1, 4)] # BADCODE #################### ## Sorting #################### ## In Tutorial 1, you were introduced to the sort() function, which is ## handy. ## Now that you have been introduced to indexing, you may have an ## inkling of how much more powerful the sorting functions of R can ## become. ## As an introduction, let’s say you have a 4-element vector, my.vector <- 5:8 ## Using numerical indexing, we can manually re-order this vector by ## calling each of its indices once in our preferred order, for ## example my.vector[c(2, 3, 4, 1)] ## or, for a quick reversal my.vector[4:1] ## Now, manually generating the vector of indices is not monumentally ## useful, which is where the function order() comes in. As a ## demonstration, imagine we have a vector of student names and a ## corresponding vector of student heights (in meters). stud.names <- c("Dario", "Hloniphile", "Steve", "Innocent", "Abigail", "Cynthia") stud.heights <- rnorm(6, 1.7, .12) ## What we definitely don’t want to do is to perform sort() on each of ## these vectors independently. This will eliminate the pairing of the ## name to the height. So how can we sort one vector and have the ## other vector align correctly? Try order() on the names, order(stud.names) ## Note that it returns the indices in the right order, not the values ## themselves. ## From what you learned above, you know it is now an easy matter to ## sort both of our vectors, as follows, stud.names[order(stud.names)] # same effect as sort() stud.heights[order(stud.names)] ## And, obviously, sorting the names by the heights is exactly ## analogous, and it will make for a pretty plot barplot(stud.heights[order(stud.heights)], names.arg = stud.names[order(stud.heights)], ylab = "Height (m)", main = "Student Heights") ## I have conveniently skipped over an important concept, because R ## handles it fairly intuitively, but I want to mention the ## terminology. The variable stud.names and the results of ls(), for ## example, are called vectors of strings, or character arrays. R ## handles them conveniently, so we don’t need to worry too much about ## them, but knowing the terminology will improve your understanding ## of R's in-line help documents. ###################################################################### ## SECTION B. Data Frames, Redux ###################################################################### ## Re-introduction to data frames ## Before we cover advanced topics of data frames, I wanted to point out the ## function data.frame(), which puts data together to form data ## frames. This is a key alternative to using the prefab data frames ## that you used in last week’s assignment. ## First I want to generate a vector of student class-years to ## correspond to the stud.names before creating a data frame (Freshmen ## as 1, Sophomores as 2, etc.). stud.years <- c(4, 2, 2, 3, 1, 3) ## Now making a data frame is easy (each argument will just add more ## columns to the data), the only trick being that we have to assign ## the constructed data frame to a variable, as follows: student.data <- data.frame(stud.names, stud.heights, stud.years) student.data ## Voila! Your own data frame. ## You may want to have better column headings than the redundant variable ## names. There are various options to accomplish this. One option is ## to use the names() function with assignment notation. Let’s take a look: names(student.data) ## What we see is a vector of strings corresponding to the current column ## names. We can change these by assigning replacement strings to the ## indexed values or by substituting our own vector of strings. names(student.data) <- c("names","heights","years") student.data ## If we think that "years" is ambigous and might be confused with a student's ## age, we could rename just that column using numerical indexing, e.g.: names(student.data)[3] <- "class.years" student.data ## There is also a similar option, row.names() to access and modify the ## the row names. By default, the row names are a series of integers indicating ## the row number: row.names(student.data) ## The assignments above are the first of many examples in R that seem ## to defy logic: it seems as though we’re assigning something to a ## function, which shouldn’t make sense because a function isn’t a ## variable. In fact, you can think of the functions names() and ## row.names() as "access functions” – they do not perform an action, ## but merely grant access to a property of the argument variable, and ## this is why we can make assignments of the sort seen above. ## Indexing data frames ## As with vectors, brackets and logical or numerical vectors are still ## the way to access data frames, but with a slight complication, ## because data frames are multidimensional. The solution (which also ## holds for matrices, etc.) is to separate the two dimensions with a ## comma. R treats the first entry as the row number and the second ## entry as the column number; thus, to access the second column of ## the fourth row, type student.data[4, 2] ## Or the second column of the last three rows, student.data[4:6, 2] ## There are two further complications. ## To access an entire row or entire column, leave the index blank, as ## in, student.data[, 1] # FIRST COLUMN student.data[3, ] # THIRD ROW student.data[ , ] # ENTIRE FRAME, equivalent to "student.data" ## The only other complication is the ability to enter the names() or ## row.names() as indices: student.data["4", ] student.data[ , "class.years"] ## Putting all of this together, we can quickly generate subsets of ## our data. For example, we can create a data frame that includes ## only the students with height greater than the mean height: tall.students <- student.data[student.data$height > mean(student.data$height), ] tall.students ## Or sort our data by various aspects: student.data[order(student.data$class.years), ] ## Introduction to factors ## When performing statistical analyses, we often want R to look at a ## set of data and compare groups within the data to one another. For ## example, you have the data frame containing data on students in a ## course. There are columns of data representing the students' heights ## and class.years. How can you look at the means of height by class.year? ## Or, another example, you have sampled a number of rabbits and have ## a column for weights before a diet treatment and a column for ## weights after a diet treatment and a third column stating the diet ## treatment (e.g, "none,” "grain diet,” and "grapefruit diet”). How ## can you evaluate the change in weight as affected by diet? ## The answer to these questions is to use factors. ## Many of the data sets that come with R already have their data ## interpreted as factors. Let’s take a look at a data set with ## factors: data(moths, package="DAAG") help(moths, package="DAAG") moths ## (Note that you may have to install the DAAG package in order to ## load these data. Do you remember how to do this? If not, ask a neighbor ## for help!) ## The help file for the moths data set tells us that our last column, ## habitat, is a factor. What does this mean? ## See what happens when we pull up this column by itself: moths$habitat ## It looks pretty standard, at first, but then we notice that it is ## more than just a list of habitat names – it has another component, ## levels. ## Factors have levels. Levels are editable, independent of the data ## itself. To see the levels alone, you can type levels(moths$habitat) ## When called that way, it has the identity of a vector of strings. ## The levels() function behaves just like the names() and row.names() ## functions (i.e., weird), and you can make assignments or ## reassignments to the levels - e.g., levels(moths$habitat)[1] <- "NEBank" ## Factors come in exceptionally handy when performing statistical ## tests, but the various plot functions can give you an idea of uses ## of a factored variable, such as, boxplot(moths$meters ~ moths$habitat) ## The tilde, ~, used in a number of contexts in R, can generally be ## read as "by,” which gives a general explanation of its use here – ## visualizing transect length ("meters") by habitat type ("habitat"). ## Making a factor ## Now that you know how to employ a factored variable ## the next step is to know how to make a factor out of a ## variable. The general syntax is: x <- factor(c("A","B","A","A","A","B")) ## For vectors of strings, like that one. The results are usually fine ## as is. ## But let’s go back to our student.data data frame. We listed ## class.years as a number 1 through 4, but these are discreet ## categories with well-defined names. A more elegant solution is to ## factor the column of the data frame, much like is seen with moths. student.data$class.years <- factor(student.data$class.years) levels(student.data$class.years) ## Not ideal, but we can use reassignment to change the names of the ## years. levels(student.data$class.years) <- c("Freshman", "Sophomore","Junior", "Senior") ## With satisfying (preliminary) results available with: student.data boxplot(student.data$heights ~ student.data$class.years) #################### ## Applying functions to data frames #################### ## Many functions you might like to apply to your data frames will ## produce unpredictable results. ## A few work nicely: nyc.air <- airquality[,c("Wind", "Temp")] nyc.air summary(nyc.air) ## But others that you might try do not work as you want: sum(nyc.air) # sums wind and temperature together mean(nyc.air) # returns an error message ## One solution to these troubles is to use the function apply(), ## which performs the function named in the third argument on the ## first argument by the index specified by the second argument ## (in this case, by column). apply(nyc.air, 2, sum) apply(nyc.air, 2, mean) apply(nyc.air, 2, var) ###################################################################### ## SECTION C. Composing your own functions ###################################################################### ## A more advanced (and very important) topic ## So far in R we have used the functions that come with R and its various ## packages; however, since you will often want to perform the same series ## of actions on different objects, R makes it relatively straight-forward ## to compose your own generic functions and store them in R’s memory. ## Before you start writing a function you need to have your mind set ## on three things: ## * What you want to give the function as input ## * What you want the function to do ## * What you want the function to give as output #################### ## A trivial example #################### ## Imagine you need to repeatedly transform sets of ## data, but your transformation is "non-standard.” For this example, ## I’m imagining that you want the natural logarithm of the data, plus ## one. We know how to perform these operations on a number we have ## stored in a variable, no problem, x <- 1:10 log(x) + 1 ## But what we would really like is a named function which will do ## this in one step, log.plus.one(). ## What we will do is make an assignment to log.plus.one, but rather ## than assigning a value (or vector, etc.), we assign a function ## which we define on the spot. We use the command function, which ## looks like a function but is not a function. Instead, it’s ## a control element of the R language – it isn’t executed like a ## function, but rather it informs R to treat the code around it in a ## special way. ## The command function has an interesting syntax. Its arguments are ## the names of variables which will serve as the arguments for your ## function (the first of three bullets, above). Then, after this ## parenthetical bit, comes the meat of the function – what you want ## it to do and what you want it to give back to you (the last two ## bullets, above). In our log.plus.one() case, what we want it to do ## and what we want it to give back happen to be the same thing, ## therefore we can define it very simply, as follows, log.plus.one <- function(y) log(y) + 1 ## Cool! Let’s test it out: log.plus.one(x) ## It behaves just like we would want it to. #################### ## A separate little world #################### ## Wait a second. I used y in my function definition but called the ## function with my variable x as the argument. What happened to y? y ## The variable is untouched by the function. ## In order to keep functions fully generic, when you give the ## function command, R generates a separate, untouchable variable ## space which has no interactions with your R workspace. This means ## that the names of your function arguments (and any variables ## assigned within your function) can be anything you find convenient ## – there is never any risk of a conflict with your active variables. #################### ## Longer functions #################### ## Either because the function is too complex to be ## executed on a single line or because you want to make the ## function’s methods clearer, you will often generate functions ## longer than one line. For this purpose, R introduces another type ## of bracket, curly brackets, { }. These are control brackets and ## indicate the contents should be treated as a unit. ## As a final example, (function(x, y){z <- x^2 + y^2 x + y + z })(0:7, 1) ## Note that the function is written on two lines, but this isn’t an ## issue because of the brackets. Note also that this function is ## anonymous. It is never assigned, but used in place. ## A common tendency when first learning to program is to write code ## in a condensed form (such as the anonymous inline function defined ## above) so that it is difficult to follow what is going on when you ## return to the code later on (or when your instructor is helping you ## find a bug that is keeping your code from working correctly). While ## writing code in this way takes a certain amount of cleverness and ## demonstrates that you have understood the concepts, it is better ## practice to write out your code so that it is easy to follow. This ## includes using plenty of whitespace, to make your code easy to ## read and thoroughly commenting your commands as you go. ## The example above is therefore better written as follows: ## SUM.VALS.PLUS.SUM.SQS() – function that takes two numerical values ## as input and returns the sum of the values plus the sum of their ## squares: sum.vals.plus.sum.sqs <- function(x, y) { z <- x^2 + y^2 # define z as the sum of the values’ squares return(x + y + z) # add the values to the sum of their squares # and return the result as output } ## Perform the above function with x equal to the numbers from 0 to ## 7 and y equal to 1: sum.vals.plus.sum.sqs(0:7, 1) ###################################################################### ###################################################################### ## This concludes Tutorial 2. Because there are some advanced topics ## here that require practice to get your head around, you should ## make sure to work through the benchmark questions before you ## move on to Tutorial 3. ## ## Question 1: ## ## R sometimes uses confusingly similar names for distinct concepts. ## Define for yourself: names, factors, levels. When would you use each? ## ## Question 2: ## ## You need a subset of the mtcars data set that has only every other ## row of data included. ## a. Do this with numerical indexing. ## b. Do this with logical indexing. ## ## Question 3: ## ## Write a function, jumble(), that takes a vector as an argument and ## returns a vector with the original elements in random order. (Note: R ## does have a built-in function that can do this, but the point of this ## question is rather for you to build a new function using the tools that ## have been introduced in the tutorials so far.)