#' @title Tutorial 2: More on Vectors, Data Frames, and Functions #' #' @author David M. Goehring 2004 #' @author Juliet R.C. Pulliam 2008, 2009 #' @author Steve Bellan 2010, 2012 #' #' Clinic on Meaningful Modeling of Epidemiological Data #' International Clinics on Infectious Disease Dynamics and Data Program #' African Institute for Mathematical Sciences, Muizenberg, Cape Town, RSA #' #' @license Some Rights Reserved, #' CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/) #' Last updated by Carl A. B. Pearson, June 2024 #' #' @description #' By the end of this tutorial you should: #' * Be able to retrieve useful subsets of your data #' * Understand more about data frames #' * Know the methods and uses of logical values in R #' * Be able to generate and use factors #' * Know how to write your own generic functions #' #' @section A. Accessing Vector Elements #' #' @subsection Beyond Numbers: Relational and Logical Operations in R #' #' So far everything you have done in R has involved numbers or #' vectors of numbers. To properly exploit R’s complexity, you need to #' become familiar with relational and logical operations in R. #' #' Relational operations work just like numerical operations, in terms #' of how they are processed. Return for a moment to our first #' calculation from the last tutorial, an addition problem: 3 + 4 #' The analogous calculation of a single relational operation is #' something like 5 > 4 #' "Is 5 greater than 4?” Yes. And R tells you that this is a TRUE #' statement. Or, 1 + 1 < 1 #' Makes sense, right? #' #' The greater-than, >, and less-than, <, symbols are #' straightforward. Similarly, R has greater-than-or-equal-to and #' less-than-or-equal-to symbols, >= and <=, respectively. #' #' Slightly less intuitive are the relational operators for equality, #' ==, and inequality, !=. Try x <- 4 x == 1 + 3 y <- x != 4 #' This last example demonstrates that variables can hold logical #' values. These relational operators also operate on logical values, #' as in, y == FALSE #' Logical operations are operations that only make sense when #' performed on TRUE and FALSE values. These will likely be familiar #' to you, the central operations being AND, OR, and NOT. #' #' The operators used in R are standard: &, |, and !, #' respectively. Let’s see them in action: !TRUE to_be <- FALSE to_be | !to_be FALSE & (TRUE | FALSE) #' By combining logical and relational operations, we can make complex #' inquiries about values. #' #' Hands off the keyboard! Pick up a writing implement… #' #' a <- TRUE != (4 > 3) #' b <- a | 1 + 1 == 4 - 2 #' c <- !FALSE & (log(Inf) == Inf + 1) #' #' What do a, b & c equal? Now execute the commands and compare #' your answers. #' Note that R has special values for infinity (Inf), not-a-number #' (NaN), and not-applicable, NA. These generally behave sensibly – a #' mathematical operation on not-a-number is obviously not a number as so is #' returned as NaN. Things are less simple when using logical and relational #' operators. Consider 4 != NaN. In one respect, the answer perhaps #' should be TRUE; that is, 4 definitely isn’t equal to #' not-a-number. But, striving for consistency, R returns NA, much as #' it would for a mathematical operation. Even worse is the situation x <- NaN x == NaN #' You might think that this is a reasonable test for whether x has a #' numerical value, but it won’t work for the same reason mentioned above. #' In general, keep this trickiness in mind and remember there is a special #' function is.nan() for determining whether x is defined as not-a-number: is.nan(x) #' There is also a special function is.finite() for determining whether #' x is a valid (finite) number: is.finite(x) #' This is all getting thrown at you in very quick succession, #' especially if you do not have much experience programming in other #' languages. It is worth noting that information about these #' operations can be pulled up at any time by typing help("&”) or #' help(">”) or the using help() function with any of the other #' symbols used in these operations. #' #' @subsection Vectors of Logical Values #' #' As a shorthand, TRUE and FALSE can be entered as T and F. This #' allows for rapid entry of vectors of logical values, for example: logical_vec <- c(TRUE, TRUE, FALSE, TRUE) logical_vec #' Unfortunately, and rather inexplicably, T and F cab be reassigned #' to any arbitrary values. This will render most code utterly #' unpredictable. So, never, never, never do this: T <- 4 # REALLY BAD, BUT NO ERROR IS PRODUCED #' And, if you ever do something like this (though you shouldn’t!), #' make sure you quickly do this: rm(T) ## which will set T (or F) back to its default logical value. #' Aside: to ensure your code is robust, we recommend always spelling out #' TRUE and FALSE for logical values. If you have a library like `lintr` #' installed, you can use it to check your code `lintr::lint("your_code.R")` #' #' Relational or logical operations also act on vectors to produce #' vectors of logical values, as in, x <- rnorm(10) x < 0 y <- (x > -.5) & (x < .5) !y #' This will be especially handy when we look at the concept of #' indexing, below. #' #' @subsection Generating Sequences #' #' There are many occasions in R when you need a patterned sequence of #' numbers. As mentioned in the previous tutorial, most counting can #' be accomplished by use of the seq() function. If you haven’t #' already done so, it is worth taking a look at the help-file on #' seq() because it has a few arguments that can make your life #' easier. ?seq #' For example, seq() can generate a vector of a certain length #' between certain endpoints by typing x <- seq(0, 1, length.out = 20) #' giving you a vector of length 20 between 0 and 1; to confirm this, type length(x) #' A very common need in R is to generate vectors with an interval of #' 1 between each element. R has a shorthand for this using the colon #' notation, as follows: y <- 5:10 #' This generates a vector that counts from 5 to 10, inclusive. Note that #' : is generally treated first in the order of operations. #' #' Don’t underestimate the value of the colon notation. Even for #' typing a vector of length 2, like "(1,2)” or "(2,1),” using the c #' function to generate the vector is pretty tedious (e.g., c(1,2)). #' These vectors can be generated in three quick characters by typing #' 1:2 or 2:1, respectively. I will also point your attention to the #' rep() function, for repeating sequences, which can also save time. #' #' @subsection Indexing #' #' R has an incredibly useful way of accessing items from a #' data set. Each item in a data set has its own index, or numbered #' location, in the object’s structure. Square brackets are used to #' extract an item or items from a data set, but it is crucial to #' understand that there are two completely distinct ways in which #' brackets are used to access items. I will consider the two methods #' for accessing a vector of length n in turn below. #' #' The first option: Logical #' Requirements: Logical vector of length n #' Use it for: Finding a subset of data based on a rule #' #' Logical indexing works as if you’ve asked your indexing vector the #' question, "Do you want this item?” for each of the items in the #' vector. x <- 1:5 x[c(TRUE, FALSE, FALSE, FALSE, TRUE)] #' If we combine this logical indexing with the relational and logical #' operators you learned above, we have an exceptionally powerful tool #' to retrieve data that meet any set of criteria. y <- rnorm(10000) hist(y[!((y > -2) & (y < 0))]) #' I will give more insight below when I discuss indexing data #' frames. Stay tuned. #' In any operation in R, vectors will be automatically repeated until #' they reach the necessary length for the operation to make #' sense. For example, note the results of 1:6 + 1:2 #' The same repetition holds for logical vectors. For this reason, you should #' be very cautious using operations on vectors that differ in length. #' #' The second option: Numerical #' Requirements: Value or vector of any length with values #' (1 to n) OR (-n to -1) #' Use it for: Single item retrieval or shuffling, sorting, and repeating #' #' Accessing single items with brackets and a single index should be #' straightforward x <- 3 * (0:5) x[4] #' One tedious way of creating a new vector of values from a vector’s #' elements would be c(x[2], x[3], x[4]) #TEDIOUS #' So R makes it much easier by allowing a vector of indices to #' generate a vector. Thereby, the command above becomes x[2:4] #' There is nothing preventing you from accessing any element any #' number of times. x[c(2, 2, 2, 5, 5, 5)] #' Additionally, R allows you to use negative indices, indicating #' which items you want to exclude, as in, x[c(-1, -6)] #' This is fine and productive as long as you remember never to mix #' negative and positive indices – R will not know what you want it to #' do: x[c(-1, 4)] # BADCODE #' @subsection Sorting #' #' In Tutorial 1, you were introduced to the sort() function, which is #' handy. #' #' Now that you have been introduced to indexing, you may have an #' inkling of how much more powerful the sorting functions of R can #' become. #' #' As an introduction, let’s say you have a 4-element vector, my_vector <- 5:8 #' Using numerical indexing, we can manually re-order this vector by #' calling each of its indices once in our preferred order, for #' example my_vector[c(2, 3, 4, 1)] #' or, for a quick reversal my_vector[4:1] #' Now, manually generating the vector of indices is not monumentally #' useful, which is where the function order() comes in. As a #' demonstration, imagine we have a vector of student names and a #' corresponding vector of student heights (in meters). student_names <- c( "Dario", "Hloniphile", "Steve", "Innocent", "Abigail", "Cynthia" ) student_heights <- rnorm(6, 1.7, .12) #' What we definitely don’t want to do is to perform sort() on each of #' these vectors independently. This will eliminate the pairing of the #' name to the height. So how can we sort one vector and have the #' other vector align correctly? Try order() on the names, order(student_names) #' Note that it returns the indices in the right order, not the values #' themselves. #' #' From what you learned above, you know it is now an easy matter to #' sort both of our vectors, as follows, student_names[order(student_names)] # same effect as sort() student_heights[order(student_names)] #' And, obviously, sorting the names by the heights is exactly #' analogous, and it will make for a pretty plot barplot( student_heights[order(student_heights)], names.arg = student_names[order(student_heights)], ylab = "Height (m)", main = "Student Heights" ) #' I have conveniently skipped over an important concept, because R #' handles it fairly intuitively, but I want to mention the #' terminology. The variable student_names and the results of ls(), for #' example, are called vectors of strings, or character arrays. R #' handles them conveniently, so we don’t need to worry too much about #' them, but knowing the terminology will improve your understanding #' of R's in-line help documents. #' #' @section B. Data Frames, Redux #' #' Before we cover advanced topics of data frames, I wanted to point out the #' function data.frame(), which puts data together to form data #' frames. This is a key alternative to using the prefab data frames #' that you used in last week’s assignment. #' #' First I want to generate a vector of student class-years to #' correspond to the student_names before creating a data frame (Freshmen #' as 1, Sophomores as 2, etc.). student_years <- c(4, 2, 2, 3, 1, 3) #' Now making a data frame is easy (each argument will just add more #' columns to the data), the only trick being that we have to assign #' the constructed data frame to a variable, as follows: student_df <- data.frame(student_names, student_heights, student_years) student_df #' Voila! Your own data frame. #' #' You may want to have better column headings than the redundant variable #' names. There are various options to accomplish this. One option is #' to use the names() function with assignment notation. Let’s take a look: names(student_df) #' What we see is a vector of strings corresponding to the current column #' names. We can change these by assigning replacement strings to the #' indexed values or by substituting our own vector of strings. names(student_df) <- c("names", "heights", "years") student_df #' If we think that "years" is ambigous and might be confused with a student's #' age, we could rename just that column using numerical indexing, e.g.: names(student_df)[3] <- "class.years" student_df #' There is also a similar option, row.names() to access and modify the #' the row names. By default, the row names are a series of integers indicating #' the row number: row.names(student_df) #' The assignments above are the first of many examples in R that seem #' to defy logic: it seems as though we’re assigning something to a #' function, which shouldn’t make sense because a function isn’t a #' variable. In fact, you can think of the functions names() and #' row.names() as "access functions” – they do not perform an action, #' but merely grant access to a property of the argument variable, and #' this is why we can make assignments of the sort seen above. #' #' @subsection Indexing data frames #' #' As with vectors, brackets and logical or numerical vectors are still #' the way to access data frames, but with a slight complication, #' because data frames are multidimensional. The solution (which also #' holds for matrices, etc.) is to separate the two dimensions with a #' comma. R treats the first entry as the row number and the second #' entry as the column number; thus, to access the second column of #' the fourth row, type student_df[4, 2] #' Or the second column of the last three rows, student_df[4:6, 2] #' There are two further complications. #' #' To access an entire row or entire column, leave the index blank, as #' in, student_df[, 1] # FIRST COLUMN student_df[3, ] # THIRD ROW student_df[, ] # ENTIRE FRAME, equivalent to "student_df" #' The only other complication is the ability to enter the names() or #' row.names() as indices: student_df["4", ] student_df[, "class.years"] #' Putting all of this together, we can quickly generate subsets of #' our data. For example, we can create a data frame that includes #' only the students with height greater than the mean height: tall_students <- student_df[student_df$height > mean(student_df$height), ] tall_students #' Or sort our data by various aspects: student_df[order(student_df$class.years), ] #' @subsection Introduction to factors #' #' When performing statistical analyses, we often want R to look at a #' set of data and compare groups within the data to one another. For #' example, you have the data frame containing data on students in a #' course. There are columns of data representing the students' heights #' and class.years. How can you look at the means of height by class.year? #' #' Or, another example, you have sampled a number of rabbits and have #' a column for weights before a diet treatment and a column for #' weights after a diet treatment and a third column stating the diet #' treatment (e.g, "none,” "grain diet,” and "grapefruit diet”). How #' can you evaluate the change in weight as affected by diet? #' #' The answer to these questions is to use factors. #' #' Many of the data sets that come with R already have their data #' interpreted as factors. Let’s take a look at a data set with #' factors: data(moths, package = "DAAG") help(moths, package = "DAAG") moths #' (Note that you may have to install the DAAG package in order to #' load these data. Do you remember how to do this? If not, ask a neighbor #' for help!) #' #' The help file for the moths data set tells us that our last column, #' habitat, is a factor. What does this mean? #' #' See what happens when we pull up this column by itself: moths$habitat #' It looks pretty standard, at first, but then we notice that it is #' more than just a list of habitat names – it has another component, #' levels. #' #' Factors have levels. Levels are editable, independent of the data #' itself. To see the levels alone, you can type levels(moths$habitat) #' When called that way, it has the identity of a vector of strings. #' #' The levels() function behaves just like the names() and row.names() #' functions (i.e., weird), and you can make assignments or #' reassignments to the levels - e.g., levels(moths$habitat)[1] <- "NEBank" #' Factors come in exceptionally handy when performing statistical #' tests, but the various plot functions can give you an idea of uses #' of a factored variable, such as, boxplot(moths$meters ~ moths$habitat) #' The tilde, ~, used in a number of contexts in R, can generally be #' read as "by,” which gives a general explanation of its use here – #' visualizing transect length ("meters") by habitat type ("habitat"). #' #' @subsection Making a factor #' #' Now that you know how to employ a factored variable #' the next step is to know how to make a factor out of a #' variable. The general syntax is: x <- factor(c("A", "B", "A", "A", "A", "B")) #' For vectors of strings, like that one. The results are usually fine #' as is. #' #' But let’s go back to our student_df data frame. We listed #' class.years as a number 1 through 4, but these are discreet #' categories with well-defined names. A more elegant solution is to #' factor the column of the data frame, much like is seen with moths. student_df$class.years <- factor(student_df$class.years) levels(student_df$class.years) #' Not ideal, but we can use reassignment to change the names of the #' years. levels(student_df$class.years) <- c("Freshman", "Sophomore", "Junior", "Senior") #' With satisfying (preliminary) results available with: student_df boxplot(student_df$heights ~ student_df$class.years) #' @subsection Applying functions to data frames #' #' Many functions you might like to apply to your data frames will #' produce unpredictable results. #' #' A few work nicely: nyc_air <- airquality[, c("Wind", "Temp")] nyc_air summary(nyc_air) #' But others that you might try do not work as you want: sum(nyc_air) # sums wind and temperature together mean(nyc_air) # returns an error message #' One solution to these troubles is to use the function apply(), #' which performs the function named in the third argument on the #' first argument by the index specified by the second argument #' (in this case, by column). apply(nyc_air, 2, sum) apply(nyc_air, 2, mean) apply(nyc_air, 2, var) #' @section C. Composing your own functions #' #' A more advanced (and very important) topic #' #' So far in R we have used the functions that come with R and its various #' packages; however, since you will often want to perform the same series #' of actions on different objects, R makes it relatively straight-forward #' to compose your own generic functions and store them in R’s memory. #' #' Before you start writing a function you need to have your mind set #' on three things: #' #' * What you want to give the function as input #' * What you want the function to do #' * What you want the function to give as output #' #' @subsection A trivial example #' #' Imagine you need to repeatedly transform sets of data, but your #' transformation is "non-standard.” For this example, I’m imagining that you #' want the natural logarithm of the data, plus one. We know how to perform #' these operations on a number we have stored in a variable, no problem, x <- 1:10 log(x) + 1 #' But what we would really like is a named function which will do #' this in one step, log.plus.one(). #' #' What we will do is make an assignment to log.plus.one, but rather #' than assigning a value (or vector, etc.), we assign a function #' which we define on the spot. We use the command function, which #' looks like a function but is not a function. Instead, it’s #' a control element of the R language – it isn’t executed like a #' function, but rather it informs R to treat the code around it in a #' special way. #' #' The command function has an interesting syntax. Its arguments are #' the names of variables which will serve as the arguments for your #' function (the first of three bullets, above). Then, after this #' parenthetical bit, comes the meat of the function – what you want #' it to do and what you want it to give back to you (the last two #' bullets, above). In our log.plus.one() case, what we want it to do #' and what we want it to give back happen to be the same thing, #' therefore we can define it very simply, as follows, log.plus.one <- function(y) log(y) + 1 #' Cool! Let’s test it out: log.plus.one(x) #' It behaves just like we would want it to. #' #' Aside: you may also see short functions define using the "lambda" syntax: log.plus.one <- \(y) log(y) + 1 #' @subsection A separate little world #' #' Wait a second. I used y in my function definition but called the #' function with my variable x as the argument. What happened to y? y #' The variable is untouched by the function. #' #' In order to keep functions fully generic, when you give the #' function command, R generates a separate, untouchable variable #' space which has no interactions with your R workspace. This means #' that the names of your function arguments (and any variables #' assigned within your function) can be anything you find convenient #' – there is never any risk of a conflict with your active variables. #' #' @subsection Longer functions #' #' Either because the function is too complex to be #' executed on a single line or because you want to make the #' function’s methods clearer, you will often generate functions #' longer than one line. For this purpose, R introduces another type #' of bracket, curly brackets, { }. These are control brackets and #' indicate the contents should be treated as a unit. #' #' As a final example, (function(x, y) { z <- x^2 + y^2; x + y + z })(0:7, 1) #' Note that the function is written on two lines, but this isn’t an #' issue because of the brackets. Note also that this function is #' anonymous. It is never assigned, but used in place. #' #' A common tendency when first learning to program is to write code #' in a condensed form (such as the anonymous inline function defined #' above) so that it is difficult to follow what is going on when you #' return to the code later on (or when your instructor is helping you #' find a bug that is keeping your code from working correctly). While #' writing code in this way takes a certain amount of cleverness and #' demonstrates that you have understood the concepts, it is better #' practice to write out your code so that it is easy to follow. This #' includes using plenty of whitespace, to make your code easy to #' read and thoroughly commenting your commands as you go. #' The example above is therefore better written as follows: #' @title Sum Values and Sum of Their Squares #' @description #' A function that takes two numerical values as input and returns the sum of #' the values plus the sum of their squares #' #' @param x A numerical vector #' @param y A numerical vector #' #' @details Note that `x` and `y` must be compatible lengths, or the recycling #' rules #' #' @return A numerical value, x + y + x^2 + y^2 sum_vals_plus_sum_sqs <- function(x, y) { z <- x^2 + y^2 # define z as the sum of the values’ squares ss <- x + y + z # add the values to the sum of their squares return(ss) # and return the result as output } #' Perform the above function with x equal to the numbers from 0 to #' 7 and y equal to 1: sum_vals_plus_sum_sqs(0:7, 1) #' @section Benchmark Questions #' #' This concludes Tutorial 2. Because there are some advanced topics #' here that require practice to get your head around, you should #' make sure to work through the benchmark questions before you #' move on to Tutorial 3. #' #' @question Semantics? #' #' R sometimes uses confusingly similar names for distinct concepts. #' Define for yourself: names, factors, levels. When would you use each? #' #' @question Subsetting #' #' You need a subset of the mtcars data set that has only every other #' row of data included. #' a. Do this with numerical indexing. #' b. Do this with logical indexing. #' #' @question Function Definition #' #' Write a function, `jumble()`, that takes a vector as an argument and #' returns a vector with the original elements in random order. (Note: R #' does have a built-in function that can do this, but the point of this #' question is rather for you to build a new function using the tools that #' have been introduced in the tutorials so far.)