--- title: "Working with lists" author: Tad Dallas includes: in_header: - \usepackage{lmodern} output: pdf_document: fig_caption: yes fig_height: 6 fig_width: 6 toc: yes html_document: fig_caption: yes fig_height: 6 fig_width: 6 highlight: tango theme: journal --- ### What are lists? Lists in R are collections of objects that can be of any mix of types (or all the same type). They can be useful when dealing with multiple data.frames that each correspond to a different unit of study (note that we solved this before by considering a single data.frame with a column corresponding to country). Lists tend to be useful in my research when I'm simulating ecological dynamics, where each list item can be the results of a single simulation, or when I'm writing functions to return data that is not really structured as a data.frame. Before we go too far into use cases, lets refresh on how we form and index list objects. ```{r} lst <- list(a=runif(100), df=data.frame(x=runif(100), y=runif(100)), letters[1:10], 'a') # index single list elements lst[[1]] lst[[2]] lst[[3]] lst[[4]] # index multiple list elements lst[1:2] ``` Let's think about how we might use lists. For one, many functions in R output data in list format. For instance, working with network data in R through `igraph`, most things are lists. Let's explore this, both as a way to introduce lists and to talk about how to analyze network data in R. ```{r, warning=FALSE, message=FALSE} #install.packages('igraph') library(igraph) g <- igraph::sample_gnm(100, 200) str(g) ``` Recall when we introduced the `plot` function, and I said that packages build in functionality such that some base functions will work with more complex objects (the `igraph` object above is a list). Try it here. ```{r} plot(g) ``` Nice. That's neat. We can also write a wrapper function (we will not go over function writing now, but will save that for the coming weeks), which can be useful across multiple projects. This is a function I routinely use for visualizing networks in a prettier way. ```{r} #' @param g graph object #' @param colz #' @param nodeColor #' @param nodeSize #' #' @return a graph plot plotGraph <- function(g, lay=layout_nicely(g), colz='dodgerblue'){ plot(g, layout=lay, edge.width= E(g)$weight, vertex.size=10, directed=FALSE, vertex.color= colz, vertex.label=NA) } plotGraph(g) ``` But let's get back to lists. We've seen that `igraph` graph objects are lists, but also many of the outputs of functions within `igraph` are lists (or even lists of lists!). I will not defend nested lists as being all that useful, but we will see them in a couple of weeks when we talk about APIs and spatial data. So one of the functions built into `igraph` is the `sir` function. This is a function which runs a model on the network known as the Susceptible-Infected-Recovered model (or SIR for short), which aims to capture how diseases spread through populations. \begin{align} \frac{dS}{dt} & = -\beta SI \\ \frac{dI}{dt} & = \beta SI - \gamma I \\ \frac{dR}{dt} & = \gamma I \end{align} The default behavior of the function (`?sir`) will run 100 simulations of the SIR model on a graph object that you provide to the function, and store the output as a list. ```{r} sims <- igraph::sir(g, beta=0.5, gamma=0.5, no.sim=100) typeof(sims) class(sims) # explore the structure of each list element sims[[1]] # each list element is another list sims[[1]][[1]] sims[[1]]$times ``` And just to go back to plotting for a quick second, `igraph` has written functionality into the `sir` class to work well with the base R `plot` function. ```{r} plot(sims) ``` Neat, right? But back to lists. Let's work through the rest of working with lists through some exercises. Given the simulations above (`sims` list) ... Calculate the mean number of infected individuals for each simulation ```{r class.source = 'fold-hide'} ``` Find the time associated with the maximum number of infected nodes ```{r class.source = 'fold-hide'} ``` Calculate the fraction of all simulations in which fewer than 5 nodes are infected ```{r class.source = 'fold-hide'} ``` ## apply statements How did we approach the above questions? You almost certaintly used a `for` loop, right? This is perfectly fine, but there is another way. `apply` statements allow you to take a function and run it over all elements of a vector, columns/rows of a matrix, or a list. `apply` statements typically have a prefix which gives information about what type of data it works well with. For instance, working with lists, we will use `lapply`. The base `apply` function is to work with matrices, where we want to apply a function to every row or column of a matrix (e.g., `apply(matrix, 2, sum)` is the same as `colSums(matrix)`). We will go over more examples of `apply` statements at some point, but for now we will focus on `lapply` and our problem of working with lists. And here we hit an issue. `lapply` statements take a function argument, where the function needs to take the list object as an argument and then does something with it. So we'll have to learn a bare minimum of function writing now. Let's use a problem above as a motivating example, where we try to calculate the mean number of infected individuals for each simulation. ```{r} meanInfections <- function(x){ # if we consider the mean infecteds as the mean of the infected vector return(mean(x$NI)) # if we consider the mean infecteds as the mean number of nodes infected # return((x$NS[1]+1)-tail(x$NS,1)) } meanInfs <- lapply(sims, meanInfections) str(meanInfs) ``` `lapply` is nice in that if we give it a list object, it gives us a list object back. This makes analytical pipelines that deal with lists pretty straightforward, but if the output is a single value, we may want this to be a vector instead of a list. ```{r} unlist(meanInfs) #or meanInfs2 <- sapply(sims, meanInfections) ``` `sapply` statements are essentially just `lapply` statements that simplify the result to a vector. This is useful when the output of the function is a single value, and not so useful when function returns multiple values. _A side note_: some people will criticize `for` loops in R, and say "just use apply, it's faster". It's not, really. Write however you feel comfortable. For awhile, `apply` statements were super confusing to me, so I tended to use `for` loops instead. After more work in, I shifted and tend to use `apply` statements when it fits, as they are less code and are more intuitive to me for many situations. ### Let's practice a bit. Calculate the maximum number of infected inviduals at any time in the `sims` list using the `apply` approach. ```{r class.source = 'fold-hide'} ``` What is the mean duration (the total time the epidemic took before it stopped) across all the epidemics in `sims`? ```{r class.source = 'fold-hide'} ``` ## plyr apply functionality tweaks `XYply` statements as nice wrappers to more classic `apply` statements. Here, `X` and `Y` can take values of 'a', 'l', or 'd', depending on the input or output data structure desired. For instance, if we have a list that we would like to apply over and return a data.frame, we would use `ldply`, where the `l` is claiming that the input is a list object, and the `d` is claiming that the output should be formatted as a data.frame. Other examples of this syntax would be `adply`, `ddply`, `laply`, `aaply`, etc. etc. Below, I provide an example of the aXply syntax (e.g., adply, alply, aaply). ```{r} arr <- array(1:27, c(3,3,3)) rownames(arr) = c("Curly", "Larry", "Moe") colnames(arr) = c("Groucho", "Harpo", "Zeppo") dimnames(arr)[[3]] = c("Bart", "Lisa", "Maggie") arr ``` Arrays are something that we did not introduce when we talked about `R` basics, and that is because they really are not used _too_ often. Think of matrix. It has two dimensions (x and y), so it can be viewed as a rectangle of data. Arrays simply add more dimensions. In the example above, there is another dimension, forming a data cube (in the rectangle analogy). We can use `plyr` functionality to operate on this array and return different forms. For instance, `aaply` takes an array and returns a simplified array (here a vector). ```{r} plyr::aaply(arr, 1, sum) ``` We can change one letter and now return a data.frame containing two columns. This is also a good time to point out the flexibility of the XYply statements to different margins. Margins (denoted as `.margins` argument in `R`, asks along which axis you would like to operate on the array). If we set .margins=1, this corresponds to a row-wise operation, so we calculate the sum across the array for Curly, Larry, and Moe. If we change this to .margins=2, we operate on columns, and will return sums for Groucho, Harpo, and Zeppo. And if we use .margins=3, we will return sums for Bart, Lisa, and Maggie. ```{r} plyr::adply(.data=arr, .margins=1, .fun=sum) plyr::adply(.data=arr, .margins=2, .fun=sum) plyr::adply(.data=arr, .margins=3, .fun=sum) ``` Finally, we can return a list object. In this use case, this is not super helpful, but in other use cases the list output is pretty helpful. ```{r} plyr::alply(.data=arr, .margins=1, .fun=sum) ``` A pitch for `plyr::ldply`. I really like this function, as I often find myself with lists of similar structures that I want to operate on and get a single clean object back. I will not go into an example, but this is a pretty useful function (though all the utility is basically contained in `vapply`). Finally, you may wonder why am I pushing apply statements so hard. It has nothing to do with speed, and only a bit to do with code clarity. The main advantage is understanding the programmatic nature of apply statements (which will be similar but less chronological than a for loop), and many parallel computing packages have their own little versions of apply statements ready to go (e.g., `parallel::mclapply`, `parallel::parLapply`, `parallel::clusterApplyLB`). Let's do one practice problem to showcase the utility of `ldply` specifically. Calculate the correlation between number of infections and time for each simulation, reporting the estimate, p-value, and confidence intervals around the estimate. (you will use `cor.test` to do this, whose output is a list object as well) ```{r class.source = 'fold-hide'} ``` ### A note about do.call and Reduce While a bit opaque, these functions are pretty useful in a variety of situations. Speaking of data manipulation functions that are useful but a bit conceptually difficult, `do.call` and `Reduce` are solid base `R` functions. `do.call` is a way of calling the same function recursively on multiple objects, and may have similar output to `Reduce`, which is also a way to recursively apply a function. ```{r} lst <- list(1:10, 1:10, 1:10, 1:10, 1:10) lst #this makes a single rbind call with each element of the list as an argument str(do.call(rbind, lst)) #this does it iteratively (so makes n-1 rbind calls) Reduce(rbind, lst) ``` ### Practice problems with lists ```{r} set.seed(123) mats <- lapply(1:100, function(x){ matrix(rbinom(5000, 1, 0.3), ncol=rpois(1,20)) }) ``` Use either a for loop or an apply statement to calculate the number of rows and columns in the `mats` list defined above. ```{r} ``` Calculate the column sums of each matrix in `mats` and return the object as a list. ```{r} ``` Subset `mats` to include only the matrices that have more than 250 rows. How many matrices are left in the subset list? ```{r} ``` Calculate the fraction of values that are 1 for each matrix in `mats` ```{r} ``` Write code to make each matrix column it's own list element, while keeping the structure of the `mats` list. So this means that list element 1 would have 16 nested list elements, with each list element corresponding to the values of columns from `mats[[1]]`. ```{r} ``` ## sessionInfo ```{r} sessionInfo() ```