--- layout: post title: "Do long tables make my code tidier?" date: "`r format(Sys.time(), '%d %B, %Y')`" comments: true author: Edzer Pebesma categories: r --- DOWNLOADHERE Last August, at the [geostat summer school](http://www.geostat-course.org) organized by [Barry Rowlingson](http://barry.rowlingson.com/) in Lancaster, I took the opportunity to learn about modern approaches to modelling health data in an excellent worskhop held by [Teresa Smith](http://www.lancaster.ac.uk/staff/smithtr/) on [Basic methods for Areal/Spatially Aggregated Data](http://geostat-course.org/node/1287). Slides, data and R scripts are found from [Teresa's course page](http://www.lancaster.ac.uk/staff/smithtr/). Let's first download the data, directly from Teresa's site: ```{r} asUrl = "http://www.lancaster.ac.uk/staff/smithtr/exampledata.RData" asFile = basename(asUrl) if (!file.exists(asFile)) download.file(asUrl, asFile, mode = "wb") load("exampledata.RData") ``` For a given disease, analyzing disease data typically starts with the raw data, which has as * $y_{ic}$ the number of cases in area $i$ and age group $c$, and * $n_{ic}$ the number of people (controls) in area $i$ and age group $c$. From this, the rate of disease in age group $c$, $p_c$, is computed from the data by $$p_c = \frac{\sum_i y_{ic}}{\sum_i n_{ic}}$$ This is needed to estimate the expected number of disease cases per area $i$, $E_i = \sum_c n_{ic} p_c$, in order to find the standardized incidence (or mortality) ratio $R_i = y_i / E_i$, where $y_i = \sum_c y_{ic}$. $R_i$ is the variable we'd like to model. So far, I could follow it -- this looks like a pretty simple weighting exercise. Then came Teresa's script to do all this. ### The hard way ```{r} library(SpatialEpi) library(tidyr) library(dplyr) Y = gather(NEEexample@data[,c(1, 5:12)], age_band, y, -pc) N = gather(NEEexample@data[,c(1,13:20)], age_band, N, -pc) Y$age_band = rep(1:8, each = length(NEEexample)) N$age_band = rep(1:8, each = length(NEEexample)) NEEmat = left_join(Y, N) NEEmat$PCnum = match(NEEmat$pc, NEEexample@data$pc) NEEmat = NEEmat[order(NEEmat$PCnum, NEEmat$age_band),] NEE = aggregate(NEEmat$y, by = list(pc = NEEmat$PCnum), FUN = sum) NEE$E = expected(NEEmat$N, NEEmat$y, n.strata = 8) ``` Here, `NEEexample@data` is the attribute table (`data.frame`) for the spatial data, which has per area (row) the ID in column one, the cases for 8 age classes in columns 5 - 12, and the controls for 8 age classes in columns 13 - 20. What is going on here? * `gather` converts the areas x age class table into a long table, * `rep` adds numeric codes representing the age class, and assumes we know that `gather` put the elements in a long table *by row*, which is not clear to me from its documentation * `left_join` then joins the two tables based on the common fields. (Of course, this is all very safe: we never know in which order records are in a table, so better use `left_join` to join tables. But wait: in the previous step, with `rep`, we assumed we *did$ know the order...) Anyway, this could also have been done using ```{r} NEEmat.cbind = cbind(Y, N = N$N) ``` as shown by ```{r} NEEmat.orig = left_join(Y, N) all.equal(NEEmat.cbind, NEEmat.orig) ``` Next, * `match` is used to match the 920 records (115 areas times 8 age classes) to the original categories; the result of course equals ```{r} all.equal(NEEmat$PCnum, rep(1:115, each = 8)) ``` Then, * `aggregate` is used to aggregate (`sum`) records by area, and * `expected` (from package SpatialEpi) is called to compute expected counts. I read function `expected`, I read its documentation, I read the function again, and didn't understand. I now do; it tries to compensate for all the lost structure, in a fairly convoluted way. ### The easy way I see $y$ and $n$ as being two matrices, which, when taken as ```{r} Y = as.matrix(NEEexample@data[, 5:12]) N = as.matrix(NEEexample@data[,13:20]) ``` have areas as rows, and age classes as columns. The equation $$p_c = \frac{\sum_i y_{ic}}{\sum_i n_{ic}}$$ averages over areas, and calls for the ratio of the column-wise sum vectors of these matrices, which we get by ```{r} Pc = apply(Y, 2, sum) / apply(N, 2, sum) ``` Next, $E_i = \sum_c n_{ic} p_c$ is obtained by the inner product of rows in $n$ and $p_c$, and save it directly to the spatial object: ```{r} NEEexample$E = as.vector(N %*% Pc) ``` where `as.vector` (or alternatively, `(N %*% Pc)[,1]`) takes care of storing the column matrix as a vector in the data.frame. So, _two lines of code_ which, if must, can be contracted to one, and no need for external packages. Indeed, ```{r} all.equal(NEE$E, NEEexample$E) ``` Finally, we can compute and plot the SIR by ```{r fig=TRUE, fig.path = "images/", label = "disease"} NEEexample$SIR = apply(Y, 1, sum) / NEEexample$E spplot(NEEexample[,"SIR"], main = "Standardised incidence rate") ``` ### Does it matter? Of course Teresa's original approach could be written way more elegantly and tidy using the same basic approach. She explained it was inspired by code from people who mainly work with databases. To me, the short two-liner solution looks more like the equations in which the problem was expressed, and that makes it the better approach. How would the math look like that, translated to R, leads to the long solution? [Tidying data](https://www.jstatsoft.org/article/view/v059i10) is a great idea, but shouldn't be confused with forcing everything into long tables -- this would degrade R to a relational database. I'm fully with [Jeff Leek](http://simplystatistics.org/2016/02/17/non-tidy-data/) that many tidy data are not in long tables, and haven't even started mentioning the analysis of [spatial](http://cran.uni-muenster.de/web/views/Spatial.html), [temporal](http://cran.uni-muenster.de/web/views/TimeSeries.html), or [spatiotemporal data](http://cran.uni-muenster.de/web/views/SpatioTemporal.html). Operating R [the new way, using dplyr, tidyr and friends](https://github.com/ropensci/unconf16/issues/22) is a cool idea. Writing code that obfuscates simplicity is not. ### Is it better to use less packages? The *easy way* solution above used base R without the need to load packages; the *hard way* required three packages to be loaded first. The *new R way* critically relies on, and promotes using packages for many tasks that can also be solved by base R. Packages are there to be helpful. Some of them are also highly dynamic, get modified (improved!) over time, get abandoned, or replaced by others. I often ask myself ``will this code still run in 10 years from now, with the R version we will use by then?'' -- or 20, or 50 years? Of course, by that time one could try to compile historic R versions and install historic packages, but that might be non-trivial and require historic compilers or run-time environments. I've been using R for nearly 2 decades, and S-Plus before that, and know where I put my bets. This year the geostat summer school will be in [Albacete](http://geostat-course.org/), hosted by Virgilio. Of the 106 applicants, 60 will be selected. I'm looking forward to meeting lots of new people, and learning more and new cool things!