--- title: "Spatial analysis with geostan" author: "Connor Donegan" date: "`r format(Sys.time(), '%d %B, %Y')`" comments: true layout: post categories: r bibliography: "../bibs/geostan-introduction.bib" --- TOC [DOWNLOADHERE] ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.align = 'center') ``` ```{r echo = FALSE} # function for getting colors, breaks, and labels for mapping map_pars <- function(x, brks = quantile(x, probs = seq(0, 1, by = 0.2), na.rm = TRUE), cols = c("#A8554EFF", "gray95", "#5D74A5FF")) { # put x values into bins x_cut <- cut(x, breaks = brks, include.lowest = TRUE) # labels for each bin lbls <- levels( cut(x, brks, include.lowest = TRUE) ) # colors rank <- as.numeric( x_cut ) max_rank <- max( rank , na.rm = TRUE ) pal_fun <- colorRampPalette( cols ) pal <- pal_fun( max_rank ) colors <- pal[ rank ] # return list ls <- list(brks = brks, lbls = lbls, pal = pal, col = colors) return( ls ) } ``` ## Summary This post introduces the [geostan](https://connordonegan.github.io/geostan) R package for spatial analysis. The package is mainly oriented towards areal data, although some models may also be used for other spatial data types. The package implements the spatial error/simultaneous spatial autoregressive (SAR) model, conditional autoregressive (CAR) model, and eigenvector spatial filter (ESF) models for spatial regression. A version of ESF modelling also appears in the ecology literature as principle coordinate analysis of neighbor matrices (PCNM) [@griffith_2006]. `geostan` also supports the application of the above regression methods to hierarchical models for count data, as is common in analyses of disease incidence or mortality in small areas ('disease mapping'). Additional features of the software include models for sampling/measurement error in covariates and methods for handling censored count data, such as when mortality or disease counts have been censored for privacy. The models were built using the [Stan](https://mc-stan.org/) modeling language, so all inference is completed using Markov chain Monte Carlo (MCMC) sampling [@stan_2023; @gabry_2024]. The spatial autoregressive models use custom-built Stan functions that speed up MCMC sampling considerably [@donegan_2021]. This post will walk through an example analysis using international data on life expectancy and per capita GDP. Package vignettes can be found with the online [documentation](https://connordonegan.github.io/geostan/articles), including an introduction to spatial weights matrices, exploratory spatial data analysis, spatial measurement error models, raster regression, and using geostan to build custom spatial models with Stan. A paper in the [*Journal of Open Source Software*](https://doi.org/10.21105/joss.04716) reports these and other features and provides the recommended citation when using `geostan` [@donegan_2022]. ## Installation `geostan` is currently on CRAN, although that may not always be the case. You can also install directly from `geostan`'s [github repository](https://github.com/ConnorDonegan/geostan). ### From github You can install from the package github repository: ```{r eval = FALSE} if (!require('devtools')) install.packages('devtools') devtools::install_github("connordonegan/geostan") ``` If you are using Windows and installing using `install_github`, you may need to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) first. Rtools is not needed when installing from CRAN. You may also contact the author by e-mail for a pre-compiled version that you can use without Rtools. If you are using Mac and installing with `install_github` then you may need to install Xcode Command Line Tools first. ### From CRAN Using your R console, you can install from CRAN as follows: ```{r eval = FALSE} install.packages("geostan") ``` ## Getting started To begin, load the `geostan` and `sf` packages into your R environment, as well as the `world` data: ```{r message = FALSE} library(geostan) library(sf) data(world, package = "spData") ``` The `world` data contains life expectancy and gross domestic product (GDP) per capita (presumably measured in current $US) for 161 countries as of 2014, gathered from the World Bank. The rest of this post is going to be structured around a bivariate analysis of these variables. We are going to apply the Robinson map projection for the countries: ```{r} world <- st_transform(world, crs = 'ESRI:54030') ``` At least a couple of the missing values can be filled in using World Bank data, so we will do that: ```{r} ## https://data.worldbank.org france <- grep("France", world$name_long) world$gdpPercap[ france ] <- 43068 world$lifeExp[ france ] <- 82 norway <- grep("Norway", world$name_long) world$gdpPercap[ norway ] <- 97666 world$lifeExp[ norway ] <- 82.1 ``` And we will also remove Antarctica: ```{r} world <- subset(world, name_long != "Antarctica") ``` Mapping the variables shows the familiar geography of high-, middle-, and low-income countries and a similar geography of longevity: ```{r fig.width = 7, fig.height = 5, fig.cap = "*Choropleth maps of GDP per capita and life expectancy.*"} # store geometry for countries world_geom <- st_geometry(world) # show two maps at once, with nice font ogpar <- par(mfrow = c(2, 1), mar = rep(0, 4)) # GDP per capita pars <- map_pars(world$gdpPercap / 1e3) plot(world_geom, col = pars$col, lwd = .2) legend("bottomleft", fill = pars$pal, title = 'GDP per capita\n($1,000s)', legend = pars$lbls, bty = 'n' ) rm(pars) # life expectancy pars <- map_pars(world$lifeExp) plot(world_geom, col = pars$col, lwd = .2) legend("left", fill = pars$pal, title = 'Life Expectancy', legend = pars$lbls, bty = 'n' ) par(ogpar) ``` The `map_pars` function breaks the variables into quantiles and returns breaks, colors, labels for the maps; it can be found at the end of the post. There will be no discussion of substantive (non-statistical) issues here, for which one can consult any number of texts on global power and inequality (e.g., Paul Farmer's *Infections and Inequalities*). By conventional methods, the correlation coefficient for life expectancy and log GDP per capita is 0.81: ```{r} log_x <- log10( world$gdpPercap ) y <- world$lifeExp cor.test(log_x, y) ``` The conventional assessment is based on the proposition that we have 161 independent observations. The visible geography of the variables, and any level of social awareness, indicates that these are not independent observations. Rather, there are various functional regions of countries that share basic political-economic conditions. A lot, but not all, of the variation can be described as variation across continents and regions. We will want to account for this dependence using a spatial model [for background see @chun_2012; @donegan_2024]. The first step will be to construct a spatial weights matrix. ## Adjacency matrix This section will illustrate use of two `geostan` functions for creating an revising a spatial weights matrix: `shape2mat` and `edges`. The `shape2mat` function may be helpful for some users but one can always do this using `spdep` or other methods, especially if `shape2mat` does not provide the exact method you're looking for. We are going to start by removing the 15 countries that are missing values: ```{r} ## remove missing values world <- subset(world, !is.na(gdpPercap) & !is.na(lifeExp)) ## leaving 162 observations nrow(world) ``` Now we can apply the `shape2mat` function to obtain an [adjacency matrix](https://connordonegan.github.io/geostan/articles/spatial-weights-matrix.html) that encodes spatial adjacency relations for countries into a binary N-by-N matrix. The function uses `spdep` to find adjacency relations and returns results as a sparse matrix (using the `Matrix` package): ```{r fig.width = 5} A <- shape2mat(world, "B", method = "rook") ``` Visualizing the connections in the matrix is important for uncovering unexpected results. `geostan`'s `edges` function converts the matrix into a list of nodes and edges that we can plot. For this we need to supply the function with the adjacency matrix, `A`, and the associated spatial object, `world`: ```{r fig.height = 3.5} # edges with geometry E <- edges(A, shape = world) graph <- st_geometry(E) ogpar <- par(mar = rep(0, 4)) # plot countries plot(world_geom, lwd = .1) # add graph nodes plot(graph, add = TRUE, type = 'p') # add graph edges plot(graph, add = TRUE, type = 'l') par(ogpar) ``` This reveals quite a few unexpected results. French Guiana is stored in the `world` data as part of France (a multi-part polygon); this is correct of course but it leads to Brazil and Suriname being listed as neighbors of France, which is not sensible. Besides removing those connections, there are a number of island nations that we might want to connect to nearby places. To connect Mozambique to Madagascar, we just replace the zeroes with ones in the slots that correspond to those countries. First we grab their index positions in the matrix: ```{r eval = FALSE} moz_idx <- grep("Mozambique", world$name_long) mad_idx <- grep("Madagascar", world$name_long) ``` And then we assign the correct slots in the matrix a value of 1 (or `TRUE`), remembering that the adjacency matrix is symmetric: ```{r eval = FALSE} A[moz_idx, mad_idx] <- A[mad_idx, moz_idx] <- TRUE ``` This can become tedious but it is important. Before moving on, we will make a series of adjustments. This will be made a bit easier with this convenience function: ```{r} connect <- function(country_a, country_b, names_vec = world$name_long, matrix = A, add = TRUE) { stopifnot( country_a %in% names_vec ) stopifnot( country_b %in% names_vec ) a_idx <- which(names_vec == country_a) b_idx <- which( names_vec == country_b) matrix[a_idx, b_idx] <- matrix[b_idx, a_idx] <- add return( matrix ) } ``` The following are at least reasonable changes to make; they also ensure that every country has at least one neighbor: ```{r} A <- connect("Mozambique", "Madagascar") A <- connect("Australia", "New Zealand") A <- connect("Philippines", "Malaysia") A <- connect("Japan", "Republic of Korea") A <- connect("Fiji", "Vanuatu") A <- connect("Solomon Islands", "Vanuatu") A <- connect("Solomon Islands", "Papua New Guinea") A <- connect("Australia", "Papua New Guinea") A <- connect("Haiti", "Jamaica") A <- connect("Bahamas", "United States") A <- connect("Dominican Republic", "Puerto Rico") A <- connect("Trinidad and Tobago", "Venezuela") A <- connect("Sri Lanka", "India") A <- connect("Cyprus", "Turkey") A <- connect("Cyprus", "Lebanon") A <- connect("Norway", "Iceland") ## remove connections between South American and France A <- connect("Suriname", "France", add = FALSE) A <- connect("Brazil", "France", add = FALSE) ``` We should look at the revised adjacency matrix: ```{r fig.width = 5, fig.height = 3.5} graph <- st_geometry( edges(A, shape = world) ) ogpar <- par(mar = rep(0, 4)) plot(world_geom, lwd = .1) plot(graph, add = TRUE, type = 'p') plot(graph, add = TRUE, type = 'l') par(ogpar) ``` Sometimes it can help to examine the edge list interactively using a proper geographic information system like QGIS. For those who are familiar with (non-R) GIS software, you can save the edges list as a GeoPackage and then open it up in your GIS to examine the connections 'by hand' with a base map or other data: ```{r eval = FALSE} E <- edges(A, shape = world) st_write(E, "world.gpkg", layer = "edge list") ``` ## Non-spatial regression Fitting regression models with `geostan` is similar to using base R's `glm` function: the user provides a model formula, data, and the model family or distribution. We can fit a normal linear model using the `stan_glm` function: ```{r} fit_lm <- stan_glm(lifeExp ~ log(gdpPercap), data = world, quiet = TRUE) ``` And we can examine parameter estimates by printing to the console: ```{r} print(fit_lm) ``` The output printed to the console provides a summary of the posterior probability distributions of the model parameters. The distributions can also be visualized using `plot(fit_lm)`: ```{r fig.width = 7, fig.height = 3} plot(fit_lm) ``` The mean of the distribution is reported in the `mean` column. For those who are more familiar with concepts from sampling theory, the `mean` may be understood as the estimate of the parameter. Each distribution's standard deviation is found in the `sd` column; this describes the width of the posterior distribution. The `sd` is analogous to the standard error of the estimate. The quantiles also summarize the width of the posterior distributions; the 2.5\% and 97.5\% values form a 95\% credible interval for the parameter value. ### MCMC output The effective sample size (ESS), `n_eff`, tells us how many independent MCMC samples the inference is based on after adjusting for serial autocorrelation in the MCMC samples. This is an important quantity to pay attention to and generally one might like to see these numbers above 400 (or around 100 samples per MCMC chain). The standard error of the mean, `se_mean`, reports how much MCMC sampling error to expect in the `mean` (`se_mean` is calculated using `n_eff`). The R-hat statistic, `Rhat`, should always be very close to 1, preferably less than 1.01. The R-hat diagnostic tests that the MCMC chains are all depicting the same distribution. If they diverge from one another, it either means that you need to draw a higher number of MCMC samples (run the chains for longer) or that there is a problem fitting the model to your data. By default, `geostan` models run four independent MCMC chains for 3,000 iterations each, half of which is discarded as warm-up. The number of iterations is controlled by the `iter` argument, the default being `iter = 3e3`. For some models this may be too low and you will want to increase this. Other times this might be *more* than is needed in which case you can reduce the computation time by using fewer iterations. What matters most is not your number of iterations but your ESS and R-hat statistics. When it comes to reporting results, it is generally best to use at least the default of four MCMC chains (`chains = 4`). ### Methods A number of familiar methods are available for working with `geostan` models including `fitted`, `resid`, and `predict`. The `fitted` method returns a `data.frame` with summaries of the fitted values. The probability distribution for each fitted value is summarized by its posterior mean, standard deviation, and quantiles: ```{r} fdf <- fitted(fit_lm) head(fdf) ``` The `resid` method behaves similarly. Examining the Moran scatter plot using the residuals shows a moderate degree of positive SA as well as some skewness: ```{r fig.width = 4.5, fig.height = 4, fig.align = 'center'} rdf <- resid(fit_lm) moran_plot(rdf$mean, A) ``` ## Spatial regression Options for spatial regression models currently include conditional autoregressive (CAR), simultaneous autoregressive (SAR/spatial error), and eigenvector spatial filtering (ESF). For count data, common variations on the intrinsic autoregressive (ICAR) model are also available. All of the spatial models require at least a spatial weights matrix as input. All additional requirements for data preparation are handled by `geostan`'s `prep_` functions: `prep_car_data`, `prep_icar_data`, `prep_sar_data`. The `make_EV` function is used to create Moran's eigenvectors for ESF regression; if you want to create your own eigenvectors (say, following the PCNM method) you can provide those directly to the ESF model (see `?stan_esf`). For the CAR model, we always provide the binary adjacency matrix as input to `prep_car_data`. See the `prep_car_data` documentation for options. Here we will fit an intercept-only CAR model to the life expectancy data: ```{r message = FALSE} cars <- prep_car_data(A) fit_car <- stan_car(lifeExp ~ 1, data = world, car_parts = cars, iter = 1e3, quiet = TRUE) ``` ```{r} print(fit_car) ``` Notice that using `iter = 1000` was more than adequate for inference in this case. The CAR model has a spatial dependence parameter `car_rho`. This parameter does *not* have an interpretation similar to a correlation coefficient, and it is often near 1; this is not a problem unless one misinterprets it or desires a value similar to the correlation coefficient. The spatial dependence parameter in the SAR model does provide that kind of interpretation. ### A filtering approach Returning to the correlation coefficient estimated above, one way to adjust our estimate for spatial dependence is to filter out the spatial trend from each of the two variables and then calculate the correlation coefficient using the detrended values [@chun_2012, p. 71]. This spatial 'filtering' or 'pre-whitening' method is not particularly common in practice but its a good trick to know given the familiarity of the correlation coefficient. We will use it here to demonstrate some basic features of the software. The spatial trend term can be extracted from any spatial `geostan` model using the `spatial` method. ```{r} theta <- spatial(fit_car)$mean ``` ```{r} pars <- map_pars(theta) ogpar <- par(mar = rep(0, 4)) plot(st_geometry(world), col = pars$col, lwd = .2) legend("left", fill = pars$pal, title = 'Spatial trend (LE)', legend = pars$lbls, bty = 'n' ) par(ogpar) ``` We can obtain detrended values most simply by taking the residuals from an intercept-only spatial model: ```{r} # lifeExpt detrended dy <- resid(fit_car)$mean # log per capita GDP detrended fit_carx <- stan_car(log(gdpPercap) ~ 1, data = world, car = cars, iter = 1e3, quiet = TRUE) dx <- resid(fit_carx)$mean ``` Using `cor.test` with those provides an estimate of correlation adjusted for spatial autocorrelation: ```{r} # adjusted correlation cor.test(dx, dy) ``` The adjusted estimate of .59 is considerably different from the naive estimate of .80 and is outside the naive confidence intervals. (The adjusted estimate is .62 if we use SAR models.) ### A bivariate model Here we will use the SAR model to illustrate its use. Fitting the spatial error or SAR model requires nearly the same steps as above. Unlike `prep_car_data`, be sure to row-standardize the adjacency matrix before passing it to `prep_sar_data`. ```{r message = FALSE, warning = FALSE} W <- row_standardize(A) sars <- prep_sar_data(W) ``` When fitting the model, we are going to add `centerx = TRUE` to center the covariate. (Internally this will call `center(x, center = TRUE, scale = FALSE)`.) This will not change coefficient estimates but it does often improve MCMC sampling efficiency, sometimes considerably so. It does change interpretation of the intercept: the intercept will be an estimate of the average life expectancy (or the expected life expectancy when all covariates are at their average values). ```{r} fit_sar <- stan_sar(lifeExp ~ log(gdpPercap), data = world, sar_parts = sars, centerx = TRUE, iter = 1e3, quiet = TRUE) ``` Lets plot the results this time: ```{r} plot(fit_sar) ``` The spatial dependence parameter is around 0.7, which indicates moderately strong SA. The mean life expectancy is about 71 (probably somewhere between about 69 and 74). And the coefficient for log GDP is around 4 (or somewhere between 3 and 5). The residual variation has a standard deviation of around 3.6 years. If we scale both variables before fitting the bivariate spatial regression model (so that their variances both equal 1) then we get approximately the same estimate as the adjusted correlation coefficient (above). The credible interval is slightly wider because uncertainty in rho is (appropriately) incorporated here: ```{r} world <- transform(world, sx = scale(log(gdpPercap), scale = T, center = T), sy = scale(lifeExp, scale = T, center = T) ) fit_scaled <- stan_sar(sy ~ sx, data = world, sar_parts = sars, iter = 1e3, quiet = TRUE) print(fit_scaled) ``` ### Predicted values We can visualize the model results by plotting the expected life expectancy across the full range of GDP per capita. We use the `predict` function for this. As input, it requires our fitted model and a `data.frame` with covariate values. We will start by creating a `data.frame` with GDP per capita values that span from the minimum to maximum values in the `world` data: ```{r fig.width = 5, fig.height = 5} gdp <- range(world$gdpPercap) min_gdp <- gdp[1] max_gdp <- gdp[2] pdf <- data.frame(gdpPercap = seq(min_gdp, max_gdp, length.out = 200)) ``` The column names in this `data.frame` have to match the variable names that were present in the data that we first provided to the model. In this case, the name of the columns should match those from the `world` data. Likewise, we provide the new GDP data on its original (un-transformed) scale, just as we did when we fit the model using `stan_sar` (the log transformation will be applied by `predict` because it is specified in the model formula). Because we centered the covariate using the `centerx = TRUE` argument, we will also allow the `predict` function to handle the centering automatically using information that is stored with the fitted model (`stan_sar$x_center`). Now we pass this new data to `predict`: ```{r} preds <- predict(fit_sar, newdata = pdf) ``` The output includes our `pdf` data plus some new columns. The new columns provide a summary of the predicted values. As usual, the `mean` is the estimate and the estimate is accompanied by other values that can be used to taken as credible intervals for the predicted value. The output reflects uncertainty in the model parameter estimates. ```{r} head(preds) ``` These 'predicted' values represent the expectation of the outcome variable at the given level of the covariates. So we would expect actual observations to form a cloud of points around the 'predicted' values. To calculate these predicted values, the `predict` function only includes covariates and the intercept, it does not include any spatial autocorrelation components. Its purpose is to examine implications of the coefficient estimates on recognizable scales of variation, not to predict values for particular places. (The log-linear model can also be interpreted in terms of percent changes in the covariate, such as 'a 10% increase in GDP per capita, e.g., from 10,000 to 11,000, is associated with around 4 * log(11/10) = 0.38 additional years of life expectancy on average.) ```{r fig.width = 4.5, fig.height = 4, fig.align = 'center'} # scale GDP preds <- transform(preds, gdpPercap = gdpPercap / 1e3) yrange <- c(min(preds$`2.5%`), max(preds$`97.5%`)) plot(preds$gdpPercap, preds$mean, t = 'l', ylim = yrange, axes = F, xlab = "GDP per capita ($1,000s)", ylab = "Life expectancy") axis(1) axis(2) # add credible intervals lines(preds$gdpPercap, preds$`2.5%`, lty = 3) lines(preds$gdpPercap, preds$`97.5%`, lty = 3) ``` Per this dataset, about 50% of the world population lives in countries with GDP per capita below $12,300. ## Future work and support You can submit any questions, requests, or issues on the package [issues page](https://github.com/connordonegan/geostan/issues) or the [discussions page](https://github.com/connordonegan/geostan/discussions). `geostan` is still actively being developed so users are encouraged to check the package [news page](https://connordonegan.github.io/geostan/news) for updates. If you are interesting contributing to the package you are encouraged to send an e-mail to the author or use the discussions page. You can submit a pull request with any bug fixes. Contributions that would make the package more useful to fields other than `geostan`'s current focus (human geography and public health), such as ecology, would be especially welcome. ## Appendix ```{r echo = TRUE} # function for getting colors, breaks, and labels for mapping map_pars <- function(x, brks = quantile(x, probs = seq(0, 1, by = 0.2), na.rm = TRUE), cols = c("#A8554EFF", "gray95", "#5D74A5FF")) { # put x values into bins x_cut <- cut(x, breaks = brks, include.lowest = TRUE) # labels for each bin lbls <- levels( cut(x, brks, include.lowest = TRUE) ) # colors rank <- as.numeric( x_cut ) max_rank <- max( rank , na.rm = TRUE ) pal_fun <- colorRampPalette( cols ) pal <- pal_fun( max_rank ) colors <- pal[ rank ] # return list ls <- list(brks = brks, lbls = lbls, pal = pal, col = colors) return( ls ) } ``` ## References