--- layout: post title: "Scalable Earth-Observation analytics with R and SciDB" date: 2016-05-09 0:00:00 +0100 comments: true author: Marius Appel, Edzer Pebesma categories: r --- TOC [DOWNLOADHERE] The analysis of non-trivial amounts of Earth Observation (EO) data is complicated due to the need to first download imagery scene-by-scene, then mozaic them, then correct them, and finally de-cloud them or select images without clouds. Only when these steps have been done, a sensible analysis, e.g. for detecting environmental change such as forest loss, can be carried out. Many large research groups or private companies (including [Google](https://earthengine.google.com/)) have set up an infrastructure for this, but do not share this infrastructure with other researchers. As a consequence, many of the decisions and assumptions made during this process are not communicated, making Earth Observation science dominantly non-reproducible. We developed a work flow that can be used to carry out many of these steps: * compose multiple spatial and temporal scenes in a 3- (x/y/t) or 4- (x/y/t/band) dimensional array, * give direct access to the composed arrays, using high-level scripting languages (R, python, Julia) * use a shared computing environment that can scale up computation onto anything from many cores to large clusters * use 100% open source software * create a 100% reproducible workflow Reproducing this workflow requires a bit more than running an R script, as the data base back-end also has to be set up, and filled. For this, we resorted to building the data base in a [docker image](https://www.docker.com/), in order to proof reproducibility, ease installation and fully separate it from other running system components. ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) # required packages: # install.packages(c("gdalUtils","rgdal", "rgeos", "MODIS", "devtools")) # devtools::install_github("Paradigm4/SciDBR", ref="laboratory") # only this experimental version seems to work properly ## GLOBAL VARIABLES FOR SCIDB SCIDB_HOST = "localhost" SCIDB_PORT = 35551 SCIDB_USER = "scidb" SCIDB_PW = "marius" # We don't want to pass this information in every single gdal_translate call und thus set it as environment variables Sys.setenv(SCIDB4GDAL_HOST=paste("https://",SCIDB_HOST,sep=""), SCIDB4GDAL_PORT=SCIDB_PORT, SCIDB4GDAL_USER=SCIDB_USER, SCIDB4GDAL_PASSWD=SCIDB_PW) ``` ## Introduction to SciDB [SciDB](http://www.paradigm4.com/) is a data management and analytics system for multidimensional array data. It scales from single machines to large clusters by distributing storage and computational load over several instances. Therefore, arrays are evenly divided by smaller-sized sub arrays called chunks which are then distributed over the instances. SciDB comes in two editions: an open source, community edition and an enterprise edition. Here, we use the community edition. With regard to complex scientific analytics, SciDB offers the following features: * Sparse array data model * Use of ScaLAPACK, matrix operations are carried out in a distributed way * High-level interfaces to R, python, and julia * extensible by user defined functions ![](fig1-architecture.png) ## Reproducing this blog To reproduce this blog, you need a running SciDB instance, and download quite a bit of data. To do both in a clean fashion, we use a docker image, which is built from a docker file. The material (docker file, scripts) to reproduce everything in this blog is found [here](http://ifgi.uni-muenster.de/~m_appe01/scidb-gis/15.7/scidb-gis-15.7.tar.gz). ## Spacetime Arrays: The scidb4geo plugin SciDB natively treats space or time not differently from other dimensions: dimensions are described by an integer index with a start value, the number of cells, and the chunk and chunk overlap sizes. This information is stored as array metadata in SciDB's system catalog which held in a PostgreSQL database. To maintain geographic reference of arrays (what are the coordinates of the spacetime raster origin, what are the cell sizes), we added the necessary metadata to this PostgreSQL database. For spatial reference, we store * which dimensions correspond to latitude (northing) and longitude (easting), * an affine transformation that relate integer array indexes to spatial coordinates, which is often just an offset and cell size vector, and * the reference system as authority name, id pair and its [WKT](https://en.wikipedia.org/wiki/Well-known_text) and [proj4](https://trac.osgeo.org/proj/) definitions. Notice that the plugin comes with all [EPSG](http://www.epsg.org/) definitions based on the PostGIS `SPATIAL_REF_SYS` table and that the affine transformation corresponds to that used by [GDAL](http://www.gdal.org/). Similarly, the temporal reference stores * which dimension refers to date or time, * what the date / time at array cell 0 is, and * what the temporal resolution, i.e. the temporal interval between successive array cells is. Dates and time (periods) are given as specified in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601). Since time is usually irregular, this definition might seem very restrictive but using SciDB's support for sparse arrays, a higher temporal resolution (e.g. of hours or days when images appear roughly monthly) often can regularize time without storage overhead. The scidb4geo plugin adds new operators to SciDB's query language AFL. These include very basic functionality for getting and setting geographic reference, general metadata, and very simple computations of the extent and to overlay two arrays based on their spatiotemporal footprint. Table 1 shows a complete list of new AFL operators. The plugin is available at [github](https://github.com/mappl/scidb4geo). Building from sources requires linking against the SciDB development libraries. | **Operator** | **Description** | | ----------- | -------------------------------------------------------- | | `eo_arrays()` | Lists geographically referenced arrays | | `eo_setsrs()` | Sets the spatial reference of existing arrays | | `eo_getsrs()` | Gets the spatial reference of existing arrays | | `eo_regnewsrs()` | Registers custom spatial reference systems | | `eo_extent()` | Computes the geographic extent of referenced arrays | | `eo_cpsrs()` | Copies the spatial reference from one array to another array| | `eo_settrs()` | Sets the temporal reference of arrays | | `eo_gettrs()` | Gets the temporal reference of arrays | | `eo_setmd()` | Sets key value metadata of arrays and array attributes | | `eo_getmd()` | Gets key value metadata of arrays and array attributes | | `eo_over()` | Overlays two geographically referenced arrays | Table: Table1: Array Functional Language (AFL) operators of the scidb4geo plugin that add spacetime references to a SciDB database. Details including parameters, return values, and minimal examples can be found at the [code repository](https://github.com/mappl/scidb4geo/tree/master/doc/operators). ## A GDAL driver for SciDB arrays SciDB's database `load` and `input` operators can create arrays from files in CSV-like text and a custom binary format. The same is true for the `save` operator to export arrays as files. To support reading and writing SciDB arrays from and to files in a variety of raster formats, we implemented an extension to the open-source [Geospatial Data Abstraction Library](http://www.gdal.org/) (GDAL). GDAL natively supports over 100 multiband raster image formats and is internally used by nearly all GIS software, open source and commercial. To create a two-dimensional array from a GDAL dataset, our driver carries out the following steps 1. Convert pixel data to SCIDB binary format 2. Transport the data over the web to SciDB's web service Shim 3. Load the data as a one-dimension array using `load` 4. Reshape the array to two dimensions 5. Set the spatial reference Notice that pixel data are processed in chunks. The download procedure works similarly: 1. Fill empty cells with a default NA value 2. Convert the array to a SciDB binary file using `save` 3. Load the data over the web from SciDB's web service Shim 4. Create a GDAL dataset from pixel data With the driver installed, the `gdal_translate` utility program can automatically convert files in all formats supported by GDAL to and from SciDB, and thereby subset arrays by spatial subwindow and band numbers as well as modify metadata and rescale imagery (see the [gdal_translate manual](http://www.gdal.org/gdal_translate.html) for details). The driver is available under an open-source license at [github](https://github.com/mappl/scidb4gdal). Notice that since it is not yet integrated in the GDAL source code tree, right now it needs manual compilation and installation. Dependencies are [libcurl](http://curl.haxx.se/) and some [Boost](http://www.boost.org/) header-only libraries. The driver works on Windows, Mac, and Linux operating systems. For Linux, a simple bash script to build and install GDAL is available (see below). ### Installation The following system call installs GDAL version 2.0.1 with the new SciDB driver. Note that this may overwrite previous GDAL installations. The [github repository](https://github.com/mappl/scidb4gdal) contains more detailed installation instruction and an automated build check with TravisCI. ```{r,engine="bash", eval=FALSE} # Running make takes some time (up to 30 minutes) sudo ./install_gdal.sh # build and install GDAL ``` GDAL lives on the client side, so must be installed there where R, or a gdal utililty program, is being run as client to SciDB. After installing the SciDB-enabled GDAL, the R package rgdal needs to be re-installed, so that it links to the modified GDAL library. ### Simple 2D arrays Ingesting and downloading single raster files to single two-dimensional arrays is done by the `gdal_translate` program. The following example downloads a sample GeoTIFF file, uploads the file as a two-dimensional array, calls `gdalinfo` directly on the SciDB array, and finally downloads the data as a PNG image and loads this into R, to plot it with `spplot`. For calling the `gdal_translate` and `gdalinfo` binaries, we use the `gdalUtils` package but this could be replaced by `system()` calls. ```{r, eval=TRUE, cache=TRUE} require(gdalUtils) # Create 2d array from single GeoTIFF # delete array to avoid errors if already exists gdalmanage(mode = "delete", datasetname = "SCIDB:array=chicago confirmDelete=Y" ) download.file("http://download.osgeo.org/geotiff/samples/spot/chicago/UTM2GTIF.TIF", destfile = "chicago.tif") gdal_translate(src_dataset = "chicago.tif", dst_dataset = "SCIDB:array=chicago", of = "SciDB") gdalinfo("SCIDB:array=chicago") # download 2d array as a png and plot in R gdal_translate(src_dataset = "SCIDB:array=chicago", dst_dataset = "chicago.png" , of = "PNG") require(rgdal) img = readGDAL("chicago.png") spplot(img) ``` ### Multi-tile 2D arrays Earth-observation imagery usually comes in spatially tiled files, i.e. a single file covers a limited, rectangular region of interest. The GDAL driver for SciDB supports loading tiled datasets into a single two-dimensional array. For this, we call `gdal_translate` iteratively and append tiles to an array, creating a mosaic. Based on their spatial extent, array coordinates are then automatically computed such that tiles are ingested to the correct position. For this, the spatial reference system of all tiles must be identical. In the R code example, we download 7 SRTM tiles covering Ethopia and ingest them to a single SciDB array called "srtm". ```{r srtm_1, eval=TRUE, cache=TRUE} # SRTM example over Ethopia require(gdalUtils) # download files, this might take some time(!) source("download.srtm.R") # find files files = list.files(path = "srtm", pattern = "*.tif", full.names = TRUE) # delete array to avoid errors if already exists gdalmanage(mode = "delete",datasetname = "SCIDB:array=srtm confirmDelete=Y") # Create a 2d SciDB array with given extent and add first image # this may produce an error if array already exists gdal_translate(src_dataset = files[1], dst_dataset = "SCIDB:array=srtm", of = "SciDB", co = list("bbox=30 5 50 15", "srs=EPSG:4326", "type=S")) # Iteratively add further images to this array for (i in 2:length(files)) { # takes around 2 minutes each on my machine gdal_translate(verbose = T, src_dataset = files[i], dst_dataset = "SCIDB:array=srtm", of = "SciDB", co = list("type=S", "srs=EPSG:4326")) } ``` Running `gdalinfo` shows that the created array has 24001 x 12001 pixels. ```{r srtm_2, eval=TRUE, cache=TRUE} require(gdalUtils) gdalinfo("SCIDB:array=srtm") ``` Using the rgdal package, we can directly download the array to R. However, rgdal does not support subsetting by spatial coordinates, so the following command uses array indexes to define a window of interest. ```{r srtm_3, eval=TRUE, cache=TRUE} require(rgdal) srtm.sp = readGDAL("SCIDB:array=srtm",offset = c(8000,3000), region.dim = c(1000,1000) ) spplot(srtm.sp, scales = list(TRUE)) ``` ### Multi-temporal, 3-D arrays The GDAL driver also allows to sucessively add images as time slices into a three-dimensional spacetime array. During the ingestion of the first image, we must specify that the image should be rearranged as a three-dimensional array and we must define the temporal resolution of the target array as ISO 8601 string for [time periods](https://en.wikipedia.org/wiki/ISO_8601#Time_intervals), and provide the date or time of the image. For subsequent images, only the date or time must be provided for ingestion. The correct temporal index is then derived automatically. Since there is no standard metadata field for time and date of datasets in the GDAL API, the date and time must be provided manually or derived e.g. from filenames of individual images. Similarly, the R code below demonstrates how to download one MODIS tile at different dates using the [MODIS R package](https://r-forge.r-project.org/R/?group_id=1252) and how to build a 3-D SciDB array from these files. Since the MODIS sinusoidal projection is not in the EPSG registry, we use a parameter of `gdal_translate` to overwrite the projection with authority and ID from [spatialreference.org](http://spatialreference.org). The database plugin then automatically downloads proj4 and WKT definitions. ```{r modis1, eval=TRUE, cache=TRUE} require(MODIS) require(gdalUtils) MODISoptions(localArcPath = paste(getwd(), "MODIS", sep="/")) hdf.download = getHdf("MOD13A3",begin="2000-01-01", end="2005-01-01",tileH = 12, tileV = 9,collection = "005") # MODIS HDF files have subdatasets for bands, we only want NDVI filenames = basename(hdf.download$MOD13A3.005) datasets = paste0("HDF4_EOS:EOS_GRID:", hdf.download$MOD13A3.005, ":MOD_Grid_monthly_1km_VI:1 km monthly NDVI") gdalmanage(mode = "delete",datasetname = "SCIDB:array=MOD13A3 confirmDelete=Y" ) # MODIS sinusoidal is not in SPATIAL_REF_SYS and must be added wkt = "PROJCS[\"Sinusoidal\",GEOGCS[\"GCS_Undefined\",DATUM[\"Undefined\",SPHEROID[\"User_Defined_Spheroid\",6371007.181,0.0]],PRIMEM[\"Greenwich\",0.0],UNIT[\"Degree\",0.0174532925199433]],PROJECTION[\"Sinusoidal\"],PARAMETER[\"False_Easting\",0.0],PARAMETER[\"False_Northing\",0.0],PARAMETER[\"Central_Meridian\",0.0],UNIT[\"Meter\",1.0],AUTHORITY[\"SR-ORG\",\"6842\"]]" # Create a 3d SciDB spacetime array and add first image # this may produce an error if array already exists gdal_translate(src_dataset = datasets[1], dst_dataset = "SCIDB:array=MOD13A3", of = "SciDB", a_srs = wkt, co = list("t=2000-01", "dt=P1M", "type=STS")) # Iteratively add further images to this array for (i in 2:length(datasets)) { d = strptime(substr(filenames[i],10,16), format="%Y%j") gdal_translate(src_dataset = datasets[i], dst_dataset = "SCIDB:array=MOD13A3", of = "SciDB", a_srs = wkt, co = list("type=ST", "dt=P1M", paste("t=",format(d,"%Y-%m"),sep=""))) } ``` Running `gdalinfo` shows that even important metadata fields like the scale and unit of bands are maintained. Further metadata entries about the temporal coverage is added. ```{r modis2, eval=TRUE, cache=TRUE} gdalinfo("SCIDB:array=MOD13A3") ``` ```{r modis3, eval=TRUE, cache=TRUE} # Download temporal slice gdal_translate(src_dataset = "SCIDB:array=MOD13A3[t,2001-04-01]", dst_dataset = "mod.tif" , of = "GTiff") require(rgdal) img = readGDAL("mod.tif") spplot(img) ``` ## Scalable in-database analytics with R So far, we used the database as a data store only. The effort in previous steps pays off as soon as we want to scale complex analyses under the following principles: (i) we want to move the analysis to the data instead of moving the data to our analysis, (ii) we want to exploit parallelism in our computations such that execution times reduce linearly with increasing hardware resources, and (iii) we want reuse code of previous analysis on a local machine instead of rewriting methods in a new programming or query language. With regard to (iii), SciDB offers two different ways to interface with R. On the one hand, R can be used as a database client where essential operations use R syntax but automatically translate to database operations using the [scidb R package](https://github.com/Paradigm4/SciDBR). On the other hand, R scripts can be executed in database queries with the [r_exec](https://github.com/Paradigm4/r_exec) database plugin. The key here is that the same R script is called independently for each array chunk. As a result, chunk sizes must be adjusted to fit the analyses. Running R functions on pixel time series for instance requires that individual array chunks hold the complete time series. The script below demonstrates how time-series analyses can be scaled up from R and still meet the aforementioned principles. We rearrange the three-dimensional MODIS array to hold the complete time series of 64x64 pixels in individual chunks first, apply a simple centered mean filter, and fit a simple harmonic model to capture annual variability of the NDVI afterwards. Resulting parameters (intercept, amplitude, and phase shift) are returned as attributes in a two-dimensional array. ```{r, cache=TRUE} # r_exec should be already installed in the Docker container (scripts are provided) # devtools::install_github("Paradigm4/SciDBR", ref="laboratory") # remove previously created arrays gdalmanage(mode = "delete",datasetname = "SCIDB:array=MOD13A3_T confirmDelete=Y" ) gdalmanage(mode = "delete",datasetname = "SCIDB:array=MOD13A3_MODEL_OUTPUT confirmDelete=Y" ) gdalmanage(mode = "delete",datasetname = "SCIDB:array=MOD13A3_MODEL_SP confirmDelete=Y" ) require(scidb) scidbconnect(host = SCIDB_HOST,port=SCIDB_PORT, username = SCIDB_USER, password = SCIDB_PW, auth_type = "digest",protocol = "https") #1. Rearrange chunks to contain complete time series and convert integers to NDVI doubles query.preprocess = "store(merge(repart(project(apply(MOD13A3,ndvi,double(band1) / 10000.0),ndvi),[y=0:1199,64,0, x=0:1199,64,0, t=0:*,256,0]), build([y=0:1199,64,0, x=0:1199,64,0, t=0:60,256,0],-1)), MOD13A3_T)" iquery(query.preprocess) #2. Apply R function over individual time series query.R = "store(unpack(r_exec(project(apply(MOD13A3_T,X,double(x),Y,double(y),T,double(t)), ndvi,X,Y,T),'output_attrs=6','expr= dim1 = length(unique(Y)) dim2 = length(unique(X)) dim3 = length(unique(T)) ndvi = array(ndvi,c(dim3,dim2,dim1)) t = 1:dim3 ndvi.fitted = apply(ndvi,c(3,2),function(x) { x[which(x < -0.29)] = NA x = filter(x,c(1,1,1)/3,circular=TRUE) if (all(is.na(x))) return(c(0,0,0,-1)) ndvi.seasonal = lm(x ~ sin(t/6) + cos(t/6)) intercept = coef(ndvi.seasonal)[1] ampl = sqrt(coef(ndvi.seasonal)[2]^2 + coef(ndvi.seasonal)[3]^2 ) phase = atan2(coef(ndvi.seasonal)[2],coef(ndvi.seasonal)[3]) ssr = sum(residuals(ndvi.seasonal)^2) return(c(intercept, ampl, phase, ssr)) }) coords = expand.grid(unique(Y),unique(X)) list(as.double(coords[,1]),as.double(coords[,2]), ndvi.fitted[1,,], ndvi.fitted[2,,], ndvi.fitted[3,,], ndvi.fitted[4,,] )'),i), MOD13A3_MODEL_OUTPUT)" iquery(query.R) # 3. Reshape thearray to two dimensions query.postprocess = "store(redimension(project(apply(MOD13A3_MODEL_OUTPUT,y,int64(expr_value_0), x,int64(expr_value_1), p0,expr_value_2, p1, expr_value_3, p2, expr_value_4, ssr, expr_value_5),y,x,p0,p1,p2,ssr), [y=0:1199,2048,0, x=0:1199,2048,0]), MOD13A3_MODEL_SP)" iquery(query.postprocess) iquery("eo_setsrs(MOD13A3_MODEL_SP,'x','y','SR-ORG',6842,'x0=-6671703.118 y0=0 a11=926.625433055833 a22=-926.625433055833 a12=0 a21=0')") ``` The result array `MOD13A3_MODEL_SP` has spatial reference and can be downloaded using GDAL as below. ```{r, cache=TRUE} # 4. download and plot result array require(gdalUtils) gdal_translate(src_dataset = "SCIDB:array=MOD13A3_MODEL_SP", dst_dataset = "ndvi.tif" , of = "GTiff") require(rgdal) img = readGDAL("ndvi.tif") spplot(img[1], scales = list(T), at=seq(quantile(img$band1,0.05),quantile(img$band1,0.95),length.out = 21), main="Fitted intercept") spplot(img[2], scales = list(T), at=seq(0,quantile(img$band2,0.999,na.rm = T),length.out = 21), main="Fitted Seasonal amplitude") spplot(img[4], scales = list(T), at=seq(0,quantile(img$band4,0.95),length.out = 21), main="Sum of squared residuals") ``` The analyses required some manual SciDB query language operations to reshape arrays, change attribute datatypes, or remove attributes. Most of the pre- and postprocessing except the actual R query could be also done using the SciDB R package. We acknowledge that plain AFL queries might be a hurdle from the perspective of a data analyst but we see many possibilities to simplify this in the future. ## Discussion and Conclusion In this blog post, we present an approach to scale up earth-observation analytics with open source software, using R and SciDB. From a data scientist perspective, the approach requires relatively little learning effort by doing all neccessary steps in R. However, there are still possibilities to improve and simplify the procedure. Running R scripts within SciDB queries using the `r_exec` plugin still requires manual bookkeeping of coordinates. It is up to the user to relate input chunks to space and time. An automatic conversion to classes of R packages `sp`, `spacetime`, or `raster` would significantly help to automatically translate local R analyses to scalable SciDB queries. Furthermore, core SciDB operations still only work on integer array indexes. We believe that SciDB has a useful set of operations but overwriting them to take spatiotemporal coordinates as input and produce referenced arrays would improve the usability of SciDB. A simple approach to achieve this could be to extend the SciDB R package by an S4 class for earth observation arrays and overwrite its methods. In our demonstration, we exclusively used R. However, since our extensions work directly in the database and GDAL, there is no reason not to use python or Julia, which have similar interfaces to SciDB. The approach thus could be the basis for future activities towards open and reproducible large-scale Earth Observation science.