---
title: "Reading Zarr files with R package stars"
author: "Edzer Pebesma"
date:  "Sep 13, 2022"
comments: false
layout: post
categories: r
---


TOC

[DOWNLOADHERE]

**Summary**: 
Zarr files are the new cloud native NetCDF, and are here to stay.
This blog post will explore how they can be read in R using the GDAL
"classic" raster API and the (relatively) new GDAL multidimensional
array API, and how we can read sub-arrays or strided (subsampled,
lower resolution) arrays. CMIP6 data is being released as Zarr data
on the Google Cloud.

```{r include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

##  What is Zarr?

`Zarr` is a data format; it does not come in a single file as `NetCDF`
or `HDF5` does but as a directory with chunks of data in compressed
files and metadata in JSON files.  Zarr was developed in the Python
numpy and xarray communities, and was quickly taken up by the Pangeo
community. A Python-independent specification of Zarr (in progress,
V3) is found [here](https://zarr-specs.readthedocs.io/en/latest/). 

[GDAL](https://gdal.org/) has a [Zarr
driver](https://gdal.org/drivers/raster/zarr.html), and can read
single (spatial, raster) slices without time reference through its
classic [raster API](https://gdal.org/api/index.html#raster-api), and
full time-referenced arrays through its newer [multidimensional array
API](https://gdal.org/api/index.html#multi-dimensional-array-api). In
this blog post we show how these can be used through R
and package [stars](https://r-spatial.github.io/stars) for raster and
[vector data cubes](https://r-spatial.org/r/2022/09/12/vdc.html). We
will start with an attempt to reproduce what
[Ryan Abernathey](https://twitter.com/rabernat)
did with Python, xarray and geopandas, published
[here](https://discourse.pangeo.io/t/conservative-region-aggregation-with-xarray-geopandas-and-sparse/2715).

## Aggregating global precipitation time series to countries

Ryan's blog post aggregates a global, 1-degree "square" gridded
daily precipitation dataset (NASA GPCP) to country values, where the
aggregation should be weighted according to the amount of overlap
of grid cells (as "square" polygons) and country polygons.

We can read the rainfall data in R using
```{r}
library(stars)
dsn = 'ZARR:"/vsicurl/https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr"'
bounds = c(longitude = "lon_bounds", latitude = "lat_bounds")
r = read_mdim(dsn, bounds = bounds)
r
st_bbox(r)
r[r < 0 | r > 100] = NA
```
where

* the dsn is constructed such that GDAL knows this is a Zarr file (`ZARR:`) and that it is a web resource (`/vsicurl/`).
* we specify the bounds because the `latitude` and `longitude` variables do not have a `bounds` attribute pointing to the cell boundaries, and their values do not point to grid cell centers (as recommended but not required by [CF 1.10](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#coordinate-types)), but to boundaries 
* the Zarr has no CRS, so that R wouldn't know these are ellipsoidal coordinates; without action it would then assume they are Cartesian coordinates (as geopandas does); we will correct this later on
* values below 0 and above 100 are set to `NA`; this is according to the `valid_range` attribute for the `precip` variable, but not automatically parsed
* `time` does not have an offset/delta pair, indicating it is not regular; there seem to be a few duplicate time steps, we can identify them by

```{r}
time(r) |> diff() |> table()
ti = time(r)
ti[which(diff(ti) == 0)]
```

### 50m natural Earth country boundaries

We can load the 50m NE Country boundaries from the `rnaturalearth` package:

```{r}
library(rnaturalearth)
ne = ne_countries(scale = 50, returnclass = "sf")
st_crs(r) = st_crs(ne)
ne = st_make_valid(ne)
```

where

* we set the CRS of the precipitation data cube, equal to that of NE (WGS84)
* we call `st_make_valid()` to make the country boundaries [valid _on the sphere_](https://r-spatial.org/book/04-Spherical.html#validity-on-the-sphere)

## simple aggregation

`aggregate` takes the grid cells as points:
```{r}
(a = aggregate(r, ne, mean, na.rm = TRUE))
```

and returns a _vector data cube_: an array with 241 country geometries x 9226 time steps (days). We can select the time series for Italy and plot it, using the time series package `xts`, and aggregation to monthly mean values from `xts`:


```{r zarr1}
library(xts)
a[, which(ne$admin == "Italy"),] |>  # select Italy
	adrop() |>  # drop geometry dimension
	as.xts() -> a.xts # convert to xts (time series class)
a.xts |>
	aggregate(as.Date(cut(time(a.xts), "month")), mean, na.rm = TRUE) |> # temporal aggregation
	plot(ylab = "mean daily precip [mm/d]")
```

The "exact" (or "conservative region") aggregation in Ryan's blog is obtained by
[area-weighted interpolation](https://r-spatial.org/book/05-Attributes.html#sec-area-weighted); we have `s2` on to get intersection using spherical coordinates; doing this with Cartesian coordinates will not work for half of the globe unless we'd recenter the precipitation coordinates from [0,360] to [-180,180]:
```{r zarr2}
sf_use_s2() # the default
ne = st_make_valid(ne)
it = ne[ne$admin == "Italy",]
st_interpolate_aw(r, it, extensive = FALSE) |>
	st_set_dimensions("time", time(r)) -> a2
a2 |> 
	adrop() |> 
	as.xts() |>
	aggregate(as.Date(cut(time(a2), "month")), mean, na.rm = TRUE) |> # monthly means
	plot(ylab = "mean daily precip [mm/d]")
```

We can compare outcomes from both computations e.g. in a scatter plot:
```{r zarr3}
a2 |> adrop() |> as.xts() -> a2.xts
plot(as.vector(a.xts), as.vector(a2.xts),
	 xlab = "pixels as points", ylab = "pixels as polygons")
abline(0, 1)
```


## CMIP6 data

[CMIP6](https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6), the
climate model intercomparison project, will share its output data
(simulations of climate projections) as NetCDF but also as Zarr files on Google
Cloud Storage. A STAC interface (static) to these data is found
[here](https://github.com/pangeo-data/pangeo-datastore-stac/blob/master/master/climate/cmip6_gcs/collection.json).


One of these files, containing pressure at sea level for 6-hour intervals, was created by the following Python script:
```
import fsspec
import xarray
import zarr
m = fsspec.get_mapper("https://storage.googleapis.com/cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/6hrPlev/psl/gn/v20170706")
ds = xarray.open_zarr(m)
ds.sel(time=slice("1948-01-01","1955-12-31")).to_zarr("./psl.zarr")
```

That file is provided [here](https://uni-muenster.sciebo.de/s/seyc31AxNzGzot9), but we will read the same file directly from the cloud storage using R:

```{r}
dsn = 'ZARR:"/vsicurl/https://storage.googleapis.com/cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/6hrPlev/psl/gn/v20170706"/:psl.zarr/'
d = read_mdim(dsn, count = c(NA, NA, 11680))
st_crs(d) = 'OGC:CRS84'
d
st_bbox(d)
```

where:

* in addition to the `ZARR:` and `/vsicurl/` prefixes, we see the array name (`psl.zarr`) specified
* the count for time, corresponding to the time slices, is given, leaving the ones for longitude and latitude to `NA`
* dimension information can be obtained using the `gdalmdiminfo` binary, or from R using

```{r}
ret = gdal_utils("mdiminfo", dsn, quiet = TRUE)
jsonlite::fromJSON(ret)$dimensions
```

* bounds are being read, and as they are specified as an attribute to the coordinate variables:

```{r}
jsonlite::fromJSON(ret)$arrays$lon$attributes$bounds
jsonlite::fromJSON(ret)$arrays$lat$attributes$bounds
```

* we can see from the print summary that the `latitude` bounds are irregular: the ones touching the poles are half the latitude width as all others. Reading this
as a regular GDAL file creates problems.
* we also see is the the `refsys` (reference system) of
the time is `PCICt_365`, indicating a "365-day" calendar is used:
leap days are ignored. 

A regular sequence of 6-hour images is provided for the non-leap days. Using package `PCICt` we can convert
this to `POSIXct` time (the one we live in) by
```{r}
as.POSIXct(r)
```
after which we see that we no longer have a regular time dimension
but semi-regular with gaps (the leap days); POSIXct values are now in
the `values` field of the `time` dimension.

We can plot the first 49 time slices e.g. by
```{r zarr4}
plot(r[,,,1:49]) # first 49 time steps
```

### Reading sub-arrays or strided arrays

Using the multidimensional array API we can read lower reslution versions
of the imagery by setting 

* `offset` the offset (start) of reading (pixels, 0, 0, 0 by default)
* `count` the number of steps to read in each dimensions
* `step` the step size in each dimension

As an example, let's try to read one value per year:
```{r zarr5}
(r = read_mdim(dsn, step = c(1, 1, 365 * 4)))
plot(r)
```

We can for instance read yearly data at a quarter of the spatial resolution by
```{r zarr6}
(r = read_mdim(dsn, step = c(4, 4, 365 * 4)))
plot(r)
```

### Using the classic GDAL raster API

We can explore the the zarr file by omitting the array name, e.g. by
```{r}
dsn = 'ZARR:"/vsicurl/https://storage.googleapis.com/cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/6hrPlev/psl/gn/v20170706"/'
gdal_utils("info", dsn)
```

which lists the subdatasets; if we would ask for info on a subdatasetname we see an error:

```{r}
dsn = 'ZARR:"/vsicurl/https://storage.googleapis.com/cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/6hrPlev/psl/gn/v20170706/":/psl'
gdal_utils("info", dsn)
```

indicating that we need to add an index of the additional dimension (time). If we add e.g. index 10, we get


```{r}
dsn = 'ZARR:"/vsicurl/https://storage.googleapis.com/cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/6hrPlev/psl/gn/v20170706/":/psl:10'
gdal_utils("info", dsn)
```

which describes a 288 x 192 pixel raster layer. It also reveals that GDAL thinks that this raster file has latitude dimensions exceeding [-90,90].  We can
read such a slice with `read_stars` and plot the result:

```{r zarr7}
(r0 = read_stars(dsn))
plot(r0, axes = TRUE)
```

## Software versions

To use the features in this blog post for reading sub-arrays or lower resolution
arrays, you need stars >= 0.5-7 and sf >= 1.0-9, which you can get by installing
from source by
```{r eval=FALSE}
remotes::install_github("r-spatial/sf") # requires rtools42 on windows
remotes::install_github("r-spatial/stars")
```
You need both, since `stars` uses `sf`'s binding to GDAL - `stars`
is a pure R package, all linking to GDAL happens through `sf`.

Alternatively, and in particular if you use Windows or MACOS you
could install the github versions as binary package from r-universe:

```{r eval=FALSE}
install.packages("sf", repos = "https://r-spatial.r-universe.dev")
install.packages("stars", repos = "https://r-spatial.r-universe.dev")
```

**update** users have reported
[failure](https://github.com/r-spatial/stars/issues/564)
to reproduce the steps in this blog post. Note that this blog
post was run on a version of `sf` linked to GDAL 3.4.3 from the
[ubuntugis-unstable](https://github.com/r-spatial/sf#ubuntu) ppa. If
you have `sf` linked to an older version of GDAL, or if the GDAL
version was not linked against BLOSC, then this (or parts of it)
may not work. Note that Zarr, and Zarr support in GDAL are all very
recent developments, and it takes time for modern library support to
propagate through e.g. CRAN build toolchains -- we are working on it!