---
title: "Vector Data Cubes"
author: "Edzer Pebesma"
date:  "Sep 12, 2022"
comments: false
layout: post
categories: r
---

```{r echo=FALSE}
knitr::opts_chunk$set(comment = "")
```

TOC

[DOWNLOADHERE]

**Summary**:

Where raster data cubes refer to data cubes with raster (x- and y-,
or lon- and lat-) dimensions, **vector data cubes are n-D arrays
that have (at least) a single spatial dimension that maps to a set of
(2-D) vector geometries.** This post explains what they are used for,
and how they can be handled with data science and GIS software.

# Vector data cubes

Vector data cubes are n-D arrays with (at least) one dimension that
maps to a set of typically 2-D vector geometries (points, lines
or polygons). A simple example is a time series of precipitation
values for a set of stations. With 5 time steps and 3 stations this
could look like
```{r echo=FALSE}
set.seed(131)
m = matrix(round(runif(15, max = 5), 2), 5, 3)
dimnames(m) = list(time = c("2022-08-15", "2022-08-16", "2022-08-17", 
							"2022-08-18", "2022-08-19"), 
				   station = LETTERS[1:3])
t(m)
```
This contains station labels (A, B, C) but not geometries; we could encode
the stations with their WKT notation `POINT(x y)`, as in
```{r echo=FALSE}
# m = matrix(round(runif(15, max = 5), 2), 5, 3)
dimnames(m) = list(time = c("2022-08-15", "2022-08-16", "2022-08-17", 
							"2022-08-18", "2022-08-19"), 
				   station = c("POINT(5 7)", "POINT(1.3 4)", "POINT(8 3)"))
t(m)
```

A second example is a time series of RGB brightness values sampled
at four locations, which gives a 4 x 5 x 3 cube that can be printed
as three 4 x 5 tables:
```{r echo=FALSE}
i = array(round(runif(60, max = 255), 0), c(4, 5, 3))
dimnames(i) = list(
		station = c("POINT(5 7)", "POINT(1.3 4)", "POINT(8 3)", "POINT(2 6)"),
		time = c("2022-08-15", "2022-08-16", "2022-08-17", "2022-08-18", "2022-08-19"), 
		color = c("R", "G", "B"))
i
```

## Where do vector data cubes come from?

Naturally, in situ sensor data, where at regular time intervals
data are collected at a number of stations, are vector data cube
candidates. In the Earth Observation world, sampling raster data
cubes at point locations leads to vector data cubes - an example
would be to sample Sentinel-5P data cubes at the locations of air
quality monitoring stations, in order to compare both - S5P values
and in situ sensor values.

Other applications involve time series of land use (change) values
observed over time periods at fixed locations, which are
input to ML models for the classification of time series
of land use: as opposed to classifying land use scene by
scene [dynamic world ref](https://dynamicworld.app/), from
observations of land use time series a better approach might
be to predict land use change from observed dynamics [sits book
ref](https://e-sensing.github.io/sitsbook/).

Another case where vector cubes arise is when (polygon) area
statistics are calculated from raster data cube imagery, e.g.
the deforested area (fraction, or ha) by year and by state or
country.


## Representing vector data cubes in software

In principle, any software that can handle labeled arrays (arrays
with named dimensions, and labels for dimension values) can handle
vector data cubes. However, the handling is rather clumsy: labels
are character (string) vectors, and do not reveal

* where time, geometries, or other dimensions are involved, and 
* what dimension values mean: measurement units, or reference systems for
  time (origin and unit in case of numeric values; time zone,
  calendar) or space (coordinate reference system: datum, projection
  parameters)

More dedicated software takes care of this, e.g. R package
[`stars`](https://r-spatial.github.io/stars/) summarizes the above data like this:
```{r echo=FALSE}
suppressPackageStartupMessages(library(stars))
st_as_stars(brightness = i) |> 
  st_set_dimensions(1, values = st_as_sfc(dimnames(i)[[1]], crs = 4326)) |>
  st_set_dimensions(2, values = as.Date(dimnames(i)[[2]])) -> st
st[[1]] = units::set_units(st[[1]], cd)
st
```

which

* recognizes the regularity of the time dimension, and its `Date` class
* adds a reference system to the station geometries, and recognizes these are points

## File formats for vector data cubes

### array formats

Multidimensional arrays with a vector geometry dimension can well be saved in formats like NetCDF
or Zarr. For instance a NetCDF representation, as printed by `ncdump`,
would look like

```{r echo=FALSE, eval=FALSE}
write_mdim(st, "a.nc")
```

```
netcdf a {
dimensions:
	color = 3 ;
	time = 5 ;
	station = 4 ;
variables:
	double brightness(color, time, station) ;
		brightness:grid_mapping = "crs" ;
		brightness:coordinates = "lat lon" ;
		brightness:units = "cd" ;
	char crs ;
		crs:grid_mapping_name = "latitude_longitude" ;
		crs:long_name = "CRS definition" ;
		crs:longitude_of_prime_meridian = 0. ;
		crs:semi_major_axis = 6378137. ;
		crs:inverse_flattening = 298.257223563 ;
		crs:spatial_ref = "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Latitude\",NORTH],AXIS[\"Longitude\",EAST],AUTHORITY[\"EPSG\",\"4326\"]]" ;
		crs:crs_wkt = "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Latitude\",NORTH],AXIS[\"Longitude\",EAST],AUTHORITY[\"EPSG\",\"4326\"]]" ;
	double station(station) ;
	double time(time) ;
		time:units = "days since 1970-01-01" ;
	string col(color) ;
	double lon(station) ;
		lon:units = "degrees_north" ;
		lon:standard_name = "longitude" ;
		lon:axis = "X" ;
	double lat(station) ;
		lat:units = "degrees_east" ;
		lat:standard_name = "latitude" ;
		lat:axis = "Y" ;
	double geometry ;
		geometry:geometry_type = "point" ;
		geometry:grid_mapping = "crs" ;

// global attributes:
		:Conventions = "CF-1.6" ;
data:

 brightness =
  74, 10, 175, 197,
  120, 55, 162, 159,
  102, 149, 83, 235,
  87, 201, 38, 61,
  107, 17, 29, 95,
  31, 152, 86, 79,
  179, 162, 162, 55,
  240, 179, 249, 7,
  151, 229, 116, 148,
  26, 53, 34, 73,
  12, 33, 250, 186,
  79, 23, 229, 26,
  64, 53, 47, 29,
  244, 165, 85, 249,
  232, 182, 87, 253 ;

 station = 1, 2, 3, 4 ;

 time = 19219, 19220, 19221, 19222, 19223 ;

 col = "R", "G", "B" ;

 lon = 5, 1.3, 8, 2 ;

 lat = 7, 4, 3, 6 ;
}
```

Such files can be read and written using [GDAL's multidimensional array API](https://gdal.org/api/index.html#multi-dimensional-array-api), and transformed into other multidimansional array formats using `gdalmdimtranslate`. The utility `gdalmdiminfo` can print the dimension metadata, or the entire information (including values) as JSON.


## I/O: GIS formats

The two common GIS formats (as supported by GDAL) are

* vector tables: a set of vector geometries with zero or more attributes
* raster data: a raster images with 1 or more layers

Clearly, for vector data cubes the the raster data format will not work because the array dimensions do not correspond to two spatial dimensions (x and y). For vector tables, there are essentially two options:

### long table form

The long table form can be illustrated by showing six records of the array above:

```{r echo=FALSE}
st_as_sf(st, long = TRUE) |> 
  structure(class = "data.frame") |>
  subset(select = c(4, 1, 2, 3)) |>
  head()
```

In this form, the complete set of array values ends up in a single
column, and all dimensions are recycled appropriately. This is the
least ambiguous form because it

* keeps dimension and variable names, 
* keeps data types (like variable time being of class `Date`) 
* keeps the array values in a single column.

On the other hand, it replicates dimension values and can lead to very large tables. 

When a table of this kind is provided by a user, it is not immediately clear

* what the (unique) dimension values are, in particular for geometries
* whether all array values are present, 
* whether it contains multiple records with identical dimension values

all this needs to be sorted out before one can recreate a multidimensional array
from its long table form.

### wide table forms

There are different ways in which we can use the column space to distribute our array values. The most extreme would _not_ replicate geometries, so end up with four rows and combine the other dimensions (time, color) into columns, creating column names that paste the information togehter, as in the 4 rows x 15 columns table

```{r echo=FALSE}
names(st) = "br"
st_as_sf(st) -> st.sf
row.names(st.sf) = NULL
names(st.sf)[1:15] = substr(names(st.sf)[1:15], 4, 15)
structure(st.sf, class = "data.frame")
```

Other forms would borrow from the long form, and for instance create 5 x 4 records with all station and time combinations and have variables `R`, `G` and `B`, or have 5 x 3 records combining the station and color, having the time values as column names.

The disadvantages are obvious:

* column names collated from combinations of dimension values are cludgy, and are ugly to read or process
* the dimension values that end up in column names loose there type, units and reference system
* this may lead to tables with extremely many columns, which may be not practical or break software
* software my require that column names start with a letter, increasing the name cludge

Recreating the array from wide table forms may be relatively
straightforward in data science languages (Python, R) but harder
in databases.

It should be noted that vector data cubes can also have line or polygon geometries. In that case, WKT representations of geometries can become very long and not useful as colunn names.

### Multiple (long) table forms (database normalization)

The repetition of dimension values in the long table form would
be prohibitive if individual geometries are large. One way to cope
with that is to create a geometry table with the unique geometries,
and put a index (foreign key) to these geometries in the long
table. This can be done with all recycling dimensions, and would
make recreating the original array easier.