# VirtualZarr with Kerchunk for RiOMar data


## Context

### Purpose

The goal is to create a virtualzarr for all RiOMar data using Kerchunk (since Icechunk does not work at the moment on Pangeo-EOSC or for data on datamor (https access).

### Description

In this notebook, we will:
- list all the RiOMar data available online on Datamore
- Create a virtualzarr of the RiOMar data
- Save it as kerchunk in parquet format

## Contributions

### Notebook

- Justus Magin (author), CNRS-LOPS (France), @keewis

## Bibliography and other interesting resources

- [Kerchunk](https://fsspec.github.io/kerchunk/)
- [Virtualzarr](http://virtualizarr.readthedocs.io)
- [RiOMar](https://coast.ifremer.fr/Laboratoires-Environnement-Ressources/LER-Pertuis-Charentais-La-Tremblade/Projets/RIOMAR-2024-2030)

In [None]:
from functools import partial

import fsspec
import virtualizarr
import xarray as xr

fs = fsspec.filesystem("http")

In [None]:
inroot = "https://data-fair2adapt.ifremer.fr/riomar/GAMAR"
urls = fs.glob(f"{inroot}/*.nc")

In [None]:
import distributed

cluster = distributed.LocalCluster(n_workers=24)
client = cluster.get_client()
client

In [None]:
func = partial(
 virtualizarr.open_virtual_dataset,
 backend=virtualizarr.readers.hdf.HDFVirtualBackend,
 indexes={},
 loadable_variables=[
 "time_counter",
 "time_instant",
 "x_rho",
 "y_rho",
 "x_u",
 "x_v",
 "y_u",
 "y_v",
 "axis_nbounds",
 ],
 decode_times=True,
)

futures = client.map(func, urls)
dss = client.gather(futures)

In [None]:
grid_url = "https://data-fair2adapt.ifremer.fr/riomar/misc/croco_grd_hdf5.nc"
grid = virtualizarr.open_virtual_dataset(
 grid_url, filetype="netcdf4", indexes={}, loadable_variables=["lon_rho", "lat_rho"]
)
grid

In [None]:
ds = (
 xr.concat(
 dss,
 dim="time_counter",
 compat="override",
 coords="minimal",
 combine_attrs="drop_conflicts",
 )
 .set_coords(["time_counter_bounds", "time_instant_bounds"])
 .assign_coords(
 {
 "nav_lon_rho": lambda ds: ds["nav_lon_rho"].copy(data=grid["lon_rho"].data),
 "nav_lat_rho": lambda ds: ds["nav_lat_rho"].copy(data=grid["lat_rho"].data),
 }
 )
)
ds

In [None]:
ds.virtualize.to_kerchunk("riomar.parquet", format="parquet")

In [None]:
ds.virtualize

In [None]:
reopened = xr.open_dataset("riomar.parquet", engine="kerchunk", chunks={})
reopened

In [None]:
(reopened["nav_lat_rho"] == -1).sum().compute()

In [None]:
virtualizarr.open_virtual_dataset(
 "riomar.parquet",
 filetype="kerchunk",
 indexes={},
 loadable_variables=[
 "time_counter",
 "time_instant",
 "x_rho",
 "y_rho",
 "x_u",
 "x_v",
 "y_u",
 "y_v",
 "axis_nbounds",
 ],
)