{ "cells": [ { "cell_type": "markdown", "id": "75bb4e55-8c88-4208-b32e-937ab90bc26b", "metadata": {}, "source": [ "# VirtualZarr with Kerchunk for RiOMar data\n", "\n", "\n", "## Context\n", "\n", "### Purpose\n", "\n", "The goal is to create a virtualzarr for all RiOMar data using Kerchunk (since Icechunk does not work at the moment on Pangeo-EOSC or for data on datamor (https access).\n", "\n", "### Description\n", "\n", "In this notebook, we will:\n", "- list all the RiOMar data available online on Datamore\n", "- Create a virtualzarr of the RiOMar data\n", "- Save it as kerchunk in parquet format\n", "\n", "## Contributions\n", "\n", "### Notebook\n", "\n", "- Justus Magin (author), CNRS-LOPS (France), @keewis\n", "\n", "## Bibliography and other interesting resources\n", "\n", "- [Kerchunk](https://fsspec.github.io/kerchunk/)\n", "- [Virtualzarr](http://virtualizarr.readthedocs.io)\n", "- [RiOMar](https://coast.ifremer.fr/Laboratoires-Environnement-Ressources/LER-Pertuis-Charentais-La-Tremblade/Projets/RIOMAR-2024-2030)" ] }, { "cell_type": "code", "execution_count": null, "id": "d36a399b-1a07-4429-8e0b-d04c2ba45ccb", "metadata": {}, "outputs": [], "source": [ "from functools import partial\n", "\n", "import fsspec\n", "import virtualizarr\n", "import xarray as xr\n", "\n", "fs = fsspec.filesystem(\"http\")" ] }, { "cell_type": "code", "execution_count": null, "id": "caec31f8-3d88-4a4c-a5c0-2ee1d9f812f3", "metadata": { "scrolled": true }, "outputs": [], "source": [ "inroot = \"https://data-fair2adapt.ifremer.fr/riomar/GAMAR\"\n", "urls = fs.glob(f\"{inroot}/*.nc\")" ] }, { "cell_type": "code", "execution_count": null, "id": "46f3356b-a5f9-4ab5-b310-0ba67b5c4d3a", "metadata": {}, "outputs": [], "source": [ "import distributed\n", "\n", "cluster = distributed.LocalCluster(n_workers=24)\n", "client = cluster.get_client()\n", "client" ] }, { "cell_type": "raw", "id": "5d9dd299-cff8-4bdb-a572-af6a2384786c", "metadata": {}, "source": [ "virtualizarr.open_virtual_dataset(\n", " urls[0],\n", " backend=virtualizarr.readers.hdf.HDFVirtualBackend,\n", " indexes={},\n", " loadable_variables=[\"time_counter\", \"time_instant\", \"x_rho\", \"y_rho\", \"x_u\", \"x_v\", \"y_u\", \"y_v\", \"axis_nbounds\"],\n", " decode_times=True,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "866f38a0-852d-4ba9-9736-b6dcc4da070c", "metadata": { "scrolled": true }, "outputs": [], "source": [ "func = partial(\n", " virtualizarr.open_virtual_dataset,\n", " backend=virtualizarr.readers.hdf.HDFVirtualBackend,\n", " indexes={},\n", " loadable_variables=[\n", " \"time_counter\",\n", " \"time_instant\",\n", " \"x_rho\",\n", " \"y_rho\",\n", " \"x_u\",\n", " \"x_v\",\n", " \"y_u\",\n", " \"y_v\",\n", " \"axis_nbounds\",\n", " ],\n", " decode_times=True,\n", ")\n", "\n", "futures = client.map(func, urls)\n", "dss = client.gather(futures)" ] }, { "cell_type": "code", "execution_count": null, "id": "98bb7859-f7af-4f92-8a6a-448924f21bf0", "metadata": {}, "outputs": [], "source": [ "grid_url = \"https://data-fair2adapt.ifremer.fr/riomar/misc/croco_grd_hdf5.nc\"\n", "grid = virtualizarr.open_virtual_dataset(\n", " grid_url, filetype=\"netcdf4\", indexes={}, loadable_variables=[\"lon_rho\", \"lat_rho\"]\n", ")\n", "grid" ] }, { "cell_type": "code", "execution_count": null, "id": "d18ecc10-7e73-40fb-b2a1-b74a260cc72b", "metadata": {}, "outputs": [], "source": [ "ds = (\n", " xr.concat(\n", " dss,\n", " dim=\"time_counter\",\n", " compat=\"override\",\n", " coords=\"minimal\",\n", " combine_attrs=\"drop_conflicts\",\n", " )\n", " .set_coords([\"time_counter_bounds\", \"time_instant_bounds\"])\n", " .assign_coords(\n", " {\n", " \"nav_lon_rho\": lambda ds: ds[\"nav_lon_rho\"].copy(data=grid[\"lon_rho\"].data),\n", " \"nav_lat_rho\": lambda ds: ds[\"nav_lat_rho\"].copy(data=grid[\"lat_rho\"].data),\n", " }\n", " )\n", ")\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "id": "bed74e5c-ba6f-4705-be93-dd9edc78e028", "metadata": {}, "outputs": [], "source": [ "ds.virtualize.to_kerchunk(\"riomar.parquet\", format=\"parquet\")" ] }, { "cell_type": "code", "execution_count": null, "id": "922e1e4c-2ab6-4b79-a800-a5b0b7d272a6", "metadata": {}, "outputs": [], "source": [ "ds.virtualize" ] }, { "cell_type": "code", "execution_count": null, "id": "a05d0dcc-66b8-4f63-abf4-7fcf4acc7bc9", "metadata": {}, "outputs": [], "source": [ "reopened = xr.open_dataset(\"riomar.parquet\", engine=\"kerchunk\", chunks={})\n", "reopened" ] }, { "cell_type": "code", "execution_count": null, "id": "8f5f3881-a406-4051-ba2a-abde76731c30", "metadata": {}, "outputs": [], "source": [ "(reopened[\"nav_lat_rho\"] == -1).sum().compute()" ] }, { "cell_type": "code", "execution_count": null, "id": "600ae30e-f486-4890-966b-5faa21aff2d2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0758be63-bb39-415f-8159-8ab5593022b5", "metadata": {}, "outputs": [], "source": [ "virtualizarr.open_virtual_dataset(\n", " \"riomar.parquet\",\n", " filetype=\"kerchunk\",\n", " indexes={},\n", " loadable_variables=[\n", " \"time_counter\",\n", " \"time_instant\",\n", " \"x_rho\",\n", " \"y_rho\",\n", " \"x_u\",\n", " \"x_v\",\n", " \"y_u\",\n", " \"y_v\",\n", " \"axis_nbounds\",\n", " ],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "84753f3b-cbda-41a0-af6e-5191f587cc3a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }