{ "cells": [ { "cell_type": "markdown", "id": "b38bafd6-2e37-46b0-b0a7-e6744f9eb0e7", "metadata": {}, "source": [ "---\n", "title: Example to load data from NRDA\n", "subtitle: Learn how to load data from the Norwegian Research Data Archive with rocrate\n", "authors:\n", " - name: Anne Fouilloux\n", " orcid: 0000-0002-1784-2920\n", " github: annefou\n", " affiliations:\n", " - id: Simula Research Laboratory\n", " institution: Simula Research Laboratory\n", " ror: 00vn06n10\n", "date: 2025-05-18\n", "keywords : earth and related environmental sciences\n", "---" ] }, { "cell_type": "markdown", "id": "b71d4ba8-5f12-4265-9e46-501613058c0b", "metadata": {}, "source": [ "(Introduction)=\n", "## Introduction\n", "\n", "The Norwegian Research Data Archive (NIRD RDA), managed by Sigma2, is the Norwegian national open-access repository for research data, built on the open-source CKAN platform. It aims at supporting the FAIR principles, enabling the discovery, access, and reuse of datasets across scientific domains. With nearly 1,000 TB of data, the archive facilitates Open Science by providing persistent identifiers and rich metadata. \n", "\n", "\n", "Below, we demonstrate how to access a dataset from the NIRD RDA using the rocrate Python library, leveraging RO-Crate metadata to retrieve and process a NetCDF file, as shown in the following example.\n", "\n" ] }, { "cell_type": "markdown", "id": "13b64621-633e-4a2a-bee4-0604ac68b226", "metadata": {}, "source": [ ":::{hint} Overview\n", "**Questions**\n", "- How can we access and extract metadata from a dataset in the Norwegian Research Data Archive using RO-Crate?\n", "- How can we programmatically retrieve and analyze NetCDF data files from an RO-Crate dataset?\n", "\n", "**Objectives**\n", "- Read and parse RO-Crate metadata from a local file to extract dataset details such as name, DOI, and geospatial coverage.\n", "- Access and process NetCDF data files referenced in the RO-Crate using `xarray` for scientific analysis.\n", ":::" ] }, { "cell_type": "markdown", "id": "48022fe9-4c4d-4f6e-b6fa-469892e7c419", "metadata": {}, "source": [ "(Setup)=\n", "## Setup\n", "\n", "- Install requirements e.g. Python packages;\n", "- Start importing the necessary libraries." ] }, { "cell_type": "code", "execution_count": 15, "id": "89991a2f-1206-4118-9a86-872707fd38b8", "metadata": { "collapsed": true, "hide-output": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: rocrate in /srv/conda/envs/notebook/lib/python3.12/site-packages (0.13.0)\n", "Collecting cmcrameri\n", " Using cached cmcrameri-1.9-py3-none-any.whl.metadata (4.6 kB)\n", "Requirement already satisfied: requests in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.32.3)\n", "Requirement already satisfied: arcp==0.2.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (0.2.1)\n", "Requirement already satisfied: jinja2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (3.1.5)\n", "Requirement already satisfied: python-dateutil in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.9.0)\n", "Requirement already satisfied: click in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (8.1.8)\n", "Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.12/site-packages (from cmcrameri) (3.10.0)\n", "Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.12/site-packages (from cmcrameri) (2.0.2)\n", "Requirement already satisfied: packaging in /srv/conda/envs/notebook/lib/python3.12/site-packages (from cmcrameri) (24.2)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from jinja2->rocrate) (3.0.2)\n", "Requirement already satisfied: contourpy>=1.0.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from matplotlib->cmcrameri) (1.3.1)\n", "Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from matplotlib->cmcrameri) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from matplotlib->cmcrameri) (4.55.4)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from matplotlib->cmcrameri) (1.4.8)\n", "Requirement already satisfied: pillow>=8 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from matplotlib->cmcrameri) (11.1.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from matplotlib->cmcrameri) (3.2.1)\n", "Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from python-dateutil->rocrate) (1.17.0)\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.4.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.10)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (1.26.19)\n", "Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (2024.12.14)\n", "Using cached cmcrameri-1.9-py3-none-any.whl (277 kB)\n", "Installing collected packages: cmcrameri\n", "Successfully installed cmcrameri-1.9\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install rocrate cmcrameri" ] }, { "cell_type": "code", "execution_count": 27, "id": "07ac67a3-28cc-4a20-a57e-d4d91dda7daa", "metadata": {}, "outputs": [], "source": [ "import requests\n", "import tempfile\n", "import json\n", "import os\n", "from rocrate.rocrate import ROCrate\n", "from rocrate.model.person import Person\n", "\n", "import xarray as xr\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import cartopy.crs as ccrs\n", "import cartopy.feature as cfeature\n", "import cmcrameri" ] }, { "cell_type": "markdown", "id": "c2e2394d-c0e3-4849-9e43-230e84687959", "metadata": {}, "source": [ "(Parameters)=\n", "## Input Parameters\n", "\n", "Currently, only temporary credentials are available. You need to access the archive to obtain a valid link and download the RO-Crate promptly, as the credentials expire after a few minutes.\n", "\n", "Visit the dataset at [https://data.archive-sandbox.sigma2.no/dataset/relative-humidity-over-small-sub-region2](https://data.archive-sandbox.sigma2.no/dataset/relative-humidity-over-small-sub-region2), and check the metadata to get the credentials." ] }, { "cell_type": "code", "execution_count": 3, "id": "8e479333-b3b1-4bd9-80fa-13c2a44a613a", "metadata": {}, "outputs": [], "source": [ "# URL of the RO-Crate metadata file\n", "url = \"https://s3.nird.sigma2.no/archive-sandbox-ro/3888d382-6269-479c-9448-701ad6e3fa74_metadata/dataset_metadata_10.82969_2025.pitlyjt0.json?response-content-disposition=attachment&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=TRnFBoNN9N9QfuRYw5mX%2F20250518%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20250518T175123Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host&X-Amz-Signature=f089662d8f3c95f9fcde3fc9da0d00b7ff8a7ce2d0c2b8598a94d6f06f9ef30a\"\n", "# Output directory and file path\n", "data_dir = \"../data\"\n", "metadata_path = os.path.join(data_dir, \"ro-crate-metadata.json\")" ] }, { "cell_type": "markdown", "id": "c7de8a4d-9143-480b-b5fd-807626312d38", "metadata": {}, "source": [ "(Retrieve)=\n", "## Retrieve RO-Crate for a given dataset" ] }, { "cell_type": "code", "execution_count": 7, "id": "9dbc558e-3ff7-476d-b779-b840d5248870", "metadata": {}, "outputs": [], "source": [ "# Create data directory if it doesn't exist\n", "os.makedirs(data_dir, exist_ok=True)\n", "\n", "if not os.path.exists(metadata_path):\n", " # Download metadata\n", " response = requests.get(url)\n", " if response.status_code != 200:\n", " raise Exception(f\"Failed to download metadata: {response.status_code} - {response.text}\")\n", "\n", " # Save as data/ro-crate-metadata.json\n", " with open(metadata_path, \"wb\") as f:\n", " f.write(response.content)\n", " print(f\"Metadata saved to: {metadata_path}\")" ] }, { "cell_type": "markdown", "id": "393e7dc0-b897-419e-8e40-65686c26f2ad", "metadata": {}, "source": [ "(Load)=\n", "## Load RO-Crate to access collection" ] }, { "cell_type": "code", "execution_count": 6, "id": "32d265f9-7711-47ad-b5d4-3060ebe9a5c4", "metadata": {}, "outputs": [], "source": [ "crate = ROCrate(data_dir)" ] }, { "cell_type": "code", "execution_count": 13, "id": "45d4d8ae-f498-4b1c-b791-ee793e574979", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Dataset Metadata ===\n", "Name: Relative humidity over small sub-region\n", "Description: rh_mean_july_1980_2018_small.nc\n", "\n", "Relative humidity (%) monthly values for July (year 1980 and 2018).\n", "DOI: https://doi.org/10.82969/2025.pitlyjt0\n", "Author: ['annef@simula.no']\n", "License: https://creativecommons.org/licenses/by/4.0/\n", "Date Published: 2025-05-18\n", "Temporal Coverage: 1980-06-30T21:00:00.000Z/2018-06-30T21:00:00.000Z\n", "Geospatial Coverage: Unknown\n", "\n", "=== Data Files ===\n", "File part of the collection: https://data.archive-sandbox.sigma2.no/dataset/3888d382-6269-479c-9448-701ad6e3fa74/download/rh_mean_july_1980_2018_small.nc\n", "\n", "=== Dataset found and opened with xarray (https://data.archive-sandbox.sigma2.no/dataset/3888d382-6269-479c-9448-701ad6e3fa74/download/rh_mean_july_1980_2018_small.nc) ===\n" ] } ], "source": [ "root_dataset = crate.root_dataset\n", "\n", "# Print metadata\n", "print(\"=== Dataset Metadata ===\")\n", "print(f\"Name: {root_dataset.get('name', 'Unnamed')}\")\n", "print(f\"Description: {root_dataset.get('description', 'No description')}\")\n", "print(f\"DOI: {root_dataset.get('identifier', 'No DOI')}\")\n", "print(f\"Author: {root_dataset.get('author', 'Unknown')}\")\n", "print(f\"License: {root_dataset.get('license', {}).get('@id', 'No license')}\")\n", "print(f\"Date Published: {root_dataset.get('datePublished', 'Unknown')}\")\n", "print(f\"Temporal Coverage: {root_dataset.get('temporalCoverage', 'Unknown')}\")\n", "location_entity = crate.dereference(root_dataset.get('location'))\n", "print(f\"Geospatial Coverage: {location_entity.get('polygon', 'Unknown') if location_entity else 'Unknown'}\")\n", "\n", "# Process hasPart files\n", "print(\"\\n=== Data Files ===\")\n", "files = []\n", "for part in root_dataset.get('hasPart', []):\n", " part_id = part if isinstance(part, str) else getattr(part, 'id', None)\n", " if not part_id:\n", " continue\n", " part_entity = crate.dereference(part_id)\n", " if not part_entity or \"File\" not in part_entity.type:\n", " continue\n", " file_url = part_entity.id\n", " if not (file_url.endswith(\".nc\") or file_url.endswith(\".zarr\")):\n", " continue\n", " print(f\"File part of the collection: {file_url}\")\n", " try:\n", " files.append(file_url)\n", " print(f\"\\n=== Dataset found and opened with xarray ({file_url}) ===\")\n", " except Exception as e:\n", " print(f\"Failed to open {file_url}: {e}\")" ] }, { "cell_type": "markdown", "id": "2b1fc97d-a741-4c0a-a728-455aa913ba3b", "metadata": {}, "source": [ "(Open)=\n", "## Access netCDF file from the dataset collection" ] }, { "cell_type": "code", "execution_count": 17, "id": "d2a86755-dec2-4d87-b585-0c5ca05e3ffd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.Dataset> Size: 4kB\n",
"Dimensions: (time: 2, lon: 17, lat: 24)\n",
"Coordinates:\n",
" * time (time) datetime64[ns] 16B 1980-07-01 2018-07-01\n",
" * lon (lon) float64 136B 2.0 2.25 2.5 2.75 3.0 ... 5.0 5.25 5.5 5.75 6.0\n",
" * lat (lat) float64 192B 44.62 44.88 45.12 45.38 ... 49.88 50.12 50.38\n",
"Data variables:\n",
" R (time, lat, lon) float32 3kB ...\n",
"Attributes:\n",
" CDI: Climate Data Interface version 2.4.4 (https://mpimet.mpg.de...\n",
" Conventions: CF-1.6\n",
" institution: European Centre for Medium-Range Weather Forecasts\n",
" history: Tue May 06 13:47:23 2025: cdo -f nc -t ecmwf copy tmpg.grib...\n",
" CDO: Climate Data Operators version 2.4.4 (https://mpimet.mpg.de...