{ "cells": [ { "cell_type": "markdown", "id": "7e8c941f-784b-46a4-b596-bd0ee3c140a4", "metadata": {}, "source": [ "# Create RO-Crate from RiOMar dataset\n", "\n", "\n", "## Context\n", "\n", "### Purpose\n", "\n", "We are showing how to create a RO-Crate for a dataset using the `rocrate` python library. This is a simple example with no specific RO-Crate profile. It follows RO-Crate v 1.1 specification.\n", "\n", "- **Standardized Metadata Packaging**: RO-Crates provide a standardized way to bundle datasets with rich metadata, making it easier to understand, share, and reuse the data.\n", "- **Enhanced FAIRness**: By including machine-readable metadata, RO-Crates improve the Findability, Accessibility, Interoperability, and Reusability (FAIR) of the dataset.\n", "- **Improved Discoverability**: Metadata in an RO-Crate allows datasets to be easily indexed and discovered through search engines and data repositories.\n", "- **Documentation and Provenance**: RO-Crates document essential information about the dataset, such as its source, authorship, and creation process, ensuring transparency and traceability.\n", "- **Facilitates Integration**: The structured metadata makes it easier to integrate the dataset with other tools, workflows, or datasets, enhancing its usability.\n", "- **Compliance with Standards**: Many funding agencies and journals now require datasets to be published with detailed metadata. RO-Crates align with these expectations and promote best practices in data management.\n", "\n", "\n", "### Description\n", "\n", "In this notebook, we will learn how to create a simple RO-Crate from the RiOMar data. We will then identify any missing metadata that needs to be added to the original dataset's metadata.\n", "\n", "## Contributions\n", "\n", "### Notebook\n", "\n", "- Anne Fouilloux (author), Simula Research Laboratory (Norway), @annefou\n", "- XX (reviewer)\n", "\n", "## Biblipgraphy and other interesting resources\n", "\n", "- [rocrate](https://pypi.org/project/rocrate/) Python package\n", "- [Research Object documentation](https://www.researchobject.org)" ] }, { "cell_type": "markdown", "id": "2b875a8b-74b1-4e5e-9448-7045d0494358", "metadata": {}, "source": [ "## Install and Import libraries" ] }, { "cell_type": "code", "execution_count": 42, "id": "0b1dff75-a254-4eae-9d03-fc4b01838052", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: rocrate in /srv/conda/envs/notebook/lib/python3.12/site-packages (0.13.0)\n", "Collecting rocrateValidator\n", " Downloading rocrateValidator-0.2.15-py3-none-any.whl.metadata (228 bytes)\n", "Requirement already satisfied: requests in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.32.3)\n", "Requirement already satisfied: arcp==0.2.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (0.2.1)\n", "Requirement already satisfied: jinja2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (3.1.4)\n", "Requirement already satisfied: python-dateutil in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (2.8.2)\n", "Requirement already satisfied: click in /srv/conda/envs/notebook/lib/python3.12/site-packages (from rocrate) (8.1.7)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from jinja2->rocrate) (2.1.5)\n", "Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from python-dateutil->rocrate) (1.16.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (1.26.19)\n", "Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests->rocrate) (2024.7.4)\n", "Downloading rocrateValidator-0.2.15-py3-none-any.whl (11 kB)\n", "Installing collected packages: rocrateValidator\n", "Successfully installed rocrateValidator-0.2.15\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install rocrate rocrateValidator" ] }, { "cell_type": "code", "execution_count": 2, "id": "982e2483-b0b5-4fe7-b9e4-47be7fcb83f0", "metadata": {}, "outputs": [], "source": [ "import requests\n", "import json\n", "from rocrate.rocrate import ROCrate\n", "from rocrate.model.person import Person\n", "import pandas as pd\n", "from datetime import datetime\n", "import geopandas\n", "import shapely\n", "import xarray as xr\n", "import numpy as np\n", "import s3fs" ] }, { "cell_type": "markdown", "id": "0631c4e0-0e5c-4eb2-a850-44b49fa0c084", "metadata": {}, "source": [ "## Open RiOMar data to get metadata" ] }, { "cell_type": "code", "execution_count": 3, "id": "26d6ed4e-49bb-40c6-ae00-0b7009c7be31", "metadata": {}, "outputs": [], "source": [ "url_data = \"https://data-fair2adapt.ifremer.fr/riomar/small.zarr\"" ] }, { "cell_type": "code", "execution_count": 4, "id": "5b48eb64-383b-4eae-b3dd-812eb5c469c2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 498MB\n",
       "Dimensions:       (y_rho: 838, x_rho: 727, s_rho: 40, time_counter: 5)\n",
       "Coordinates:\n",
       "    nav_lat_rho   (y_rho, x_rho) float64 5MB dask.array<chunksize=(838, 727), meta=np.ndarray>\n",
       "    nav_lon_rho   (y_rho, x_rho) float64 5MB dask.array<chunksize=(838, 727), meta=np.ndarray>\n",
       "  * s_rho         (s_rho) float32 160B -0.9875 -0.9625 ... -0.0375 -0.0125\n",
       "  * time_counter  (time_counter) datetime64[ns] 40B 2004-01-01T00:58:30 ... 2...\n",
       "    time_instant  (time_counter) datetime64[ns] 40B dask.array<chunksize=(1,), meta=np.ndarray>\n",
       "Dimensions without coordinates: y_rho, x_rho\n",
       "Data variables:\n",
       "    ocean_mask    (y_rho, x_rho) bool 609kB dask.array<chunksize=(838, 727), meta=np.ndarray>\n",
       "    temp          (time_counter, s_rho, y_rho, x_rho) float32 487MB dask.array<chunksize=(1, 40, 838, 727), meta=np.ndarray>\n",
       "Attributes: (12/45)\n",
       "    CPP-options:    REGIONAL GAMAR MPI TIDES OBC_WEST OBC_NORTH XIOS USE_CALE...\n",
       "    Conventions:    CF-1.6\n",
       "    Cs_r:           have a look at variable Cs_r in this file\n",
       "    Cs_w:           have a look at variable Cs_w in this file\n",
       "    SRCS:           main.F step.F read_inp.F timers_roms.F init_scalars.F ini...\n",
       "    Tcline:         15.0\n",
       "    ...             ...\n",
       "    title:          GAMAR_GLORYS\n",
       "    tnu4_expl:      biharmonic mixing coefficient for tracers\n",
       "    units:          meter4 second-1\n",
       "    uuid:           06f6b784-fcc0-4422-aceb-17da2a5aa9fa\n",
       "    v_sponge:       0.0\n",
       "    x_sponge:       0.0
" ], "text/plain": [ " Size: 498MB\n", "Dimensions: (y_rho: 838, x_rho: 727, s_rho: 40, time_counter: 5)\n", "Coordinates:\n", " nav_lat_rho (y_rho, x_rho) float64 5MB dask.array\n", " nav_lon_rho (y_rho, x_rho) float64 5MB dask.array\n", " * s_rho (s_rho) float32 160B -0.9875 -0.9625 ... -0.0375 -0.0125\n", " * time_counter (time_counter) datetime64[ns] 40B 2004-01-01T00:58:30 ... 2...\n", " time_instant (time_counter) datetime64[ns] 40B dask.array\n", "Dimensions without coordinates: y_rho, x_rho\n", "Data variables:\n", " ocean_mask (y_rho, x_rho) bool 609kB dask.array\n", " temp (time_counter, s_rho, y_rho, x_rho) float32 487MB dask.array\n", "Attributes: (12/45)\n", " CPP-options: REGIONAL GAMAR MPI TIDES OBC_WEST OBC_NORTH XIOS USE_CALE...\n", " Conventions: CF-1.6\n", " Cs_r: have a look at variable Cs_r in this file\n", " Cs_w: have a look at variable Cs_w in this file\n", " SRCS: main.F step.F read_inp.F timers_roms.F init_scalars.F ini...\n", " Tcline: 15.0\n", " ... ...\n", " title: GAMAR_GLORYS\n", " tnu4_expl: biharmonic mixing coefficient for tracers\n", " units: meter4 second-1\n", " uuid: 06f6b784-fcc0-4422-aceb-17da2a5aa9fa\n", " v_sponge: 0.0\n", " x_sponge: 0.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = xr.open_zarr(url_data)\n", "ds" ] }, { "cell_type": "markdown", "id": "ff1c2632-c124-41e2-acb1-e70d8b8ffd31", "metadata": {}, "source": [ "## Get metadata from RiOMAR\n", "\n", "### Get the title\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "5055ddd0-0dbf-40fb-9c13-7353c833f4c2", "metadata": {}, "outputs": [], "source": [ "title = ds.attrs[\"name\"]" ] }, { "cell_type": "markdown", "id": "215f4d4e-1b33-4ea4-a662-73f8287ccf3c", "metadata": {}, "source": [ "### Need to have better description available in the metadata. It could be constructed from the metadata if metadata were better constructed" ] }, { "cell_type": "code", "execution_count": 6, "id": "1c3d0764-eecd-4258-a5b0-2e6e80a895ac", "metadata": {}, "outputs": [], "source": [ "description = \"RiOMar dataset \" + title " ] }, { "cell_type": "markdown", "id": "99e3f949-2476-4931-a244-12e3a35522b5", "metadata": {}, "source": [ "### Get bounding box in WKT\n", "- Latitudes with values of -1 are NaN" ] }, { "cell_type": "code", "execution_count": 7, "id": "010ca1f3-d450-4b7b-af96-690266b1386f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43.285 50.867471190931404 -8.0 1.6800000000000015\n" ] } ], "source": [ "minlat = ds.nav_lat_rho.where(ds.nav_lat_rho > -1, np.nan).min().values\n", "maxlat = ds.nav_lat_rho.max().values\n", "minlon = ds.nav_lon_rho.min().values\n", "maxlon = ds.nav_lon_rho.max().values\n", "print(minlat, maxlat, minlon, maxlon)" ] }, { "cell_type": "code", "execution_count": 8, "id": "0cbb4e28-be31-4626-9f68-06889c040d3d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'POLYGON ((1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285))'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "geometry_wkt = shapely.geometry.box(minlon, minlat, maxlon, maxlat).wkt\n", "geometry_wkt" ] }, { "cell_type": "markdown", "id": "2edeba4f-ed1b-4369-af5b-e01d178ff2c2", "metadata": {}, "source": [ "- time range " ] }, { "cell_type": "code", "execution_count": 9, "id": "abbfaf9b-376a-4231-b001-d6fb979cf55d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('2004.01.01', '2004.01.01')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ts = pd.to_datetime(str(ds.time_counter.min().values)) \n", "te = pd.to_datetime(str(ds.time_counter.max().values)) \n", "date_start = ts.strftime('%Y.%m.%d')\n", "date_end = te.strftime('%Y.%m.%d')\n", "date_start, date_end" ] }, { "cell_type": "markdown", "id": "0a99c5ba-3309-4389-bb1a-f6e0cc10bc15", "metadata": {}, "source": [ "- Creation date (we assume `timeStamp` contains this information (TBC)" ] }, { "cell_type": "code", "execution_count": 10, "id": "45e09520-f431-490a-b164-dca23f1347a2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2024-Apr-01 10:49:18 GMT'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dateCreated = ds.attrs[\"timeStamp\"]\n", "dateCreated" ] }, { "cell_type": "code", "execution_count": 11, "id": "b382efec-d9c2-4e4e-a57a-0183487f311a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Today's date: 2025.01.19\n" ] } ], "source": [ "from datetime import date\n", "\n", "today = date.today().strftime('%Y.%m.%d')\n", "print(\"Today's date:\", today)\n", "\n", "sdDatePublished = today # could be the date corresponding to the creation of the DOI (publishing)\n", "dateModified = today # could be the date of creation of the DGGS regridded data e.g. it needs to be added to Zarr when regridding" ] }, { "cell_type": "markdown", "id": "373bb143-72f0-49d9-9e79-be8271eca7f6", "metadata": {}, "source": [ "### Get the size of the dataset\n", "- We usually can get this information from the metadata (needs to be added)" ] }, { "cell_type": "code", "execution_count": 12, "id": "7d41e5e2-5805-4314-8d61-dec91cfa99cc", "metadata": {}, "outputs": [], "source": [ "contentSize = 0 # We need to get the total size in bytes" ] }, { "cell_type": "markdown", "id": "52c4412b-f9fb-452e-b198-805c7efb35ec", "metadata": {}, "source": [ "### Get the persistent identifier\n", "- Dataset should have a persistent identifier e.g. DOI (currently it does not have one)\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "24b37254-2060-4afa-86fa-d0a4e08c079f", "metadata": {}, "outputs": [], "source": [ "doi_data = \"NONE\" # it is a problem" ] }, { "cell_type": "markdown", "id": "e1aa2fc7-9b6c-4344-82a4-01bbb52b1b16", "metadata": {}, "source": [ "### StudySubject and keywords\n", "\n", "- StudySubject and keywords" ] }, { "cell_type": "code", "execution_count": 14, "id": "c4084757-8620-40f5-9bc2-9f71ba809662", "metadata": {}, "outputs": [], "source": [ "studySubject_urls = [ \"http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment\"]\n", "keywords = [\"riomar\", \"croco\"]" ] }, { "cell_type": "markdown", "id": "9b96b57b-74df-474d-9215-3c21dfa8bd61", "metadata": {}, "source": [ "### Version of the dataset" ] }, { "cell_type": "code", "execution_count": 15, "id": "5c1ac837-3d31-4bc5-854c-69bcd81443ce", "metadata": {}, "outputs": [], "source": [ "version_data = \"1.0\"" ] }, { "cell_type": "markdown", "id": "adcdefad-1ede-452d-bbf5-3c6e7c713473", "metadata": {}, "source": [ "### Prepare information for the provenance" ] }, { "cell_type": "code", "execution_count": 16, "id": "abf88a98-3c85-4c88-b39f-55a8ff7ca356", "metadata": {}, "outputs": [], "source": [ "prov = {\n", " \"@id\": \"https://doi.org/10.5281/zenodo.13898339\",\n", " \"@type\": \"SoftwareApplication\",\n", " \"url\": \"https://www.croco-ocean.org\",\n", " \"name\": \"CROCO, Coastal and Regional Ocean COmmunity\",\n", " \"version\": \"CROCO GAMA model v2.0.1 https://doi.org/10.5281/zenodo.13898339\"\n", "}" ] }, { "cell_type": "markdown", "id": "7b452d92-85ce-47ff-b1fc-3acee12d7f5e", "metadata": {}, "source": [ "## Create a new RO-Crate" ] }, { "cell_type": "code", "execution_count": 17, "id": "875d831c-08d4-4435-8349-d904570bf2d1", "metadata": {}, "outputs": [], "source": [ "crate = ROCrate()" ] }, { "cell_type": "markdown", "id": "8e3a0d3f-b0ce-44ab-b858-a392c4c65f01", "metadata": {}, "source": [ "## Add the license for the RO-Crate\n", "\n", "- The license of the Research Object (RO-Crate) may not be the same as the licenses of the data bundled in the RO-Crate.\n", "- Our RO-Crate is open and distributed under [CC-BY-4](https://creativecommons.org/licenses/by/4.0/) license.\n", "- The content of the license needs to be a URL (here `https://creativecommons.org/licenses/by/4.0/`)" ] }, { "cell_type": "code", "execution_count": 18, "id": "deb96d5a-c7b2-4148-8b46-e3757ce46889", "metadata": {}, "outputs": [], "source": [ "RO_license_id = \"CC-BY-4.0\"\n", "RO_license_url = \"https://creativecommons.org/licenses/by/4.0/\"\n", "RO_license_title = \"Creative Commons Attribution 4.0\"" ] }, { "cell_type": "markdown", "id": "11cd5d1a-0457-47a8-9f7b-9e6d7d578d41", "metadata": {}, "source": [ "### Add the selected license to the RO-Crate" ] }, { "cell_type": "code", "execution_count": 19, "id": "2450c7f7-427f-4364-a80a-bb81e644480c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crate.update_jsonld(\n", "{\n", " \"@id\": \"./\",\n", " \"license\": { \"@id\": RO_license_url},\n", "})\n", "license = {\n", " \"@id\": RO_license_url,\n", " \"@type\": \"CreativeWork\",\n", " \"name\": RO_license_id,\n", " \"description\": RO_license_title,\n", " }\n", "crate.add_jsonld(license)" ] }, { "cell_type": "markdown", "id": "e2bcc825-d82b-49e0-a010-b3759cef049e", "metadata": {}, "source": [ "## Add creators and their Organizations\n", "\n", "- you need to add here the list of creators of the RO-Crate \n", "- you can go to `https://ror.org` and search for the organisation you would like to add. In this notebook, we create this information \"manually\" but it can be better streamlined in the future (for instance using [Rohub](https://rohub.org\")\n", "- You may have several authors and would need to add them in the RO-Crate following the same approach." ] }, { "cell_type": "markdown", "id": "01712a6c-18ee-4c51-bdef-11255384f82a", "metadata": {}, "source": [ "### Add Persons and organisations" ] }, { "cell_type": "code", "execution_count": 20, "id": "dadd5404-eca0-4ed1-80d8-716b7c8942ab", "metadata": {}, "outputs": [], "source": [ "list_authors = []" ] }, { "cell_type": "code", "execution_count": 21, "id": "8e97d385-197b-4fa0-a2a2-ba9cd2ac597b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'id': 'https://orcid.org/0000-0002-1784-2920',\n", " 'email': 'annef@simula.no',\n", " 'givenName': 'Anne',\n", " 'familyName': 'Fouilloux',\n", " 'affiliation': {'@id': 'https://ror.org/00vn06n10'}}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "organisation_1 = {\n", " \"name\": \"Simula Research Laboratory\",\n", " \"id\": \"https://ror.org/00vn06n10\",\n", " \"url\" : \"https://www.simula.no\"\n", "}\n", "creator_1 = {\n", " \"id\": \"https://orcid.org/0000-0002-1784-2920\", # The id is the ORCID of the author\n", " \"email\": \"annef@simula.no\",\n", " \"givenName\": \"Anne\", \n", " \"familyName\": \"Fouilloux\", \n", " \"affiliation\": {\"@id\": organisation_1[\"id\"]}\n", " \n", "}\n", "creator_1" ] }, { "cell_type": "code", "execution_count": 22, "id": "51f17d61-d636-40b6-bb96-09b0b19e2d3a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'id': 'https://orcid.org/0000-0002-1500-0156',\n", " 'email': 'tina.odaka@ifremer.fr',\n", " 'givenName': 'Tina Erica',\n", " 'familyName': 'Odaka',\n", " 'affiliation': {'@id': 'https://ror.org/044jxhp58'}}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "organisation_2 = {\n", " \"name\": \"Ifremer\",\n", " \"id\": \"https://ror.org/044jxhp58\",\n", " \"url\" : \"https://www.ifremer.fr\"\n", "}\n", "creator_2 = {\n", " \"id\": \"https://orcid.org/0000-0002-1500-0156\", # The id is the ORCID of the author\n", " \"email\": \"tina.odaka@ifremer.fr\",\n", " \"givenName\": \"Tina Erica\", \n", " \"familyName\": \"Odaka\", \n", " \"affiliation\": {\"@id\": organisation_2[\"id\"]}\n", " \n", "}\n", "creator_2" ] }, { "cell_type": "code", "execution_count": 23, "id": "b6110c53-0465-4a4d-a329-47de54478ec8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['https://orcid.org/0000-0002-1784-2920',\n", " 'https://orcid.org/0000-0002-1500-0156']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_orcids = [ creator_1[\"id\"], creator_2[\"id\"]]\n", "list_orcids" ] }, { "cell_type": "markdown", "id": "af2cfa81-1bbb-4dd2-bd9b-5f4b3cfe1ed3", "metadata": {}, "source": [ "### Adding all the authors" ] }, { "cell_type": "code", "execution_count": 24, "id": "fbadc428-861d-4dc5-b44d-18c8112dd125", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Anne Fouilloux', 'Tina Erica Odaka']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_authors.append(creator_1['givenName'] + \" \" + creator_1['familyName'])\n", "list_authors.append(creator_2['givenName'] + \" \" + creator_2['familyName'])\n", "list_authors" ] }, { "cell_type": "markdown", "id": "31357bad-6c74-46c9-9733-16587eca470a", "metadata": {}, "source": [ "Add the 2 creators as Person in the RO-Crate" ] }, { "cell_type": "code", "execution_count": 25, "id": "da2383de-f5be-44c4-9dc4-438fa327be03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crate.add(Person(crate, creator_1.pop(\"id\"), properties=creator_1))\n", "crate.add(Person(crate, creator_2.pop(\"id\"), properties=creator_2))" ] }, { "cell_type": "markdown", "id": "5394d0fc-e5a9-4840-a921-f384bcf6b045", "metadata": {}, "source": [ "Add the list of authors in the RO-Crate" ] }, { "cell_type": "code", "execution_count": 26, "id": "811cb618-bdb0-4e81-9c80-ebc1748de220", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<./ Dataset>" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crate.update_jsonld({\n", " \"@id\": \"./\",\n", " \"author\": list_orcids,\n", "})" ] }, { "cell_type": "markdown", "id": "8b0497c1-ac3c-4035-a544-2407e027a54e", "metadata": {}, "source": [ "### Add information about data bundled in the RO-Crate" ] }, { "cell_type": "markdown", "id": "c6b779aa-b8d5-4360-a627-06d6fe5a076c", "metadata": {}, "source": [ "#### Prepare Temporal coverage if available" ] }, { "cell_type": "code", "execution_count": 27, "id": "d7e99b8c-f027-47ed-b2bf-429d0a0a8e4b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2004.01.01/2004.01.01'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temporal_coverage = date_start + \"/\" + date_end\n", "temporal_coverage" ] }, { "cell_type": "markdown", "id": "94595220-ffad-4cf1-b1bb-3081522342c2", "metadata": {}, "source": [ "### Prepare Spatial coverage if available" ] }, { "cell_type": "code", "execution_count": 28, "id": "71378cb7-8fc1-45ec-ac2b-4c3cd715544b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'@type': 'GeoShape',\n", " '@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285',\n", " 'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_geoshape(geometry):\n", " # We assume wkt geometry\n", " geo = shapely.wkt.loads(geometry)\n", " if hasattr(geo, 'geoms'):\n", " # take the first one\n", " geo = geo.geoms[0]\n", " geo = geo.wkt.replace(\"POLYGON\", \"\").replace(\"(\",\"\").replace(\")\",\"\").strip() \n", " geolocation = { \"@type\": \"GeoShape\", \"@id\": geo, \"polygon\": geo}\n", " return geolocation\n", "\n", "\n", "geolocation = get_geoshape(geometry_wkt)\n", "geolocation" ] }, { "cell_type": "markdown", "id": "a45c7795-416a-44e3-bc50-29e0dc063dab", "metadata": {}, "source": [ "### Go through each data and add it in the RO-Crate \n", "- In this example we only add one dataset" ] }, { "cell_type": "code", "execution_count": 29, "id": "19167758-0e7f-4b56-8737-ad570bc212b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "properties = {'modified_date': '2025.01.19', 'name': 'https://data-fair2adapt.ifremer.fr/riomar/small.zarr', 'location': {'@type': 'GeoShape', '@id': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285', 'polygon': '1.6800000000000015 43.285, 1.6800000000000015 50.867471190931404, -8 50.867471190931404, -8 43.285, 1.6800000000000015 43.285'}, 'temporalCoverage': '2004.01.01/2004.01.01', 'sdDatePublished': '2025.01.19', 'dateCreated': '2024-Apr-01 10:49:18 GMT', 'dateModified': '2025.01.19', 'encodingFormat': ' text/html; charset=us-ascii '}\n" ] } ], "source": [ "properties = {\n", " \"modified_date\": dateModified, \n", " \"name\": url_data, \n", " \"location\": geolocation,\n", " \"temporalCoverage\": temporal_coverage, \n", " \"sdDatePublished\": sdDatePublished, \n", " \"dateCreated\": dateCreated, \n", " \"dateModified\": dateModified, # could be the date of creation of the DGGS regridded data\n", "### \"contentSize\": contentSize, TBC\n", " \"encodingFormat\": ' text/html; charset=us-ascii '\n", "}\n", "\n", "print(\"properties = \", properties)\n", "\n", "resource = crate.add_file(url_data, fetch_remote = False, properties=properties)" ] }, { "cell_type": "markdown", "id": "dc365352-e78d-4f9a-85fd-b1f093b5f5b2", "metadata": {}, "source": [ "## Add metadata to RO\n", "\n", "### Add the title and description" ] }, { "cell_type": "code", "execution_count": 30, "id": "e0b68bbc-23a8-477f-9d86-646ac8cc1220", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<./ Dataset>" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crate.update_jsonld({\n", " \"@id\": \"./\",\n", " \"description\": description,\n", " \"title\": title,\n", " \"name\": title,\n", "})" ] }, { "cell_type": "markdown", "id": "14fb8d03-43ef-468c-8b0e-180fa2b932b4", "metadata": {}, "source": [ "### Add the publisher and creator" ] }, { "cell_type": "code", "execution_count": 31, "id": "78144e35-e155-4da8-8c80-37529a667bc8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<./ Dataset>" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "publisher_name = \"Sigma2 AS\"\n", "publisher_url = \"https://www.wikidata.org/wiki/Q12008197\"\n", "publisher = {\n", " \"@id\": publisher_url,\n", " \"@type\": \"Organization\",\n", " \"name\": publisher_name,\n", " \"url\": publisher_url\n", " }\n", "crate.add_jsonld(publisher)\n", "crate.update_jsonld(\n", "{\n", " \"@id\": \"./\",\n", " \"publisher\": { \"@id\": publisher_url },\n", "})" ] }, { "cell_type": "markdown", "id": "64e4f181-8916-44fc-bf5c-9cb44a438e5a", "metadata": {}, "source": [ "### Add the creator of the RO-Crate" ] }, { "cell_type": "code", "execution_count": 32, "id": "42a366f5-a8be-4879-94e6-1b552f0ad1ae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crate.update_jsonld(\n", "{\n", " \"@id\": \"ro-crate-metadata.json\",\n", " \"creator\": { \"@id\": publisher_url },\n", "})" ] }, { "cell_type": "markdown", "id": "8a7a26ed-00c1-4ee5-b662-515a60e41166", "metadata": {}, "source": [ "### Add Publication date" ] }, { "cell_type": "code", "execution_count": 33, "id": "5dea564e-f752-49c4-9e7b-72dda72977c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<./ Dataset>" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_published = datetime.strptime(sdDatePublished, \"%Y.%m.%d\")\n", "\n", "crate.update_jsonld({\n", " \"@id\": \"./\",\n", " \"datePublished\": date_published.strftime(\"%Y-%m-%d\") ,\n", "})" ] }, { "cell_type": "markdown", "id": "bdf6caf3-1a5d-4670-af82-7942c2759d98", "metadata": {}, "source": [ "### Add citation" ] }, { "cell_type": "code", "execution_count": 34, "id": "ed4d2062-cf55-4fe9-a550-e621860855d5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<./ Dataset>" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doi = \"https://doi.org/\" + doi_data\n", "cite_as = \" and \".join(list_authors) + \", \" + title + \", \" + publisher_name + \", \" + date_published.strftime(\"%Y\") + \". \" + doi_data + \".\"\n", "\n", "crate.update_jsonld({\n", " \"@id\": \"./\",\n", " \"identifier\": doi_data,\n", " \"url\": doi_data,\n", " \"cite-as\": cite_as ,\n", "})\n" ] }, { "cell_type": "markdown", "id": "f34d561f-0692-4470-a282-1a7d96386da3", "metadata": {}, "source": [ "### Add studySubject, keywords, etc.\n", "\n", "The studySubject is from `http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/`.\n", "Go to the URL and select the studySubject that is most relevant for your data" ] }, { "cell_type": "code", "execution_count": 35, "id": "ab8e7036-78dd-4cb7-b1b1-380badc155f3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'@id': 'http://inspire.ec.europa.eu/metadata-codelist/TopicCategory/environment'}]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "study_subjects = []\n", "for subject_url in studySubject_urls:\n", " study_subjects.append({\n", " \"@id\": subject_url\n", " })\n", "study_subjects" ] }, { "cell_type": "code", "execution_count": 36, "id": "c4195737-95cf-4bc0-b9fa-ed75dac9ceed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'riomar, croco'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keywords = \", \".join(keywords)\n", "keywords" ] }, { "cell_type": "code", "execution_count": 37, "id": "15fbb73a-bd36-4178-a09a-884007c83a78", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<./ Dataset>" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crate.update_jsonld({\n", " \"@id\": \"./\",\n", " \"about\": study_subjects,\n", " \"keywords\": keywords,\n", "})" ] }, { "cell_type": "markdown", "id": "ff473b0b-f643-49ce-bd7f-38bab4423034", "metadata": {}, "source": [ "### Add version" ] }, { "cell_type": "code", "execution_count": 38, "id": "8af264c8-20f8-44b0-b3de-ee21ad8f3ac8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<./ Dataset>" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crate.update_jsonld({\n", " \"@id\": \"./\",\n", " \"version\": version_data,\n", "})" ] }, { "cell_type": "markdown", "id": "b7d19574-c98d-4ba2-8b97-205aee76630a", "metadata": {}, "source": [ "### Add Language" ] }, { "cell_type": "code", "execution_count": 39, "id": "c2a43b07-98a7-4c11-8978-43b4e145e3b2", "metadata": {}, "outputs": [], "source": [ "#crate.update_jsonld({\n", "# \"@id\": ,\n", "# \"@type\": \"Language\",\n", "#})" ] }, { "cell_type": "markdown", "id": "310d83cc-9b3a-4756-8844-90d4934e9068", "metadata": {}, "source": [ "## Write to disk" ] }, { "cell_type": "code", "execution_count": 40, "id": "5436ef67-3b3c-4cc9-9f75-803da492e984", "metadata": {}, "outputs": [], "source": [ "crate.write(\"ro-crate\")" ] }, { "cell_type": "code", "execution_count": 43, "id": "5d418015-8725-41de-8adb-954aec57f34c", "metadata": {}, "outputs": [], "source": [ "from rocrateValidator import validate as validate" ] }, { "cell_type": "code", "execution_count": 44, "id": "6a86fd4b-6482-4640-8844-425e81e5a07b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is an INVALID RO-Crate\n", "{\n", " \"File existence\": [\n", " true\n", " ],\n", " \"File size\": [\n", " true\n", " ],\n", " \"Metadata file existence\": [\n", " true\n", " ],\n", " \"Json check\": [\n", " true\n", " ],\n", " \"Json-ld check\": [\n", " true\n", " ],\n", " \"File descriptor check\": [\n", " true\n", " ],\n", " \"Direct property check\": [\n", " true\n", " ],\n", " \"Referencing check\": [\n", " true\n", " ],\n", " \"Encoding check\": [\n", " true\n", " ],\n", " \"Web-based data entity check\": [\n", " false,\n", " \"Semantic Error: Invalid ID at https://data-fair2adapt.ifremer.fr/riomar/small.zarr. It should be a downloadable url\"\n", " ],\n", " \"Person entity check\": [\n", " true\n", " ],\n", " \"Organization entity check\": [\n", " true\n", " ],\n", " \"Contact information check\": [\n", " true\n", " ],\n", " \"Citation property check\": [\n", " true\n", " ],\n", " \"Publisher property check\": [\n", " true\n", " ],\n", " \"Funder property check\": [\n", " true\n", " ],\n", " \"Licensing property check\": [\n", " false,\n", " \"Semantic Error: Invalid ID Value at https://creativecommons.org/licenses/by/4.0/. It must be an URL.\"\n", " ],\n", " \"Places property check\": [\n", " true\n", " ],\n", " \"Time property check\": [\n", " true\n", " ],\n", " \"Scripts and workflow check\": [\n", " false,\n", " \"Semantic Error: Scripts and Workflow is Wrong\"\n", " ]\n", "}\n" ] } ], "source": [ "v = validate.validate(\"ro-crate\")\n", "v.validator()" ] }, { "cell_type": "code", "execution_count": null, "id": "640e656d-42e2-49be-b8b9-113b5b582845", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }